Xiaolong Wang

72

Papers

973

Total Citations

Papers (72)

Designing Deep Networks for Surface Normal Estimation

TD-MPC2: Scalable, Robust World Models for Continuous Control

GenSim: Generating Robotic Simulation Tasks via Large Language Models

One-Minute Video Generation with Test-Time Training

Investigating and Mitigating the Side Effects of Noisy Views for Self-Supervised Clustering Algorithms in Practical Multi-View Scenarios

WorldModelBench: Judging Video Generation Models As World Models

Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Editable Image Elements for Controllable Synthesis

Consistent Flow Distillation for Text-to-3D Generation

Parallel Sequence Modeling via Generalized Spatial Propagation Network

3D-SPATIAL MULTIMODAL MEMORY

3D Human Pose Estimation in the Wild by Adversarial Learning

Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs

Non-Local Neural Networks

Learning Correspondence From the Cycle-Consistency of Time

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks

Semi-Supervised 3D Hand-Object Poses Estimation With Interactions in Time

Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

Learning Continuous Image Representation With Local Implicit Image Function

CoordGAN: Self-Supervised Dense Correspondences Emerge From GANs

VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution

GIFS: Neural Implicit Function for General Shape Representation

Look Outside the Room: Synthesizing a Consistent Long-Term 3D Scene Video From a Single Image

GroupViT: Semantic Segmentation Emerges From Text Supervision

Joint Hand Motion and Interaction Hotspots Prediction From Egocentric Videos

DexArt: Benchmarking Generalizable Dexterous Manipulation With Articulated Objects

Dynamic Inference With Grounding Based Vision and Language Models

Open-Vocabulary Panoptic Segmentation With Text-to-Image Diffusion Models

Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters

Policy Adaptation From Foundation Model Feedback

Neural Volumetric Memory for Visual Locomotion Control

Unsupervised Learning of Visual Representations Using Videos

Transitive Invariance for Self-Supervised Visual Representation Learning

Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection

Rethinking Self-Supervised Correspondence Learning: A Video Frame-Level Similarity Perspective

Video Autoencoder: Self-Supervised Disentanglement of Static 3D Structure and Motion

Contrastive Learning of Image Representations With Cross-Video Cycle-Consistency

Robust Object Detection via Instance-Level Temporal Cycle Confusion

A-SDF: Learning Disentangled Signed Distance Functions for Articulated Shape Representation

Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning

Region Similarity Representation Learning

Hand-Object Contact Consistency Reasoning for Human Grasps Generation

Rethinking Preventing Class-Collapsing in Metric Learning With Margin-Based Losses

ActorsNeRF: Animatable Few-shot Human Rendering with Generalizable NeRFs

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Hierarchical Style-based Networks for Motion Synthesis

Scraping Textures from Natural Images for Synthesis and Editing

Transformers As Meta-Learners for Implicit Neural Representations

Learning Implicit Feature Alignment Function for Semantic Segmentation

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

COLMAP-Free 3D Gaussian Splatting

HomoMatcher: Achieving Dense Feature Matching with Semi-Dense Efficiency by Homography Estimation

EditAR: Unified Conditional Generation with Autoregressive Models

Image Neural Field Diffusion Models

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Pixel-Aligned Language Model

CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation

RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Actions ~ Transformations

Binge Watching: Scaling Affordance Learning From Sitcoms

A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection

Joint-task Self-supervised Learning for Temporal Correspondence

Multi-Task Reinforcement Learning with Soft Modularization

Online Adaptation for Consistent Mesh Reconstruction in the Wild

Test-Time Personalization with a Transformer for Human Pose Estimation

Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation

Multi-Person 3D Motion Prediction with Multi-Range Transformers

NovelD: A Simple yet Effective Exploration Criterion

Category-Level 6D Object Pose Estimation in the Wild: A Semi-Supervised Learning Approach and A New Dataset

Fine-Grained Cross-View Geo-Localization Using a Correlation-Aware Homography Estimator

Elastic Decision Transformer