Siyuan Huang

51
Papers
364
Total Citations

Papers (51)

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

ICCV 2025
96
citations

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

CVPR 2024
78
citations

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

NeurIPS 2025
34
citations

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

ICLR 2025
26
citations

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ICCV 2025arXiv
24
citations

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

ECCV 2024
22
citations

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

CVPR 2025
18
citations

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

CVPR 2025
17
citations

Neural-Symbolic Recursive Machine for Systematic Generalization

ICLR 2024
14
citations

TACO: Taming Diffusion for in-the-wild Video Amodal Completion

ICCV 2025
9
citations

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

NeurIPS 2025
8
citations

Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing

ICCV 2025
7
citations

InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing

CVPR 2025
6
citations

Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation

CVPR 2025arXiv
4
citations

PrimHOI: Compositional Human-Object Interaction via Reusable Primitives

ICCV 2025
1
citations

Infrared Invisible Clothing: Hiding From Infrared Detectors at Multiple Angles in Real World

CVPR 2022arXiv
0
citations

Adversarial Texture for Fooling Person Detectors in the Physical World

CVPR 2022arXiv
0
citations

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

CVPR 2023arXiv
0
citations

GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts

CVPR 2023arXiv
0
citations

Diffusion-Based Generation, Optimization, and Planning in 3D Scenes

CVPR 2023arXiv
0
citations

Predicting Human Activities Using Stochastic Grammar

ICCV 2017arXiv
0
citations

Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning

ICCV 2019
0
citations

Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense

ICCV 2019
0
citations

YouRefIt: Embodied Reference Understanding With Language and Gesture

ICCV 2021arXiv
0
citations

VLGrammar: Grounded Grammar Induction of Vision and Language

ICCV 2021arXiv
0
citations

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

ICCV 2023
0
citations

ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes

ICCV 2023arXiv
0
citations

Full-Body Articulated Human-Object Interaction

ICCV 2023arXiv
0
citations

A Competence-aware Curriculum for Visual Concepts Learning via Question Answering

ECCV 2020
0
citations

LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities

ECCV 2020
0
citations

Spatio-Temporal Self-Supervised Representation Learning for 3D Point Clouds

ICCV 2021arXiv
0
citations

Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding

CVPR 2025
0
citations

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

CVPR 2025
0
citations

METASCENES: Towards Automated Replica Creation for Real-world 3D Scans

CVPR 2025
0
citations

Dynamic Motion Blending for Versatile Motion Editing

CVPR 2025
0
citations

ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning

CVPR 2025
0
citations

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

CVPR 2025
0
citations

GWM: Towards Scalable Gaussian World Models for Robotic Manipulation

ICCV 2025
0
citations

AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents

CVPR 2024
0
citations

Scaling Up Dynamic Human-Scene Interaction Modeling

CVPR 2024
0
citations

PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI

CVPR 2024
0
citations

An Embodied Generalist Agent in 3D World

ICML 2024
0
citations

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

ICML 2024
0
citations

Human-Centric Indoor Scene Synthesis Using Stochastic Grammar

CVPR 2018arXiv
0
citations

Learning Neural Representation of Camera Pose with Matrix Representation of Pose Shift via View Synthesis

CVPR 2021arXiv
0
citations

Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation

NeurIPS 2018
0
citations

PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points

NeurIPS 2019
0
citations

EgoTaskQA: Understanding Human Tasks in Egocentric Videos

NeurIPS 2022
0
citations

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

NeurIPS 2022
0
citations

ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab

NeurIPS 2023
0
citations

Tailoring Self-Attention for Graph via Rooted Subtrees

NeurIPS 2023
0
citations