Xiaojian Ma

12

Papers

145

Total Citations

Papers (12)

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

An Embodied Generalist Agent in 3D World

Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement

Unsupervised Foreground Extraction via Deep Region Competition