LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

0citations

arXiv:2406.20095 Project

citations

#2434

in ICLR 2025

of 3827 papers

Top Authors

Data Points

Top Authors

Xiang Li Cristina Mata Jongwoo Park Kumara Kahatapitiya Yoo Jang Jinghuan Shang Kanchana Ranasinghe Ryan Burgert Mu Cai Yong Jae Lee Michael Ryoo

Abstract

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 28, 2026