Aniruddha Kembhavi

47

Papers

198

Total Citations

Papers (47)

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

One Diffusion to Generate Them All

Iterated Learning Improves Compositionality in Large Vision-Language Models

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

Seeing the Unseen: Visual Common Sense for Semantic Placement

Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension

Structured Set Matching Networks for One-Shot Part Labeling

IQA: Visual Question Answering in Interactive Environments

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

ELASTIC: Improving CNNs With Dynamic Scaling Policies

Two Body Problem: Collaborative Visual Task Completion

RoboTHOR: An Open Simulation-to-Real Embodied AI Platform

What's Hidden in a Randomly Weighted Neural Network?

ManipulaTHOR: A Framework for Visual Object Manipulation

Visual Room Rearrangement

Visual Semantic Role Labeling for Video Understanding

Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture

What Do Navigation Agents Learn About Their Environment?

Simple but Effective: CLIP Embeddings for Embodied AI

Visual Programming: Compositional Visual Reasoning Without Training

EXCALIBUR: Encouraging and Evaluating Embodied Exploration

Objaverse: A Universe of Annotated 3D Objects

Phone2Proc: Bringing Robust Robots Into Our Chaotic World

RobustNav: Towards Benchmarking Robustness in Embodied Navigation

Scene Graph Contrastive Learning for Embodied Navigation

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding

Grounded Situation Recognition

A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks

Webly Supervised Concept Expansion for General Purpose Vision Models

Object Manipulation via Visual Target Localization

GridToPix: Training Embodied Agents With Minimal Supervision

ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences

Learning About Objects by Learning to Interact with Them

Supermasks in Superposition

Bridging the Imitation Gap by Adaptive Insubordination

Container: Context Aggregation Networks

🏘️ ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Ask4Help: Learning to Leverage an Expert for Embodied Tasks

OBJECT 3DIT: Language-guided 3D-aware Image Editing

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

Objaverse-XL: A Universe of 10M+ 3D Objects

Neural Priming for Sample-Efficient Adaptation