Ranjay Krishna
43
Papers
358
Total Citations
Papers (43)
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
CVPR 2025
96
citations
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
ICLR 2025
80
citations
SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World
CVPR 2024
52
citations
One Diffusion to Generate Them All
CVPR 2025
34
citations
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
ECCV 2024
25
citations
Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
ECCV 2024
23
citations
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
CVPR 2024
20
citations
Iterated Learning Improves Compositionality in Large Vision-Language Models
CVPR 2024
16
citations
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
CVPR 2025arXiv
9
citations
Convergent Functions, Divergent Forms
NeurIPS 2025arXiv
3
citations
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
CVPR 2024
0
citations
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
CVPR 2024
0
citations
Offline Training of Language Model Agents with Functions as Learnable Weights
ICML 2024
0
citations
Image Retrieval Using Scene Graphs
CVPR 2015
0
citations
A Hierarchical Approach for Generating Descriptive Image Paragraphs
CVPR 2017arXiv
0
citations
Referring Relationships
CVPR 2018arXiv
0
citations
Information Maximizing Visual Question Generation
CVPR 2019
0
citations
Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs
CVPR 2020
0
citations
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
CVPR 2021
0
citations
Measuring Compositional Consistency for Video Question Answering
CVPR 2022arXiv
0
citations
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
CVPR 2023arXiv
0
citations
Dense-Captioning Events in Videos
ICCV 2017arXiv
0
citations
Scene Graph Prediction With Limited Labels
ICCV 2019
0
citations
Agile Modeling: From Concept to Classifier in Minutes
ICCV 2023arXiv
0
citations
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
ICCV 2023arXiv
0
citations
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
ICCV 2025
0
citations
RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations
CVPR 2025
0
citations
Semantic and Expressive Variations in Image Captions Across Languages
CVPR 2025
0
citations
NVILA: Efficient Frontier Visual Language Models
CVPR 2025
0
citations
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
CVPR 2025
0
citations
Synthetic Visual Genome
CVPR 2025
0
citations
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
CVPR 2025
0
citations
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
ICCV 2025
0
citations
Contrastive Flow Matching
ICCV 2025
0
citations
Holodeck: Language Guided Generation of 3D Embodied AI Environments
CVPR 2024
0
citations
HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models
NeurIPS 2019
0
citations
ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward
NeurIPS 2022
0
citations
OBJECT 3DIT: Language-guided 3D-aware Image Editing
NeurIPS 2023
0
citations
DataComp: In search of the next generation of multimodal datasets
NeurIPS 2023
0
citations
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
NeurIPS 2023
0
citations
Quilt-1M: One Million Image-Text Pairs for Histopathology
NeurIPS 2023
0
citations
Cola: A Benchmark for Compositional Text-to-image Retrieval
NeurIPS 2023
0
citations
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
NeurIPS 2023
0
citations