Mohit Bansal

22
Papers
167
Total Citations

Papers (22)

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

NeurIPS 2025arXiv
28
citations

Self-Consistency Preference Optimization

ICML 2025
23
citations

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models

ICLR 2024
20
citations

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

ICCV 2025
19
citations

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

ICLR 2025
15
citations

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

ICLR 2024
13
citations

Unbounded: A Generative Infinite Game of Character Life Simulation

ICLR 2025
12
citations

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

AAAI 2024arXiv
10
citations

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

ICLR 2025
9
citations

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

ICLR 2025
7
citations

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

NeurIPS 2025arXiv
6
citations

LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits

NeurIPS 2025
5
citations

Position: TrustLLM: Trustworthiness in Large Language Models

ICML 2024
0
citations

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

CVPR 2025
0
citations

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

ICCV 2025
0
citations

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

ICCV 2025
0
citations

Multimodal Representation Learning by Alternating Unimodal Adaptation

CVPR 2024
0
citations

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

CVPR 2024
0
citations

Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts

CVPR 2024
0
citations

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

ICML 2024
0
citations

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

ICML 2024
0
citations

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

CVPR 2025
0
citations