Mohit Bansal

22

Papers

167

Total Citations

Papers (22)

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

NeurIPS 2025arXiv

Self-Consistency Preference Optimization

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

Unbounded: A Generative Infinite Game of Character Life Simulation

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

NeurIPS 2025arXiv

LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits

Position: TrustLLM: Trustworthiness in Large Language Models

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

Multimodal Representation Learning by Alternating Unimodal Adaptation

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos