Mohit Bansal

58
Papers
168
Total Citations

Papers (58)

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

NeurIPS 2025
29
citations

Self-Consistency Preference Optimization

ICML 2025
23
citations

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models

ICLR 2024
20
citations

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

ICCV 2025
19
citations

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

ICLR 2025
15
citations

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

ICLR 2024
13
citations

Unbounded: A Generative Infinite Game of Character Life Simulation

ICLR 2025
12
citations

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

AAAI 2024arXiv
10
citations

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

ICLR 2025
9
citations

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

ICLR 2025
7
citations

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

NeurIPS 2025
6
citations

LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits

NeurIPS 2025
5
citations

MAttNet: Modular Attention Network for Referring Expression Comprehension

CVPR 2018arXiv
0
citations

Multi-Target Embodied Question Answering

CVPR 2019
0
citations

Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

CVPR 2021arXiv
0
citations

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

CVPR 2022
0
citations

EnvEdit: Environment Editing for Vision-and-Language Navigation

CVPR 2022arXiv
0
citations

Hierarchical Video-Moment Retrieval and Step-Captioning

CVPR 2023arXiv
0
citations

Unifying Vision, Text, and Layout for Universal Document Processing

CVPR 2023arXiv
0
citations

Vision Transformers Are Parameter-Efficient Audio-Visual Learners

CVPR 2023arXiv
0
citations

VindLU: A Recipe for Effective Video-and-Language Pretraining

CVPR 2023arXiv
0
citations

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

CVPR 2023arXiv
0
citations

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

ICCV 2023arXiv
0
citations

Scaling Data Generation in Vision-and-Language Navigation

ICCV 2023arXiv
0
citations

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

ICCV 2023
0
citations

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

ECCV 2020
0
citations

ECLIPSE: Efficient Long-Range Video Retrieval Using Sight and Sound

ECCV 2022
0
citations

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

ECCV 2022
0
citations

CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation

CVPR 2024
0
citations

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

CVPR 2025
0
citations

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

ICCV 2025
0
citations

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

ICCV 2025
0
citations

Multimodal Representation Learning by Alternating Unimodal Adaptation

CVPR 2024
0
citations

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

CVPR 2025
0
citations

Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts

CVPR 2024
0
citations

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

ICML 2024
0
citations

ReGAL: Refactoring Programs to Discover Generalizable Abstractions

ICML 2024
0
citations

Position: TrustLLM: Trustworthiness in Large Language Models

ICML 2024
0
citations

We Are Humor Beings: Understanding and Predicting Visual Humor

CVPR 2016
0
citations

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

CVPR 2017arXiv
0
citations

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations

NeurIPS 2021
0
citations

Detecting Moments and Highlights in Videos via Natural Language Queries

NeurIPS 2021
0
citations

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

NeurIPS 2021
0
citations

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

NeurIPS 2022
0
citations

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

NeurIPS 2022
0
citations

TVLT: Textless Vision-Language Transformer

NeurIPS 2022
0
citations

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

NeurIPS 2022
0
citations

VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives

NeurIPS 2022
0
citations

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

NeurIPS 2022
0
citations

Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation

NeurIPS 2023
0
citations

TIES-Merging: Resolving Interference When Merging Models

NeurIPS 2023
0
citations

Any-to-Any Generation via Composable Diffusion

NeurIPS 2023
0
citations

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

NeurIPS 2023
0
citations

Paxion: Patching Action Knowledge in Video-Language Foundation Models

NeurIPS 2023
0
citations

PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation

NeurIPS 2023
0
citations

Adaptive Contextual Perception: How To Generalize To New Backgrounds and Ambiguous Objects

NeurIPS 2023
0
citations

Can Language Models Teach? Teacher Explanations Improve Student Performance via Personalization

NeurIPS 2023
0
citations

Self-Chained Image-Language Model for Video Localization and Question Answering

NeurIPS 2023
0
citations