Mohit Bansal
58
Papers
168
Total Citations
Papers (58)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
NeurIPS 2025
29
citations
Self-Consistency Preference Optimization
ICML 2025
23
citations
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
ICLR 2024
20
citations
CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
ICCV 2025
19
citations
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
ICLR 2025
15
citations
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
ICLR 2024
13
citations
Unbounded: A Generative Infinite Game of Character Life Simulation
ICLR 2025
12
citations
VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation
AAAI 2024arXiv
10
citations
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
ICLR 2025
9
citations
VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning
ICLR 2025
7
citations
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
NeurIPS 2025
6
citations
LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits
NeurIPS 2025
5
citations
MAttNet: Modular Attention Network for Referring Expression Comprehension
CVPR 2018arXiv
0
citations
Multi-Target Embodied Question Answering
CVPR 2019
0
citations
Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
CVPR 2021arXiv
0
citations
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
CVPR 2022
0
citations
EnvEdit: Environment Editing for Vision-and-Language Navigation
CVPR 2022arXiv
0
citations
Hierarchical Video-Moment Retrieval and Step-Captioning
CVPR 2023arXiv
0
citations
Unifying Vision, Text, and Layout for Universal Document Processing
CVPR 2023arXiv
0
citations
Vision Transformers Are Parameter-Efficient Audio-Visual Learners
CVPR 2023arXiv
0
citations
VindLU: A Recipe for Effective Video-and-Language Pretraining
CVPR 2023arXiv
0
citations
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics
CVPR 2023arXiv
0
citations
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
ICCV 2023arXiv
0
citations
Scaling Data Generation in Vision-and-Language Navigation
ICCV 2023arXiv
0
citations
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models
ICCV 2023
0
citations
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
ECCV 2020
0
citations
ECLIPSE: Efficient Long-Range Video Retrieval Using Sight and Sound
ECCV 2022
0
citations
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
ECCV 2022
0
citations
CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation
CVPR 2024
0
citations
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
CVPR 2025
0
citations
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
ICCV 2025
0
citations
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
ICCV 2025
0
citations
Multimodal Representation Learning by Alternating Unimodal Adaptation
CVPR 2024
0
citations
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
CVPR 2025
0
citations
Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts
CVPR 2024
0
citations
MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models
ICML 2024
0
citations
ReGAL: Refactoring Programs to Discover Generalizable Abstractions
ICML 2024
0
citations
Position: TrustLLM: Trustworthiness in Large Language Models
ICML 2024
0
citations
We Are Humor Beings: Understanding and Predicting Visual Humor
CVPR 2016
0
citations
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
CVPR 2017arXiv
0
citations
The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations
NeurIPS 2021
0
citations
Detecting Moments and Highlights in Videos via Natural Language Queries
NeurIPS 2021
0
citations
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
NeurIPS 2021
0
citations
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
NeurIPS 2022
0
citations
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
NeurIPS 2022
0
citations
TVLT: Textless Vision-Language Transformer
NeurIPS 2022
0
citations
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning
NeurIPS 2022
0
citations
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives
NeurIPS 2022
0
citations
WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models
NeurIPS 2022
0
citations
Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation
NeurIPS 2023
0
citations
TIES-Merging: Resolving Interference When Merging Models
NeurIPS 2023
0
citations
Any-to-Any Generation via Composable Diffusion
NeurIPS 2023
0
citations
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
NeurIPS 2023
0
citations
Paxion: Patching Action Knowledge in Video-Language Foundation Models
NeurIPS 2023
0
citations
PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation
NeurIPS 2023
0
citations
Adaptive Contextual Perception: How To Generalize To New Backgrounds and Ambiguous Objects
NeurIPS 2023
0
citations
Can Language Models Teach? Teacher Explanations Improve Student Performance via Personalization
NeurIPS 2023
0
citations
Self-Chained Image-Language Model for Video Localization and Question Answering
NeurIPS 2023
0
citations