Visual Question Answering
Answering questions about images
Top Papers
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia et al.
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu, Saining Xie
NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving
Tianwen Qian, Jingjing Chen, Linhai Zhuo et al.
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
Wenbo Hu, Yifan Xu, Yi Li et al.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang et al.
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Siva Mustikovela et al.
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
Feng Wang, Jieru Mei, Alan Yuille
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang, Min Shi, Qingyun Li et al.
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao, Angela Yao, Yicong Li et al.
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Haoqin Tu, Chenhang Cui, Zijun Wang et al.
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Xianjie Wu, Jian Yang, Linzheng Chai et al.
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models
Zhiyuan You, Zheyuan Li, Jinjin Gu et al.
Towards Open-ended Visual Quality Comparison
Haoning Wu, Hanwei Zhu, Zicheng Zhang et al.
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang Weiyun, yiming ren, Haowen Luo et al.
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Chengke Zou, Xingang Guo, Rui Yang et al.
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
jiarui zhang, Mahyar Khayatkhoei, Prateek Chhikara et al.
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition
Feng Lu, Xiangyuan Lan, Lijun Zhang et al.
MV-Adapter: Multi-View Consistent Image Generation Made Easy
Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu, Tianjie Zhang, Yu Gu et al.
Unifying 3D Vision-Language Understanding via Promptable Queries
ziyu zhu, Zhuofan Zhang, Xiaojian Ma et al.
Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl
DocFormerv2: Local Features for Document Understanding
Srikar Appalaraju, Peng Tang, Qi Dong et al.
Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang, Jing Yu, Keke Gai et al.
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning
Weiqi Li, Xuanyu Zhang, Shijie Zhao et al.
VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting
Seunggu Kang, WonJun Moon, Euiyeon Kim et al.
Wonderland: Navigating 3D Scenes from a Single Image
Hanwen Liang, Junli Cao, Vidit Goel et al.
Visual In-Context Prompting
Feng Li, Qing Jiang, Hao Zhang et al.
Describing Differences in Image Sets with Natural Language
Lisa Dunlap, Yuhui Zhang, Xiaohan Wang et al.
Discovering and Mitigating Visual Biases through Keyword Explanation
Younghyun Kim, Sangwoo Mo, Minkyu Kim et al.
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan, Jaemin Cho, Elias Stengel-Eskin et al.
EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering
Junjue Wang, Zhuo Zheng, Zihang Chen et al.
Grounded Question-Answering in Long Egocentric Videos
Shangzhe Di, Weidi Xie
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
Zhi Gao, Yuntao Du., Xintong Zhang et al.
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Yucheng Suo, Fan Ma, Linchao Zhu et al.
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Jeremy Irvin, Emily Liu, Joyce Chen et al.
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Xingyu Fu, Minqian Liu, Zhengyuan Yang et al.
Visual Agents as Fast and Slow Thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang et al.
MET3R: Measuring Multi-View Consistency in Generated Images
Mohammad Asim, Christopher Wewer, Thomas Wimmer et al.
Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Long Le, Jason Xie, William Liang et al.
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
Lin Li, Haoyan Guan, Jianing Qiu et al.
A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
Zicheng Zhang, Haoning Wu, Chunyi Li et al.
Revisiting Single Image Reflection Removal In the Wild
Yurui Zhu, Bo Li, Xueyang Fu et al.
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li, Jiawei Peng, huiyi chen et al.
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz, Yair Kittenplon, Aviad Aberdam et al.
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Ziyao Shangguan, Chuhan Li, Yuxuan Ding et al.
Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning
Jinsong Shi, Pan Gao, Jie Qin
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Xiangyu Zhao, Peiyuan Zhang, Kexian Tang et al.
Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction
Senqiao Yang, Jiarui Wu, Jiaming Liu et al.
Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network
ye junyan, Zhutao Lv, Li Weijia et al.
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation
Derong Xu, Xinhang Li, Ziheng Zhang et al.
Open-World Human-Object Interaction Detection via Multi-modal Prompts
Jie Yang, Bingliang Li, Ailing Zeng et al.
Visual Agentic AI for Spatial Reasoning with a Dynamic API
Damiano Marsili, Rohun Agrawal, Yisong Yue et al.
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
Tianhe Wu, Jian Zou, Jie Liang et al.
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
Tsung-Han Wu, Giscard Biamby, Jerome Quenum et al.
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation
Siwei Wen, junyan ye, Peilin Feng et al.
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou et al.
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
Zhecan Wang, Garrett Bingham, Adams Wei Yu et al.
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Sheng Zhou, Junbin Xiao, Qingyun Li et al.
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Zhongwei Ren, Yunchao Wei, Xun Guo et al.
Blind Image Quality Assessment Based on Geometric Order Learning
Nyeong-Ho Shin, Seon-Ho Lee, Chang-Su Kim
Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models
Zhenyu Pan, Haozheng Luo, Manling Li et al.
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA
Wentao Mo, Yang Liu
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
Ying Chen, Guoan Wang, Yuanfeng Ji et al.
Doubly Abductive Counterfactual Inference for Text-based Image Editing
Xue Song, Jiequan Cui, Hanwang Zhang et al.
MagicQuill: An Intelligent Interactive Image Editing System
Zichen Liu, Yue Yu, Hao Ouyang et al.
Your ViT is Secretly an Image Segmentation Model
Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans et al.
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Fengxiang Wang, hongzhen wang, Zonghao Guo et al.
Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions
Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov et al.
An Intelligent Agentic System for Complex Image Restoration Problems
Kaiwen Zhu, Jinjin Gu, Zhiyuan You et al.
Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
Zhangbin Li, Jinxing Zhou, Dan Guo et al.
WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
Pingyi Chen, Chenglu Zhu, Sunyi Zheng et al.
MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions
Jian Wu, Linyi Yang, Dongyuan Li et al.
KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
Yongliang Wu, Zonghui Li, Xinting Hu et al.
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Yaoting Wang, Peiwen Sun, Dongzhan Zhou et al.
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Fucai Ke, Zhixi Cai, Simindokht Jahangard et al.
MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction
Zixuan Gong, Qi Zhang, Guangyin Bao et al.
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
Dongping Chen, Yue Huang, Siyuan Wu et al.
ViLA: Efficient Video-Language Alignment for Video Question Answering
Xijun Wang, Junbang Liang, Chun-Kai Wang et al.
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
Zaid Khan, Yun Fu
Question Calibration and Multi-Hop Modeling for Temporal Question Answering
Chao Xue, Di Liang, Pengfei Wang et al.
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang, Yuchang Su, Yiming Liu et al.
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
Zhong-Yu Li, Ruoyi Du, Juncheng Yan et al.
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Letitia Parcalabescu, Anette Frank
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Zeyu Han, Fangrui Zhu, Qianru Lao et al.
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Boyu Gou, Zanming Huang, Yuting Ning et al.
ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement
Muhammad Atif Butt, Kai Wang, Javier Vazquez-Corral et al.
Improving Video Segmentation via Dynamic Anchor Queries
Yikang Zhou, Tao Zhang, Xiangtai Li et al.
LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion
Pancheng Zhao, Peng Xu, Pengda Qin et al.
VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning
Kang Chen, Xiangqian Wu
AMEGO: Active Memory from long EGOcentric videos
Gabriele Goletto, Tushar Nagarajan, Giuseppe Averta et al.
KGARevion: An AI Agent for Knowledge-Intensive Biomedical QA
Xiaorui Su, Yibo Wang, Shanghua Gao et al.
StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion
Ming Tao, BINGKUN BAO, Hao Tang et al.
WebVLN: Vision-and-Language Navigation on Websites
Qi Chen, Dileepa Pitawela, Chongyang Zhao et al.
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu, Yuhang Zang, Xiaoyi Dong et al.
KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Eunice Yiu, Maan Qraitem, Anisa Majhi et al.
Decomposing Semantic Shifts for Composed Image Retrieval
Xingyu Yang, Daqing Liu, Heng Zhang et al.
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
Rujie Wu, Xiaojian Ma, Zhenliang Zhang et al.
One-stage Prompt-based Continual Learning
Youngeun Kim, YUHANG LI, Priyadarshini Panda
Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception
Lei Fan, Mingfu Liang, Yunxuan Li et al.
PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts
Zewen Chen, Haina Qin, Juan Wang et al.