🧬Multimodal

Vision-Language Models

Models that understand both images and text

100 papers19,595 total citations
Compare with other topics
Mar '24 Feb '261645 papers
Also includes: vision-language models, vlm, vlms, vision language, multimodal llm, multimodal large language models

Top Papers

#1

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang et al.

CVPR 2024
2,210
citations
#2

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Xin Li, Jing Yu Koh, Alexander Ku et al.

ICLR 2024
1,366
citations
#3

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia et al.

ICLR 2024
1,171
citations
#4

VILA: On Pre-training for Visual Language Models

Ji Lin, Danny Yin, Wei Ping et al.

CVPR 2024
685
citations
#5

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

CVPR 2024
570
citations
#6

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

CVPR 2024
550
citations
#7

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai et al.

ICLR 2025arXiv:2408.12528
multimodal understandingmultimodal generationunified transformerautoregressive modeling+4
455
citations
#8

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Sicong Leng, Hang Zhang, Guanzheng Chen et al.

CVPR 2024
449
citations
#9

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu et al.

CVPR 2024
384
citations
#10

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.

CVPR 2024
354
citations
#11

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin, Ryuichi Takanobu, Cai Zhang et al.

CVPR 2024
354
citations
#12

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu et al.

ECCV 2024arXiv:2403.06764
attention mechanismvision-language modelsinference accelerationcomputational efficiency+4
343
citations
#13

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.

CVPR 2025
342
citations
#14

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.

ICCV 2025
338
citations
#15

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu, Saining Xie

CVPR 2024
327
citations
#16

Vision-Language Foundation Models as Effective Robot Imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang et al.

ICLR 2024
310
citations
#17

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li et al.

ECCV 2024
307
citations
#18

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu et al.

ICLR 2025
294
citations
#19

FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts

Yichen Gong, Delong Ran, Jinyuan Liu et al.

AAAI 2025
283
citations
#20

Detecting and Preventing Hallucinations in Large Vision Language Models

Anisha Gunjal, Jihan Yin, Erhan Bas

AAAI 2024arXiv:2308.06394
vision language modelsvisual question answeringhallucination detectionmultimodal datasets+4
256
citations
#21

On Scaling Up a Multilingual Vision and Language Model

Xi Chen, Josip Djolonga, Piotr Padlewski et al.

CVPR 2024
254
citations
#22

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu et al.

ICLR 2025arXiv:2408.04840
multi-modal large language modelslong image-sequence understandingvision-language integrationhyper attention blocks+3
237
citations
#23

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang et al.

CVPR 2024
236
citations
#24

Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.

CVPR 2024
230
citations
#25

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim et al.

CVPR 2025
203
citations
#26

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou et al.

ICLR 2025arXiv:2407.02371
text-to-video generationvideo diffusion transformerhigh-quality video datasetsmulti-modality models+3
200
citations
#27

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Wenbo Hu, Yifan Xu, Yi Li et al.

AAAI 2024arXiv:2308.09936
vision language modelsvisual question answeringmultimodal large language modelstext-rich image understanding+4
190
citations
#28

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma et al.

ICML 2025
190
citations
#29

Revisiting Feature Prediction for Learning Visual Representations from Video

Quentin Garrido, Yann LeCun, Michael Rabbat et al.

ICLR 2025
178
citations
#30

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Jeongho Kim, Gyojung Gu, Minho Park et al.

CVPR 2024
176
citations
#31

LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images

Zonghao Guo, Ruyi Xu, Yuan Yao et al.

ECCV 2024
170
citations
#32

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang et al.

ICML 2025
165
citations
#33

Uni3D: Exploring Unified 3D Representation at Scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma et al.

ICLR 2024
165
citations
#34

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Mu Cai, Haotian Liu, Siva Mustikovela et al.

CVPR 2024
153
citations
#35

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

CVPR 2025arXiv:2409.14485
long video understandingmulti-modal large language modelsvisual token compressionkey-value sparsification+3
144
citations
#36

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum et al.

ICLR 2024
141
citations
#37

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Yuzhou Huang, Liangbin Xie, Xintao Wang et al.

CVPR 2024
139
citations
#38

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.

ICLR 2024
133
citations
#39

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu, Haojia Lin, Xiong Wang et al.

NeurIPS 2025arXiv:2501.01957
multimodal large language modelsvision and speech interactionspeech-to-speech dialoguevisual and speech modalities+3
130
citations
#40

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han et al.

CVPR 2024
127
citations
#41

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao et al.

CVPR 2025
123
citations
#42

AnyText: Multilingual Visual Text Generation and Editing

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He et al.

ICLR 2024
121
citations
#43

Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs

Shi Liu, Kecheng Zheng, Wei Chen

ECCV 2024
121
citations
#44

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Feng Wang, Jieru Mei, Alan Yuille

ECCV 2024
120
citations
#45

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Weiyun Wang, Min Shi, Qingyun Li et al.

ICLR 2024
118
citations
#46

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Chaoya Jiang, Haiyang Xu, Mengfan Dong et al.

CVPR 2024
116
citations
#47

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi, Fuxiao Liu, Shihao Wang et al.

ICLR 2025arXiv:2408.15998
multimodal large language modelsvision encodersoptical character recognitiondocument analysis+4
116
citations
#48

Efficient Test-Time Adaptation of Vision-Language Models

Adilbek Karmanov, Dayan Guan, Shijian Lu et al.

CVPR 2024
109
citations
#49

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

Haoqin Tu, Chenhang Cui, Zijun Wang et al.

ECCV 2024
103
citations
#50

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee et al.

CVPR 2025
96
citations
#51

An Empirical Study of CLIP for Text-Based Person Search

Cao Min, Yang Bai, ziyin Zeng et al.

AAAI 2024arXiv:2308.10045
text-based person searchcontrastive language image pretrainingcross-modal retrievalvision-language pre-training+3
94
citations
#52

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, hangyu guo, Kun Zhou et al.

ECCV 2024
93
citations
#53

Towards Open-ended Visual Quality Comparison

Haoning Wu, Hanwei Zhu, Zicheng Zhang et al.

ECCV 2024arXiv:2402.16641
image quality assessmentlarge multi-modality modelsvisual quality comparisonopen-ended evaluation+4
93
citations
#54

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu et al.

ICLR 2025arXiv:2407.01449
document retrievalvision language modelsmulti-vector embeddingsvisual document retrieval+3
91
citations
#55

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, Mike Zheng Shou

NeurIPS 2025
90
citations
#56

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Haian Jin, Hanwen Jiang, Hao Tan et al.

ICLR 2025arXiv:2410.17242
novel view synthesistransformer architecturesparse-view inputsscene representation learning+4
90
citations
#57

Brain decoding: toward real-time reconstruction of visual perception

Yohann Benchetrit, Hubert Banville, Jean-Remi King

ICLR 2024
90
citations
#58

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Haoran Wei, Lingyu Kong, Jinyue Chen et al.

ECCV 2024
89
citations
#59

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Cong Wei, Zheyang Xiong, Weiming Ren et al.

ICLR 2025arXiv:2411.07199
instruction-guided image editingdiffusion modelsspecialist model supervisionlarge multimodal models+4
88
citations
#60

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Weiyun Wang Weiyun, yiming ren, Haowen Luo et al.

ECCV 2024arXiv:2402.19474
relation comprehensionobject localizationmultimodal large language modelsrelation conversation task+3
86
citations
#61

VIGC: Visual Instruction Generation and Correction

Théo Delemazure, Jérôme Lang, Grzegorz Pierczyński

AAAI 2024arXiv:2308.12714
visual instruction generationmultimodal large language modelsinstruction-tuning datavision-language tasks+3
84
citations
#62

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.

CVPR 2024
84
citations
#63

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li et al.

NeurIPS 2025arXiv:2505.20275
image editing datasettext-to-image generationvision-language modelmulti-turn editing+1
84
citations
#64

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

Qing Jiang, Feng Li, Zhaoyang Zeng et al.

ECCV 2024arXiv:2403.14610
open-set object detectiontext promptsvisual promptscontrastive learning+2
83
citations
#65

PSALM: Pixelwise Segmentation with Large Multi-modal Model

Zheng Zhang, YeYao Ma, Enming Zhang et al.

ECCV 2024arXiv:2403.14598
large multimodal modelsimage segmentationmask decoderreferring expression segmentation+4
82
citations
#66

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

Chengke Zou, Xingang Guo, Rui Yang et al.

ICLR 2025
82
citations
#67

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Jingkang Yang, Yuhao Dong, Shuai Liu et al.

ECCV 2024
81
citations
#68

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu et al.

CVPR 2024
79
citations
#69

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

Jinyi Hu, Yuan Yao, Chongyi Wang et al.

ICLR 2024
77
citations
#70

Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Sixian Zhang, Bohan Wang, Junqiang Wu et al.

CVPR 2024
76
citations
#71

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

CVPR 2024
74
citations
#72

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang et al.

ICLR 2025
74
citations
#73

Towards 3D Molecule-Text Interpretation in Language Models

Sihang Li, Zhiyuan Liu, Yanchen Luo et al.

ICLR 2024
73
citations
#74

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

Wenbin Wang, Liang Ding, Minyan Zeng et al.

AAAI 2025
73
citations
#75

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.

ICML 2025
72
citations
#76

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Choi Yisol, Sangkyung Kwak, Kyungmin Lee et al.

ECCV 2024
72
citations
#77

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Mengcheng Lan, Chaofeng Chen, Yiping Ke et al.

ECCV 2024
68
citations
#78

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Luo, Xue Yang, Wenhan Dou et al.

CVPR 2025
68
citations
#79

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Xiao Liu, Tianjie Zhang, Yu Gu et al.

ICLR 2025
67
citations
#80

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung et al.

NeurIPS 2025
67
citations
#81

Unifying 3D Vision-Language Understanding via Promptable Queries

ziyu zhu, Zhuofan Zhang, Xiaojian Ma et al.

ECCV 2024
64
citations
#82

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Tianwei Lin, Wenqiao Zhang, Sijing Li et al.

ICML 2025
63
citations
#83

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Jiamian Wang, Guohao Sun, Pichao Wang et al.

CVPR 2024
63
citations
#84

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Fei Shen, Hu Ye, Sibo Liu et al.

AAAI 2025
62
citations
#85

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Fushuo Huo, Wenchao Xu, Zhong Zhang et al.

ICLR 2025arXiv:2408.02032
hallucination mitigationvision-language modelscontrastive decodingtoken selection strategy+2
61
citations
#86

A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed et al.

CVPR 2024
61
citations
#87

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Renjie Pi, Lewei Yao, Jiahui Gao et al.

CVPR 2024
59
citations
#88

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu, Hao Fei, Xiangtai Li et al.

ICLR 2025arXiv:2406.05127
vision tokenizationsemantic alignmentmultimodal llmsdynamic clustering+2
58
citations
#89

DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju, Peng Tang, Qi Dong et al.

AAAI 2024arXiv:2306.01733
visual document understandingmulti-modal transformerlocal-feature alignmentdocument information extraction+4
58
citations
#90

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Zhaorun Chen, Zichen Wen, Yichao Du et al.

NeurIPS 2025arXiv:2407.04842
multimodal reward modelstext-to-image generationpreference datasetimage generation models+4
57
citations
#91

VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation

Zhen Qu, Xian Tao, Mukesh Prasad et al.

ECCV 2024arXiv:2407.12276
zero-shot anomaly segmentationvision-language modelsvisual context promptingclip model+4
55
citations
#92

FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

George Cazenavette, Avneesh Sud, Thomas Leung et al.

CVPR 2024
53
citations
#93

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim et al.

ICLR 2025arXiv:2503.03321
attention mechanismvisual attention sinkmultimodal modelsvision-language tasks+4
52
citations
#94

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang, Donglin Yang, ziqin wang et al.

ICLR 2025arXiv:2410.07087
vision-language navigationuav navigationtrajectory generationmultimodal understanding+4
52
citations
#95

Describing Differences in Image Sets with Natural Language

Lisa Dunlap, Yuhui Zhang, Xiaohan Wang et al.

CVPR 2024
51
citations
#96

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

Yining Hong, Zishuo Zheng, Peihao Chen et al.

CVPR 2024
51
citations
#97

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou, Jingkun An, Cheng Chi et al.

NeurIPS 2025arXiv:2506.04308
spatial referringvision-language modelsdepth encoder integrationsupervised fine-tuning+4
51
citations
#98

Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model

Shraman Pramanick, Guangxing Han, Rui Hou et al.

CVPR 2024
50
citations
#99

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Yufeng Huang, Jiji Tang, Zhuo Chen et al.

AAAI 2024arXiv:2305.06152
scene graph knowledgemulti-modal structured representationsvision-language pre-trainingimage-text matching+3
49
citations
#100

ReMamber: Referring Image Segmentation with Mamba Twister

Yuhuan Yang, Chaofan Ma, Jiangchao Yao et al.

ECCV 2024
49
citations