Vision-Language Models
Models that understand both images and text
Top Papers
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang et al.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Xin Li, Jing Yu Koh, Alexander Ku et al.
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia et al.
VILA: On Pre-training for Visual Language Models
Ji Lin, Danny Yin, Wei Ping et al.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani et al.
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Sicong Leng, Hang Zhang, Guanzheng Chen et al.
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu et al.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin, Ryuichi Takanobu, Cai Zhang et al.
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu et al.
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu, Saining Xie
Vision-Language Foundation Models as Effective Robot Imitators
Xinghang Li, Minghuan Liu, Hanbo Zhang et al.
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu, Yushi Hu, Bangzheng Li et al.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu et al.
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong, Delong Ran, Jinyuan Liu et al.
Detecting and Preventing Hallucinations in Large Vision Language Models
Anisha Gunjal, Jihan Yin, Erhan Bas
On Scaling Up a Multilingual Vision and Language Model
Xi Chen, Josip Djolonga, Piotr Padlewski et al.
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
Xin Guo, Jiangwei Lao, Bo Dang et al.
Sequential Modeling Enables Scalable Learning for Large Vision Models
Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu et al.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Qingqing Zhao, Yao Lu, Moo Jin Kim et al.
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma et al.
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
Wenbo Hu, Yifan Xu, Yi Li et al.
Revisiting Feature Prediction for Learning Visual Representations from Video
Quentin Garrido, Yann LeCun, Michael Rabbat et al.
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
Jeongho Kim, Gyojung Gu, Minho Park et al.
LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images
Zonghao Guo, Ruyi Xu, Yuan Yao et al.
Uni3D: Exploring Unified 3D Representation at Scale
Junsheng Zhou, Jinsheng Wang, Baorui Ma et al.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang et al.
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Siva Mustikovela et al.
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Zheng Liu, Peitian Zhang et al.
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum et al.
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
Yuzhou Huang, Liangbin Xie, Xintao Wang et al.
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Juan Rocamonde, Victoriano Montesinos, Elvis Nava et al.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Chaoyou Fu, Haojia Lin, Xiong Wang et al.
GSVA: Generalized Segmentation via Multimodal Large Language Models
Zhuofan Xia, Dongchen Han, Yizeng Han et al.
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Kevin Qinghong Lin, Linjie Li, Difei Gao et al.
Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs
Shi Liu, Kecheng Zheng, Wei Chen
AnyText: Multilingual Visual Text Generation and Editing
Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He et al.
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
Feng Wang, Jieru Mei, Alan Yuille
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang, Min Shi, Qingyun Li et al.
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Chaoya Jiang, Haiyang Xu, Mengfan Dong et al.
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi, Fuxiao Liu, Shihao Wang et al.
Efficient Test-Time Adaptation of Vision-Language Models
Adilbek Karmanov, Dayan Guan, Shijian Lu et al.
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
Haoqin Tu, Chenhang Cui, Zijun Wang et al.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee et al.
An Empirical Study of CLIP for Text-Based Person Search
Cao Min, Yang Bai, ziyin Zeng et al.
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li, hangyu guo, Kun Zhou et al.
Towards Open-ended Visual Quality Comparison
Haoning Wu, Hanwei Zhu, Zicheng Zhang et al.
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu et al.
Brain decoding: toward real-time reconstruction of visual perception
Yohann Benchetrit, Hubert Banville, Jean-Remi King
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, Mike Zheng Shou
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
Haoran Wei, Lingyu Kong, Jinyue Chen et al.
LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias
Haian Jin, Hanwen Jiang, Hao Tan et al.
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang Weiyun, yiming ren, Haowen Luo et al.
VIGC: Visual Instruction Generation and Correction
Théo Delemazure, Jérôme Lang, Grzegorz Pierczyński
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Chengke Zou, Xingang Guo, Rui Yang et al.
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
Jingkang Yang, Yuhao Dong, Shuai Liu et al.
General Object Foundation Model for Images and Videos at Scale
Junfeng Wu, Yi Jiang, Qihao Liu et al.
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Jinyi Hu, Yuan Yao, Chongyi Wang et al.
Learning Multi-Dimensional Human Preference for Text-to-Image Generation
Sixian Zhang, Bohan Wang, Junqiang Wu et al.
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
Renrui Zhang, Xinyu Wei, Dongzhi Jiang et al.
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Wenbin Wang, Liang Ding, Minyan Zeng et al.
Towards 3D Molecule-Text Interpretation in Language Models
Sihang Li, Zhiyuan Liu, Yanchen Luo et al.
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.
Improving Diffusion Models for Authentic Virtual Try-on in the Wild
Choi Yisol, Sangkyung Kwak, Kyungmin Lee et al.
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Mengcheng Lan, Chaofeng Chen, Yiping Ke et al.
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Luo, Xue Yang, Wenhan Dou et al.
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu, Tianjie Zhang, Yu Gu et al.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung et al.
Unifying 3D Vision-Language Understanding via Promptable Queries
ziyu zhu, Zhuofan Zhang, Xiaojian Ma et al.
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Jiamian Wang, Guohao Sun, Pichao Wang et al.
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
Tianwei Lin, Wenqiao Zhang, Sijing Li et al.
Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models
Fei Shen, Hu Ye, Sibo Liu et al.
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed et al.
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi, Lewei Yao, Jiahui Gao et al.
DocFormerv2: Local Features for Document Understanding
Srikar Appalaraju, Peng Tang, Qi Dong et al.
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu, Hao Fei, Xiangtai Li et al.
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen, Zichen Wen, Yichao Du et al.
FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion
George Cazenavette, Avneesh Sud, Thomas Leung et al.
Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology
Xiangyu Wang, Donglin Yang, ziqin wang et al.
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
Yining Hong, Zishuo Zheng, Peihao Chen et al.
Describing Differences in Image Sets with Natural Language
Lisa Dunlap, Yuhui Zhang, Xiaohan Wang et al.
Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick, Guangxing Han, Rui Hou et al.
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang, Jinyeong Kim, Junhyeok Kim et al.
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations
Yufeng Huang, Jiji Tang, Zhuo Chen et al.
ReMamber: Referring Image Segmentation with Mamba Twister
Yuhuan Yang, Chaofan Ma, Jiangchao Yao et al.
SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples
Phillip Howard, Avinash Madasu, Tiep Le et al.
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jitesh Jain, Jianwei Yang, Humphrey Shi
Towards Interpreting Visual Information Processing in Vision-Language Models
Clement Neo, Luke Ong, Philip Torr et al.
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
Yu Zeng, Vishal M. Patel, Haochen Wang et al.
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Chenyang Zhu, Kai Li, Yue Ma et al.
Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention
Jie Ren, Yaxin Li, Shenglai Zeng et al.
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Jeremy Irvin, Emily Liu, Joyce Chen et al.
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng et al.
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang et al.