Multimodal Learning
Learning from multiple modalities
Top Papers
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen et al.
Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong et al.
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye et al.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang, Wenyi Yu, Guangzhi Sun et al.
Generative Multimodal Models are In-Context Learners
Quan Sun, Yufeng Cui, Xiaosong Zhang et al.
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu et al.
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Qidong Huang, Xiaoyi Dong, Pan Zhang et al.
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu, Saining Xie
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu et al.
DreamLLM: Synergistic Multimodal Comprehension and Creation
Runpei Dong, chunrui han, Yuang Peng et al.
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
yi yang, Xiaoxuan He, Hongkun Pan et al.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier et al.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu et al.
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
Le Xue, Ning Yu, Shu Zhang et al.
MSGNet: Learning Multi-Scale Inter-series Correlations for Multivariate Time Series Forecasting
Wanlin Cai, Yuxuan Liang, Xianggen Liu et al.
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Siva Mustikovela et al.
Multimodal Web Navigation with Instruction-Finetuned Foundation Models
Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum et al.
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
Yuzhou Huang, Liangbin Xie, Xintao Wang et al.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Chaoyou Fu, Haojia Lin, Xiong Wang et al.
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Cong Wei, Yang Chen, Haonan Chen et al.
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi, Fuxiao Liu, Shihao Wang et al.
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Chaoya Jiang, Haiyang Xu, Mengfan Dong et al.
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang, Chao Feng, Ziyang Chen et al.
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, Mike Zheng Shou
Reliable Conflictive Multi-View Learning
Cai Xu, Jiajun Si, Ziyu Guan et al.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.
PSALM: Pixelwise Segmentation with Large Multi-modal Model
Zheng Zhang, YeYao Ma, Enming Zhang et al.
MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi et al.
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Jinyi Hu, Yuan Yao, Chongyi Wang et al.
Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
Renjie Pi, Tianyang Han, Wei Xiong et al.
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Wenbin Wang, Liang Ding, Minyan Zeng et al.
Frequency-Spatial Entanglement Learning for Camouflaged Object Detection
Yanguang Sun, Chunyan Xu, Jian Yang et al.
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Luo, Xue Yang, Wenhan Dou et al.
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu, Tianjie Zhang, Yu Gu et al.
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Cheng Yang, Chufan Shi, Yaxin Liu et al.
Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification
Yunlong Zhang, Honglin Li, YUXUAN SUN et al.
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi, Lewei Yao, Jiahui Gao et al.
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
Wenwen Zhuang, Xin Huang, Xiantao Zhang et al.
Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao et al.
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen, Zichen Wen, Yichao Du et al.
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu, Hao Fei, Xiangtai Li et al.
DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency
Wenfang Yao, Kejing Yin, William Cheung et al.
Delving into Multimodal Prompting for Fine-Grained Visual Classification
Xin Jiang, Hao Tang, Junyao Gao et al.
OmniSat: Self-Supervised Modality Fusion for Earth Observation
Guillaume Astruc, Nicolas Gonthier, Clement Mallet et al.
Enhancing Multimodal Cooperation via Sample-level Modality Valuation
Yake Wei, Ruoxuan Feng, Zihe Wang et al.
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li, Ge Zhang, Yinghao Ma et al.
MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation
Jinfeng Xu, Zheyu Chen, Shuo Yang et al.
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification
Pingping Zhang, Yuhao Wang, Yang Liu et al.
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations
Yufeng Huang, Jiji Tang, Zhuo Chen et al.
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Chaoyi Zhang, Kevin Lin, Zhengyuan Yang et al.
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jitesh Jain, Jianwei Yang, Humphrey Shi
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua, Yunlong Tang, Chenliang Xu et al.
Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping
Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti et al.
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Rizhao Cai, Zirui Song, DAYAN GUAN et al.
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
Pan Wang, Qiang Zhou, Yawen Wu et al.
Debiasing Multimodal Sarcasm Detection with Contrastive Learning
Mengzhao Jia, Can Xie, Liqiang Jing
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Zhibo Yang, Jun Tang, Zhaohai Li et al.
Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning
Yiwen Ye, Yutong Xie, Jianpeng Zhang et al.
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
Zhaofeng Wu, Xinyan Yu, Dani Yogatama et al.
Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector
An Lao, Qi Zhang, Chongyang Shi et al.
XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning
Pritam Sarkar, Ali Etemad
Amodal Completion via Progressive Mixed Context Diffusion
Katherine Xu, Lingzhi Zhang, Jianbo Shi
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Ziyao Shangguan, Chuhan Li, Yuxuan Ding et al.
Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models
Liqi He, Zuchao Li, Xiantao Cai et al.
Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition
Qianrui Zhou, Hua Xu, Hao Li et al.
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Hui Zhang, Dexiang Hong, Yitong Wang et al.
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification
Yajing Zhai, Yawen Zeng, Zhiyong Huang et al.
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
Hongxin Zhang, Zeyuan Wang, Qiushi Lyu et al.
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu, Wenwei Zhang, Lumin Xu et al.
Graph-Aware Contrasting for Multivariate Time-Series Classification
Yucheng Wang, Yuecong Xu, Jianfei Yang et al.
Jointly Training Large Autoregressive Multimodal Models
Emanuele Aiello, Lili Yu, Yixin Nie et al.
Open-World Human-Object Interaction Detection via Multi-modal Prompts
Jie Yang, Bingliang Li, Ailing Zeng et al.
RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation
Haiming Zhang, Xu Yan, Dongfeng Bai et al.
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
Sensen Gao, Xiaojun Jia, Xuhong Ren et al.
What to align in multimodal contrastive learning?
Benoit Dufumier, Javiera Castillo Navarro, Devis Tuia et al.
UMIE: Unified Multimodal Information Extraction with Instruction Tuning
Lin Sun, Kai Zhang, Qingyuan Li et al.
DC-NAS: Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification
Xinyan Liang, Pinhan Fu, Qian Guo et al.
LAMM: Label Alignment for Multi-Modal Prompt Learning
Jingsheng Gao, Jiacheng Ruan, Suncheng Xiang et al.
Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
Xu Zheng, Yuanhuiyi Lyu, LIN WANG
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
Hanxun Yu, Wentong Li, Song Wang et al.
A Comprehensive Overhaul of Multimodal Assistant with Small Language Models
Minjie Zhu, Yichen Zhu, Ning Liu et al.
Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning
HAO ZHANG, Linfeng Tang, Xinyu Xiang et al.
LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations
Anian Ruoss, Fabio Pardo, Harris Chan et al.
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang, Rabiul Awal, Aishwarya Agrawal
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
jiazhou zhou, Xu Zheng, Yuanhuiyi Lyu et al.
Multi-modal Learning for Geospatial Vegetation Forecasting
Vitus Benson, Claire Robin, Christian Requena-Mesa et al.
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo et al.
Distilling Multi-modal Large Language Models for Autonomous Driving
Deepti Hegde, Rajeev Yasarla, Hong Cai et al.
Graphic Design with Large Multimodal Model
Yutao Cheng, Zhao Zhang, Maoke Yang et al.
Multimodal Patient Representation Learning with Missing Modalities and Labels
Zhenbang Wu, Anant Dadu, Nicholas Tustison et al.
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
Baichuan Zhou, Haote Yang, Dairong Chen et al.
CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs
Siyu Wang, Cailian Chen, Xinyi Le et al.
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng, Yun Xing, Zesen Cheng et al.
UMBRAE: Unified Multimodal Brain Decoding
Weihao Xia, Raoul de Charette, Cengiz Oztireli et al.
Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities
AJ Piergiovanni, Isaac Noble, Dahun Kim et al.
Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection
Ting Lei, Shaofeng Yin, Yuxin Peng et al.