🧬Multimodal

Audio-Visual Learning

Learning from audio and visual signals

100 papers3,917 total citations
Compare with other topics
Feb '24 Jan '26344 papers
Also includes: audio-visual learning, audio visual, speech and vision, sound and video

Top Papers

#1

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

CVPR 2024
570
citations
#2

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun et al.

ICLR 2024
447
citations
#3

Listen, Think, and Understand

Yuan Gong, Hongyin Luo, Alexander Liu et al.

ICLR 2024
221
citations
#4

EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Linrui Tian, Qi Wang, Bang Zhang et al.

ECCV 2024
218
citations
#5

HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang, Xinyi Yang, Yihao Feng et al.

CVPR 2024
164
citations
#6

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Yazhou Xing, Yingqing He, Zeyue Tian et al.

CVPR 2024
109
citations
#7

Decoding Natural Images from EEG for Object Recognition

Yonghao Song, Bingchuan Liu, Xiang Li et al.

ICLR 2024
92
citations
#8

Brain decoding: toward real-time reconstruction of visual perception

Yohann Benchetrit, Hubert Banville, Jean-Remi King

ICLR 2024
90
citations
#9

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Jianwen Jiang, Chao Liang, Jiaqi Yang et al.

ICLR 2025
89
citations
#10

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini et al.

CVPR 2024
78
citations
#11

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Sreyan Ghosh, Arushi Goel, Jaehyeon Kim et al.

NeurIPS 2025
74
citations
#12

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Heng Wang, Jianbo Ma, Santiago Pascual et al.

AAAI 2024arXiv:2308.09300
vision-to-audio generationcross-modal generationfoundation modelslatent space alignment+4
74
citations
#13

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Evonne Ng, Javier Romero, Timur Bagautdinov et al.

CVPR 2024
71
citations
#14

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma, Yinghao Ma, Yanqiao Zhu et al.

NeurIPS 2025arXiv:2505.13032
audio-language modelsmultimodal audio reasoningchain-of-thought rationaleaudio question answering+4
52
citations
#15

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

Shivangi Aneja, Justus Thies, Angela Dai et al.

CVPR 2024
52
citations
#16

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Hang Hua, Yunlong Tang, Chenliang Xu et al.

AAAI 2025
47
citations
#17

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

Long Le, Jason Xie, William Liang et al.

ICLR 2025
42
citations
#18

EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning

Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng et al.

CVPR 2024
38
citations
#19

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

Yaoting Wang, Liu Weisong, Guangyao Li et al.

AAAI 2024arXiv:2309.07929
audio-visual localizationaudio-visual segmentationzero-shot learningfew-shot learning+4
38
citations
#20

XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning

Pritam Sarkar, Ali Etemad

AAAI 2024arXiv:2211.13929
cross-modal knowledge distillationmasked data reconstructiondomain alignment strategyvideo representation learning+4
38
citations
#21

Video-Guided Foley Sound Generation with Multimodal Controls

Ziyang Chen, Prem Seetharaman, Bryan Russell et al.

CVPR 2025arXiv:2411.17698
video-guided sound generationmultimodal conditioningfoley sound synthesisaudio-visual synchronization+4
38
citations
#22

VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation

Wei Zhao, Pengxiang Ding, Zhang Min et al.

ICLR 2025
37
citations
#23

Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

Ziyang Chen, Israel D. Gebru, Christian Richardt et al.

CVPR 2024
36
citations
#24

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

Yuanhong Chen, Yuyuan Liu, Hu Wang et al.

CVPR 2024
34
citations
#25

Audio-Synchronized Visual Animation

Lin Zhang, Shentong Mo, Yijing Zhang et al.

ECCV 2024
33
citations
#26

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

CHEN CHEN, Ruizhe Li, Yuchen Hu et al.

ICLR 2024
32
citations
#27

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Tiantian Geng, Jinrui Zhang, Qingni Wang et al.

CVPR 2025arXiv:2411.19772
omni-modal perceptionmulti-modal video understandingevent boundary detectionvision-audio-language benchmark+4
32
citations
#28

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

Shentong Mo, Pedro Morgado

CVPR 2024
31
citations
#29

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Santiago Pascual, Chunghsin YEH, Ioannis Tsiamas et al.

ECCV 2024arXiv:2407.10387
video-to-audio generationaudio-visual synchronizationgenerative audio codecmasked generative model+2
31
citations
#30

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong et al.

CVPR 2025arXiv:2501.03218
video large language modelsactive real-time interactionstreaming video processingdisentangled system architecture+4
31
citations
#31

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

Zhe Kong, Feng Gao, Yong Zhang et al.

NeurIPS 2025arXiv:2505.22647
audio-driven human animationtalking head generationtalking body generationmulti-person video generation+3
30
citations
#32

Audio-Visual Segmentation via Unlabeled Frame Exploitation

Jinxiang Liu, Yikun Liu, Ferenas et al.

CVPR 2024
27
citations
#33

Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering

Zhangbin Li, Jinxing Zhou, Dan Guo et al.

AAAI 2024arXiv:2312.12816
audio-visual question answeringobject-level cluesmulti-modal relationsquestion-conditioned discovery+4
24
citations
#34

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

Wang Jiarui, Huiyu Duan, Guangtao Zhai et al.

CVPR 2025
24
citations
#35

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Yunlong Tang, Daiki Shimada, Jing Bi et al.

AAAI 2025
24
citations
#36

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Kun Su, Judith Li, Qingqing Huang et al.

AAAI 2024arXiv:2305.06594
video-to-music generationautoregressive modelvisual-audio correspondenceaudio codecs+4
23
citations
#37

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang, Peiwen Sun, Dongzhan Zhou et al.

ECCV 2024
23
citations
#38

NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

David Robinson, Marius Miron, Masato Hagiwara et al.

ICLR 2025arXiv:2411.07186
audio-language foundation modelbioacoustics taskszero-shot classificationanimal vocalization detection+3
23
citations
#39

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu, Kun yuan, Yaling Shen et al.

ICCV 2025
23
citations
#40

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

CHEN CHEN, Yuchen Hu, Siyin Wang et al.

ICLR 2025arXiv:2501.17202
speech quality evaluationaudio large language modelsmultimodal agentsnatural language evaluation+3
22
citations
#41

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

feilong tang, Chengzhi Liu, Zhongxing Xu et al.

CVPR 2025arXiv:2505.16652
attention mechanismmultimodal large language modelsvisual question answeringhallucination mitigation+3
22
citations
#42

Audio Entailment: Assessing Deductive Reasoning for Audio Understanding

Soham Deshmukh, Shuo Han, Hazim Bukhari et al.

AAAI 2025
22
citations
#43

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Mintong Kang, Chejian Xu, Bo Li

ICLR 2025
21
citations
#44

Attention Distillation: A Unified Approach to Visual Characteristics Transfer

Yang Zhou, Xu Gao, Zichong Chen et al.

CVPR 2025
21
citations
#45

Navigation Instruction Generation with BEV Perception and Large Language Models

Sheng Fan, Rui Liu, Wenguan Wang et al.

ECCV 2024
20
citations
#46

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Changan Chen, Puyuan Peng, Ami Baid et al.

ECCV 2024
19
citations
#47

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Chao Xu, Yang Liu, Jiazheng Xing et al.

CVPR 2024
18
citations
#48

Dense Audio-Visual Event Localization Under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

Ziheng Zhou, Jinxing Zhou, Wei Qian et al.

AAAI 2025
18
citations
#49

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.

ICLR 2025arXiv:2410.18325
audio-visual llmscross-modal hallucinationmultimodal understandingaudio-visual perception+3
17
citations
#50

VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model

Zuwei Long, Yunhang Shen, Chaoyou Fu et al.

NeurIPS 2025
audio-text token generationlarge speech-language modelmultiple cross-modal predictionstreaming speech synthesis+4
17
citations
#51

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Yongming Zhu, Longhao Zhang, Zhengkun Rong et al.

CVPR 2025arXiv:2412.04037
audio-driven head generationdyadic conversation modelingmotion latent spacedenoising motion generation+4
17
citations
#52

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Bolin Lai, Fiona Ryan, Wenqi Jia et al.

ECCV 2024
17
citations
#53

ThinkSound: Chain-of-Thought Reasoning in Multimodal LLMs for Audio Generation and Editing

Huadai Liu, Kaicheng Luo, Jialei Wang et al.

NeurIPS 2025
16
citations
#54

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Jeongsoo Choi, Se Jin Park, Minsu Kim et al.

CVPR 2024
16
citations
#55

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

Wenqi Jia, Miao Liu, Hao Jiang et al.

CVPR 2024
15
citations
#56

LeVo: High-Quality Song Generation with Multi-Preference Alignment

Shun Lei, Yaoxun XU, ZhiweiLin et al.

NeurIPS 2025arXiv:2506.07520
lyrics-to-song generationaudio language modelsvocal-instrument harmonyparallel token modeling+4
15
citations
#57

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Dongjin Kim, Sung Jin Um, Sangmin Lee et al.

CVPR 2024
15
citations
#58

Cyclic Learning for Binaural Audio Generation and Localization

Zhaojian Li, Bin Zhao, Yuan Yuan

CVPR 2024
15
citations
#59

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Qiushi Zhu, Jie Zhang, Yu Gu et al.

AAAI 2024arXiv:2401.03468
self-supervised learningmultichannel speech processingaudio-visual speech recognitioncontrastive learning+4
15
citations
#60

Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observations

Shengeng Tang, Jiayi He, Lechao Cheng et al.

CVPR 2025
15
citations
#61

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Kaibin Tian, Yanhua Cheng, Yi Liu et al.

AAAI 2024arXiv:2401.00701
text-to-video retrievalcoarse-to-fine representationmulti-granularity featurescross-modal alignment+4
14
citations
#62

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Junwen Xiong, Peng Zhang, Tao You et al.

CVPR 2024
14
citations
#63

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Han Wang, Yuxiang Nie, Yongjie Ye et al.

ICCV 2025
14
citations
#64

Finding Visual Task Vectors

Alberto Hojel, Yutong Bai, Trevor Darrell et al.

ECCV 2024
14
citations
#65

Learning to Learn Better Visual Prompts

Fengxiang Wang, Wanrong Huang, Shaowu Yang et al.

AAAI 2024
14
citations
#66

Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing

Pengcheng Zhao, Jinxing Zhou, Yang Zhao et al.

AAAI 2025
13
citations
#67

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Jiazhi Guan, Zhiliang Xu, Hang Zhou et al.

ECCV 2024arXiv:2408.03284
style-based generatoraudio-visual lip-syncing3d facial dynamicsstyle-injected transformer+4
13
citations
#68

EvSign: Sign Language Recognition and Translation with Streaming Events

Pengyu Zhang, Hao Yin, Zeren Wang et al.

ECCV 2024
13
citations
#69

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

Ziqiao Peng, Jiwen Liu, Haoxian Zhang et al.

NeurIPS 2025
12
citations
#70

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Saksham Singh Kushwaha, Yapeng Tian

CVPR 2025
12
citations
#71

Audio-Visual Instance Segmentation

Ruohao Guo, Xianghua Ying, Yaru Chen et al.

CVPR 2025
11
citations
#72

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

Fating Hong, Zunnan Xu, Zixiang Zhou et al.

ICCV 2025arXiv:2504.02542
talking head synthesisvideo diffusion frameworkmulti-modal controlmamba structure+3
11
citations
#73

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Changan Chen, Kumar Ashutosh, Rohit Girdhar et al.

CVPR 2024
11
citations
#74

Tri-Ergon: Fine-Grained Video-to-Audio Generation with Multi-Modal Conditions and LUFS Control

Bingliang Li, Fengyu Yang, Yuxin Mao et al.

AAAI 2025
11
citations
#75

ViSpeak: Visual Instruction Feedback in Streaming Videos

Shenghao Fu, Qize Yang, Yuan-Ming Li et al.

ICCV 2025
11
citations
#76

Step Differences in Instructional Video

Tushar Nagarajan, Lorenzo Torresani

CVPR 2024
10
citations
#77

SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

Tony Alex, Sara Atito, Armin Mustafa et al.

ICLR 2025
10
citations
#78

Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models

Ruiyu Wang, Yu Yuan, Shizhao Sun et al.

ICML 2025
10
citations
#79

MemoNav: Working Memory Model for Visual Navigation

Hongxin Li, Zeyu Wang, Xu Yang et al.

CVPR 2024
10
citations
#80

Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores

Lucas Goncalves, Prashant Mathur, Chandrashekhar Lavania et al.

ECCV 2024
9
citations
#81

Selective Visual Prompting in Vision Mamba

Yifeng Yao, Zichen Liu, Zhenyu Cui et al.

AAAI 2025
9
citations
#82

ADIFF: Explaining audio difference using natural language

Soham Deshmukh, Shuo Han, Rita Singh et al.

ICLR 2025arXiv:2502.04476
audio difference explanationaudio captioning datasetscross-projection moduleprefix tuning+4
9
citations
#83

Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning

Stefan Smeu, Dragos-Alexandru Boldisor, Dan Oneata et al.

CVPR 2025
9
citations
#84

Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology

Pei Liu, Luping Ji, Jiaxiang Gou et al.

ICLR 2025
8
citations
#85

Aligned Better, Listen Better for Audio-Visual Large Language Models

Yuxin Guo, Shuailei Ma, Shijie Ma et al.

ICLR 2025
8
citations
#86

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Yusheng Dai, HangChen, Jun Du et al.

CVPR 2024
8
citations
#87

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

Samuel Pegg, Kai Li, Xiaolin Hu

ICLR 2024
8
citations
#88

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

CVPR 2024
8
citations
#89

ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention

Bencheng Liao, Xinggang Wang, Lianghui Zhu et al.

AAAI 2025
8
citations
#90

LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation

Can Jin, Ying Li, Mingyu Zhao et al.

ICLR 2025
8
citations
#91

Learning Adaptive Lighting via Channel-Aware Guidance

Qirui Yang, Peng-Tao Jiang, Hao Zhang et al.

ICML 2025
7
citations
#92

Detours for Navigating Instructional Videos

Kumar Ashutosh, Zihui Xue, Tushar Nagarajan et al.

CVPR 2024
7
citations
#93

BearLLM: A Prior Knowledge-Enhanced Bearing Health Management Framework with Unified Vibration Signal Representation

Haotian Peng, Jiawei Liu, Jinsong Du et al.

AAAI 2025
7
citations
#94

Audio-visual Generalized Zero-shot Learning the Easy Way

Shentong Mo, Pedro Morgado

ECCV 2024
7
citations
#95

Contextual AD Narration with Interleaved Multimodal Sequence

Hanlin Wang, Zhan Tong, Kecheng Zheng et al.

CVPR 2025arXiv:2403.12922
audio description generationmultimodal sequence modelingvideo feature alignmentcharacter bank modeling+3
7
citations
#96

Self-Supervised Audio-Visual Soundscape Stylization

Tingle Li, Renhao Wang, Po-Yao Huang et al.

ECCV 2024
7
citations
#97

SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Hantao Lou, Changye Li, Jiaming Ji et al.

ICML 2025
6
citations
#98

Language-Guided Audio-Visual Learning for Long-Term Sports Assessment

Huangbiao Xu, Xiao Ke, Huanqi Wu et al.

CVPR 2025
6
citations
#99

Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs

Kejia Zhang, Keda TAO, Jiasheng Tang et al.

NeurIPS 2025
5
citations
#100

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um, Dongjin Kim, Sangmin Lee et al.

CVPR 2025arXiv:2506.18557
sound source localizationaudio-visual correspondencemultimodal large language modelsobject-aware contrastive alignment+2
5
citations