Audio-Visual Learning
Learning from audio and visual signals
Top Papers
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang, Wenyi Yu, Guangzhi Sun et al.
Listen, Think, and Understand
Yuan Gong, Hongyin Luo, Alexander Liu et al.
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
Linrui Tian, Qi Wang, Bang Zhang et al.
HIVE: Harnessing Human Feedback for Instructional Visual Editing
Shu Zhang, Xinyi Yang, Yihao Feng et al.
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Yazhou Xing, Yingqing He, Zeyue Tian et al.
Decoding Natural Images from EEG for Object Recognition
Yonghao Song, Bingchuan Liu, Xiang Li et al.
Brain decoding: toward real-time reconstruction of visual perception
Yohann Benchetrit, Hubert Banville, Jean-Remi King
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
Jianwen Jiang, Chao Liang, Jiaqi Yang et al.
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
Haiyang Liu, Zihao Zhu, Giorgio Becherini et al.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Sreyan Ghosh, Arushi Goel, Jaehyeon Kim et al.
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
Heng Wang, Jianbo Ma, Santiago Pascual et al.
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Evonne Ng, Javier Romero, Timur Bagautdinov et al.
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Ziyang Ma, Yinghao Ma, Yanqiao Zhu et al.
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
Shivangi Aneja, Justus Thies, Angela Dai et al.
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua, Yunlong Tang, Chenliang Xu et al.
Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Long Le, Jason Xie, William Liang et al.
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning
Hongxia Xie, Chu-Jun Peng, Yu-Wen Tseng et al.
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
Yaoting Wang, Liu Weisong, Guangyao Li et al.
XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning
Pritam Sarkar, Ali Etemad
Video-Guided Foley Sound Generation with Multimodal Controls
Ziyang Chen, Prem Seetharaman, Bryan Russell et al.
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation
Wei Zhao, Pengxiang Ding, Zhang Min et al.
Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark
Ziyang Chen, Israel D. Gebru, Christian Richardt et al.
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
Yuanhong Chen, Yuyuan Liu, Hu Wang et al.
Audio-Synchronized Visual Animation
Lin Zhang, Shentong Mo, Yijing Zhang et al.
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
CHEN CHEN, Ruizhe Li, Yuchen Hu et al.
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
Tiantian Geng, Jinrui Zhang, Qingni Wang et al.
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo, Pedro Morgado
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
Santiago Pascual, Chunghsin YEH, Ioannis Tsiamas et al.
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Rui Qian, Shuangrui Ding, Xiaoyi Dong et al.
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
Zhe Kong, Feng Gao, Yong Zhang et al.
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Jinxiang Liu, Yikun Liu, Ferenas et al.
Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
Zhangbin Li, Jinxing Zhou, Dan Guo et al.
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM
Wang Jiarui, Huiyu Duan, Guangtao Zhai et al.
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
Yunlong Tang, Daiki Shimada, Jing Bi et al.
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Kun Su, Judith Li, Qingqing Huang et al.
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Yaoting Wang, Peiwen Sun, Dongzhan Zhou et al.
NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics
David Robinson, Marius Miron, Masato Hagiwara et al.
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
Ming Hu, Kun yuan, Yaling Shen et al.
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators
CHEN CHEN, Yuchen Hu, Siyin Wang et al.
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
feilong tang, Chengzhi Liu, Zhongxing Xu et al.
Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Soham Deshmukh, Shuo Han, Hazim Bukhari et al.
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
Mintong Kang, Chejian Xu, Bo Li
Attention Distillation: A Unified Approach to Visual Characteristics Transfer
Yang Zhou, Xu Gao, Zichong Chen et al.
Navigation Instruction Generation with BEV Perception and Large Language Models
Sheng Fan, Rui Liu, Wenguan Wang et al.
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen, Puyuan Peng, Ami Baid et al.
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio
Chao Xu, Yang Liu, Jiazheng Xing et al.
Dense Audio-Visual Event Localization Under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration
Ziheng Zhou, Jinxing Zhou, Wei Qian et al.
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.
VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model
Zuwei Long, Yunhang Shen, Chaoyou Fu et al.
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
Yongming Zhu, Longhao Zhang, Zhengkun Rong et al.
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
Bolin Lai, Fiona Ryan, Wenqi Jia et al.
ThinkSound: Chain-of-Thought Reasoning in Multimodal LLMs for Audio Generation and Editing
Huadai Liu, Kaicheng Luo, Jialei Wang et al.
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Jeongsoo Choi, Se Jin Park, Minsu Kim et al.
The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective
Wenqi Jia, Miao Liu, Hao Jiang et al.
LeVo: High-Quality Song Generation with Multi-Preference Alignment
Shun Lei, Yaoxun XU, ZhiweiLin et al.
Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge
Dongjin Kim, Sung Jin Um, Sangmin Lee et al.
Cyclic Learning for Binaural Audio Generation and Localization
Zhaojian Li, Bin Zhao, Yuan Yuan
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Qiushi Zhu, Jie Zhang, Yu Gu et al.
Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observations
Shengeng Tang, Jiayi He, Lechao Cheng et al.
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
Kaibin Tian, Yanhua Cheng, Yi Liu et al.
DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
Junwen Xiong, Peng Zhang, Tao You et al.
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye et al.
Finding Visual Task Vectors
Alberto Hojel, Yutong Bai, Trevor Darrell et al.
Learning to Learn Better Visual Prompts
Fengxiang Wang, Wanrong Huang, Shaowu Yang et al.
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
Pengcheng Zhao, Jinxing Zhou, Yang Zhao et al.
ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
Jiazhi Guan, Zhiliang Xu, Hang Zhou et al.
EvSign: Sign Language Recognition and Translation with Streaming Events
Pengyu Zhang, Hao Yin, Zeren Wang et al.
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers
Ziqiao Peng, Jiwen Liu, Haoxian Zhang et al.
VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
Saksham Singh Kushwaha, Yapeng Tian
Audio-Visual Instance Segmentation
Ruohao Guo, Xianghua Ying, Yaru Chen et al.
Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
Fating Hong, Zunnan Xu, Zixiang Zhou et al.
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen, Kumar Ashutosh, Rohit Girdhar et al.
Tri-Ergon: Fine-Grained Video-to-Audio Generation with Multi-Modal Conditions and LUFS Control
Bingliang Li, Fengyu Yang, Yuxin Mao et al.
ViSpeak: Visual Instruction Feedback in Streaming Videos
Shenghao Fu, Qize Yang, Yuan-Ming Li et al.
Step Differences in Instructional Video
Tushar Nagarajan, Lorenzo Torresani
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes
Tony Alex, Sara Atito, Armin Mustafa et al.
Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models
Ruiyu Wang, Yu Yuan, Shizhao Sun et al.
MemoNav: Working Memory Model for Visual Navigation
Hongxin Li, Zeyu Wang, Xu Yang et al.
Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores
Lucas Goncalves, Prashant Mathur, Chandrashekhar Lavania et al.
Selective Visual Prompting in Vision Mamba
Yifeng Yao, Zichen Liu, Zhenyu Cui et al.
ADIFF: Explaining audio difference using natural language
Soham Deshmukh, Shuo Han, Rita Singh et al.
Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning
Stefan Smeu, Dragos-Alexandru Boldisor, Dan Oneata et al.
Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology
Pei Liu, Luping Ji, Jiaxiang Gou et al.
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo, Shuailei Ma, Shijie Ma et al.
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition
Yusheng Dai, HangChen, Jun Du et al.
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
Samuel Pegg, Kai Li, Xiaolin Hu
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos
Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
Bencheng Liao, Xinggang Wang, Lianghui Zhu et al.
LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation
Can Jin, Ying Li, Mingyu Zhao et al.
Learning Adaptive Lighting via Channel-Aware Guidance
Qirui Yang, Peng-Tao Jiang, Hao Zhang et al.
Detours for Navigating Instructional Videos
Kumar Ashutosh, Zihui Xue, Tushar Nagarajan et al.
BearLLM: A Prior Knowledge-Enhanced Bearing Health Management Framework with Unified Vibration Signal Representation
Haotian Peng, Jiawei Liu, Jinsong Du et al.
Audio-visual Generalized Zero-shot Learning the Easy Way
Shentong Mo, Pedro Morgado
Contextual AD Narration with Interleaved Multimodal Sequence
Hanlin Wang, Zhan Tong, Kecheng Zheng et al.
Self-Supervised Audio-Visual Soundscape Stylization
Tingle Li, Renhao Wang, Po-Yao Huang et al.
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Hantao Lou, Changye Li, Jiaming Ji et al.
Language-Guided Audio-Visual Learning for Long-Term Sports Assessment
Huangbiao Xu, Xiao Ke, Huanqi Wu et al.
Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs
Kejia Zhang, Keda TAO, Jiasheng Tang et al.
Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Sung Jin Um, Dongjin Kim, Sangmin Lee et al.