2025 "multimodal understanding" Papers

20 papers found

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

Lequan Lin, Dai Shi, Andi Han et al.

NEURIPS 2025posterarXiv:2511.09833

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio–Language Models

Weifei Jin, Yuxin Cao, Junjie Su et al.

NEURIPS 2025posterarXiv:2510.26096

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.

ICLR 2025posterarXiv:2410.18325
17
citations

Can LLMs Understand Time Series Anomalies?

Zihao Zhou, Rose Yu

ICLR 2025posterarXiv:2410.05440
32
citations

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Wei Chen, Lin Li, Yongqi Yang et al.

CVPR 2025highlightarXiv:2406.10462
12
citations

Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

Jingjing Jiang, Chongjie Si, Jun Luo et al.

NEURIPS 2025spotlightarXiv:2505.17534
5
citations

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu, Wenwei Zhang, Lumin Xu et al.

ICCV 2025posterarXiv:2503.21979
37
citations

HMVLM:Human Motion-Vision-Language Model via MoE LoRA

Lei Hu, Yongjing Ye, Shihong Xia

NEURIPS 2025poster

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Li Huaqiu, Yong Wang, Tongwen Huang et al.

ICCV 2025posterarXiv:2507.00790
3
citations

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Yuchen Liu, Yaoming Wang, Bowen Shi et al.

ICCV 2025posterarXiv:2507.20842
1
citations

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie, Chen Du, Ping Song et al.

ICCV 2025posterarXiv:2411.17762
25
citations

One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head

Junhao Xia, Haotian Zhu, Shuchao Pang et al.

NEURIPS 2025poster

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.

CVPR 2025posterarXiv:2411.18499
19
citations

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, Mike Zheng Shou

NEURIPS 2025oralarXiv:2506.15564
95
citations

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai et al.

ICLR 2025posterarXiv:2408.12528
455
citations

Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

SOMESH SINGH, Harini S I, Yaman Singla et al.

ICLR 2025poster
2
citations

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Liao Qu, Huichao Zhang, Yiheng Liu et al.

CVPR 2025posterarXiv:2412.03069
120
citations

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang, Donglin Yang, ziqin wang et al.

ICLR 2025posterarXiv:2410.07087
52
citations

Two Causally Related Needles in a Video Haystack

Miaoyu Li, Qin Chao, Boyang Li

NEURIPS 2025posterarXiv:2505.19853

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

jian ma, Qirong Peng, Xu Guo et al.

ICCV 2025posterarXiv:2503.06134
5
citations