2025 Poster "multimodal understanding" Papers
17 papers found
ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking
Lequan Lin, Dai Shi, Andi Han et al.
NEURIPS 2025posterarXiv:2511.09833
ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio–Language Models
Weifei Jin, Yuxin Cao, Junjie Su et al.
NEURIPS 2025posterarXiv:2510.26096
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.
ICLR 2025posterarXiv:2410.18325
17
citations
Can LLMs Understand Time Series Anomalies?
Zihao Zhou, Rose Yu
ICLR 2025posterarXiv:2410.05440
32
citations
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu, Wenwei Zhang, Lumin Xu et al.
ICCV 2025posterarXiv:2503.21979
37
citations
HMVLM:Human Motion-Vision-Language Model via MoE LoRA
Lei Hu, Yongjing Ye, Shihong Xia
NEURIPS 2025poster
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling
Li Huaqiu, Yong Wang, Tongwen Huang et al.
ICCV 2025posterarXiv:2507.00790
3
citations
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
Yuchen Liu, Yaoming Wang, Bowen Shi et al.
ICCV 2025posterarXiv:2507.20842
1
citations
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
Rongchang Xie, Chen Du, Ping Song et al.
ICCV 2025posterarXiv:2411.17762
25
citations
One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head
Junhao Xia, Haotian Zhu, Shuchao Pang et al.
NEURIPS 2025poster
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.
CVPR 2025posterarXiv:2411.18499
19
citations
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai et al.
ICLR 2025posterarXiv:2408.12528
469
citations
Teaching Human Behavior Improves Content Understanding Abilities Of VLMs
SOMESH SINGH, Harini S I, Yaman Singla et al.
ICLR 2025poster
2
citations
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
Liao Qu, Huichao Zhang, Yiheng Liu et al.
CVPR 2025posterarXiv:2412.03069
120
citations
Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology
Xiangyu Wang, Donglin Yang, ziqin wang et al.
ICLR 2025posterarXiv:2410.07087
52
citations
Two Causally Related Needles in a Video Haystack
Miaoyu Li, Qin Chao, Boyang Li
NEURIPS 2025posterarXiv:2505.19853
X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
jian ma, Qirong Peng, Xu Guo et al.
ICCV 2025posterarXiv:2503.06134
5
citations