"multimodal alignment" Papers
14 papers found
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
Hua Ye, Hang Ding, Siyuan Chen et al.
NeurIPS 2025posterarXiv:2511.08399
AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward
Haonan Han, Xiangzuo Wu, Huan Liao et al.
CVPR 2025posterarXiv:2411.18654
5
citations
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang et al.
CVPR 2025posterarXiv:2409.18042
44
citations
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Mayug Maniparambil, Raiymbek Akshulakov, YASSER ABDELAZIZ DAHOU DJILALI et al.
CVPR 2025posterarXiv:2409.19425
2
citations
OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis
Run Luo, Ting-En Lin, Haonan Zhang et al.
NeurIPS 2025poster
Self-Supervised Spatial Correspondence Across Modalities
Ayush Shrivastava, Andrew Owens
CVPR 2025posterarXiv:2506.03148
2
citations
A Touch, Vision, and Language Dataset for Multimodal Alignment
Letian Fu, Gaurav Datta, Huang Huang et al.
ICML 2024poster
Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video
Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.
AAAI 2024paperarXiv:2401.07567
11
citations
Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
Ruihuang Li, Zhengqiang ZHANG, Chenhang He et al.
ECCV 2024posterarXiv:2407.09781
11
citations
STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment
Jaewoo Lee, Jaehong Yoon, Wonjae Kim et al.
ICML 2024oral
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma, Furong Xu, Jian liu et al.
ICML 2024poster
TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang, Wei Ye, Haiyang Xu et al.
AAAI 2024paperarXiv:2312.08846
6
citations
Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA
Chengen Lai, Shengli Song, Shiqi Meng et al.
AAAI 2024paperarXiv:2312.13594
9
citations
V2Meow: Meowing to the Visual Beat via Video-to-Music Generation
Kun Su, Judith Li, Qingqing Huang et al.
AAAI 2024paperarXiv:2305.06594
23
citations