"multimodal alignment" Papers

14 papers found

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye, Hang Ding, Siyuan Chen et al.

NeurIPS 2025posterarXiv:2511.08399

AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

Haonan Han, Xiangzuo Wu, Huan Liao et al.

CVPR 2025posterarXiv:2411.18654
5
citations

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen, Yunhao Gou, Runhui Huang et al.

CVPR 2025posterarXiv:2409.18042
44
citations

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Mayug Maniparambil, Raiymbek Akshulakov, YASSER ABDELAZIZ DAHOU DJILALI et al.

CVPR 2025posterarXiv:2409.19425
2
citations

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-time Emotional Speech Synthesis

Run Luo, Ting-En Lin, Haonan Zhang et al.

NeurIPS 2025poster

Self-Supervised Spatial Correspondence Across Modalities

Ayush Shrivastava, Andrew Owens

CVPR 2025posterarXiv:2506.03148
2
citations

A Touch, Vision, and Language Dataset for Multimodal Alignment

Letian Fu, Gaurav Datta, Huang Huang et al.

ICML 2024poster

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.

AAAI 2024paperarXiv:2401.07567
11
citations

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Ruihuang Li, Zhengqiang ZHANG, Chenhang He et al.

ECCV 2024posterarXiv:2407.09781
11
citations

STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment

Jaewoo Lee, Jaehong Yoon, Wonjae Kim et al.

ICML 2024oral

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Ziping Ma, Furong Xu, Jian liu et al.

ICML 2024poster

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

Chaoya Jiang, Wei Ye, Haiyang Xu et al.

AAAI 2024paperarXiv:2312.08846
6
citations

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

Chengen Lai, Shengli Song, Shiqi Meng et al.

AAAI 2024paperarXiv:2312.13594
9
citations

V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Kun Su, Judith Li, Qingqing Huang et al.

AAAI 2024paperarXiv:2305.06594
23
citations