2025 Poster "cross-modal alignment" Papers
28 papers found
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Junming Liu, Siyuan Meng, Yanting Gao et al.
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment
Yan Li, Yifei Xing, Xiangyuan Lan et al.
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.
Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process
Tsai Hor Chan, Feng Wu, Yihang Chen et al.
Beyond Modality Collapse: Representation Blending for Multimodal Dataset Distillation
xin zhang, Ziruo Zhang, JIAWEI DU et al.
Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning
Tianjiao Jiang, Zhen Zhang, Yuhang Liu et al.
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo, Andrew Rouditchenko, Yuan Gong et al.
CF-VLM:CounterFactual Vision-Language Fine-tuning
jusheng zhang, Kaitong Cai, Yijia Fan et al.
CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
Jinlan Fu, Shenzhen Huangfu, Hao Fei et al.
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding
Henry Zheng, Hao Shi, Qihang Peng et al.
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
Jeonghyeon Kim, Sangheum Hwang
Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective
Xiangru Zhu, Penglei Sun, Yaoxian Song et al.
Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning
Yiyang Chen, Shanshan Zhao, Lunhao Duan et al.
Hierarchical Cross-modal Prompt Learning for Vision-Language Models
Hao Zheng, Shunzhi Yang, Zhuoxin He et al.
It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data
Dominik Schnaus, Nikita Araslanov, Daniel Cremers
Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval
Yue Wu, Zhaobo Qi, Yiling Wu et al.
Learning Source-Free Domain Adaptation for Visible-Infrared Person Re-Identification
Yongxiang Li, Yanglin Feng, Yuan Sun et al.
Mitigate the Gap: Improving Cross-Modal Alignment in CLIP
Sedigheh Eslami, Gerard de Melo
Multimodal Prompt Alignment for Facial Expression Recognition
Fuyan Ma, Yiran He, Bin Sun et al.
Preacher: Paper-to-Video Agentic System
Jingwei Liu, Ling Yang, Hao Luo et al.
ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning
Quanxing Zha, Xin Liu, Shu-Juan Peng et al.
Robust Cross-modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding
Yanglin Feng, Hongyuan Zhu, Dezhong Peng et al.
Seg4Diff: Unveiling Open-Vocabulary Semantic Segmentation in Text-to-Image Diffusion Transformers
Chaehyun Kim, Heeseong Shin, Eunbeen Hong et al.
Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency
Kai Gan, Bo Ye, Min-Ling Zhang et al.
SGAR: Structural Generative Augmentation for 3D Human Motion Retrieval
Jiahang Zhang, Lilang Lin, Shuai Yang et al.
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models
Ziyang Luo, Nian Liu, Xuguang Yang et al.
The Indra Representation Hypothesis
Jianglin Lu, Hailing Wang, Kuo Yang et al.
When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product
Youqi WU, Jingwei Zhang, Farzan Farnia