"cross-modal alignment" Papers
29 papers found
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment
Yan Li, Yifei Xing, Xiangyuan Lan et al.
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Ahmed Masry, Juan Rodriguez, Tianyu Zhang et al.
Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process
Tsai Hor Chan, Feng Wu, Yihang Chen et al.
Beyond Modality Collapse: Representation Blending for Multimodal Dataset Distillation
xin zhang, Ziruo Zhang, JIAWEI DU et al.
Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning
Tianjiao Jiang, Zhen Zhang, Yuhang Liu et al.
CrossOver: 3D Scene Cross-Modal Alignment
Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys et al.
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding
Henry Zheng, Hao Shi, Qihang Peng et al.
Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval
Yue Wu, Zhaobo Qi, Yiling Wu et al.
Learning Source-Free Domain Adaptation for Visible-Infrared Person Re-Identification
Yongxiang Li, Yanglin Feng, Yuan Sun et al.
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
Lijie Liu, Tianxiang Ma, Bingchuan Li et al.
Robust Cross-modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding
Yanglin Feng, Hongyuan Zhu, Dezhong Peng et al.
Seg4Diff: Unveiling Open-Vocabulary Semantic Segmentation in Text-to-Image Diffusion Transformers
Chaehyun Kim, Heeseong Shin, Eunbeen Hong et al.
Semi-Supervised CLIP Adaptation by Enforcing Semantic and Trapezoidal Consistency
Kai Gan, Bo Ye, Min-Ling Zhang et al.
SGAR: Structural Generative Augmentation for 3D Human Motion Retrieval
Jiahang Zhang, Lilang Lin, Shuai Yang et al.
Amend to Alignment: Decoupled Prompt Tuning for Mitigating Spurious Correlation in Vision-Language Models
Jie ZHANG, Xiaosong Ma, Song Guo et al.
Audio-visual Generalized Zero-shot Learning the Easy Way
Shentong Mo, Pedro Morgado
Augmented Commonsense Knowledge for Remote Object Grounding
Bahram Mohammadi, Yicong Hong, Yuankai Qi et al.
Detection-Based Intermediate Supervision for Visual Question Answering
Yuhang Liu, Daowan Peng, Wei Wei et al.
Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
Zhiyue Liu, Jinyuan Liu, Fanrong Ma
Integration of Global and Local Representations for Fine-grained Cross-modal Alignment
Seungwan Jin, Hoyoung Choi, Taehyung Noh et al.
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment
Yuxiao Chen, Kai Li, Wentao Bao et al.
Multi-Level Cross-Modal Alignment for Image Clustering
Liping Qiu, Qin Zhang, Xiaojun Chen et al.
Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification
Yajing Zhai, Yawen Zeng, Zhiyong Huang et al.
Position: The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang et al.
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection
Yuanpeng Tu, Boshen Zhang, Liang Liu et al.
Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval
Zhihang Liu, Jun Li, Hongtao Xie et al.
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
Kaibin Tian, Yanhua Cheng, Yi Liu et al.
Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training
Jinxia Yang, Bing Su, Xin Zhao et al.
Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models
Jinhao Li, Haopeng Li, Sarah Erfani et al.