2024 "cross-modal alignment" Papers

16 papers found

Amend to Alignment: Decoupled Prompt Tuning for Mitigating Spurious Correlation in Vision-Language Models

Jie ZHANG, Xiaosong Ma, Song Guo et al.

ICML 2024poster

Audio-visual Generalized Zero-shot Learning the Easy Way

Shentong Mo, Pedro Morgado

ECCV 2024posterarXiv:2407.13095
7
citations

Augmented Commonsense Knowledge for Remote Object Grounding

Bahram Mohammadi, Yicong Hong, Yuankai Qi et al.

AAAI 2024paperarXiv:2406.01256

CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring

Taewoo Kim, Hoonhee Cho, Kuk-Jin Yoon

ECCV 2024posterarXiv:2408.14930
12
citations

Detection-Based Intermediate Supervision for Visual Question Answering

Yuhang Liu, Daowan Peng, Wei Wei et al.

AAAI 2024paperarXiv:2312.16012
3
citations

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Zhiyue Liu, Jinyuan Liu, Fanrong Ma

AAAI 2024paperarXiv:2312.08865
20
citations

Integration of Global and Local Representations for Fine-grained Cross-modal Alignment

Seungwan Jin, Hoyoung Choi, Taehyung Noh et al.

ECCV 2024poster
1
citations

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Yuxiao Chen, Kai Li, Wentao Bao et al.

ECCV 2024posterarXiv:2409.16145
5
citations

Multi-Level Cross-Modal Alignment for Image Clustering

Liping Qiu, Qin Zhang, Xiaojun Chen et al.

AAAI 2024paperarXiv:2401.11740
6
citations

Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-identification

Yajing Zhai, Yawen Zeng, Zhiyong Huang et al.

AAAI 2024paperarXiv:2312.16797
33
citations

Position: The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang et al.

ICML 2024poster

Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection

Yuanpeng Tu, Boshen Zhang, Liang Liu et al.

ECCV 2024posterarXiv:2401.03145
24
citations

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Zhihang Liu, Jun Li, Hongtao Xie et al.

AAAI 2024paperarXiv:2312.12155
40
citations

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Kaibin Tian, Yanhua Cheng, Yi Liu et al.

AAAI 2024paperarXiv:2401.00701
14
citations

Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Jinxia Yang, Bing Su, Xin Zhao et al.

ICML 2024oralarXiv:2405.19654

Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

Jinhao Li, Haopeng Li, Sarah Erfani et al.

ICML 2024posterarXiv:2406.02915