🧬Multimodal

Image-Text Matching

Aligning images with text descriptions

100 papers2,821 total citations
Compare with other topics
Feb '24 Jan '26464 papers
Also includes: image-text matching, cross-modal alignment, vision-language pre-training, clip

Top Papers

#1

Grounding Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, Jerome Revaud

ECCV 2024
499
citations
#2

RoMa: Robust Dense Feature Matching

Johan Edstedt, Qiyu Sun, Georg Bökman et al.

CVPR 2024
238
citations
#3

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Xunpeng Yi, Han Xu, HAO ZHANG et al.

CVPR 2024
123
citations
#4

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.

CVPR 2025
98
citations
#5

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, hangyu guo, Kun Zhou et al.

ECCV 2024
93
citations
#6

MoCha-Stereo: Motif Channel Attention Network for Stereo Matching

Ziyang Chen, Wei Long, He Yao et al.

CVPR 2024
72
citations
#7

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao et al.

CVPR 2024
68
citations
#8

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

Jisu Nam, Heesu Kim, DongJae Lee et al.

CVPR 2024
62
citations
#9

Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval

Yuanmin Tang, Jing Yu, Keke Gai et al.

AAAI 2024arXiv:2309.16137
zero-shot learningcomposed image retrievalimage representation learningcontext-dependent mapping+4
57
citations
#10

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li et al.

ICCV 2025arXiv:2502.11079
subject-consistent video generationcross-modal alignmenttext-to-video architectureimage-to-video architecture+4
55
citations
#11

Text-Image Alignment for Diffusion-Based Perception

Neehar Kondapaneni, Markus Marks, Manuel Knott et al.

CVPR 2024
53
citations
#12

Describing Differences in Image Sets with Natural Language

Lisa Dunlap, Yuhui Zhang, Xiaohan Wang et al.

CVPR 2024
51
citations
#13

ReMamber: Referring Image Segmentation with Mamba Twister

Yuhuan Yang, Chaofan Ma, Jiangchao Yao et al.

ECCV 2024
49
citations
#14

Neural Markov Random Field for Stereo Matching

Tongfan Guan, Chen Wang, Yun-Hui Liu

CVPR 2024
48
citations
#15

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Chenyang Zhu, Kai Li, Yue Ma et al.

AAAI 2025
46
citations
#16

Improving Image Restoration through Removing Degradations in Textual Representations

Jingbo Lin, Zhilu Zhang, Yuxiang Wei et al.

CVPR 2024
45
citations
#17

Improved Probabilistic Image-Text Representations

Sanghyuk Chun

ICLR 2024
43
citations
#18

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Bojia Zi, Shihao Zhao, Xianbiao Qi et al.

AAAI 2025
38
citations
#19

Control4D: Efficient 4D Portrait Editing with Text

Ruizhi Shao, Jingxiang Sun, Cheng Peng et al.

CVPR 2024
36
citations
#20

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Yoad Tewel, Rinon Gal, Dvir Samuel et al.

ICLR 2025arXiv:2411.07232
attention mechanismdiffusion modelssemantic image editingobject insertion+3
34
citations
#21

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Young Kyun Jang, Dat B Huynh, Ashish Shah et al.

ECCV 2024
32
citations
#22

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

Zhiwei Zhao, Bin Liu, Yan Lu et al.

AAAI 2024
29
citations
#23

UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence

Ruihai Wu, Haoran Lu, Yiyan Wang et al.

CVPR 2024
29
citations
#24

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding

jiazhou zhou, Xu Zheng, Yuanhuiyi Lyu et al.

ECCV 2024arXiv:2308.03135
vision-language modelsmodality gap bridgingevent-based recognitiontemporal information modeling+4
28
citations
#25

VideoCon: Robust Video-Language Alignment via Contrast Captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor et al.

CVPR 2024
28
citations
#26

Perception-Guided Jailbreak Against Text-to-Image Models

Yihao Huang, Le Liang, Tianlin Li et al.

AAAI 2025
26
citations
#27

Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation

Yuan Wang, Rui Sun, Naisong Luo et al.

CVPR 2024
25
citations
#28

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Xue Song, Jiequan Cui, Hanwang Zhang et al.

CVPR 2024
25
citations
#29

CLIM: Contrastive Language-Image Mosaic for Region Representation

Size Wu, Wenwei Zhang, Lumin XU et al.

AAAI 2024arXiv:2312.11376
vision-language alignmentopen-vocabulary object detectioncontrastive learningregion-text alignment+3
24
citations
#30

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

Zixuan Gong, Qi Zhang, Guangyin Bao et al.

AAAI 2025
23
citations
#31

Image Clustering Conditioned on Text Criteria

Sehyun Kwon, Jaden Park, Minkyu Kim et al.

ICLR 2024
21
citations
#32

Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination

Qingwang Zhang, Yingying Zhu

AAAI 2024
20
citations
#33

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Zhiyue Liu, Jinyuan Liu, Fanrong Ma

AAAI 2024arXiv:2312.08865
cross-modal alignmenttext-only image captioningsynthetic image-text pairsclip embedding space+3
20
citations
#34

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang et al.

CVPR 2024
19
citations
#35

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain et al.

CVPR 2024
19
citations
#36

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

Kang Chen, Xiangqian Wu

CVPR 2024
19
citations
#37

CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

ZeMing Gong, Austin Wang, Xiaoliang Huo et al.

ICLR 2025arXiv:2405.17537
contrastive learningmultimodal fusionbiodiversity monitoringtaxonomic classification+3
18
citations
#38

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

Haoyu Zhao, Tianyi Lu, Jiaxi Gu et al.

ECCV 2024
18
citations
#39

A Dual-Way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking

Shezheng Song, Shan Zhao, ChengYu Wang et al.

AAAI 2024arXiv:2312.11816
multimodal entity linkingneural text matchingcross-modal enhancementfine-grained image attributes+3
18
citations
#40

Composing Object Relations and Attributes for Image-Text Matching

Khoi Pham, Chuong Huynh, Ser-Nam Lim et al.

CVPR 2024
18
citations
#41

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.

CVPR 2025
18
citations
#42

Text Image Inpainting via Global Structure-Guided Diffusion Models

Shipeng Zhu, Pengfei Fang, Chenjie Zhu et al.

AAAI 2024arXiv:2401.14832
text image inpaintingdiffusion modelsscene text recognitionhandwritten text images+4
18
citations
#43

Decomposing Semantic Shifts for Composed Image Retrieval

Xingyu Yang, Daqing Liu, Heng Zhang et al.

AAAI 2024arXiv:2309.09531
composed image retrievalsemantic shift decompositionvisual prototype generationdegradation and upgradation+2
17
citations
#44

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.

ECCV 2024arXiv:2312.03766
image-text alignmentmisalignment explanationvisual groundingvision language models+3
17
citations
#45

MESA: Matching Everything by Segmenting Anything

Yesheng Zhang, Xu Zhao

CVPR 2024
17
citations
#46

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation

Yuchen Su, Zhineng Chen, Zhiwen Shao et al.

AAAI 2024arXiv:2306.15142
scene text detectionlow-rank approximationarbitrary-shaped textshape representation+4
17
citations
#47

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

Yan Li, Yifei Xing, Xiangyuan Lan et al.

CVPR 2025arXiv:2412.00833
multimodal fusioncross-modal alignmentmamba modelsoptimal transport+3
17
citations
#48

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

Chen Duan, Pei Fu, Shan Guo et al.

CVPR 2024
16
citations
#49

FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs

Mothilal Asokan, Kebin wu, Fatima Albreiki

CVPR 2025
14
citations
#50

FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Hang Hua, Jing Shi, Kushal Kafle et al.

ECCV 2024arXiv:2404.14715
vision-language modelsimage-text matchingmismatch detectioncompositional reasoning+2
14
citations
#51

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

Zhiyuan Ma, Guoli Jia, Bowen Zhou

AAAI 2024arXiv:2312.08019
text-conditioned diffusion modelstext-driven image editingcontinuity-sensitive editingspatio-temporal guidance+3
13
citations
#52

Partial-to-Partial Shape Matching with Geometric Consistency

Viktoria Ehm, Maolin Gao, Paul Roetzer et al.

CVPR 2024
13
citations
#53

Personalized Federated Learning for Spatio-Temporal Forecasting: A Dual Semantic Alignment-Based Contrastive Approach

Qingxiang Liu, Sheng Sun, Yuxuan Liang et al.

AAAI 2025
13
citations
#54

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Wei Chen, Lin Li, Yongqi Yang et al.

CVPR 2025
12
citations
#55

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching

Bin Wang, Fan Wu, Linke Ouyang et al.

CVPR 2025
11
citations
#56

Deep Kernel Relative Test for Machine-generated Text Detection

Yiliao Song, Zhenqiao Yuan, Shuhai Zhang et al.

ICLR 2025
11
citations
#57

Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy

Hong Zhang, Yixuan Lyu, Qian Yu et al.

ECCV 2024
11
citations
#58

Bridging the Gap Between End-to-End and Two-Step Text Spotting

Mingxin Huang, Hongliang Li, Yuliang Liu et al.

CVPR 2024
11
citations
#59

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

Clément Chadebec, Onur Tasar, Sanjeev Sreetharan et al.

ICCV 2025
11
citations
#60

DELTA: Pre-Train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment

Haitao Li, Qingyao Ai, Xinyan Han et al.

AAAI 2025
11
citations
#61

The Double-Ellipsoid Geometry of CLIP

Meir Yossef Levi, Guy Gilboa

ICML 2025
10
citations
#62

DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching

Paul Roetzer, Ahmed Abbas, Dongliang Cao et al.

ECCV 2024arXiv:2310.08230
3d shape matchinggeometric consistencycombinatorial optimizationdiscrete optimization+3
10
citations
#63

InsightEdit: Towards Better Instruction Following for Image Editing

Yingjing Xu, Jie Kong, Jiazhi Wang et al.

CVPR 2025
10
citations
#64

Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval

Zhen-Duo Chen, Li-Jun Zhao, Zi-Chao Zhang et al.

CVPR 2024
10
citations
#65

Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

Shuyu Yang, Yaxiong Wang, Li Zhu et al.

ICCV 2025
10
citations
#66

Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion

Sanghyun Kim, Seohyeon Jung, Balhae Kim et al.

ECCV 2024arXiv:2407.21032
text-to-image diffusionharmful content generationhuman feedback alignmentconcept removal+2
9
citations
#67

TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Zhiying Song, Lei Yang, Fuxi Wen et al.

CVPR 2025
9
citations
#68

DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation

Soojin Jang, JungMin Yun, JuneHyoung Kwon et al.

ECCV 2024
8
citations
#69

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

Hyeongjun Kwon, Jinhyun Jang, Jin Kim et al.

CVPR 2024
8
citations
#70

ACE: Anti-Editing Concept Erasure in Text-to-Image Models

Zihao Wang, Yuxiang Wei, Fan Li et al.

CVPR 2025
8
citations
#71

GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement

Linfang Zheng, Tze Ho Elden Tse, Chen Wang et al.

CVPR 2024
8
citations
#72

VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

Shang Liu, Chaohui Yu, Chenjie Cao et al.

ECCV 2024
7
citations
#73

Out of Length Text Recognition with Sub-String Matching

Yongkun Du, Zhineng Chen, Caiyan Jia et al.

AAAI 2025
7
citations
#74

HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs

Ziwei Yao, Ruiping Wang, Xilin CHEN

ECCV 2024
7
citations
#75

PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments

rixin zhou, Ding Xia, YI ZHANG et al.

ECCV 2024arXiv:2312.08704
image fragment restorationcontour shape matchingtexture feature extractiongraph-based networks+4
7
citations
#76

Text-Guided Video Masked Autoencoder

David Fan, Jue Wang, Shuai Liao et al.

ECCV 2024
7
citations
#77

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Mu Cai, Haotian Liu, Yuheng Li et al.

ECCV 2024
7
citations
#78

CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment

Hyeongmin Lee, Kyoungkook Kang, Jungseul Ok et al.

CVPR 2024
7
citations
#79

CrossOver: 3D Scene Cross-Modal Alignment

Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys et al.

CVPR 2025arXiv:2502.15011
cross-modal alignment3d scene understandingmodality-agnostic embeddingscene retrieval+3
7
citations
#80

Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models

Kartik Thakral, Tamar Glaser, Tal Hassner et al.

CVPR 2025
7
citations
#81

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

Zitang Zhou, Ke Mei, Yu Lu et al.

CVPR 2025
7
citations
#82

Knowledge-Enhanced Historical Document Segmentation and Recognition

En-Hao Gao, Yu-Xuan Huang, Wen-Chao Hu et al.

AAAI 2024
7
citations
#83

VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment

Shangkun Sun, Xiaoyu Liang, Songlin Fan et al.

AAAI 2025
6
citations
#84

CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing

Ziqi Jiang, Zhen Wang, Long Chen

ICLR 2025
6
citations
#85

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

Congpei Qiu, Yanhao Wu, Wei Ke et al.

ICLR 2025
6
citations
#86

Multi-Level Cross-Modal Alignment for Image Clustering

Liping Qiu, Qin Zhang, Xiaojun Chen et al.

AAAI 2024arXiv:2401.11740
cross-modal alignmentimage clusteringsemantic space learninginstance-level alignment+2
6
citations
#87

Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment

Yang Bai, Yucheng Ji, Min Cao et al.

CVPR 2025
6
citations
#88

MIEB: Massive Image Embedding Benchmark

Chenghao Xiao, Isaac Chung, Imene Kerboua et al.

ICCV 2025arXiv:2504.10471
image embedding modelsmultimodal evaluation benchmarkvision-language taskstext-to-image retrieval+3
6
citations
#89

SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

Zhengyuan Li, Kai Cheng, Anindita Ghosh et al.

CVPR 2025
6
citations
#90

CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images

Changsheng Chen, Liangwei Lin, Yongqi Chen et al.

CVPR 2024
6
citations
#91

Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Yang Qin, Chao Chen, Zhihang Fu et al.

CVPR 2025
6
citations
#92

Novel Class Discovery in Chest X-rays via Paired Images and Text

Jiaying Zhou, Yang Liu, Qingchao Chen

AAAI 2024
5
citations
#93

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Tianci Bi, Xiaoyi Zhang, Zhizheng Zhang et al.

CVPR 2024
5
citations
#94

Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models

Luozhou Wang, Guibao Shen, Wenhang Ge et al.

ECCV 2024arXiv:2306.14408
text-to-image generationdiffusion modelscondition misalignmentcontrollable generation+3
5
citations
#95

Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment

Yizhi Song, Liu He, Zhifei Zhang et al.

ICLR 2025
5
citations
#96

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Yeji Song, Jimyeong Kim, Wonhark Park et al.

AAAI 2025
5
citations
#97

EcoMatcher: Efficient Clustering Oriented Matcher for Detector-free Image Matching

Peiqi Chen, Lei Yu, Yi Wan et al.

ECCV 2024
4
citations
#98

InstructOCR: Instruction Boosting Scene Text Spotting

Chen Duan, Qianyi Jiang, Pei Fu et al.

AAAI 2025
4
citations
#99

Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation

Suho Park, SuBeen Lee, Hyun Seok Seong et al.

AAAI 2025
4
citations
#100

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Yayuan Li, Jintao Guo, Lei Qi et al.

AAAI 2025
4
citations