Image-Text Matching
Aligning images with text descriptions
Top Papers
Grounding Image Matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, Jerome Revaud
RoMa: Robust Dense Feature Matching
Johan Edstedt, Qiyu Sun, Georg Bökman et al.
Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
Xunpeng Yi, Han Xu, HAO ZHANG et al.
FoundationStereo: Zero-Shot Stereo Matching
Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li, hangyu guo, Kun Zhou et al.
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching
Ziyang Chen, Wei Long, He Yao et al.
OmniGlue: Generalizable Feature Matching with Foundation Model Guidance
Hanwen Jiang, Arjun Karpur, Bingyi Cao et al.
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
Jisu Nam, Heesu Kim, DongJae Lee et al.
Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang, Jing Yu, Keke Gai et al.
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
Lijie Liu, Tianxiang Ma, Bingchuan Li et al.
Text-Image Alignment for Diffusion-Based Perception
Neehar Kondapaneni, Markus Marks, Manuel Knott et al.
Describing Differences in Image Sets with Natural Language
Lisa Dunlap, Yuhui Zhang, Xiaohan Wang et al.
ReMamber: Referring Image Segmentation with Mamba Twister
Yuhuan Yang, Chaofan Ma, Jiangchao Yao et al.
Neural Markov Random Field for Stereo Matching
Tongfan Guan, Chen Wang, Yun-Hui Liu
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Chenyang Zhu, Kai Li, Yue Ma et al.
Improving Image Restoration through Removing Degradations in Textual Representations
Jingbo Lin, Zhilu Zhang, Yuxiang Wei et al.
Improved Probabilistic Image-Text Representations
Sanghyuk Chun
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
Bojia Zi, Shihao Zhao, Xianbiao Qi et al.
Control4D: Efficient 4D Portrait Editing with Text
Ruizhi Shao, Jingxiang Sun, Cheng Peng et al.
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
Yoad Tewel, Rinon Gal, Dvir Samuel et al.
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval
Young Kyun Jang, Dat B Huynh, Ashish Shah et al.
Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification
Zhiwei Zhao, Bin Liu, Yan Lu et al.
UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence
Ruihai Wu, Haoran Lu, Yiyan Wang et al.
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
jiazhou zhou, Xu Zheng, Yuanhuiyi Lyu et al.
VideoCon: Robust Video-Language Alignment via Contrast Captions
Hritik Bansal, Yonatan Bitton, Idan Szpektor et al.
Perception-Guided Jailbreak Against Text-to-Image Models
Yihao Huang, Le Liang, Tianlin Li et al.
Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation
Yuan Wang, Rui Sun, Naisong Luo et al.
Doubly Abductive Counterfactual Inference for Text-based Image Editing
Xue Song, Jiequan Cui, Hanwang Zhang et al.
CLIM: Contrastive Language-Image Mosaic for Region Representation
Size Wu, Wenwei Zhang, Lumin XU et al.
MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction
Zixuan Gong, Qi Zhang, Guangyin Bao et al.
Image Clustering Conditioned on Text Criteria
Sehyun Kwon, Jaden Park, Minkyu Kim et al.
Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination
Qingwang Zhang, Yingying Zhu
Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
Zhiyue Liu, Jinyuan Liu, Fanrong Ma
Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation
Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang et al.
You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain et al.
VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning
Kang Chen, Xiangqian Wu
CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
ZeMing Gong, Austin Wang, Xiaoliang Huo et al.
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
Haoyu Zhao, Tianyi Lu, Jiaxi Gu et al.
A Dual-Way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking
Shezheng Song, Shan Zhao, ChengYu Wang et al.
Composing Object Relations and Attributes for Image-Text Matching
Khoi Pham, Chuong Huynh, Ser-Nam Lim et al.
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.
Text Image Inpainting via Global Structure-Guided Diffusion Models
Shipeng Zhu, Pengfei Fang, Chenjie Zhu et al.
Decomposing Semantic Shifts for Composed Image Retrieval
Xingyu Yang, Daqing Liu, Heng Zhang et al.
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.
MESA: Matching Everything by Segmenting Anything
Yesheng Zhang, Xu Zhao
LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation
Yuchen Su, Zhineng Chen, Zhiwen Shao et al.
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment
Yan Li, Yifei Xing, Xiangyuan Lan et al.
ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting
Chen Duan, Pei Fu, Shan Guo et al.
FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs
Mothilal Asokan, Kebin wu, Fatima Albreiki
FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua, Jing Shi, Kushal Kafle et al.
AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing
Zhiyuan Ma, Guoli Jia, Bowen Zhou
Partial-to-Partial Shape Matching with Geometric Consistency
Viktoria Ehm, Maolin Gao, Paul Roetzer et al.
Personalized Federated Learning for Spatio-Temporal Forecasting: A Dual Semantic Alignment-Based Contrastive Approach
Qingxiang Liu, Sheng Sun, Yuxuan Liang et al.
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Wei Chen, Lin Li, Yongqi Yang et al.
Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching
Bin Wang, Fan Wu, Linke Ouyang et al.
Deep Kernel Relative Test for Machine-generated Text Detection
Yiliao Song, Zhenqiao Yuan, Shuhai Zhang et al.
Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy
Hong Zhang, Yixuan Lyu, Qian Yu et al.
Bridging the Gap Between End-to-End and Two-Step Text Spotting
Mingxin Huang, Hongliang Li, Yuliang Liu et al.
LBM: Latent Bridge Matching for Fast Image-to-Image Translation
Clément Chadebec, Onur Tasar, Sanjeev Sreetharan et al.
DELTA: Pre-Train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment
Haitao Li, Qingyao Ai, Xinyan Han et al.
The Double-Ellipsoid Geometry of CLIP
Meir Yossef Levi, Guy Gilboa
DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching
Paul Roetzer, Ahmed Abbas, Dongliang Cao et al.
InsightEdit: Towards Better Instruction Following for Image Editing
Yingjing Xu, Jie Kong, Jiazhi Wang et al.
Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval
Zhen-Duo Chen, Li-Jun Zhao, Zi-Chao Zhang et al.
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Shuyu Yang, Yaxiong Wang, Li Zhu et al.
Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
Sanghyun Kim, Seohyeon Jung, Balhae Kim et al.
TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception
Zhiying Song, Lei Yang, Fuxi Wen et al.
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation
Soojin Jang, JungMin Yun, JuneHyoung Kwon et al.
Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
Hyeongjun Kwon, Jinhyun Jang, Jin Kim et al.
ACE: Anti-Editing Concept Erasure in Text-to-Image Models
Zihao Wang, Yuxiang Wei, Fan Li et al.
GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement
Linfang Zheng, Tze Ho Elden Tse, Chen Wang et al.
VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing
Shang Liu, Chaohui Yu, Chenjie Cao et al.
Out of Length Text Recognition with Sub-String Matching
Yongkun Du, Zhineng Chen, Caiyan Jia et al.
HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs
Ziwei Yao, Ruiping Wang, Xilin CHEN
PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments
rixin zhou, Ding Xia, YI ZHANG et al.
Text-Guided Video Masked Autoencoder
David Fan, Jue Wang, Shuai Liao et al.
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Mu Cai, Haotian Liu, Yuheng Li et al.
CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
Hyeongmin Lee, Kyoungkook Kang, Jungseul Ok et al.
CrossOver: 3D Scene Cross-Modal Alignment
Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys et al.
Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models
Kartik Thakral, Tamar Glaser, Tal Hassner et al.
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Zitang Zhou, Ke Mei, Yu Lu et al.
Knowledge-Enhanced Historical Document Segmentation and Recognition
En-Hao Gao, Yu-Xuan Huang, Wen-Chao Hu et al.
VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment
Shangkun Sun, Xiaoyu Liang, Songlin Fan et al.
CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing
Ziqi Jiang, Zhen Wang, Long Chen
Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
Congpei Qiu, Yanhao Wu, Wei Ke et al.
Multi-Level Cross-Modal Alignment for Image Clustering
Liping Qiu, Qin Zhang, Xiaojun Chen et al.
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment
Yang Bai, Yucheng Ji, Min Cao et al.
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao, Isaac Chung, Imene Kerboua et al.
SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction
Zhengyuan Li, Kai Cheng, Anindita Ghosh et al.
CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images
Changsheng Chen, Liangwei Lin, Yongqi Chen et al.
Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification
Yang Qin, Chao Chen, Zhihang Fu et al.
Novel Class Discovery in Chest X-rays via Paired Images and Text
Jiaying Zhou, Yang Liu, Qingchao Chen
Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
Tianci Bi, Xiaoyi Zhang, Zhizheng Zhang et al.
Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models
Luozhou Wang, Guibao Shen, Wenhang Ge et al.
Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment
Yizhi Song, Liu He, Zhifei Zhang et al.
Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization
Yeji Song, Jimyeong Kim, Wonhark Park et al.
EcoMatcher: Efficient Clustering Oriented Matcher for Detector-free Image Matching
Peiqi Chen, Lei Yu, Yi Wan et al.
InstructOCR: Instruction Boosting Scene Text Spotting
Chen Duan, Qianyi Jiang, Pei Fu et al.
Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation
Suho Park, SuBeen Lee, Hyun Seok Seong et al.
Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP
Yayuan Li, Jintao Guo, Lei Qi et al.