Image-Text Matching
Aligning images with text descriptions
Top Papers
Grounding Image Matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, Jerome Revaud
RoMa: Robust Dense Feature Matching
Johan Edstedt, Qiyu Sun, Georg Bökman et al.
Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
Xunpeng Yi, Han Xu, HAO ZHANG et al.
FoundationStereo: Zero-Shot Stereo Matching
Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li, hangyu guo, Kun Zhou et al.
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching
Ziyang Chen, Wei Long, He Yao et al.
OmniGlue: Generalizable Feature Matching with Foundation Model Guidance
Hanwen Jiang, Arjun Karpur, Bingyi Cao et al.
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
Jisu Nam, Heesu Kim, DongJae Lee et al.
Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang, Jing Yu, Keke Gai et al.
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
Lijie Liu, Tianxiang Ma, Bingchuan Li et al.
Text-Image Alignment for Diffusion-Based Perception
Neehar Kondapaneni, Markus Marks, Manuel Knott et al.
Describing Differences in Image Sets with Natural Language
Lisa Dunlap, Yuhui Zhang, Xiaohan Wang et al.
ReMamber: Referring Image Segmentation with Mamba Twister
Yuhuan Yang, Chaofan Ma, Jiangchao Yao et al.
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li, Zhe Chen, Weiyun Wang et al.
Neural Markov Random Field for Stereo Matching
Tongfan Guan, Chen Wang, Yun-Hui Liu
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Chenyang Zhu, Kai Li, Yue Ma et al.
Improving Image Restoration through Removing Degradations in Textual Representations
Jingbo Lin, Zhilu Zhang, Yuxiang Wei et al.
Improved Probabilistic Image-Text Representations
Sanghyuk Chun
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
Bojia Zi, Shihao Zhao, Xianbiao Qi et al.
Control4D: Efficient 4D Portrait Editing with Text
Ruizhi Shao, Jingxiang Sun, Cheng Peng et al.
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
Yoad Tewel, Rinon Gal, Dvir Samuel et al.
Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval
Young Kyun Jang, Dat B Huynh, Ashish Shah et al.
Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification
Zhiwei Zhao, Bin Liu, Yan Lu et al.
UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence
Ruihai Wu, Haoran Lu, Yiyan Wang et al.
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
jiazhou zhou, Xu Zheng, Yuanhuiyi Lyu et al.
VideoCon: Robust Video-Language Alignment via Contrast Captions
Hritik Bansal, Yonatan Bitton, Idan Szpektor et al.
Perception-Guided Jailbreak Against Text-to-Image Models
Yihao Huang, Le Liang, Tianlin Li et al.
Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation
Yuan Wang, Rui Sun, Naisong Luo et al.
Doubly Abductive Counterfactual Inference for Text-based Image Editing
Xue Song, Jiequan Cui, Hanwang Zhang et al.
CLIM: Contrastive Language-Image Mosaic for Region Representation
Size Wu, Wenwei Zhang, Lumin XU et al.
MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction
Zixuan Gong, Qi Zhang, Guangyin Bao et al.
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
Zicheng Zhang, Tengchuan Kou, Chunyi Li et al.
Image Clustering Conditioned on Text Criteria
Sehyun Kwon, Jaden Park, Minkyu Kim et al.
Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
Zhiyue Liu, Jinyuan Liu, Fanrong Ma
Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination
Qingwang Zhang, Yingying Zhu
Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation
Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang et al.
VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning
Kang Chen, Xiangqian Wu
You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain et al.
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
Haoyu Zhao, Tianyi Lu, Jiaxi Gu et al.
Text Image Inpainting via Global Structure-Guided Diffusion Models
Shipeng Zhu, Pengfei Fang, Chenjie Zhu et al.
Composing Object Relations and Attributes for Image-Text Matching
Khoi Pham, Chuong Huynh, Ser-Nam Lim et al.
A Dual-Way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking
Shezheng Song, Shan Zhao, ChengYu Wang et al.
CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
ZeMing Gong, Austin Wang, Xiaoliang Huo et al.
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.
LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation
Yuchen Su, Zhineng Chen, Zhiwen Shao et al.
MESA: Matching Everything by Segmenting Anything
Yesheng Zhang, Xu Zhao
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment
Yan Li, Yifei Xing, Xiangyuan Lan et al.
Decomposing Semantic Shifts for Composed Image Retrieval
Xingyu Yang, Daqing Liu, Heng Zhang et al.
ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting
Chen Duan, Pei Fu, Shan Guo et al.
Mitigate the Gap: Improving Cross-Modal Alignment in CLIP
Sedigheh Eslami, Gerard de Melo
FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua, Jing Shi, Kushal Kafle et al.
SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
Zilan Wang, Junfeng Guo, Jiacheng Zhu et al.
FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs
Mothilal Asokan, Kebin wu, Fatima Albreiki
Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching
Bin Wang, Fan Wu, Linke Ouyang et al.
Personalized Federated Learning for Spatio-Temporal Forecasting: A Dual Semantic Alignment-Based Contrastive Approach
Qingxiang Liu, Sheng Sun, Yuxuan Liang et al.
Partial-to-Partial Shape Matching with Geometric Consistency
Viktoria Ehm, Maolin Gao, Paul Roetzer et al.
AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing
Zhiyuan Ma, Guoli Jia, Bowen Zhou
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Wei Chen, Lin Li, Yongqi Yang et al.
Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy
Hong Zhang, Yixuan Lyu, Qian Yu et al.
DELTA: Pre-Train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment
Haitao Li, Qingyao Ai, Xinyan Han et al.
Bridging the Gap Between End-to-End and Two-Step Text Spotting
Mingxin Huang, Hongliang Li, Yuliang Liu et al.
LBM: Latent Bridge Matching for Fast Image-to-Image Translation
Clément Chadebec, Onur Tasar, Sanjeev Sreetharan et al.
MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
Anurag Das, Xinting Hu, Li Jiang et al.
Deep Kernel Relative Test for Machine-generated Text Detection
Yiliao Song, Zhenqiao Yuan, Shuhai Zhang et al.
The Double-Ellipsoid Geometry of CLIP
Meir Yossef Levi, Guy Gilboa
Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval
Zhen-Duo Chen, Li-Jun Zhao, Zi-Chao Zhang et al.
DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching
Paul Roetzer, Ahmed Abbas, Dongliang Cao et al.
InsightEdit: Towards Better Instruction Following for Image Editing
Yingjing Xu, Jie Kong, Jiazhi Wang et al.
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Shuyu Yang, Yaxiong Wang, Li Zhu et al.
Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
Sanghyun Kim, Seohyeon Jung, Balhae Kim et al.
TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception
Zhiying Song, Lei Yang, Fuxi Wen et al.
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation
Soojin Jang, JungMin Yun, JuneHyoung Kwon et al.
GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement
Linfang Zheng, Tze Ho Elden Tse, Chen Wang et al.
Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
Hyeongjun Kwon, Jinhyun Jang, Jin Kim et al.
Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects
Weimin Qiu, Jieke Wang, Meng Tang
ACE: Anti-Editing Concept Erasure in Text-to-Image Models
Zihao Wang, Yuxiang Wei, Fan Li et al.
HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs
Ziwei Yao, Ruiping Wang, Xilin CHEN
PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments
rixin zhou, Ding Xia, YI ZHANG et al.
Out of Length Text Recognition with Sub-String Matching
Yongkun Du, Zhineng Chen, Caiyan Jia et al.
Text-Guided Video Masked Autoencoder
David Fan, Jue Wang, Shuai Liao et al.
CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
Hyeongmin Lee, Kyoungkook Kang, Jungseul Ok et al.
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Mu Cai, Haotian Liu, Yuheng Li et al.
CrossOver: 3D Scene Cross-Modal Alignment
Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys et al.
Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models
Kartik Thakral, Tamar Glaser, Tal Hassner et al.
Knowledge-Enhanced Historical Document Segmentation and Recognition
En-Hao Gao, Yu-Xuan Huang, Wen-Chao Hu et al.
VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing
Shang Liu, Chaohui Yu, Chenjie Cao et al.
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Zitang Zhou, Ke Mei, Yu Lu et al.
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao, Isaac Chung, Imene Kerboua et al.
Multi-Level Cross-Modal Alignment for Image Clustering
Liping Qiu, Qin Zhang, Xiaojun Chen et al.
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment
Yang Bai, Yucheng Ji, Min Cao et al.
VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment
Shangkun Sun, Xiaoyu Liang, Songlin Fan et al.
CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing
Ziqi Jiang, Zhen Wang, Long Chen
Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
Congpei Qiu, Yanhao Wu, Wei Ke et al.
CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images
Changsheng Chen, Liangwei Lin, Yongqi Chen et al.
Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification
Yang Qin, Chao Chen, Zhihang Fu et al.
SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction
Zhengyuan Li, Kai Cheng, Anindita Ghosh et al.
Refine-by-Align: Reference-Guided Artifacts Refinement through Semantic Alignment
Yizhi Song, Liu He, Zhifei Zhang et al.
Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation
Yuwen Pan, Rui Sun, Naisong Luo et al.
Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
Tianci Bi, Xiaoyi Zhang, Zhizheng Zhang et al.