🧬Multimodal

Image Captioning

Generating text descriptions of images

248 papers(showing top 100)5,431 total citations
Compare with other topics
Mar '24 β€” Feb '26203 papers
Also includes: image captioning, visual captioning, image description

Top Papers

#1

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Xin Li, Jing Yu Koh, Alexander Ku et al.

ICLR 2024
1,366
citations
#2

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.

ICLR 2025arXiv:2408.06072
text-to-video generationdiffusion transformer3d variational autoencoderexpert transformer+4
1,318
citations
#3

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Jing Shi, Wei Xiong, Zhe Lin et al.

CVPR 2024arXiv:2304.03411
369
citations
#4

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

CVPR 2024arXiv:2402.19479
341
citations
#5

Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos

Yue Ma, Yingqing HE, Xiaodong Cun et al.

AAAI 2024arXiv:2304.01186
pose-guided generationtext-to-video generationcharacter video synthesispose-controllable generation+4
276
citations
#6

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Xunpeng Yi, Han Xu, HAO ZHANG et al.

CVPR 2024arXiv:2403.16387
123
citations
#7

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Dewei Zhou, You Li, Fan Ma et al.

CVPR 2024arXiv:2402.05408
109
citations
#8

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai, Enxin Song, Yilun Du et al.

ICLR 2025arXiv:2410.03051
video detailed captioninglarge multimodal modeltoken merging strategytemporal modeling+4
97
citations
#9

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.

CVPR 2024arXiv:2311.17049
84
citations
#10

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang et al.

CVPR 2024arXiv:2402.13250
82
citations
#11

Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Sixian Zhang, Bohan Wang, Junqiang Wu et al.

CVPR 2024arXiv:2405.14705
76
citations
#12

Language-Image Pre-training with Long Captions

Kecheng Zheng, Yifei Zhang, Wei Wu et al.

ECCV 2024arXiv:2403.17007
63
citations
#13

VeCLIP: Improving CLIP Training via Visual-enriched Captions

Zhengfeng Lai, Haotian Zhang, Bowen Zhang et al.

ECCV 2024
59
citations
#14

SECap: Speech Emotion Captioning with Large Language Model

Yaoxun Xu, Hangting Chen, Jianwei Yu et al.

AAAI 2024arXiv:2312.10381
speech emotion captioninglarge language modelsaudio feature extractionmutual information learning+4
56
citations
#15

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Wentao Tan, Changxing Ding, Jiayu Jiang et al.

CVPR 2024arXiv:2405.04940
55
citations
#16

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Xiang Wang, Shiwei Zhang, Hangjie Yuan et al.

CVPR 2024arXiv:2312.15770
53
citations
#17

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

ICCV 2025arXiv:2504.16072
49
citations
#18

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Yu Zeng, Vishal M. Patel, Haochen Wang et al.

CVPR 2024arXiv:2407.06187
47
citations
#19

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Chenyang Zhu, Kai Li, Yue Ma et al.

AAAI 2025arXiv:2404.14239
46
citations
#20

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Pratyush Maini, Sachin Goyal, Zachary Lipton et al.

ICLR 2024arXiv:2307.03132
41
citations
#21

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Yunhao Ge, Xiaohui Zeng, Jacob Huffman et al.

CVPR 2024arXiv:2404.19752
33
citations
#22

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Lital Binyamin, Yoad Tewel, Hilit Segev et al.

CVPR 2025arXiv:2406.10210
text-to-image generationdiffusion modelsobject countinginstance identity+3
32
citations
#23

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Mushui Liu, Yuhang Ma, Zhen Yang et al.

AAAI 2025arXiv:2407.00737
31
citations
#24

View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson, Honglak Lee

ECCV 2024arXiv:2404.07984
29
citations
#25

Enriching Multimodal Sentiment Analysis Through Textual Emotional Descriptions of Visual-Audio Content

Sheng Wu, Dongxiao He, Xiaobao Wang et al.

AAAI 2025arXiv:2412.10460
28
citations
#26

VideoCon: Robust Video-Language Alignment via Contrast Captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor et al.

CVPR 2024arXiv:2311.10111
28
citations
#27

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Siteng Huang, Biao Gong, Yutong Feng et al.

CVPR 2024arXiv:2311.15841
23
citations
#28

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

Hao Wu, Huabin Liu, Yu Qiao et al.

CVPR 2024arXiv:2404.02755
20
citations
#29

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zeyu Han, Fangrui Zhu, Qianru Lao et al.

CVPR 2024arXiv:2311.17048
20
citations
#30

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Zhiyue Liu, Jinyuan Liu, Fanrong Ma

AAAI 2024arXiv:2312.08865
cross-modal alignmenttext-only image captioningsynthetic image-text pairsclip embedding space+3
20
citations
#31

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin, Wei Liu, Chen Chen et al.

ICCV 2025
20
citations
#32

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

XIAOYU LIU, Yuxiang WEI, Ming LIU et al.

ECCV 2024
19
citations
#33

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang et al.

CVPR 2024arXiv:2404.04231
19
citations
#34

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.

CVPR 2025arXiv:2411.18499
18
citations
#35

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

Fan Lu, Wei Wu, Kecheng Zheng et al.

CVPR 2025
18
citations
#36

Composing Object Relations and Attributes for Image-Text Matching

Khoi Pham, Chuong Huynh, Ser-Nam Lim et al.

CVPR 2024arXiv:2406.11820
18
citations
#37

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Hang Hua, Qing Liu, Lingzhi Zhang et al.

CVPR 2025arXiv:2411.15411
vision-language modelscompositional image captioningfine-grained image understandingsegmentation mask alignment+4
17
citations
#38

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.

ECCV 2024arXiv:2312.03766
image-text alignmentmisalignment explanationvisual groundingvision language models+3
17
citations
#39

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Qinghao Ye, Xianhan Zeng, Fu Li et al.

ICLR 2025arXiv:2503.07906
15
citations
#40

Exploiting Auxiliary Caption for Video Grounding

Hongxiang Li, Meng Cao, Xuxin Cheng et al.

AAAI 2024arXiv:2301.05997
video groundingdense video captioningcross-modal contrastive learningsemantic relation projection+3
14
citations
#41

Hyperbolic Learning with Synthetic Captions for Open-World Detection

Fanjie Kong, Yanbei Chen, Jiarui Cai et al.

CVPR 2024arXiv:2404.05016
14
citations
#42

Generating Multi-Image Synthetic Data for Text-to-Image Customization

Nupur Kumari, Xi Yin, Jun-Yan Zhu et al.

ICCV 2025arXiv:2502.01720
14
citations
#43

Customization Assistant for Text-to-Image Generation

Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu et al.

CVPR 2024arXiv:2312.03045
14
citations
#44

CoBIT: A Contrastive Bi-directional Image-Text Generation Model

Haoxuan You, Xiaoyue Guo, Zhecan Wang et al.

ICLR 2024arXiv:2303.13455
14
citations
#45

Cycle-Consistency Learning for Captioning and Grounding

Ning Wang, Jiajun Deng, Mingbo Jia

AAAI 2024arXiv:2312.15162
visual groundingimage captioningcyclic-consistent learningsemi-weakly supervised training+3
13
citations
#46

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Sara Sarto, Marcella Cornia, Lorenzo Baraldi et al.

ECCV 2024arXiv:2407.20341
12
citations
#47

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Wei Chen, Lin Li, Yongqi Yang et al.

CVPR 2025arXiv:2406.10462
multimodal large language modelsinterleaved image-text generationmultimodal in-context learningnarrative coherence+4
12
citations
#48

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Weixi Feng, Chao Liu, Sifei Liu et al.

CVPR 2025arXiv:2501.07647
11
citations
#49

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching

Bin Wang, Fan Wu, Linke Ouyang et al.

CVPR 2025arXiv:2409.03643
formula recognitionevaluation metricscharacter detection matchinglatex rendering+3
11
citations
#50

Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation

Ling-An Zeng, Guohong Huang, Gaojie Wu et al.

AAAI 2025arXiv:2412.11193
10
citations
#51

VIXEN: Visual Text Comparison Network for Image Difference Captioning

Alexander Black, Jing Shi, Yifei Fan et al.

AAAI 2024arXiv:2402.19119
image difference captioningvisual text comparisonpairwise image featuressoft prompt construction+4
9
citations
#52

PreciseCam: Precise Camera Control for Text-to-Image Generation

Edurne Bernal-Berdun, Ana Serrano, Belen Masia et al.

CVPR 2025
9
citations
#53

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

Yuyang Peng, Shishi Xiao, Keming Wu et al.

CVPR 2025arXiv:2503.20672
visual text renderinginfographics generationlayout-guided attentionbusiness content generation+4
8
citations
#54

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

Xueqing Deng, Linjie Yang, Qihang Yu et al.

NeurIPS 2025
8
citations
#55

Progress-Aware Video Frame Captioning

Zihui Xue, Joungbin An, Xitong Yang et al.

CVPR 2025arXiv:2412.02071
7
citations
#56

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Mu Cai, Haotian Liu, Yuheng Li et al.

ECCV 2024arXiv:2410.00905
7
citations
#57

L-MAGIC: Language Model Assisted Generation of Images with Coherence

zhipeng cai, Matthias Mueller, Reiner Birkl et al.

CVPR 2024arXiv:2406.01843
7
citations
#58

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Xu Yuan, Li Zhou, Zenghui Sun et al.

AAAI 2025arXiv:2409.13407
7
citations
#59

Prompt Augmentation for Self-supervised Text-guided Image Manipulation

Rumeysa Bodur, Binod Bhattarai, Tae-Kyun Kim

CVPR 2024arXiv:2412.13081
6
citations
#60

CompCap: Improving Multimodal Large Language Models with Composite Captions

Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.

ICCV 2025arXiv:2412.05243
multimodal large language modelscomposite image understandingvision-language alignmentsynthetic data generation+3
6
citations
#61

Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Bingchao Wang, Zhiwei Ning, Jianyu Ding et al.

ICCV 2025arXiv:2507.10095
contrastive learninglong-text retrievalvision-language alignmentsynthetic caption generation+4
5
citations
#62

Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Jiayu Jiang, Changxing Ding, Wentao Tan et al.

CVPR 2025
5
citations
#63

EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

GuangHao Meng, Sunan He, Jinpeng Wang et al.

AAAI 2025arXiv:2505.18594
5
citations
#64

Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP

Yayuan Li, Jintao Guo, Lei Qi et al.

AAAI 2025arXiv:2412.11375
4
citations
#65

HiCMΒ²: Hierarchical Compact Memory Modeling for Dense Video Captioning

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon et al.

AAAI 2025
4
citations
#66

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

Zhen Wang, Xinyun Jiang, Jun Xiao et al.

ECCV 2024arXiv:2311.14920
explicit caption editingdiffusion modelsdenoising processcaption generation+3
4
citations
#67

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning In Text-to-Image Models

Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu et al.

ICCV 2025arXiv:2503.15283
4
citations
#68

A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

Andrew Z Wang, Songwei Ge, Tero Karras et al.

CVPR 2025arXiv:2506.08210
text-to-image generationdecoder-only llmsdiffusion modelstext encoders+3
4
citations
#69

MultiGen: Zero-shot Image Generation from Multi-modal Prompts

Zhi-Fan Wu, Lianghua Huang, Wei Wang et al.

ECCV 2024
4
citations
#70

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Soyeong Kwon, TAEGYEONG LEE, Taehwan Kim

ECCV 2024arXiv:2407.12642
zero-shot learningtext-guided image synthesisdiffusion modelslarge language models+4
3
citations
#71

SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

Lin Zhang, Xianfang Zeng, Kangcong Li et al.

ICCV 2025arXiv:2508.06125
3
citations
#72

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie, Tengda Han, Max Bain et al.

ICCV 2025arXiv:2504.01020
3
citations
#73

Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention

Jeonghoon Park, Juyoung Lee, Chaeyeon Chung et al.

ICCV 2025arXiv:2506.13298
3
citations
#74

Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation

Xiaoying Xing, Avinab Saha, Junfeng He et al.

CVPR 2025arXiv:2501.06481
text-to-image generationreward model fine-tuningregion-aware fine-tuninghuman preference alignment+4
3
citations
#75

BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

Zhantao Yang, Ruili Feng, Keyu Yan et al.

CVPR 2025arXiv:2407.03314
3
citations
#76

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Han Xiao, yina xie, Guanxin tan et al.

CVPR 2025
3
citations
#77

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

Chutian Meng, Fan Ma, Jiaxu Miao et al.

AAAI 2025arXiv:2411.09449
3
citations
#78

Zero-Shot Image Captioning with Multi-type Entity Representations

Delong Zeng, Ying Shen, Man Lin et al.

AAAI 2025
3
citations
#79

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu, Chen-Wei Xie, Bin Wen et al.

NeurIPS 2025arXiv:2502.14914
3
citations
#80

Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent

En Ci, Shanyan Guan, Yanhao Ge et al.

ICCV 2025
2
citations
#81

ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

Eric Xing, Pranavi Kolouju, Robert Pless et al.

CVPR 2025arXiv:2505.20764
2
citations
#82

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning

Jianjie Luo, Jingwen Chen, Yehao Li et al.

ECCV 2024arXiv:2501.00437
2
citations
#83

PhyS-EdiT: Physics-aware Semantic Image Editing with Text Description

Ziqi Cai, Shuchen Weng, Yifei Xia et al.

CVPR 2025
2
citations
#84

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

Nina Shvetsova, Arsha Nagrani, Bernt Schiele et al.

CVPR 2025arXiv:2503.18637
1
citations
#85

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Tommaso Galliena, Tommaso Apicella, Stefano Rosa et al.

ICCV 2025arXiv:2504.08531
1
citations
#86

Type-R: Automatically Retouching Typos for Text-to-Image Generation

Wataru Shimoda, Naoto Inoue, Daichi Haraguchi et al.

CVPR 2025
1
citations
#87

SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning

Xu Zhang, Jin Yuan, Hanwang Zhang et al.

AAAI 2025arXiv:2512.01975
1
citations
#88

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

Yunbin Tu, Liang Li, Li Su et al.

AAAI 2025arXiv:2412.13543
1
citations
#89

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Tianyi Liang, Jiangqi Liu, Yifei Huang et al.

ICML 2025arXiv:2404.11824
1
citations
#90

Localizing and Editing Knowledge In Text-to-Image Generative Models

Samyadeep Basu, Nanxuan Zhao, Vlad Morariu et al.

ICLR 2024
β€”
not collected
#91

BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

Andrew Luo, Maggie Henderson, Michael Tarr et al.

ICLR 2024arXiv:2310.04420
β€”
not collected
#92

MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying

Ryan Burgert, Brian Price, Jason Kuen et al.

CVPR 2024
β€”
not collected
#93

Alt-Text with Context: Improving Accessibility for Images on Twitter

Nikita Srivatsan, Sofia Samaniego, Omar Florez et al.

ICLR 2024
β€”
not collected
#94

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Sanghyeok Chu, Seonguk Seo, Bohyung Han

ICML 2025
β€”
not collected
#95

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Yuiga Wada, Kanta Kaneda, Daichi Saito et al.

CVPR 2024arXiv:2402.18091
β€”
not collected
#96

LEDITS++: Limitless Image Editing using Text-to-Image Models

Manuel Brack, Felix Friedrich, Katharina Kornmeier et al.

CVPR 2024
β€”
not collected
#97

Tag2Text: Guiding Vision-Language Model via Image Tagging

Xinyu Huang, Youcai Zhang, Jinyu Ma et al.

ICLR 2024arXiv:2303.05657
β€”
not collected
#98

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto et al.

CVPR 2024
β€”
not collected
#99

Learning Continuous 3D Words for Text-to-Image Generation

Ta-Ying Cheng, Matheus Gadelha, Thibault Groueix et al.

CVPR 2024arXiv:2402.08654
β€”
not collected
#100

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

Jack Urbanek, Florian Bordes, Pietro Astolfi et al.

CVPR 2024
β€”
not collected