🧬Multimodal

Image Captioning

Generating text descriptions of images

100 papers8,197 total citations

Compare with other topics

Feb '24 — Jan '26751 papers

Top Conferences

CVPR: 43 AAAI: 21 ECCV: 15 ICLR: 11 ICCV: 5 NeurIPS: 5

Top Papers

#1

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Xin Li, Jing Yu Koh, Alexander Ku et al.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu et al.

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Jing Shi, Wei Xiong, Zhe Lin et al.

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos

Yue Ma, Yingqing HE, Xiaodong Cun et al.

AAAI 2024arXiv:2304.01186

pose-guided generationtext-to-video generationcharacter video synthesispose-controllable generation+4

276

citations

#7

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

Shelly Sheynin, Adam Polyak, Uriel Singer et al.

Demystifying CLIP Data

Hu Xu, Saining Xie, Xiaoqing Tan et al.

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

Konstantin Klemmer, Esther Rolf, Caleb Robinson et al.

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

Xunpeng Yi, Han Xu, HAO ZHANG et al.

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

Feng Wang, Jieru Mei, Alan Yuille

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Dewei Zhou, You Li, Fan Ma et al.

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai, Enxin Song, Yilun Du et al.

ICLR 2025arXiv:2410.03051

video detailed captioninglarge multimodal modeltoken merging strategytemporal modeling+4

102

citations

#14

An Empirical Study of CLIP for Text-Based Person Search

Cao Min, Yang Bai, ziyin Zeng et al.

AAAI 2024arXiv:2308.10045

text-based person searchcontrastive language image pretrainingcross-modal retrievalvision-language pre-training+3

94

citations

#15

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.

Video ReCap: Recursive Captioning of Hour-Long Videos

Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang et al.

Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Sixian Zhang, Bohan Wang, Junqiang Wu et al.

Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering

Kim Youwang, Tae-Hyun Oh, Gerard Pons-Moll

PromptTTS 2: Describing and Generating Voices with Text Prompt

Yichong Leng, ZHifang Guo, Kai Shen et al.

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

TC4D: Trajectory-Conditioned Text-to-4D Generation

Sherwin Bahmani, Xian Liu, Wang Yifan et al.

ECCV 2024arXiv:2403.17920

text-to-4d generationtrajectory-conditioned generationdynamic 3d scenesneural representations+4

64

citations

#23

Language-Image Pre-training with Long Captions

Kecheng Zheng, Yifei Zhang, Wei Wu et al.

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

Jiamian Wang, Guohao Sun, Pichao Wang et al.

VeCLIP: Improving CLIP Training via Visual-enriched Captions

Zhengfeng Lai, Haotian Zhang, Bowen Zhang et al.

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Zhaorun Chen, Zichen Wen, Yichao Du et al.

NeurIPS 2025arXiv:2407.04842

multimodal reward modelstext-to-image generationpreference datasetimage generation models+4

57

citations

#27

Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval

Yuanmin Tang, Jing Yu, Keke Gai et al.

AAAI 2024arXiv:2309.16137

zero-shot learningcomposed image retrievalimage representation learningcontext-dependent mapping+4

57

citations

#28

SECap: Speech Emotion Captioning with Large Language Model

Yaoxun Xu, Hangting Chen, Jianwei Yu et al.

AAAI 2024arXiv:2312.10381

speech emotion captioninglarge language modelsaudio feature extractionmutual information learning+4

56

citations

#29

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Wentao Tan, Changxing Ding, Jiayu Jiang et al.

Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy

Yu Fu, Deyi Xiong, Yue Dong

AAAI 2024arXiv:2307.13808

watermarking text generationai detectionconditional text generationsemantic-aware watermarking+4

54

citations

#31

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Xiang Wang, Shiwei Zhang, Hangjie Yuan et al.

Describing Differences in Image Sets with Natural Language

Lisa Dunlap, Yuhui Zhang, Xiaohan Wang et al.

Discovering and Mitigating Visual Biases through Keyword Explanation

Younghyun Kim, Sangwoo Mo, Minkyu Kim et al.

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Yufeng Huang, Jiji Tang, Zhuo Chen et al.

AAAI 2024arXiv:2305.06152

scene graph knowledgemulti-modal structured representationsvision-language pre-trainingimage-text matching+3

49

citations

#35

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

Yiming Huang, WEILIN WAN, Yue Yang et al.

ECCV 2024arXiv:2403.13900

text-to-motion generationcontrollable motion editingdiscrete pose codeslarge language models+4

48

citations

#37

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Yu Zeng, Vishal M. Patel, Haochen Wang et al.

MultiBooth: Towards Generating All Your Concepts in an Image from Text

Chenyang Zhu, Kai Li, Yue Ma et al.

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

Jeremy Irvin, Emily Liu, Joyce Chen et al.

ICLR 2025arXiv:2410.06234

vision-language assistanttemporal earth observationinstruction-following datasetchange detection+4

45

citations

#40

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen, Yunhao Gou, Runhui Huang et al.

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Shuai Tan, Bin Ji, Ye Pan

AAAI 2024arXiv:2403.06365

talking head generationemotion style transferart style transferaudio-driven animation+4

43

citations

#42

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

Pratyush Maini, Sachin Goyal, Zachary Lipton et al.

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Bu Jin, Yupeng Zheng, Pengfei Li et al.

ECCV 2024arXiv:2403.19589

3d dense captioningoutdoor scene understandinglidar point cloudbev representation+4

40

citations

#44

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Bojia Zi, Shihao Zhao, Xianbiao Qi et al.

Control4D: Efficient 4D Portrait Editing with Text

Ruizhi Shao, Jingxiang Sun, Cheng Peng et al.

Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

Yi Wu, Ziqiang Li, Heliang Zheng et al.

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Yoad Tewel, Rinon Gal, Dvir Samuel et al.

ICLR 2025arXiv:2411.07232

attention mechanismdiffusion modelssemantic image editingobject insertion+3

34

citations

#48

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Yunhao Ge, Xiaohui Zeng, Jacob Huffman et al.

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

Zhe Li, Weihao Yuan, Yisheng He et al.

MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL

Arian Askari, Christian Poelitz, Xinye Tang

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Lital Binyamin, Yoad Tewel, Hilit Segev et al.

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Young Kyun Jang, Dat B Huynh, Ashish Shah et al.

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Mushui Liu, Yuhang Ma, Zhen Yang et al.

StarVector: Generating Scalable Vector Graphics Code from Images and Text

Juan Rodriguez, Abhay Puri, Shubham Agarwal et al.

CVPR 2025arXiv:2312.11556

svg generationmultimodal language modelsimage vectorizationvector graphics code+4

30

citations

#55

PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation

Jaejung Seol, Seojun Kim, Jaejun Yoo

View Selection for 3D Captioning via Diffusion Ranking

Tiange Luo, Justin Johnson, Honglak Lee

Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Weifeng Lin, Xinyu Wei, Ruichuan An et al.

NeurIPS 2025arXiv:2506.05302

region-level visual understandingobject segmentationlarge language modelssemantic perceiver+4

29

citations

#58

VideoCon: Robust Video-Language Alignment via Contrast Captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor et al.

Enriching Multimodal Sentiment Analysis Through Textual Emotional Descriptions of Visual-Audio Content

Sheng Wu, Dongxiao He, Xiaobao Wang et al.

2382 SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation

Chengyou Jia, Minnan Luo, Zhuohang Dang et al.

Perception-Guided Jailbreak Against Text-to-Image Models

Yihao Huang, Le Liang, Tianlin Li et al.

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

Ying Chen, Guoan Wang, Yuanfeng Ji et al.

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Shenghai Yuan, Xianyi He, Yufan Deng et al.

Doubly Abductive Counterfactual Inference for Text-based Image Editing

Xue Song, Jiequan Cui, Hanwang Zhang et al.

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

Bowen Chen, Brynn zhao, Haomiao Sun et al.

MagicQuill: An Intelligent Interactive Image Editing System

Zichen Liu, Yue Yu, Hao Ouyang et al.

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

Wang Jiarui, Huiyu Duan, Guangtao Zhai et al.

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Siteng Huang, Biao Gong, Yutong Feng et al.

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

Zixuan Gong, Qi Zhang, Guangyin Bao et al.

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Tianwei Xiong, Jun Hao Liew, Zilong Huang et al.

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Wei Pang, Kevin Qinghong Lin, Xiangru Jian et al.

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Denis Zavadski, Johann-Friedrich Feiden, Carsten Rother

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Razvan Pasca, Alexey Gavryushin, Muhammad Hamza et al.

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Rui Xie, Yinhong Liu, Penghao Zhou et al.

StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation

Sidi Wu, Yizi Chen, Loic Landrieu et al.

Image Clustering Conditioned on Text Criteria

Sehyun Kwon, Jaden Park, Minkyu Kim et al.

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin, Wei Liu, Chen Chen et al.

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

Hao Wu, Huabin Liu, Yu Qiao et al.

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zeyu Han, Fangrui Zhu, Qianru Lao et al.

Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning

Zhiyue Liu, Jinyuan Liu, Fanrong Ma

AAAI 2024arXiv:2312.08865

cross-modal alignmenttext-only image captioningsynthetic image-text pairsclip embedding space+3

20

citations

#81

ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement

Muhammad Atif Butt, Kai Wang, Javier Vazquez-Corral et al.

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

Yifan Pu, Yiming Zhao, Zhicong Tang et al.

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain et al.

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

XIAOYU LIU, Yuxiang WEI, Ming LIU et al.

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Yuchi Wang, Junliang Guo, Jianhong Bai et al.

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang et al.

LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion

Pancheng Zhao, Peng Xu, Pengda Qin et al.

Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

Tianzhe Chu, Shengbang Tong, Tianjiao Ding et al.

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

Ming Tao, BINGKUN BAO, Hao Tang et al.

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

Fan Lu, Wei Wu, Kecheng Zheng et al.

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.

Composing Object Relations and Attributes for Image-Text Matching

Khoi Pham, Chuong Huynh, Ser-Nam Lim et al.

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

Dezhi Peng, Chongyu Liu, Yuliang Liu et al.

AAAI 2024arXiv:2306.12106

scene text removalvision transformersmasked image modelingtext box segmentation+3

18

citations

#94

Text Image Inpainting via Global Structure-Guided Diffusion Models

Shipeng Zhu, Pengfei Fang, Chenjie Zhu et al.

AAAI 2024arXiv:2401.14832

text image inpaintingdiffusion modelsscene text recognitionhandwritten text images+4

18

citations

#95

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Junyan Wang, Zhenhong Sun, Stewart Tan et al.

Condition-Aware Neural Network for Controlled Image Generation

Han Cai, Muyang Li, Qinsheng Zhang et al.

Decomposing Semantic Shifts for Composed Image Retrieval

Xingyu Yang, Daqing Liu, Heng Zhang et al.

AAAI 2024arXiv:2309.09531

composed image retrievalsemantic shift decompositionvisual prototype generationdegradation and upgradation+2

17

citations

#98

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Hang Hua, Qing Liu, Lingzhi Zhang et al.

Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.

ECCV 2024arXiv:2312.03766

image-text alignmentmisalignment explanationvisual groundingvision language models+3

17

citations

#100

CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs

Haocheng Yuan, Jing Xu, Hao Pan et al.

CVPR 2024

16

citations

Image Captioning

Top Conferences

Related Topics (Multimodal)

Top Papers

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

Demystifying CLIP Data

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

An Empirical Study of CLIP for Text-Based Person Search

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Video ReCap: Recursive Captioning of Hour-Long Videos

Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering

PromptTTS 2: Describing and Generating Voices with Text Prompt

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

MV-Adapter: Multi-View Consistent Image Generation Made Easy

TC4D: Trajectory-Conditioned Text-to-4D Generation

Language-Image Pre-training with Long Captions

Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval

VeCLIP: Improving CLIP Training via Visual-enriched Captions

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval

SECap: Speech Emotion Captioning with Large Language Model

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

Describing Differences in Image Sets with Natural Language

Discovering and Mitigating Visual Biases through Keyword Explanation

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Describe Anything: Detailed Localized Image and Video Captioning

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

MultiBooth: Towards Generating All Your Concepts in an Image from Text

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Control4D: Efficient 4D Portrait Editing with Text

Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

StarVector: Generating Scalable Vector Graphics Code from Images and Text

PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation

View Selection for 3D Captioning via Diffusion Ranking

Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

VideoCon: Robust Video-Language Alignment via Contrast Captions

Enriching Multimodal Sentiment Analysis Through Textual Emotional Descriptions of Visual-Audio Content

2382 SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation

Perception-Guided Jailbreak Against Text-to-Image Models

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Doubly Abductive Counterfactual Inference for Text-based Image Editing

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

MagicQuill: An Intelligent Interactive Image Editing System

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation

Image Clustering Conditioned on Text Criteria