Image Captioning
Generating text descriptions of images
Top Papers
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Xin Li, Jing Yu Koh, Alexander Ku et al.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
Jing Shi, Wei Xiong, Zhe Lin et al.
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.
Follow Your Pose: Pose-Guided Text-to-Video Generation Using Pose-Free Videos
Yue Ma, Yingqing HE, Xiaodong Cun et al.
Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
Xunpeng Yi, Han Xu, HAO ZHANG et al.
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
Dewei Zhou, You Li, Fan Ma et al.
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai, Enxin Song, Yilun Du et al.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.
Video ReCap: Recursive Captioning of Hour-Long Videos
Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang et al.
Learning Multi-Dimensional Human Preference for Text-to-Image Generation
Sixian Zhang, Bohan Wang, Junqiang Wu et al.
Language-Image Pre-training with Long Captions
Kecheng Zheng, Yifei Zhang, Wei Wu et al.
VeCLIP: Improving CLIP Training via Visual-enriched Captions
Zhengfeng Lai, Haotian Zhang, Bowen Zhang et al.
SECap: Speech Emotion Captioning with Large Language Model
Yaoxun Xu, Hangting Chen, Jianwei Yu et al.
Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID
Wentao Tan, Changxing Ding, Jiayu Jiang et al.
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan et al.
Describe Anything: Detailed Localized Image and Video Captioning
Long Lian, Yifan Ding, Yunhao Ge et al.
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
Yu Zeng, Vishal M. Patel, Haochen Wang et al.
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Chenyang Zhu, Kai Li, Yue Ma et al.
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Pratyush Maini, Sachin Goyal, Zachary Lipton et al.
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Yunhao Ge, Xiaohui Zeng, Jacob Huffman et al.
Make It Count: Text-to-Image Generation with an Accurate Number of Objects
Lital Binyamin, Yoad Tewel, Hilit Segev et al.
LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation
Mushui Liu, Yuhang Ma, Zhen Yang et al.
View Selection for 3D Captioning via Diffusion Ranking
Tiange Luo, Justin Johnson, Honglak Lee
Enriching Multimodal Sentiment Analysis Through Textual Emotional Descriptions of Visual-Audio Content
Sheng Wu, Dongxiao He, Xiaobao Wang et al.
VideoCon: Robust Video-Language Alignment via Contrast Captions
Hritik Bansal, Yonatan Bitton, Idan Szpektor et al.
Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
Siteng Huang, Biao Gong, Yutong Feng et al.
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement
Hao Wu, Huabin Liu, Yu Qiao et al.
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Zeyu Han, Fangrui Zhu, Qianru Lao et al.
Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
Zhiyue Liu, Jinyuan Liu, Fanrong Ma
STIV: Scalable Text and Image Conditioned Video Generation
Zongyu Lin, Wei Liu, Chen Chen et al.
SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
XIAOYU LIU, Yuxiang WEI, Ming LIU et al.
Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation
Ji-Jia Wu, Andy Chia-Hao Chang, Chieh-Yu Chuang et al.
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Pengfei Zhou, Xiaopeng Peng, Jiajun Song et al.
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
Fan Lu, Wei Wu, Kecheng Zheng et al.
Composing Object Relations and Attributes for Image-Text Matching
Khoi Pham, Chuong Huynh, Ser-Nam Lim et al.
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Hang Hua, Qing Liu, Lingzhi Zhang et al.
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
Brian Gordon, Yonatan Bitton, Yonatan Shafir et al.
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
Qinghao Ye, Xianhan Zeng, Fu Li et al.
Exploiting Auxiliary Caption for Video Grounding
Hongxiang Li, Meng Cao, Xuxin Cheng et al.
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong, Yanbei Chen, Jiarui Cai et al.
Generating Multi-Image Synthetic Data for Text-to-Image Customization
Nupur Kumari, Xi Yin, Jun-Yan Zhu et al.
Customization Assistant for Text-to-Image Generation
Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu et al.
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You, Xiaoyue Guo, Zhecan Wang et al.
Cycle-Consistency Learning for Captioning and Grounding
Ning Wang, Jiajun Deng, Mingbo Jia
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Sara Sarto, Marcella Cornia, Lorenzo Baraldi et al.
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Wei Chen, Lin Li, Yongqi Yang et al.
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations
Weixi Feng, Chao Liu, Sifei Liu et al.
Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching
Bin Wang, Fan Wu, Linke Ouyang et al.
Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation
Ling-An Zeng, Guohong Huang, Gaojie Wu et al.
VIXEN: Visual Text Comparison Network for Image Difference Captioning
Alexander Black, Jing Shi, Yifei Fan et al.
PreciseCam: Precise Camera Control for Text-to-Image Generation
Edurne Bernal-Berdun, Ana Serrano, Belen Masia et al.
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
Yuyang Peng, Shishi Xiao, Keming Wu et al.
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
Xueqing Deng, Linjie Yang, Qihang Yu et al.
Progress-Aware Video Frame Captioning
Zihui Xue, Joungbin An, Xitong Yang et al.
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Mu Cai, Haotian Liu, Yuheng Li et al.
L-MAGIC: Language Model Assisted Generation of Images with Coherence
zhipeng cai, Matthias Mueller, Reiner Birkl et al.
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model
Xu Yuan, Li Zhou, Zenghui Sun et al.
Prompt Augmentation for Self-supervised Text-guided Image Manipulation
Rumeysa Bodur, Binod Bhattarai, Tae-Kyun Kim
CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.
Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
Bingchao Wang, Zhiwei Ning, Jianyu Ding et al.
Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Jiayu Jiang, Changxing Ding, Wentao Tan et al.
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
GuangHao Meng, Sunan He, Jinpeng Wang et al.
Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP
Yayuan Li, Jintao Guo, Lei Qi et al.
HiCMΒ²: Hierarchical Compact Memory Modeling for Dense Video Captioning
Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon et al.
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
Zhen Wang, Xinyun Jiang, Jun Xiao et al.
TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning In Text-to-Image Models
Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu et al.
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
Andrew Z Wang, Songwei Ge, Tero Karras et al.
MultiGen: Zero-shot Image Generation from Multi-modal Prompts
Zhi-Fan Wu, Lianghua Huang, Wei Wang et al.
Zero-shot Text-guided Infinite Image Synthesis with LLM guidance
Soyeong Kwon, TAEGYEONG LEE, Taehwan Kim
SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
Lin Zhang, Xianfang Zeng, Kangcong Li et al.
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
Junyu Xie, Tengda Han, Max Bain et al.
Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention
Jeonghoon Park, Juyoung Lee, Chaeyeon Chung et al.
Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
Xiaoying Xing, Avinab Saha, Junfeng He et al.
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs
Zhantao Yang, Ruili Feng, Keyu Yan et al.
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Han Xiao, yina xie, Guanxin tan et al.
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models
Chutian Meng, Fan Ma, Jiaxu Miao et al.
Zero-Shot Image Captioning with Multi-type Entity Representations
Delong Zeng, Ying Shen, Man Lin et al.
CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
Zhihang Liu, Chen-Wei Xie, Bin Wen et al.
Describe, Donβt Dictate: Semantic Image Editing with Natural Language Intent
En Ci, Shanyan Guan, Yanhao Ge et al.
ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
Eric Xing, Pranavi Kolouju, Robert Pless et al.
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Jianjie Luo, Jingwen Chen, Yehao Li et al.
PhyS-EdiT: Physics-aware Semantic Image Editing with Text Description
Ziqi Cai, Shuchen Weng, Yifei Xia et al.
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Nina Shvetsova, Arsha Nagrani, Bernt Schiele et al.
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena, Tommaso Apicella, Stefano Rosa et al.
Type-R: Automatically Retouching Typos for Text-to-Image Generation
Wataru Shimoda, Naoto Inoue, Daichi Haraguchi et al.
SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning
Xu Zhang, Jin Yuan, Hanwang Zhang et al.
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Yunbin Tu, Liang Li, Li Su et al.
TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation
Tianyi Liang, Jiangqi Liu, Yifei Huang et al.
Localizing and Editing Knowledge In Text-to-Image Generative Models
Samyadeep Basu, Nanxuan Zhao, Vlad Morariu et al.
BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity
Andrew Luo, Maggie Henderson, Michael Tarr et al.
MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying
Ryan Burgert, Brian Price, Jason Kuen et al.
Alt-Text with Context: Improving Accessibility for Images on Twitter
Nikita Srivatsan, Sofia Samaniego, Omar Florez et al.
Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
Sanghyeok Chu, Seonguk Seo, Bohyung Han
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada, Kanta Kaneda, Daichi Saito et al.
LEDITS++: Limitless Image Editing using Text-to-Image Models
Manuel Brack, Felix Friedrich, Katharina Kornmeier et al.
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang, Youcai Zhang, Jinyu Ma et al.
EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension
Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto et al.
Learning Continuous 3D Words for Text-to-Image Generation
Ta-Ying Cheng, Matheus Gadelha, Thibault Groueix et al.
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Jack Urbanek, Florian Bordes, Pietro Astolfi et al.