"image captioning" Papers
30 papers found
Conference
AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation
Jingyi Xie, Jintao Yang, Zhunchen Luo et al.
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang, Chen Ju, Weixiong Lin et al.
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs
Zhantao Yang, Ruili Feng, Keyu Yan et al.
BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts
Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.
Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)
SUBBA REDDY OOTA, Akshett Rai Jindal, Ishani Mondal et al.
Edit Flows: Variable Length Discrete Flow Matching with Sequence-Level Edit Operations
Marton Havasi, Brian Karrer, Itai Gat et al.
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena, Tommaso Apicella, Stefano Rosa et al.
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
Qihao Liu, Xi Yin, Alan L. Yuille et al.
HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Eunkyu Park, Minyeong Kim, Gunhee Kim
Hand1000: Generating Realistic Hands from Text with Only 1,000 Images
Haozhuo Zhang, Bin Zhu, Yu Cao et al.
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
Xin Dong, Shichao Dong, Jin Wang et al.
Mimic In-Context Learning for Multimodal Tasks
Yuchu Jiang, Jiale Fu, chenduo hao et al.
Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models
Tiezheng Zhang, Yitong Li, Yu-Cheng Chou et al.
VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning
Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales
Cycle-Consistency Learning for Captioning and Grounding
Ning Wang, Jiajun Deng, Mingbo Jia
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
Zhen Wang, Xinyun Jiang, Jun Xiao et al.
Differentially Private Representation Learning via Image Captioning
Tom Sander, Yaodong Yu, Maziar Sanjabi et al.
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Guez Aflalo et al.
Image Captioning with Multi-Context Synthetic Data
Feipeng Ma, Y. Zhou, Fengyun Rao et al.
Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens
LookupViT: Compressing visual information to a limited number of tokens
Rajat Koner, Gagan Jain, Sujoy Paul et al.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.
Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models
Didi Zhu, Zhongyi Sun, Zexi Li et al.
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu et al.
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
Razvan Pasca, Alexey Gavryushin, Muhammad Hamza et al.
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma, Furong Xu, Jian liu et al.
TrojVLM: Backdoor Attack Against Vision Language Models
Weimin Lyu, Lu Pang, Tengfei Ma et al.
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Yunhao Ge, Xiaohui Zeng, Jacob Huffman et al.