"image captioning" Papers

30 papers found

AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation

Jingyi Xie, Jintao Yang, Zhunchen Luo et al.

CVPR 2025

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Haicheng Wang, Chen Ju, Weixiong Lin et al.

CVPR 2025arXiv:2412.00440
10
citations

BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

Zhantao Yang, Ruili Feng, Keyu Yan et al.

CVPR 2025arXiv:2407.03314
3
citations

BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts

Divya Jyoti Bajpai, Manjesh Kumar Hanawal

ICLR 2025arXiv:2502.00745
3
citations

Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.

ICCV 2025arXiv:2508.11616

Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)

SUBBA REDDY OOTA, Akshett Rai Jindal, Ishani Mondal et al.

ICLR 2025arXiv:2505.20029
5
citations

Edit Flows: Variable Length Discrete Flow Matching with Sequence-Level Edit Operations

Marton Havasi, Brian Karrer, Itai Gat et al.

NEURIPS 2025

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Tommaso Galliena, Tommaso Apicella, Stefano Rosa et al.

ICCV 2025highlightarXiv:2504.08531
1
citations

Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution

Qihao Liu, Xi Yin, Alan L. Yuille et al.

CVPR 2025highlightarXiv:2412.15213
12
citations

HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Eunkyu Park, Minyeong Kim, Gunhee Kim

CVPR 2025arXiv:2506.10286
3
citations

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

Haozhuo Zhang, Bin Zhu, Yu Cao et al.

AAAI 2025paperarXiv:2408.15461
7
citations

INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Xin Dong, Shichao Dong, Jin Wang et al.

ICCV 2025arXiv:2507.05056
3
citations

Mimic In-Context Learning for Multimodal Tasks

Yuchu Jiang, Jiale Fu, chenduo hao et al.

CVPR 2025arXiv:2504.08851
9
citations

Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models

Tiezheng Zhang, Yitong Li, Yu-Cheng Chou et al.

NEURIPS 2025arXiv:2507.07104
2
citations

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Yongshuo Zong, Ondrej Bohdal, Timothy Hospedales

ICLR 2025arXiv:2403.13164
18
citations

Cycle-Consistency Learning for Captioning and Grounding

Ning Wang, Jiajun Deng, Mingbo Jia

AAAI 2024paperarXiv:2312.15162
14
citations

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

Zhen Wang, Xinyun Jiang, Jun Xiao et al.

ECCV 2024arXiv:2311.14920
5
citations

Differentially Private Representation Learning via Image Captioning

Tom Sander, Yaodong Yu, Maziar Sanjabi et al.

ICML 2024arXiv:2403.02506
7
citations

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Guez Aflalo et al.

ECCV 2024arXiv:2404.01197
26
citations

Image Captioning with Multi-Context Synthetic Data

Feipeng Ma, Y. Zhou, Fengyun Rao et al.

AAAI 2024paperarXiv:2305.18072
18
citations

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

ECCV 2024arXiv:2403.09377
4
citations

LookupViT: Compressing visual information to a limited number of tokens

Rajat Koner, Gagan Jain, Sujoy Paul et al.

ECCV 2024arXiv:2407.12753
16
citations

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.

CVPR 2024arXiv:2311.17049
87
citations

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Didi Zhu, Zhongyi Sun, Zexi Li et al.

ICML 2024arXiv:2402.12048
48
citations

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu et al.

CVPR 2024highlightarXiv:2311.06607
391
citations

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.

CVPR 2024arXiv:2312.03777
89
citations

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Razvan Pasca, Alexey Gavryushin, Muhammad Hamza et al.

CVPR 2024arXiv:2301.09209
22
citations

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Ziping Ma, Furong Xu, Jian liu et al.

ICML 2024arXiv:2401.02137
7
citations

TrojVLM: Backdoor Attack Against Vision Language Models

Weimin Lyu, Lu Pang, Tengfei Ma et al.

ECCV 2024arXiv:2409.19232
25
citations

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Yunhao Ge, Xiaohui Zeng, Jacob Huffman et al.

CVPR 2024arXiv:2404.19752
35
citations