"vision-language tasks" Papers

17 papers found

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Tatiana Zemskova, Dmitry Yudin

ICCV 2025posterarXiv:2412.18450
11
citations

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man, De-An Huang, Guilin Liu et al.

CVPR 2025posterarXiv:2505.23766
19
citations

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Yongxin Guo, Zhenglin Cheng, Xiaoying Tang et al.

ICLR 2025posterarXiv:2405.14297
33
citations

Gatekeeper: Improving Model Cascades Through Confidence Tuning

Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha et al.

NeurIPS 2025posterarXiv:2502.19335
4
citations

Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Lorenzo Basile, Valentino Maiorca, Diego Doimo et al.

NeurIPS 2025spotlightarXiv:2510.21518
2
citations

MIEB: Massive Image Embedding Benchmark

Chenghao Xiao, Isaac Chung, Imene Kerboua et al.

ICCV 2025posterarXiv:2504.10471
6
citations

Release the Powers of Prompt Tuning: Cross-Modality Prompt Transfer

Ningyuan Zhang, Jie Lu, Keqiuyin Li et al.

ICLR 2025poster
1
citations

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim et al.

ICLR 2025posterarXiv:2503.03321
52
citations

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai et al.

ICLR 2025posterarXiv:2408.12528
455
citations

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu, Hao Fei, Xiangtai Li et al.

ICLR 2025posterarXiv:2406.05127
58
citations

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Hao Tang, Chen-Wei Xie, Haiyang Wang et al.

NeurIPS 2025spotlightarXiv:2503.01342
14
citations

Cycle-Consistency Learning for Captioning and Grounding

Ning Wang, Jiajun Deng, Mingbo Jia

AAAI 2024paperarXiv:2312.15162
13
citations

Differentially Private Representation Learning via Image Captioning

Tom Sander, Yaodong Yu, Maziar Sanjabi et al.

ICML 2024poster

FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning

Haokun Chen, Yao Zhang, Denis Krompass et al.

AAAI 2024paperarXiv:2308.12305
86
citations

Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens

ECCV 2024posterarXiv:2403.09377
4
citations

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Dongyang Liu, Renrui Zhang, Longtian Qiu et al.

ICML 2024poster

VIGC: Visual Instruction Generation and Correction

Théo Delemazure, Jérôme Lang, Grzegorz Pierczyński

AAAI 2024paperarXiv:2308.12714
84
citations