"multimodal large language models" Papers

310 papers found • Page 7 of 7

SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

Kailin Li, Jingbo Wang, Lixin Yang et al.

ECCV 2024arXiv:2404.03590
34
citations

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Yuzhou Huang, Liangbin Xie, Xintao Wang et al.

CVPR 2024highlightarXiv:2312.06739
143
citations

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Weiyun Wang Weiyun, yiming ren, Haowen Luo et al.

ECCV 2024arXiv:2402.19474
89
citations

UniCode : Learning a Unified Codebook for Multimodal Large Language Models

Sipeng Zheng, Bohan Zhou, Yicheng Feng et al.

ECCV 2024arXiv:2403.09072
14
citations

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Yang Jin, Zhicheng Sun, Kun Xu et al.

ICML 2024oralarXiv:2402.03161
82
citations

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Hao Fei, Shengqiong Wu, Wei Ji et al.

ICML 2024oralarXiv:2501.03230
146
citations

VIGC: Visual Instruction Generation and Correction

Théo Delemazure, Jérôme Lang, Grzegorz Pierczyński

AAAI 2024paperarXiv:2308.12714
88
citations

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Xing Han Lù, Zdeněk Kasner, Siva Reddy

ICML 2024spotlightarXiv:2402.05930
121
citations

When Do We Not Need Larger Vision Models?

Baifeng Shi, Ziyang Wu, Maolin Mao et al.

ECCV 2024arXiv:2403.13043
71
citations

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Swetha Sirnam, Jinyu Yang, Tal Neiman et al.

ECCV 2024arXiv:2407.13851
11
citations
Previous
1...567
Next