Poster "multimodal large language models" Papers

228 papers found • Page 2 of 5

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

Jiayi Guo, Zhao Junhao, Chaoqun Du et al.

CVPR 2025arXiv:2406.04295

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

Xiaoqin Wang, Xusen Ma, Xianxu Hou et al.

CVPR 2025arXiv:2503.21457
8
citations

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers

Renshan Zhang, Rui Shao, Gongwei Chen et al.

ICCV 2025arXiv:2501.16297
11
citations

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Gen Luo, Yiyi Zhou, Yuxin Zhang et al.

ICLR 2025arXiv:2403.03003
100
citations

FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

Zichen Tang, Haihong E, Jiacheng Liu et al.

ICCV 2025arXiv:2508.04625
2
citations

Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models

Hai Yan, Haijian Ma, Xiaowen Cai et al.

NEURIPS 2025

FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression

Bo Tong, Bokai Lai, Yiyi Zhou et al.

CVPR 2025arXiv:2412.04317
4
citations

FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

shengming yuan, Xinyu Lyu, Shuailong Wang et al.

NEURIPS 2025arXiv:2510.11190

FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers

Yanbing Zhang, Zhe Wang, Qin Zhou et al.

ICCV 2025arXiv:2507.15249
1
citations

From Imitation to Innovation: The Emergence of AI's Unique Artistic Styles and the Challenge of Copyright Protection

Zexi Jia, Chuanwei Huang, Hongyan Fei et al.

ICCV 2025arXiv:2507.04769
3
citations

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang, Zhuofan Zhang, Ziyu Zhu et al.

NEURIPS 2025arXiv:2506.04897
1
citations

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

Kaihang Pan, Wang Lin, Zhongqi Yue et al.

CVPR 2025arXiv:2504.14666
18
citations

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Shijie Ma, Yuying Ge, Teng Wang et al.

ICCV 2025arXiv:2503.19480
9
citations

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Xudong LU, Yinghao Chen, Renshou Wu et al.

ICCV 2025arXiv:2503.06019

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Jiahui Gao, Renjie Pi, Jipeng Zhang et al.

ICLR 2025arXiv:2312.11370
170
citations

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

Rongyao Fang, Chengqi Duan, Kun Wang et al.

NEURIPS 2025
60
citations

GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs

Yi Fang, Bowen Jin, Jiacheng Shen et al.

CVPR 2025arXiv:2502.11925
3
citations

GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

Xiaomeng Chu, Jiajun Deng, Guoliang You et al.

ICCV 2025arXiv:2503.16013
2
citations

Grounding Multimodal Large Language Model in GUI World

Weixian Lei, Difei Gao, Mike Zheng Shou

ICLR 2025

Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

Kaiqing Lin, Zhiyuan Yan, Ke-Yue Zhang et al.

NEURIPS 2025arXiv:2505.19582
2
citations

Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment

Pengfei Zhao, Rongbo Luan, Wei Zhang et al.

NEURIPS 2025arXiv:2506.06970
1
citations

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Tianyi Bai, Yuxuan Fan, Qiu Jiantao et al.

NEURIPS 2025arXiv:2506.07227
2
citations

Harnessing Webpage UIs for Text-Rich Visual Understanding

Junpeng Liu, Tianyue Ou, Yifan Song et al.

ICLR 2025arXiv:2410.13824
22
citations

HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator

Fan Yang, Ru Zhen, Jianing Wang et al.

CVPR 2025arXiv:2411.17261
11
citations

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Ma Teng, Xiaojun Jia, Ranjie Duan et al.

ICCV 2025arXiv:2412.05934
21
citations

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat

CVPR 2025arXiv:2503.08585
12
citations

Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving

Hao Zhou, Zhanning Gao, Zhili Chen et al.

ICCV 2025arXiv:2411.13076
4
citations

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Kun Liu, Qi Liu, Xinchen Liu et al.

CVPR 2025arXiv:2503.23715
13
citations

HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

Chuhao Zhou, Jianfei Yang

NEURIPS 2025arXiv:2505.17645

How Can Objects Help Video-Language Understanding?

Zitian Tang, Shijie Wang, Junho Cho et al.

ICCV 2025arXiv:2504.07454
3
citations

Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Yang Qin, Chao Chen, Zhihang Fu et al.

CVPR 2025arXiv:2506.11036
8
citations

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren et al.

CVPR 2025arXiv:2411.18042
9
citations

IDEA-Bench: How Far are Generative Models from Professional Designing?

Chen Liang, Lianghua Huang, Jingwu Fang et al.

CVPR 2025arXiv:2412.11767
4
citations

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao, Zhengyuan Yang, Linjie Li et al.

ICCV 2025arXiv:2503.19312
21
citations

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

Jiayi Guo, Chuanhao Yan, Xingqian Xu et al.

ICCV 2025arXiv:2509.26231
1
citations

InsightEdit: Towards Better Instruction Following for Image Editing

Yingjing Xu, Jie Kong, Jiazhi Wang et al.

CVPR 2025arXiv:2411.17323
10
citations

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

Chenwei Lin, Hanjia Lyu, Xian Xu et al.

ICCV 2025arXiv:2406.09105
4
citations

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Lehan Wang, Haonan Wang, Honglong Yang et al.

ICLR 2025arXiv:2410.18387
17
citations

Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs

Barrett Tang, Zile Huang, Chengzhi Liu et al.

ICLR 2025
20
citations

Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning

JiHyeok Jung, EunTae Kim, SeoYeon Kim et al.

CVPR 2025arXiv:2411.16761
3
citations

Is Your Multimodal Language Model Oversensitive to Safe Queries?

Xirui Li, Hengguang Zhou, Ruochen Wang et al.

ICLR 2025arXiv:2406.17806
20
citations

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng, Erjia Xiao, Jing Shao et al.

NEURIPS 2025arXiv:2501.13772
6
citations

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.

ICCV 2025arXiv:2501.04931
28
citations

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

Kaihang Pan, Yang Wu, Wendong Bu et al.

NEURIPS 2025arXiv:2506.01480
7
citations

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Junsung Park, Jungbeom Lee, Jongyoon Song et al.

ICCV 2025arXiv:2501.10913
14
citations

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng, shijia Huang, Yanyang Li et al.

NEURIPS 2025arXiv:2505.24625
24
citations

Learning to Instruct for Visual Instruction Tuning

Zhihan Zhou, Feng Hong, JIAAN LUO et al.

NEURIPS 2025arXiv:2503.22215
3
citations

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework

Xuan Wang, Siyuan Liang, Dongping Liao et al.

NEURIPS 2025arXiv:2503.16872
4
citations

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Yuxuan Cai, Jiangning Zhang, Haoyang He et al.

ICCV 2025arXiv:2410.16236
25
citations

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Haoran Lou, Chunxiao Fan, Ziyan Liu et al.

ICCV 2025arXiv:2507.00505