"multimodal reasoning" Papers

47 papers found

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Qiong Wu, Wenhao Lin, Yiyi Zhou et al.

NEURIPS 2025arXiv:2411.19628
5
citations

Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

Junming Liu, Siyuan Meng, Yanting Gao et al.

ICCV 2025arXiv:2503.12972
13
citations

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Yunze Man, De-An Huang, Guilin Liu et al.

CVPR 2025arXiv:2505.23766
21
citations

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

Jean Park, Kuk Jin Jang, Basam Alasaly et al.

AAAI 2025paperarXiv:2408.12763
15
citations

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin, Oh Hyun-Bin, Lee Jung-Mok et al.

ICLR 2025arXiv:2410.18325
19
citations

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Han Lin, Jaemin Cho, Amir Zadeh et al.

NEURIPS 2025arXiv:2508.05954
6
citations

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

Muye Huang, Lingling Zhang, Jie Ma et al.

NEURIPS 2025arXiv:2505.19076
5
citations

Crafting Dynamic Virtual Activities with Advanced Multimodal Models

Changyang Li, Qingan Yan, Minyoung Kim et al.

ISMAR 2025paperarXiv:2406.17582

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Di Zhang, Jingdi Lei, Junxian Li et al.

CVPR 2025arXiv:2411.18203
33
citations

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

Xinmiao Yu, Xiaocheng Feng, Yun Li et al.

AAAI 2025paperarXiv:2412.17787

Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities

Zheyuan Zhang, Fengyuan Hu, Jayjun Lee et al.

ICLR 2025arXiv:2410.17385
41
citations

DreamPRM: Domain-reweighted Process Reward Model for Multimodal Reasoning

Qi Cao, Ruiyi Wang, Ruiyi Zhang et al.

NEURIPS 2025arXiv:2505.20241
9
citations

ExpertAF: Expert Actionable Feedback from Video

Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos et al.

CVPR 2025arXiv:2408.00672
11
citations

Fast-in-Slow: A Dual-System VLA Model Unifying Fast Manipulation within Slow Reasoning

Hao Chen, Jiaming Liu, Chenyang Gu et al.

NEURIPS 2025
27
citations

Language‑Bias‑Resilient Visual Question Answering via Adaptive Multi‑Margin Collaborative Debiasing

Huanjia Zhu, Shuyuan Zheng, Yishu Liu et al.

NEURIPS 2025

LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception

Yuan-Hong Liao, Sven Elflein, Liu He et al.

COLM 2025paperarXiv:2504.15362
9
citations

Mellow: a small audio language model for reasoning

Soham Deshmukh, Satvik Dixit, Rita Singh et al.

NEURIPS 2025arXiv:2503.08540
19
citations

MemEIC: A Step Toward Continual and Compositional Knowledge Editing

Jin Seong, Jiyun Park, Wencke Liermann et al.

NEURIPS 2025arXiv:2510.25798

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen, Renrui Zhang, Dongzhi JIANG et al.

NEURIPS 2025arXiv:2506.05331
24
citations

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Tianhong Gao, Yannian Fu, Weiqun Wu et al.

ICCV 2025arXiv:2507.21924
1
citations

MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

Yuxuan Luo, Ryan Yuan, Junwen Chen et al.

NEURIPS 2025arXiv:2506.10963
3
citations

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain et al.

NEURIPS 2025arXiv:2511.20490
1
citations

NL-Eye: Abductive NLI For Images

Mor Ventura, Michael Toker, Nitay Calderon et al.

ICLR 2025arXiv:2410.02613
3
citations

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun et al.

NEURIPS 2025arXiv:2510.21122
1
citations

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin et al.

NEURIPS 2025arXiv:2503.17352
15
citations

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Minheng Ni, Zhengyuan Yang, Linjie Li et al.

NEURIPS 2025arXiv:2505.19702
13
citations

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Yuhao Zhou, Yiheng Wang, Xuming He et al.

NEURIPS 2025arXiv:2506.10521
18
citations

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Yi Ding, Ruqi Zhang

NEURIPS 2025arXiv:2505.22651
7
citations

Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown

Bowen Wang, Zhouqiang Jiang, Yasuaki Susumu et al.

ICCV 2025arXiv:2506.17589
1
citations

Temporal Reasoning Transfer from Text to Video

Lei Li, Yuanxin Liu, Linli Yao et al.

ICLR 2025oralarXiv:2410.06166
21
citations

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying, Henghui Ding, Guangquan Jie et al.

ICCV 2025arXiv:2507.22886
6
citations

Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

Guanlin (Frank) Wu, Boyan Su, Yang Zhao et al.

NEURIPS 2025spotlightarXiv:2510.21160

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Deepayan Das, Davide Talon, Yiming Wang et al.

ICCV 2025arXiv:2503.18623
2
citations

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Aritra Bhowmik, Mohammad Mahdi Derakhshani, Dennis Koelma et al.

ICCV 2025arXiv:2410.10491

ViLLa: Video Reasoning Segmentation with Large Language Model

rongkun Zheng, Lu Qi, Xi Chen et al.

ICCV 2025arXiv:2407.14500
17
citations

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang et al.

NEURIPS 2025spotlightarXiv:2504.08837
183
citations

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Cheng Tan, Jingxuan Wei, Zhangyang Gao et al.

ECCV 2024arXiv:2311.14109
29
citations

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Yichi Zhang, Yinpeng Dong, Siyuan Zhang et al.

CVPR 2024highlightarXiv:2404.11207
18
citations

Image Content Generation with Causal Reasoning

Xiaochuan Li, Baoyu Fan, Run Zhang et al.

AAAI 2024paperarXiv:2312.07132
12
citations

KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

Debjyoti Mondal, Suraj Modi, Subhadarshi Panda et al.

AAAI 2024paperarXiv:2401.12863
82
citations

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Zhening Huang, Xiaoyang Wu, Xi Chen et al.

ECCV 2024arXiv:2309.00616
83
citations

PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model

Amrin Kareem, Jean Lahoud, Hisham Cholakkal

ECCV 2024arXiv:2404.03836
7
citations

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

YUXUAN SUN, Hao Wu, Chenglu Zhu et al.

ECCV 2024arXiv:2401.16355
36
citations

Question Aware Vision Transformer for Multimodal Reasoning

Roy Ganz, Yair Kittenplon, Aviad Aberdam et al.

CVPR 2024highlightarXiv:2402.05472
37
citations

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Jinrui Zhang, Teng Wang, Haigang Zhang et al.

ECCV 2024arXiv:2407.11422
11
citations

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

Yao Mu, Junting Chen, Qing-Long Zhang et al.

ICML 2024arXiv:2402.16117
46
citations

Vamos: Versatile Action Models for Video Understanding

Shijie Wang, Qi Zhao, Minh Quan et al.

ECCV 2024arXiv:2311.13627
36
citations