Jifeng Dai

21
Papers
2,706
Total Citations

Papers (21)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CVPR 2024
2,210
citations

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

ICLR 2024
118
citations

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

ECCV 2024
86
citations

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

CVPR 2025
68
citations

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

NeurIPS 2025
60
citations

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision

CVPR 2024
43
citations

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

ICCV 2025
35
citations

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

CVPR 2025
34
citations

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

ICCV 2025
17
citations

Docopilot: Improving Multimodal Models for Document-Level Understanding

CVPR 2025
14
citations

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

CVPR 2025
7
citations

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

ICML 2025
6
citations

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

CVPR 2025
5
citations

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

NeurIPS 2025
2
citations

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

NeurIPS 2025
1
citations

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

ICML 2024
0
citations

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

ICCV 2025
0
citations

LangBridge: Interpreting Image as a Combination of Language Embeddings

ICCV 2025
0
citations

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

CVPR 2024
0
citations

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

CVPR 2024
0
citations

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

CVPR 2025
0
citations