Jifeng Dai

65
Papers
8,642
Total Citations

Papers (65)

R-FCN: Object Detection via Region-based Fully Convolutional Networks

NeurIPS 2016arXiv
5,936
citations

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CVPR 2024
2,210
citations

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

ICLR 2024
118
citations

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

ECCV 2024
86
citations

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

CVPR 2025
68
citations

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

NeurIPS 2025
60
citations

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision

CVPR 2024
43
citations

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

ICCV 2025
35
citations

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

CVPR 2025
34
citations

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

ICCV 2025
17
citations

Docopilot: Improving Multimodal Models for Document-Level Understanding

CVPR 2025
14
citations

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

CVPR 2025
7
citations

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

ICML 2025
6
citations

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

CVPR 2025
5
citations

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

NeurIPS 2025
2
citations

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

NeurIPS 2025
1
citations

Unsupervised Object Detection With LIDAR Clues

CVPR 2021arXiv
0
citations

Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks

CVPR 2022
0
citations

AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks

CVPR 2022
0
citations

Exploring the Equivalence of Siamese Self-Supervised Learning via a Unified Gradient Framework

CVPR 2022arXiv
0
citations

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

CVPR 2023arXiv
0
citations

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

CVPR 2023arXiv
0
citations

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

CVPR 2023
0
citations

FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation

CVPR 2023
0
citations

Siamese Image Modeling for Self-Supervised Vision Representation Learning

CVPR 2023arXiv
0
citations

Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior

CVPR 2023arXiv
0
citations

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

CVPR 2023arXiv
0
citations

Planning-Oriented Autonomous Driving

CVPR 2023arXiv
0
citations

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions

CVPR 2023
0
citations

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation

ICCV 2015
0
citations

Flow-Guided Feature Aggregation for Video Object Detection

ICCV 2017arXiv
0
citations

Deformable Convolutional Networks

ICCV 2017arXiv
0
citations

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

ICCV 2019
0
citations

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

CVPR 2025
0
citations

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

ICCV 2021arXiv
0
citations

Exploring Cross-Image Pixel Contrast for Semantic Segmentation

ICCV 2021arXiv
0
citations

Fast Convergence of DETR With Spatially Modulated Co-Attention

ICCV 2021
0
citations

VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation

ICCV 2023arXiv
0
citations

Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation

ECCV 2020
0
citations

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

ECCV 2022
0
citations

FlowFormer: A Transformer Architecture for Optical Flow

ECCV 2022
0
citations

VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

ECCV 2022
0
citations

Frozen CLIP Models Are Efficient Video Learners

ECCV 2022
0
citations

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

ECCV 2022
0
citations

Influence Selection for Active Learning

ICCV 2021arXiv
0
citations

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

ICCV 2025
0
citations

LangBridge: Interpreting Image as a Combination of Language Embeddings

ICCV 2025
0
citations

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

CVPR 2024
0
citations

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

CVPR 2024
0
citations

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

ICML 2024
0
citations

Convolutional Feature Masking for Joint Object and Stuff Segmentation

CVPR 2015
0
citations

Instance-Aware Semantic Segmentation via Multi-Task Network Cascades

CVPR 2016
0
citations

ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation

CVPR 2016
0
citations

Deep Feature Flow for Video Recognition

CVPR 2017arXiv
0
citations

Fully Convolutional Instance-Aware Semantic Segmentation

CVPR 2017arXiv
0
citations

Relation Networks for Object Detection

CVPR 2018arXiv
0
citations

Towards High Performance Video Object Detection

CVPR 2018arXiv
0
citations

Deformable ConvNets V2: More Deformable, Better Results

CVPR 2019
0
citations

Resolution Adaptive Networks for Efficient Inference

CVPR 2020arXiv
0
citations

Hierarchical Human Parsing With Typed Part-Relation Reasoning

CVPR 2020arXiv
0
citations

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

NeurIPS 2022
0
citations

MCMAE: Masked Convolution Meets Masked Autoencoders

NeurIPS 2022
0
citations

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

NeurIPS 2023
0
citations

JourneyDB: A Benchmark for Generative Image Understanding

NeurIPS 2023
0
citations

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

NeurIPS 2023
0
citations