Jifeng Dai
65
Papers
8,642
Total Citations
Papers (65)
R-FCN: Object Detection via Region-based Fully Convolutional Networks
NeurIPS 2016arXiv
5,936
citations
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2024
2,210
citations
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
ICLR 2024
118
citations
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
ECCV 2024
86
citations
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
CVPR 2025
68
citations
GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing
NeurIPS 2025
60
citations
Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision
CVPR 2024
43
citations
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
ICCV 2025
35
citations
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
CVPR 2025
34
citations
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
ICCV 2025
17
citations
Docopilot: Improving Multimodal Models for Document-Level Understanding
CVPR 2025
14
citations
MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism
CVPR 2025
7
citations
MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
ICML 2025
6
citations
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
CVPR 2025
5
citations
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis
NeurIPS 2025
2
citations
Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
NeurIPS 2025
1
citations
Unsupervised Object Detection With LIDAR Clues
CVPR 2021arXiv
0
citations
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks
CVPR 2022
0
citations
AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks
CVPR 2022
0
citations
Exploring the Equivalence of Siamese Self-Supervised Learning via a Unified Gradient Framework
CVPR 2022arXiv
0
citations
Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information
CVPR 2023arXiv
0
citations
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023arXiv
0
citations
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
CVPR 2023
0
citations
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation
CVPR 2023
0
citations
Siamese Image Modeling for Self-Supervised Vision Representation Learning
CVPR 2023arXiv
0
citations
Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior
CVPR 2023arXiv
0
citations
InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions
CVPR 2023arXiv
0
citations
Planning-Oriented Autonomous Driving
CVPR 2023arXiv
0
citations
Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions
CVPR 2023
0
citations
BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation
ICCV 2015
0
citations
Flow-Guided Feature Aggregation for Video Object Detection
ICCV 2017arXiv
0
citations
Deformable Convolutional Networks
ICCV 2017arXiv
0
citations
An Empirical Study of Spatial Attention Mechanisms in Deep Networks
ICCV 2019
0
citations
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
CVPR 2025
0
citations
FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting
ICCV 2021arXiv
0
citations
Exploring Cross-Image Pixel Contrast for Semantic Segmentation
ICCV 2021arXiv
0
citations
Fast Convergence of DETR With Spatially Modulated Co-Attention
ICCV 2021
0
citations
VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation
ICCV 2023arXiv
0
citations
Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation
ECCV 2020
0
citations
BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
ECCV 2022
0
citations
FlowFormer: A Transformer Architecture for Optical Flow
ECCV 2022
0
citations
VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
ECCV 2022
0
citations
Frozen CLIP Models Are Efficient Video Learners
ECCV 2022
0
citations
Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification
ECCV 2022
0
citations
Influence Selection for Active Learning
ICCV 2021arXiv
0
citations
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
ICCV 2025
0
citations
LangBridge: Interpreting Image as a Combination of Language Embeddings
ICCV 2025
0
citations
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
CVPR 2024
0
citations
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024
0
citations
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
ICML 2024
0
citations
Convolutional Feature Masking for Joint Object and Stuff Segmentation
CVPR 2015
0
citations
Instance-Aware Semantic Segmentation via Multi-Task Network Cascades
CVPR 2016
0
citations
ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation
CVPR 2016
0
citations
Deep Feature Flow for Video Recognition
CVPR 2017arXiv
0
citations
Fully Convolutional Instance-Aware Semantic Segmentation
CVPR 2017arXiv
0
citations
Relation Networks for Object Detection
CVPR 2018arXiv
0
citations
Towards High Performance Video Object Detection
CVPR 2018arXiv
0
citations
Deformable ConvNets V2: More Deformable, Better Results
CVPR 2019
0
citations
Resolution Adaptive Networks for Efficient Inference
CVPR 2020arXiv
0
citations
Hierarchical Human Parsing With Typed Part-Relation Reasoning
CVPR 2020arXiv
0
citations
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
NeurIPS 2022
0
citations
MCMAE: Masked Convolution Meets Masked Autoencoders
NeurIPS 2022
0
citations
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
NeurIPS 2023
0
citations
JourneyDB: A Benchmark for Generative Image Understanding
NeurIPS 2023
0
citations
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
NeurIPS 2023
0
citations