Xinlei Chen
33
Papers
165
Total Citations
Papers (33)
Transformers without Normalization
CVPR 2025arXiv
96
citations
Scaling Language-Free Visual Representation Learning
ICCV 2025arXiv
39
citations
R-MAE: Regions Meet Masked Autoencoders
ICLR 2024
16
citations
PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining
ICCV 2025
8
citations
LLMs can see and hear without any training
ICML 2025
6
citations
Multi-Target Embodied Question Answering
CVPR 2019
0
citations
Grounded Video Description
CVPR 2019
0
citations
Cycle-Consistency for Robust Visual Question Answering
CVPR 2019
0
citations
Towards VQA Models That Can Read
CVPR 2019
0
citations
ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes
CVPR 2020arXiv
0
citations
In Defense of Grid Features for Visual Question Answering
CVPR 2020arXiv
0
citations
Exploring Simple Siamese Representation Learning
CVPR 2021arXiv
0
citations
Masked Autoencoders Are Scalable Vision Learners
CVPR 2022arXiv
0
citations
On the Importance of Asymmetry for Siamese Representation Learning
CVPR 2022arXiv
0
citations
Point-Level Region Contrast for Object Detection Pre-Training
CVPR 2022arXiv
0
citations
ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders
CVPR 2023arXiv
0
citations
Improving Selective Visual Question Answering by Learning From Your Peers
CVPR 2023
0
citations
Webly Supervised Learning of Convolutional Networks
ICCV 2015
0
citations
Spatial Memory for Context Reasoning in Object Detection
ICCV 2017arXiv
0
citations
Order-Aware Generative Modeling Using the 3D-Craft Dataset
ICCV 2019
0
citations
Embodied Amodal Recognition: Learning to Move to Perceive Objects
ICCV 2019
0
citations
TensorMask: A Foundation for Dense Object Segmentation
ICCV 2019
0
citations
nocaps: novel object captioning at scale
ICCV 2019
0
citations
Prior-Aware Neural Network for Partially-Supervised Multi-Organ Segmentation
ICCV 2019
0
citations
An Empirical Study of Training Self-Supervised Vision Transformers
ICCV 2021arXiv
0
citations
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
ICCV 2023arXiv
0
citations
Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation
ECCV 2020
0
citations
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
CVPR 2021arXiv
0
citations
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
ICCV 2025
0
citations
Mind's Eye: A Recurrent Visual Representation for Image Caption Generation
CVPR 2015
0
citations
Sense Discovery via Co-Clustering on Images and Text
CVPR 2015
0
citations
Iterative Visual Reasoning Beyond Convolutions
CVPR 2018arXiv
0
citations
Test-Time Training with Masked Autoencoders
NeurIPS 2022
0
citations