Xinlei Chen

33
Papers
165
Total Citations

Papers (33)

Transformers without Normalization

CVPR 2025arXiv
96
citations

Scaling Language-Free Visual Representation Learning

ICCV 2025arXiv
39
citations

R-MAE: Regions Meet Masked Autoencoders

ICLR 2024
16
citations

PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining

ICCV 2025
8
citations

LLMs can see and hear without any training

ICML 2025
6
citations

Multi-Target Embodied Question Answering

CVPR 2019
0
citations

Grounded Video Description

CVPR 2019
0
citations

Cycle-Consistency for Robust Visual Question Answering

CVPR 2019
0
citations

Towards VQA Models That Can Read

CVPR 2019
0
citations

ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes

CVPR 2020arXiv
0
citations

In Defense of Grid Features for Visual Question Answering

CVPR 2020arXiv
0
citations

Exploring Simple Siamese Representation Learning

CVPR 2021arXiv
0
citations

Masked Autoencoders Are Scalable Vision Learners

CVPR 2022arXiv
0
citations

On the Importance of Asymmetry for Siamese Representation Learning

CVPR 2022arXiv
0
citations

Point-Level Region Contrast for Object Detection Pre-Training

CVPR 2022arXiv
0
citations

ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders

CVPR 2023arXiv
0
citations

Improving Selective Visual Question Answering by Learning From Your Peers

CVPR 2023
0
citations

Webly Supervised Learning of Convolutional Networks

ICCV 2015
0
citations

Spatial Memory for Context Reasoning in Object Detection

ICCV 2017arXiv
0
citations

Order-Aware Generative Modeling Using the 3D-Craft Dataset

ICCV 2019
0
citations

Embodied Amodal Recognition: Learning to Move to Perceive Objects

ICCV 2019
0
citations

TensorMask: A Foundation for Dense Object Segmentation

ICCV 2019
0
citations

nocaps: novel object captioning at scale

ICCV 2019
0
citations

Prior-Aware Neural Network for Partially-Supervised Multi-Organ Segmentation

ICCV 2019
0
citations

An Empirical Study of Training Self-Supervised Vision Transformers

ICCV 2021arXiv
0
citations

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

ICCV 2023arXiv
0
citations

Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation

ECCV 2020
0
citations

KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA

CVPR 2021arXiv
0
citations

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

ICCV 2025
0
citations

Mind's Eye: A Recurrent Visual Representation for Image Caption Generation

CVPR 2015
0
citations

Sense Discovery via Co-Clustering on Images and Text

CVPR 2015
0
citations

Iterative Visual Reasoning Beyond Convolutions

CVPR 2018arXiv
0
citations

Test-Time Training with Masked Autoencoders

NeurIPS 2022
0
citations