Saining Xie

38
Papers
1,615
Total Citations

Papers (38)

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

CVPR 2024
570
citations

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

CVPR 2025
342
citations

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

CVPR 2024
327
citations

Demystifying CLIP Data

ICLR 2024
205
citations

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

ICCV 2025arXiv
73
citations

Scaling Language-Free Visual Representation Learning

ICCV 2025arXiv
39
citations

MoDE: CLIP Data Experts via Clustering

CVPR 2024
25
citations

DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

ICLR 2025
14
citations

Scaling Inference Time Compute for Diffusion Models

CVPR 2025
13
citations

Fast Encoding and Decoding for Implicit Video Representation

ECCV 2024
7
citations

Exploring Data-Efficient 3D Scene Understanding With Contrastive Scene Contexts

CVPR 2021arXiv
0
citations

Masked Feature Prediction for Self-Supervised Visual Pre-Training

CVPR 2022arXiv
0
citations

Masked Autoencoders Are Scalable Vision Learners

CVPR 2022arXiv
0
citations

A ConvNet for the 2020s

CVPR 2022arXiv
0
citations

ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders

CVPR 2023arXiv
0
citations

Holistically-Nested Edge Detection

ICCV 2015
0
citations

Exploring Randomly Wired Neural Networks for Image Recognition

ICCV 2019
0
citations

Order-Aware Generative Modeling Using the 3D-Craft Dataset

ICCV 2019
0
citations

On Network Design Spaces for Visual Recognition

ICCV 2019
0
citations

Pri3D: Can 3D Priors Help 2D Representation Learning?

ICCV 2021arXiv
0
citations

An Empirical Study of Training Self-Supervised Vision Transformers

ICCV 2021arXiv
0
citations

CiT: Curation in Training for Effective Vision-Language Data

ICCV 2023arXiv
0
citations

Going Denser with Open-Vocabulary Part Segmentation

ICCV 2023arXiv
0
citations

Scalable Diffusion Models with Transformers

ICCV 2023arXiv
0
citations

PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding

ECCV 2020
0
citations

Are Labels Necessary for Neural Architecture Search?

ECCV 2020
0
citations

SLIP: Self-Supervision Meets Language-Image Pre-training

ECCV 2022
0
citations

Momentum Contrast for Unsupervised Visual Representation Learning

CVPR 2020arXiv
0
citations

Science-T2I: Addressing Scientific Illusions in Image Synthesis

CVPR 2025
0
citations

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

CVPR 2025
0
citations

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

ICCV 2025
0
citations

Dynamic Test-Time Compute Scaling in Control Policy: Difficulty-Aware Stochastic Interpolant Policy

NeurIPS 2025
0
citations

Image Sculpting: Precise Object Editing with 3D Geometry Control

CVPR 2024
0
citations

Hyper-Class Augmented and Regularized Deep Learning for Fine-Grained Image Classification

CVPR 2015
0
citations

Aggregated Residual Transformations for Deep Neural Networks

CVPR 2017arXiv
0
citations

Attentional ShapeContextNet for Point Cloud Recognition

CVPR 2018
0
citations

FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions

CVPR 2020arXiv
0
citations

On Interaction Between Augmentations and Corruptions in Natural Corruption Robustness

NeurIPS 2021
0
citations