Saining Xie
38
Papers
1,615
Total Citations
Papers (38)
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
CVPR 2024
570
citations
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
CVPR 2025
342
citations
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
CVPR 2024
327
citations
Demystifying CLIP Data
ICLR 2024
205
citations
REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
ICCV 2025arXiv
73
citations
Scaling Language-Free Visual Representation Learning
ICCV 2025arXiv
39
citations
MoDE: CLIP Data Experts via Clustering
CVPR 2024
25
citations
DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing
ICLR 2025
14
citations
Scaling Inference Time Compute for Diffusion Models
CVPR 2025
13
citations
Fast Encoding and Decoding for Implicit Video Representation
ECCV 2024
7
citations
Exploring Data-Efficient 3D Scene Understanding With Contrastive Scene Contexts
CVPR 2021arXiv
0
citations
Masked Feature Prediction for Self-Supervised Visual Pre-Training
CVPR 2022arXiv
0
citations
Masked Autoencoders Are Scalable Vision Learners
CVPR 2022arXiv
0
citations
A ConvNet for the 2020s
CVPR 2022arXiv
0
citations
ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders
CVPR 2023arXiv
0
citations
Holistically-Nested Edge Detection
ICCV 2015
0
citations
Exploring Randomly Wired Neural Networks for Image Recognition
ICCV 2019
0
citations
Order-Aware Generative Modeling Using the 3D-Craft Dataset
ICCV 2019
0
citations
On Network Design Spaces for Visual Recognition
ICCV 2019
0
citations
Pri3D: Can 3D Priors Help 2D Representation Learning?
ICCV 2021arXiv
0
citations
An Empirical Study of Training Self-Supervised Vision Transformers
ICCV 2021arXiv
0
citations
CiT: Curation in Training for Effective Vision-Language Data
ICCV 2023arXiv
0
citations
Going Denser with Open-Vocabulary Part Segmentation
ICCV 2023arXiv
0
citations
Scalable Diffusion Models with Transformers
ICCV 2023arXiv
0
citations
PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding
ECCV 2020
0
citations
Are Labels Necessary for Neural Architecture Search?
ECCV 2020
0
citations
SLIP: Self-Supervision Meets Language-Image Pre-training
ECCV 2022
0
citations
Momentum Contrast for Unsupervised Visual Representation Learning
CVPR 2020arXiv
0
citations
Science-T2I: Addressing Scientific Illusions in Image Synthesis
CVPR 2025
0
citations
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
CVPR 2025
0
citations
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
ICCV 2025
0
citations
Dynamic Test-Time Compute Scaling in Control Policy: Difficulty-Aware Stochastic Interpolant Policy
NeurIPS 2025
0
citations
Image Sculpting: Precise Object Editing with 3D Geometry Control
CVPR 2024
0
citations
Hyper-Class Augmented and Regularized Deep Learning for Fine-Grained Image Classification
CVPR 2015
0
citations
Aggregated Residual Transformations for Deep Neural Networks
CVPR 2017arXiv
0
citations
Attentional ShapeContextNet for Point Cloud Recognition
CVPR 2018
0
citations
FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions
CVPR 2020arXiv
0
citations
On Interaction Between Augmentations and Corruptions in Natural Corruption Robustness
NeurIPS 2021
0
citations