🧬Architectures

Vision Transformers

Transformers for computer vision

100 papers13,494 total citations
Compare with other topics
Feb '24 Jan '262251 papers
Also includes: vision transformer, vision transformers, vit, image transformer

Top Papers

#1

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang et al.

CVPR 2024
2,210
citations
#2

VBench: Comprehensive Benchmark Suite for Video Generative Models

Ziqi Huang, Yinan He, Jiashuo Yu et al.

CVPR 2024
996
citations
#3

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Tao Lu, Mulin Yu, Linning Xu et al.

CVPR 2024
589
citations
#4

Grounding Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, Jerome Revaud

ECCV 2024
499
citations
#5

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li et al.

ICLR 2025
365
citations
#6

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Chi Yan, Delin Qu, Dong Wang et al.

CVPR 2024
359
citations
#7

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers

Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo et al.

CVPR 2024
266
citations
#8

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Lu Ling, Yichen Sheng, Zhi Tu et al.

CVPR 2024
266
citations
#9

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

Yi-Lun Liao, Brandon Wood, Abhishek Das et al.

ICLR 2024
254
citations
#10

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Yunyang Xiong, Balakrishnan Varadarajan, Lemeng Wu et al.

CVPR 2024
241
citations
#11

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang et al.

CVPR 2024
236
citations
#12

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang et al.

ICLR 2024
236
citations
#13

Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.

CVPR 2024
230
citations
#14

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang et al.

ICML 2025
165
citations
#15

Uni3D: Exploring Unified 3D Representation at Scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma et al.

ICLR 2024
165
citations
#16

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Shunyuan Zheng, Boyao ZHOU, Ruizhi Shao et al.

CVPR 2024
160
citations
#17

Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou et al.

ECCV 2024
149
citations
#18

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

Yuelang Xu, Benwang Chen, Zhe Li et al.

CVPR 2024
147
citations
#19

Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan, Wentong Li, Jian liu et al.

CVPR 2024
147
citations
#20

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

CVPR 2025
142
citations
#21

Rotary Position Embedding for Vision Transformer

Byeongho Heo, Song Park, Dongyoon Han et al.

ECCV 2024
135
citations
#22

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Zhiyuan Yan, Yuhao Luo, Siwei Lyu et al.

CVPR 2024
133
citations
#23

SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

Mingrui Li, Shuhong Liu, Heng Zhou et al.

ECCV 2024arXiv:2402.03246
gaussian splattingvisual slamsemantic segmentationneural implicit slam+4
131
citations
#24

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Chunlong Xia, Xinliang Wang, Feng Lv et al.

CVPR 2024
131
citations
#25

SCTNet: Single Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Authors: Zhengze Xu, Dongyue Wu, Changqian Yu et al.

AAAI 2024arXiv:2312.17071
real-time segmentationsemantic segmentationsingle branch cnntransformer semantic information+3
126
citations
#26

Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

Chengjie Wang, wenbing zhu, Bin-Bin Gao et al.

CVPR 2024
120
citations
#27

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Weiyun Wang, Min Shi, Qingyun Li et al.

ICLR 2024
118
citations
#28

Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection

Jiangnan Yang, Shuangli Liu, Jingjun Wu et al.

AAAI 2025
115
citations
#29

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu, Wenwei Zhang, Lumin Xu et al.

ICLR 2024
104
citations
#30

OmniRe: Omni Urban Scene Reconstruction

Ziyu Chen, Jiawei Yang, Jiahui Huang et al.

ICLR 2025
103
citations
#31

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Tianchen Deng, Guole Shen, Tong Qin et al.

CVPR 2024
101
citations
#32

Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

ZHIXIANG WEI, Lin Chen, Xiaoxiao Ma et al.

CVPR 2024
100
citations
#33

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.

CVPR 2025
98
citations
#34

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Seokju Yun, Youngmin Ro

CVPR 2024
97
citations
#35

Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He et al.

CVPR 2025arXiv:2503.10622
transformer architecturenormalization layersdynamic tanhself-supervised learning+3
96
citations
#36

FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu et al.

CVPR 2025
92
citations
#37

Decoding Natural Images from EEG for Object Recognition

Yonghao Song, Bingchuan Liu, Xiang Li et al.

ICLR 2024
92
citations
#38

Brain decoding: toward real-time reconstruction of visual perception

Yohann Benchetrit, Hubert Banville, Jean-Remi King

ICLR 2024
90
citations
#39

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

Sihan liu, Yiwei Ma, Xiaoqing Zhang et al.

CVPR 2024
89
citations
#40

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Haian Jin, Hanwen Jiang, Hao Tan et al.

ICLR 2025
86
citations
#41

Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin, Maximilian Beck, Korbinian Pöppel et al.

ICLR 2025
85
citations
#42

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Kyle Sargent, Zizhang Li, Tanmay Shah et al.

CVPR 2024
85
citations
#43

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Jingkang Yang, Yuhao Dong, Shuai Liu et al.

ECCV 2024
81
citations
#44

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang et al.

CVPR 2025
80
citations
#45

FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding

Jun Xiang, Xuan Gao, Yudong Guo et al.

CVPR 2024
78
citations
#46

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian et al.

CVPR 2025
78
citations
#47

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal et al.

CVPR 2024
78
citations
#48

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

Leheng Zhang, Yawei Li, Xingyu Zhou et al.

CVPR 2024
76
citations
#49

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Feng Lu, Xiangyuan Lan, Lijun Zhang et al.

CVPR 2024
75
citations
#50

BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting

Lingzhe Zhao, Peng Wang, Peidong Liu

ECCV 2024arXiv:2403.11831
3d gaussian splattingmotion deblurringneural renderingbundle adjustment+4
74
citations
#51

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

CVPR 2024
74
citations
#52

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Qihang Ma, Xin Tan, Yanyun Qu et al.

CVPR 2024
73
citations
#53

MaskBit: Embedding-free Image Generation via Bit Tokens

Mark Weber, Lijun Yu, Qihang Yu et al.

ICLR 2025
72
citations
#54

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Xi Chen, Zhifei Zhang, He Zhang et al.

CVPR 2025
70
citations
#55

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

ICCV 2025
69
citations
#56

DiT4Edit: Diffusion Transformer for Image Editing

Kunyu Feng, Yue Ma, Bingyuan Wang et al.

AAAI 2025
69
citations
#57

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Ye Yuan, Xueting Li, Yangyi Huang et al.

CVPR 2024
69
citations
#58

Frequency-Spatial Entanglement Learning for Camouflaged Object Detection

Yanguang Sun, Chunyan Xu, Jian Yang et al.

ECCV 2024
68
citations
#59

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Yufu Wang, Ziyun Wang, Lingjie Liu et al.

ECCV 2024arXiv:2403.17346
human motion reconstructionglobal trajectory estimationslam robustificationvideo transformer model+4
66
citations
#60

Unifying 3D Vision-Language Understanding via Promptable Queries

ziyu zhu, Zhuofan Zhang, Xiaojian Ma et al.

ECCV 2024
64
citations
#61

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Xiao Wang, Zongzhen Wu, Bo Jiang et al.

AAAI 2024arXiv:2211.09648
human activity recognitiondynamic vision sensorsevent camerasspatial-temporal feature learning+3
64
citations
#62

Accelerating Diffusion Transformers with Token-wise Feature Caching

Chang Zou, Xuyang Liu, Ting Liu et al.

ICLR 2025
62
citations
#63

Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl

ICLR 2025
60
citations
#64

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Renjie Pi, Lewei Yao, Jiahui Gao et al.

CVPR 2024
59
citations
#65

DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju, Peng Tang, Qi Dong et al.

AAAI 2024arXiv:2306.01733
visual document understandingmulti-modal transformerlocal-feature alignmentdocument information extraction+4
58
citations
#66

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Shilin Yan, Renrui Zhang, Ziyu Guo et al.

AAAI 2024arXiv:2305.16318
video object segmentationmulti-modal referencetemporal transformersemantic alignment+4
58
citations
#67

Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction

Devikalyan Das, Christopher Wewer, Raza Yunus et al.

CVPR 2024
57
citations
#68

Magnushammer: A Transformer-Based Approach to Premise Selection

Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak et al.

ICLR 2024
57
citations
#69

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Yao Xiao, Tingfa Xu, Yu Xin et al.

AAAI 2025
55
citations
#70

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Nicolae Ristea, Florinel Croitoru, Radu Tudor Ionescu et al.

CVPR 2024
55
citations
#71

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Jianhong Bai, Menghan Xia, Xintao WANG et al.

ICLR 2025
55
citations
#72

VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Seunggu Kang, WonJun Moon, Euiyeon Kim et al.

AAAI 2024arXiv:2312.16580
zero-shot object countingsemantic-patch embeddingsvisual-language representationsemantic-conditioned prompt tuning+3
54
citations
#73

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian Parker, Anton Smirnov, Jordi Pons et al.

ICLR 2025
54
citations
#74

FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

George Cazenavette, Avneesh Sud, Thomas Leung et al.

CVPR 2024
53
citations
#75

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Xiang Zhang, Yulun Zhang, Fisher Yu

ECCV 2024
52
citations
#76

AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond

Zixiang Zhou, Yu Wan, Baoyuan Wang

CVPR 2024
52
citations
#77

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen et al.

ICLR 2025
50
citations
#78

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Shuai Yang, Yifan Zhou, Ziwei Liu et al.

CVPR 2024
49
citations
#79

ReMamber: Referring Image Segmentation with Mamba Twister

Yuhuan Yang, Chaofan Ma, Jiangchao Yao et al.

ECCV 2024
49
citations
#80

TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Chuanrui Zhang, Yingshuang Zou, Zhuoling Li et al.

AAAI 2025
49
citations
#81

SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments

Shibo Zhao, Yuanjun Gao, Tianhao Wu et al.

CVPR 2024
49
citations
#82

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Baorui Ma, Huachen Gao, Haoge Deng et al.

CVPR 2025arXiv:2412.06699
3d generation modelsmulti-view diffusion modelpose-free videoslarge-scale video data+4
49
citations
#83

GPAvatar: Generalizable and Precise Head Avatar from Image(s)

Xuangeng Chu, Yu Li, Ailing Zeng et al.

ICLR 2024
48
citations
#84

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

Yuqian Fu, Yu Wang, Yixuan Pan et al.

ECCV 2024
48
citations
#85

Feature Fusion from Head to Tail for Long-Tailed Visual Recognition

Mengke Li, Zhikai HU, Yang Lu et al.

AAAI 2024arXiv:2306.06963
long-tailed recognitionfeature fusionclass imbalancedecision boundary optimization+3
48
citations
#86

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain et al.

ICLR 2025
48
citations
#87

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

Ziqi Pang, Ziyang Xie, Yunze Man et al.

ICLR 2024
48
citations
#88

Neural Markov Random Field for Stereo Matching

Tongfan Guan, Chen Wang, Yun-Hui Liu

CVPR 2024
48
citations
#89

LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection

hongcheng Guo, Jian Yang, Jiaheng Liu et al.

AAAI 2024arXiv:2401.04749
log anomaly detectiontransformer-based frameworkadapter-based tuningmulti-domain logs+3
47
citations
#90

UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction

Lan Feng, Mohammadhossein Bahari, Kaouther Messaoud et al.

ECCV 2024arXiv:2403.15098
vehicle trajectory predictionunified frameworkdataset unificationcross-dataset generalization+4
47
citations
#91

Digital Life Project: Autonomous 3D Characters with Social Intelligence

Zhongang Cai, Jianping Jiang, Zhongfei Qing et al.

CVPR 2024
46
citations
#92

ODEFormer: Symbolic Regression of Dynamical Systems with Transformers

Stéphane d'Ascoli, Sören Becker, Philippe Schwaller et al.

ICLR 2024
46
citations
#93

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Yu Deng, Duomin Wang, Baoyuan Wang

ECCV 2024
45
citations
#94

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

Jeremy Irvin, Emily Liu, Joyce Chen et al.

ICLR 2025arXiv:2410.06234
vision-language assistanttemporal earth observationinstruction-following datasetchange detection+4
45
citations
#95

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

Keon Hee Park, Kyungwoo Song, Gyeong-Moon Park

CVPR 2024
45
citations
#96

SEPT: Towards Efficient Scene Representation Learning for Motion Prediction

Zhiqian Lan, Yuxuan Jiang, Yao Mu et al.

ICLR 2024
45
citations
#97

TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation

Yuhao Wang, Xuehu Liu, Pingping Zhang et al.

AAAI 2024arXiv:2312.09612
multi-spectral object re-identificationvision transformerstoken permutationfeature aggregation+4
45
citations
#98

Visual Agents as Fast and Slow Thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang et al.

ICLR 2025
44
citations
#99

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang

ICCV 2025
44
citations
#100

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Jianjian Cao, Peng Ye, Shengze Li et al.

CVPR 2024
44
citations