🧬Architectures

Vision Transformers

Transformers for computer vision

100 papers13,494 total citations

Compare with other topics

Feb '24 — Jan '262251 papers

Top Conferences

CVPR: 46 ICLR: 25 ECCV: 14 AAAI: 12 ICCV: 2 ICML: 1

Top Papers

#1

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang et al.

VBench: Comprehensive Benchmark Suite for Video Generative Models

Ziqi Huang, Yinan He, Jiashuo Yu et al.

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Tao Lu, Mulin Yu, Linning Xu et al.

Grounding Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, Jerome Revaud

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li et al.

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Chi Yan, Delin Qu, Dong Wang et al.

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers

Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo et al.

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Lu Ling, Yichen Sheng, Zhi Tu et al.

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

Yi-Lun Liao, Brandon Wood, Abhishek Das et al.

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Yunyang Xiong, Balakrishnan Varadarajan, Lemeng Wu et al.

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang et al.

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang et al.

Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang et al.

Uni3D: Exploring Unified 3D Representation at Scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma et al.

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Shunyuan Zheng, Boyao ZHOU, Ruizhi Shao et al.

Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou et al.

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

Yuelang Xu, Benwang Chen, Zhe Li et al.

Osprey: Pixel Understanding with Visual Instruction Tuning

Yuqian Yuan, Wentong Li, Jian liu et al.

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang et al.

Rotary Position Embedding for Vision Transformer

Byeongho Heo, Song Park, Dongyoon Han et al.

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Zhiyuan Yan, Yuhao Luo, Siwei Lyu et al.

SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

Mingrui Li, Shuhong Liu, Heng Zhou et al.

ECCV 2024arXiv:2402.03246

gaussian splattingvisual slamsemantic segmentationneural implicit slam+4

131

citations

#24

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Chunlong Xia, Xinliang Wang, Feng Lv et al.

SCTNet: Single Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Authors: Zhengze Xu, Dongyue Wu, Changqian Yu et al.

AAAI 2024arXiv:2312.17071

real-time segmentationsemantic segmentationsingle branch cnntransformer semantic information+3

126

citations

#26

Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

Chengjie Wang, wenbing zhu, Bin-Bin Gao et al.

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Weiyun Wang, Min Shi, Qingyun Li et al.

Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection

Jiangnan Yang, Shuangli Liu, Jingjun Wu et al.

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu, Wenwei Zhang, Lumin Xu et al.

OmniRe: Omni Urban Scene Reconstruction

Ziyu Chen, Jiawei Yang, Jiahui Huang et al.

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Tianchen Deng, Guole Shen, Tong Qin et al.

Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

ZHIXIANG WEI, Lin Chen, Xiaoxiao Ma et al.

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Seokju Yun, Youngmin Ro

Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He et al.

CVPR 2025arXiv:2503.10622

transformer architecturenormalization layersdynamic tanhself-supervised learning+3

96

citations

#36

FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu et al.

Decoding Natural Images from EEG for Object Recognition

Yonghao Song, Bingchuan Liu, Xiang Li et al.

Brain decoding: toward real-time reconstruction of visual perception

Yohann Benchetrit, Hubert Banville, Jean-Remi King

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

Sihan liu, Yiwei Ma, Xiaoqing Zhang et al.

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Haian Jin, Hanwen Jiang, Hao Tan et al.

Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin, Maximilian Beck, Korbinian Pöppel et al.

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Kyle Sargent, Zizhang Li, Tanmay Shah et al.

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Jingkang Yang, Yuhao Dong, Shuai Liu et al.

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang et al.

FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding

Jun Xiang, Xuan Gao, Yudong Guo et al.

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian et al.

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal et al.

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

Leheng Zhang, Yawei Li, Xingyu Zhou et al.

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Feng Lu, Xiangyuan Lan, Lijun Zhang et al.

BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting

Lingzhe Zhao, Peng Wang, Peidong Liu

ECCV 2024arXiv:2403.11831

3d gaussian splattingmotion deblurringneural renderingbundle adjustment+4

74

citations

#51

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

Qihang Ma, Xin Tan, Yanyun Qu et al.

MaskBit: Embedding-free Image Generation via Bit Tokens

Mark Weber, Lijun Yu, Qihang Yu et al.

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Xi Chen, Zhifei Zhang, He Zhang et al.

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

DiT4Edit: Diffusion Transformer for Image Editing

Kunyu Feng, Yue Ma, Bingyuan Wang et al.

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Ye Yuan, Xueting Li, Yangyi Huang et al.

Frequency-Spatial Entanglement Learning for Camouflaged Object Detection

Yanguang Sun, Chunyan Xu, Jian Yang et al.

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Yufu Wang, Ziyun Wang, Lingjie Liu et al.

ECCV 2024arXiv:2403.17346

human motion reconstructionglobal trajectory estimationslam robustificationvideo transformer model+4

66

citations

#60

Unifying 3D Vision-Language Understanding via Promptable Queries

ziyu zhu, Zhuofan Zhang, Xiaojian Ma et al.

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Xiao Wang, Zongzhen Wu, Bo Jiang et al.

AAAI 2024arXiv:2211.09648

human activity recognitiondynamic vision sensorsevent camerasspatial-temporal feature learning+3

64

citations

#62

Accelerating Diffusion Transformers with Token-wise Feature Caching

Chang Zou, Xuyang Liu, Ting Liu et al.

Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Renjie Pi, Lewei Yao, Jiahui Gao et al.

DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju, Peng Tang, Qi Dong et al.

AAAI 2024arXiv:2306.01733

visual document understandingmulti-modal transformerlocal-feature alignmentdocument information extraction+4

58

citations

#66

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Shilin Yan, Renrui Zhang, Ziyu Guo et al.

AAAI 2024arXiv:2305.16318

video object segmentationmulti-modal referencetemporal transformersemantic alignment+4

58

citations

#67

Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction

Devikalyan Das, Christopher Wewer, Raza Yunus et al.

Magnushammer: A Transformer-Based Approach to Premise Selection

Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak et al.

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Yao Xiao, Tingfa Xu, Yu Xin et al.

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Nicolae Ristea, Florinel Croitoru, Radu Tudor Ionescu et al.

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Jianhong Bai, Menghan Xia, Xintao WANG et al.

VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Seunggu Kang, WonJun Moon, Euiyeon Kim et al.

AAAI 2024arXiv:2312.16580

zero-shot object countingsemantic-patch embeddingsvisual-language representationsemantic-conditioned prompt tuning+3

54

citations

#73

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian Parker, Anton Smirnov, Jordi Pons et al.

FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

George Cazenavette, Avneesh Sud, Thomas Leung et al.

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Xiang Zhang, Yulun Zhang, Fisher Yu

AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond

Zixiang Zhou, Yu Wan, Baoyuan Wang

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen et al.

FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

Shuai Yang, Yifan Zhou, Ziwei Liu et al.

ReMamber: Referring Image Segmentation with Mamba Twister

Yuhuan Yang, Chaofan Ma, Jiangchao Yao et al.

TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Chuanrui Zhang, Yingshuang Zou, Zhuoling Li et al.

SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments

Shibo Zhao, Yuanjun Gao, Tianhao Wu et al.

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Baorui Ma, Huachen Gao, Haoge Deng et al.

CVPR 2025arXiv:2412.06699

3d generation modelsmulti-view diffusion modelpose-free videoslarge-scale video data+4

49

citations

#83

GPAvatar: Generalizable and Precise Head Avatar from Image(s)

Xuangeng Chu, Yu Li, Ailing Zeng et al.

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

Yuqian Fu, Yu Wang, Yixuan Pan et al.

Feature Fusion from Head to Tail for Long-Tailed Visual Recognition

Mengke Li, Zhikai HU, Yang Lu et al.

AAAI 2024arXiv:2306.06963

long-tailed recognitionfeature fusionclass imbalancedecision boundary optimization+3

48

citations

#86

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain et al.

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

Ziqi Pang, Ziyang Xie, Yunze Man et al.

Neural Markov Random Field for Stereo Matching

Tongfan Guan, Chen Wang, Yun-Hui Liu

LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection

hongcheng Guo, Jian Yang, Jiaheng Liu et al.

AAAI 2024arXiv:2401.04749

log anomaly detectiontransformer-based frameworkadapter-based tuningmulti-domain logs+3

47

citations

#90

UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction

Lan Feng, Mohammadhossein Bahari, Kaouther Messaoud et al.

ECCV 2024arXiv:2403.15098

vehicle trajectory predictionunified frameworkdataset unificationcross-dataset generalization+4

47

citations

#91

Digital Life Project: Autonomous 3D Characters with Social Intelligence

Zhongang Cai, Jianping Jiang, Zhongfei Qing et al.

ODEFormer: Symbolic Regression of Dynamical Systems with Transformers

Stéphane d'Ascoli, Sören Becker, Philippe Schwaller et al.

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Yu Deng, Duomin Wang, Baoyuan Wang

TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data

Jeremy Irvin, Emily Liu, Joyce Chen et al.

ICLR 2025arXiv:2410.06234

vision-language assistanttemporal earth observationinstruction-following datasetchange detection+4

45

citations

#95

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

Keon Hee Park, Kyungwoo Song, Gyeong-Moon Park

SEPT: Towards Efficient Scene Representation Learning for Motion Prediction

Zhiqian Lan, Yuxuan Jiang, Yao Mu et al.

TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation

Yuhao Wang, Xuehu Liu, Pingping Zhang et al.

AAAI 2024arXiv:2312.09612

multi-spectral object re-identificationvision transformerstoken permutationfeature aggregation+4

45

citations

#98

Visual Agents as Fast and Slow Thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang et al.

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Jianjian Cao, Peng Ye, Shengze Li et al.

CVPR 2024

44

citations

Vision Transformers

Top Conferences

Related Topics (Architectures)

Top Papers

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

VBench: Comprehensive Benchmark Suite for Video Generative Models

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Grounding Image Matching in 3D with MASt3R

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Sequential Modeling Enables Scalable Learning for Large Vision Models

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Uni3D: Exploring Unified 3D Representation at Scale

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

Osprey: Pixel Understanding with Visual Instruction Tuning

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Rotary Position Embedding for Vision Transformer

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

SCTNet: Single Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

OmniRe: Omni Urban Scene Reconstruction

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

FoundationStereo: Zero-Shot Stereo Matching

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

Transformers without Normalization

FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

Decoding Natural Images from EEG for Object Recognition

Brain decoding: toward real-time reconstruction of visual perception

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Vision-LSTM: xLSTM as Generic Vision Backbone

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

MaskBit: Embedding-free Image Generation via Bit Tokens

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

MV-Adapter: Multi-View Consistent Image Generation Made Easy

DiT4Edit: Diffusion Transformer for Image Editing

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Frequency-Spatial Entanglement Learning for Camouflaged Object Detection

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Unifying 3D Vision-Language Understanding via Promptable Queries

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Accelerating Diffusion Transformers with Token-wise Feature Caching

Image and Video Tokenization with Binary Spherical Quantization

PerceptionGPT: Effectively Fusing Visual Perception into LLM

DocFormerv2: Local Features for Document Understanding

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction

Magnushammer: A Transformer-Based Approach to Premise Selection

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond