Vision Transformers
Transformers for computer vision
Top Papers
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang et al.
VBench: Comprehensive Benchmark Suite for Video Generative Models
Ziqi Huang, Yinan He, Jiashuo Yu et al.
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
Tao Lu, Mulin Yu, Linning Xu et al.
Grounding Image Matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, Jerome Revaud
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li et al.
GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting
Chi Yan, Delin Qu, Dong Wang et al.
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo et al.
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
Lu Ling, Yichen Sheng, Zhi Tu et al.
EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations
Yi-Lun Liao, Brandon Wood, Abhishek Das et al.
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
Yunyang Xiong, Balakrishnan Varadarajan, Lemeng Wu et al.
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
Xin Guo, Jiangwei Lao, Bo Dang et al.
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Hongtao Wu, Ya Jing, Chilam Cheang et al.
Sequential Modeling Enables Scalable Learning for Large Vision Models
Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang et al.
Uni3D: Exploring Unified 3D Representation at Scale
Junsheng Zhou, Jinsheng Wang, Baorui Ma et al.
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
Shunyuan Zheng, Boyao ZHOU, Ruizhi Shao et al.
Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting
Yunzhi Yan, Haotong Lin, Chenxu Zhou et al.
Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians
Yuelang Xu, Benwang Chen, Zhe Li et al.
Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan, Wentong Li, Jian liu et al.
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Zheng Liu, Peitian Zhang et al.
Rotary Position Embedding for Vision Transformer
Byeongho Heo, Song Park, Dongyoon Han et al.
Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection
Zhiyuan Yan, Yuhao Luo, Siwei Lyu et al.
SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
Mingrui Li, Shuhong Liu, Heng Zhou et al.
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
Chunlong Xia, Xinliang Wang, Feng Lv et al.
SCTNet: Single Branch CNN with Transformer Semantic Information for Real-Time Segmentation
Authors: Zhengze Xu, Dongyue Wu, Changqian Yu et al.
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection
Chengjie Wang, wenbing zhu, Bin-Bin Gao et al.
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang, Min Shi, Qingyun Li et al.
Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection
Jiangnan Yang, Shuangli Liu, Jingjun Wu et al.
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Size Wu, Wenwei Zhang, Lumin Xu et al.
OmniRe: Omni Urban Scene Reconstruction
Ziyu Chen, Jiawei Yang, Jiahui Huang et al.
PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment
Tianchen Deng, Guole Shen, Tong Qin et al.
Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
ZHIXIANG WEI, Lin Chen, Xiaoxiao Ma et al.
FoundationStereo: Zero-Shot Stereo Matching
Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
Seokju Yun, Youngmin Ro
Transformers without Normalization
Jiachen Zhu, Xinlei Chen, Kaiming He et al.
FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
Shangzhan Zhang, Jianyuan Wang, Yinghao Xu et al.
Decoding Natural Images from EEG for Object Recognition
Yonghao Song, Bingchuan Liu, Xiang Li et al.
Brain decoding: toward real-time reconstruction of visual perception
Yohann Benchetrit, Hubert Banville, Jean-Remi King
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
Sihan liu, Yiwei Ma, Xiaoqing Zhang et al.
LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias
Haian Jin, Hanwen Jiang, Hao Tan et al.
Vision-LSTM: xLSTM as Generic Vision Backbone
Benedikt Alkin, Maximilian Beck, Korbinian Pöppel et al.
ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image
Kyle Sargent, Zizhang Li, Tanmay Shah et al.
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
Jingkang Yang, Yuhao Dong, Shuai Liu et al.
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
Zhenggang Tang, Yuchen Fan, Dilin Wang et al.
FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding
Jun Xiang, Xuan Gao, Yudong Guo et al.
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian et al.
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
Mubashir Noman, Muzammal Naseer, Hisham Cholakkal et al.
Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary
Leheng Zhang, Yawei Li, Xingyu Zhou et al.
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition
Feng Lu, Xiangyuan Lan, Lijun Zhang et al.
BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting
Lingzhe Zhao, Peng Wang, Peidong Liu
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.
COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
Qihang Ma, Xin Tan, Yanyun Qu et al.
MaskBit: Embedding-free Image Generation via Bit Tokens
Mark Weber, Lijun Yu, Qihang Yu et al.
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Xi Chen, Zhifei Zhang, He Zhang et al.
MV-Adapter: Multi-View Consistent Image Generation Made Easy
Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.
DiT4Edit: Diffusion Transformer for Image Editing
Kunyu Feng, Yue Ma, Bingyuan Wang et al.
GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
Ye Yuan, Xueting Li, Yangyi Huang et al.
Frequency-Spatial Entanglement Learning for Camouflaged Object Detection
Yanguang Sun, Chunyan Xu, Jian Yang et al.
TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos
Yufu Wang, Ziyun Wang, Lingjie Liu et al.
Unifying 3D Vision-Language Understanding via Promptable Queries
ziyu zhu, Zhuofan Zhang, Xiaojian Ma et al.
HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors
Xiao Wang, Zongzhen Wu, Bo Jiang et al.
Accelerating Diffusion Transformers with Token-wise Feature Caching
Chang Zou, Xuyang Liu, Ting Liu et al.
Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi, Lewei Yao, Jiahui Gao et al.
DocFormerv2: Local Features for Document Understanding
Srikar Appalaraju, Peng Tang, Qi Dong et al.
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
Shilin Yan, Renrui Zhang, Ziyu Guo et al.
Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction
Devikalyan Das, Christopher Wewer, Raza Yunus et al.
Magnushammer: A Transformer-Based Approach to Premise Selection
Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak et al.
FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection
Yao Xiao, Tingfa Xu, Yu Xin et al.
Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
Nicolae Ristea, Florinel Croitoru, Radu Tudor Ionescu et al.
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
Jianhong Bai, Menghan Xia, Xintao WANG et al.
VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting
Seunggu Kang, WonJun Moon, Euiyeon Kim et al.
Scaling Transformers for Low-Bitrate High-Quality Speech Coding
Julian Parker, Anton Smirnov, Jordi Pons et al.
FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion
George Cazenavette, Avneesh Sud, Thomas Leung et al.
HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
Xiang Zhang, Yulun Zhang, Fisher Yu
AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond
Zixiang Zhou, Yu Wan, Baoyuan Wang
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
Yiheng Xu, Dunjie Lu, Zhennan Shen et al.
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
Shuai Yang, Yifan Zhou, Ziwei Liu et al.
ReMamber: Referring Image Segmentation with Mamba Twister
Yuhuan Yang, Chaofan Ma, Jiangchao Yao et al.
TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers
Chuanrui Zhang, Yingshuang Zou, Zhuoling Li et al.
SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments
Shibo Zhao, Yuanjun Gao, Tianhao Wu et al.
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Baorui Ma, Huachen Gao, Haoge Deng et al.
GPAvatar: Generalizable and Precise Head Avatar from Image(s)
Xuangeng Chu, Yu Li, Ailing Zeng et al.
Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector
Yuqian Fu, Yu Wang, Yixuan Pan et al.
Feature Fusion from Head to Tail for Long-Tailed Visual Recognition
Mengke Li, Zhikai HU, Yang Lu et al.
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain et al.
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Ziqi Pang, Ziyang Xie, Yunze Man et al.
Neural Markov Random Field for Stereo Matching
Tongfan Guan, Chen Wang, Yun-Hui Liu
LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection
hongcheng Guo, Jian Yang, Jiaheng Liu et al.
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction
Lan Feng, Mohammadhossein Bahari, Kaouther Messaoud et al.
Digital Life Project: Autonomous 3D Characters with Social Intelligence
Zhongang Cai, Jianping Jiang, Zhongfei Qing et al.
ODEFormer: Symbolic Regression of Dynamical Systems with Transformers
Stéphane d'Ascoli, Sören Becker, Philippe Schwaller et al.
Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer
Yu Deng, Duomin Wang, Baoyuan Wang
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Jeremy Irvin, Emily Liu, Joyce Chen et al.
Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
Keon Hee Park, Kyungwoo Song, Gyeong-Moon Park
SEPT: Towards Efficient Scene Representation Learning for Motion Prediction
Zhiqian Lan, Yuxuan Jiang, Yao Mu et al.
TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation
Yuhao Wang, Xuehu Liu, Pingping Zhang et al.
Visual Agents as Fast and Slow Thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang et al.
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
Jianjian Cao, Peng Ye, Shengze Li et al.