🧬Architectures

Transformer Architecture

Transformer and attention-based architectures

100 papers4,855 total citations
Compare with other topics
Feb '24 Jan '261224 papers
Also includes: transformer architecture, transformer, transformers, attention mechanism, self-attention

Top Papers

#1

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li et al.

ICLR 2025
365
citations
#2

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

Yi-Lun Liao, Brandon Wood, Abhishek Das et al.

ICLR 2024
254
citations
#3

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo et al.

ICLR 2025arXiv:2410.10819
kv cache pruninglong-context inferenceattention headsretrieval heads+4
165
citations
#4

Grounded Text-to-Image Synthesis with Attention Refocusing

Quynh Phung, Songwei Ge, Jia-Bin Huang

CVPR 2024
157
citations
#5

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, Ali Hatamizadeh

ICLR 2025
141
citations
#6

SCTNet: Single Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Authors: Zhengze Xu, Dongyue Wu, Changqian Yu et al.

AAAI 2024arXiv:2312.17071
real-time segmentationsemantic segmentationsingle branch cnntransformer semantic information+3
126
citations
#7

An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention

Yehjin Shin, Jeongwhan Choi, Hyowon Wi et al.

AAAI 2024arXiv:2312.10325
sequential recommendationself-attention mechanismoversmoothing problemtransformer-based models+4
99
citations
#8

HyperAttention: Long-context Attention in Near-Linear Time

Insu Han, Rajesh Jayaram, Amin Karbasi et al.

ICLR 2024
94
citations
#9

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du et al.

ICLR 2025arXiv:2410.10781
attention sink phenomenonlanguage model pre-trainingsoftmax normalizationkey biases+4
90
citations
#10

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

Jingfeng Wu, Difan Zou, Zixiang Chen et al.

ICLR 2024
85
citations
#11

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

Michael Zhang, Kush Bhatia, Hermann Kumbong et al.

ICLR 2024
84
citations
#12

Kolmogorov-Arnold Transformer

Xingyi Yang, Xinchao Wang

ICLR 2025
83
citations
#13

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Di Liu, Meng Chen, Baotong Lu et al.

NeurIPS 2025
81
citations
#14

Real-Time Video Generation with Pyramid Attention Broadcast

Xuanlei Zhao, Xiaolong Jin, Kai Wang et al.

ICLR 2025
79
citations
#15

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Xiuquan Hou, Meiqin Liu, Senlin Zhang et al.

CVPR 2024
78
citations
#16

Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

Jiawen Li, Yuxuan Chen, Hongbo Chu et al.

CVPR 2024
75
citations
#17

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song et al.

ICCV 2025
69
citations
#18

FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions

Zhen Liu, Hao Zhu, Qi Zhang et al.

CVPR 2024
66
citations
#19

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

guo, Tianwei Lin

CVPR 2024
63
citations
#20

MagicPIG: LSH Sampling for Efficient LLM Generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye et al.

ICLR 2025arXiv:2410.16179
attention approximation methodskv cache bottlenecklocality sensitive hashingsampling-based approximation+3
62
citations
#21

Accelerating Diffusion Transformers with Token-wise Feature Caching

Chang Zou, Xuyang Liu, Ting Liu et al.

ICLR 2025
62
citations
#22

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

Yunlong Zhang, Honglin Li, YUXUAN SUN et al.

ECCV 2024
61
citations
#23

HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention

Xiaolong Tang, Meina Kan, Shiguang Shan et al.

CVPR 2024
61
citations
#24

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Hongjie Wang, Bhishma Dedhia, Niraj Jha

CVPR 2024
59
citations
#25

Magnushammer: A Transformer-Based Approach to Premise Selection

Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak et al.

ICLR 2024
57
citations
#26

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian Parker, Anton Smirnov, Jordi Pons et al.

ICLR 2025
54
citations
#27

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim et al.

ICLR 2025arXiv:2503.03321
attention mechanismvisual attention sinkmultimodal modelsvision-language tasks+4
52
citations
#28

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Xiang Zhang, Yulun Zhang, Fisher Yu

ECCV 2024
52
citations
#29

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain et al.

ICLR 2025
48
citations
#30

Feature Fusion from Head to Tail for Long-Tailed Visual Recognition

Mengke Li, Zhikai HU, Yang Lu et al.

AAAI 2024arXiv:2306.06963
long-tailed recognitionfeature fusionclass imbalancedecision boundary optimization+3
48
citations
#31

LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection

hongcheng Guo, Jian Yang, Jiaheng Liu et al.

AAAI 2024arXiv:2401.04749
log anomaly detectiontransformer-based frameworkadapter-based tuningmulti-domain logs+3
47
citations
#32

Simplifying Transformer Blocks

Bobby He, Thomas Hofmann

ICLR 2024
47
citations
#33

EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

Daiheng Gao, Shilin Lu, Wenbo Zhou et al.

ICML 2025
47
citations
#34

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

Jie Ren, Yaxin Li, Shenglai Zeng et al.

ECCV 2024
46
citations
#35

S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention

Chiyu Zhang, Xiaogang Xu, Lei Wang et al.

AAAI 2024arXiv:2210.12381
style transfervision transformerattention mechanismwindow attention+4
46
citations
#36

JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention

Yuandong Tian, Yiping Wang, Zhenyu Zhang et al.

ICLR 2024
46
citations
#37

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

Keon Hee Park, Kyungwoo Song, Gyeong-Moon Park

CVPR 2024
45
citations
#38

TabM: Advancing tabular deep learning with parameter-efficient ensembling

Yury Gorishniy, Akim Kotelnikov, Artem Babenko

ICLR 2025
45
citations
#39

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Minghong Cai, Xiaodong Cun, Xiaoyu Li et al.

CVPR 2025
44
citations
#40

TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling

Shimin Zhang, Qu Yang, Chenxiang Ma et al.

AAAI 2024arXiv:2308.13250
spiking neural networkstemporal classification tasksneuromorphic computing systemslong-term temporal dependency+4
41
citations
#41

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

ICLR 2025arXiv:2410.13708
attention mechanismsmodel safetymechanistic interpretabilitysafety representations+4
40
citations
#42

Scene Adaptive Sparse Transformer for Event-based Object Detection

Yansong Peng, Li Hebei, Yueyi Zhang et al.

CVPR 2024
40
citations
#43

Trajectory attention for fine-grained video motion control

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou et al.

ICLR 2025
38
citations
#44

Combining Induction and Transduction for Abstract Reasoning

Wen-Ding Li, Keya Hu, Carter Larsen et al.

ICLR 2025
38
citations
#45

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Lianghui Zhu, Zilong Huang, Bencheng Liao et al.

CVPR 2025
38
citations
#46

MoH: Multi-Head Attention as Mixture-of-Head Attention

Peng Jin, Bo Zhu, Li Yuan et al.

ICML 2025
37
citations
#47

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Weikang Meng, Yadan Luo, Xin Li et al.

ICLR 2025arXiv:2501.15061
linear attentionvision transformersattention mechanismquery-key interactions+2
36
citations
#48

Question Aware Vision Transformer for Multimodal Reasoning

Roy Ganz, Yair Kittenplon, Aviad Aberdam et al.

CVPR 2024
36
citations
#49

LION: Implicit Vision Prompt Tuning

Haixin Wang, Jianlong Chang, Yihang Zhai et al.

AAAI 2024arXiv:2303.09992
vision prompt tuningimplicit layersvision transformerscomputational efficiency+4
35
citations
#50

Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning

Jinsong Shi, Pan Gao, Jie Qin

AAAI 2024arXiv:2312.06995
image quality assessmentno-reference iqasupervised contrastive learningtransformer architecture+4
34
citations
#51

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Jason Ramapuram, Federico Danieli, Eeshan Gunesh Dhekane et al.

ICLR 2025
33
citations
#52

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang et al.

ICCV 2025arXiv:2412.03859
layout-to-image generationmultimodal diffusion transformerssiamese network architecturecreative layout planning+1
33
citations
#53

On the Emergence of Position Bias in Transformers

Xinyi Wu, Yifei Wang, Stefanie Jegelka et al.

ICML 2025
33
citations
#54

Looped Transformers for Length Generalization

Ying Fan, Yilun Du, Kannan Ramchandran et al.

ICLR 2025
33
citations
#55

Attentive Eraser: Unleashing Diffusion Model’s Object Removal Potential via Self-Attention Redirection Guidance

Wenhao Sun, Xue-Mei Dong, Benlei Cui et al.

AAAI 2025
32
citations
#56

TCI-Former: Thermal Conduction-Inspired Transformer for Infrared Small Target Detection

Tianxiang Chen, Zhentao Tan, Qi Chu et al.

AAAI 2024arXiv:2402.02046
infrared small target detectionthermal conduction analogypixel movement differential equationthermal conduction-inspired attention+3
31
citations
#57

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

Saebom Leem, Hyunseok Seo

AAAI 2024arXiv:2402.04563
vision transformerattention mechanismvisual explanationsweakly-supervised localization+4
31
citations
#58

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

Shentong Mo, Pedro Morgado

CVPR 2024
31
citations
#59

GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers

Takeru Miyato, Bernhard Jaeger, Max Welling et al.

ICLR 2024
31
citations
#60

Logical Languages Accepted by Transformer Encoders with Hard Attention

Pablo Barcelo, Alexander Kozachinskiy, Anthony W. Lin et al.

ICLR 2024
29
citations
#61

Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi et al.

ICLR 2024
28
citations
#62

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Zhicheng Wang, Liwen Xiao, Zhiguo Cao et al.

AAAI 2024arXiv:2305.04440
class-agnostic countingvision transformerfew-shot learningself-attention mechanism+3
28
citations
#63

Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

Zanlin Ni, Yulin Wang, Renping Zhou et al.

CVPR 2024
28
citations
#64

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Hyesu Lim, Jinho Choi, Jaegul Choo et al.

ICLR 2025
27
citations
#65

Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas D. Lingle

ICLR 2024
26
citations
#66

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Yizhe Xiong, Hui Chen, Tianxiang Hao et al.

ECCV 2024
26
citations
#67

Understanding Factual Recall in Transformers via Associative Memories

Eshaan Nichani, Jason Lee, Alberto Bietti

ICLR 2025
25
citations
#68

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Chen-Xiao Gao, Chenyang Wu, Mingjun Cao et al.

AAAI 2024arXiv:2309.05915
decision transformeroffline policy optimizationadvantage conditioningdynamic programming+3
25
citations
#69

SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks

Meng Lou, Yunxiang Fu, Yizhou Yu

AAAI 2025
24
citations
#70

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov et al.

ICLR 2025
24
citations
#71

Attention Disturbance and Dual-Path Constraint Network for Occluded Person Re-identification

Jiaer Xia, Lei Tan, Pingyang Dai et al.

AAAI 2024arXiv:2303.10976
occluded person re-identificationattention mechanismtransformer architecturegeneralization enhancement+3
24
citations
#72

AesFA: An Aesthetic Feature

Aware Arbitrary Neural Style Transfer

AAAI 2024arXiv:2312.05928
neural style transferaesthetic feature extractionfrequency decompositionfeature disentanglement+3
24
citations
#73

SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation

Changsheng Lv, Mengshi Qi, Xia Li et al.

AAAI 2024arXiv:2303.11048
3d scene graph generationpoint cloud parsingsemantic graph transformerglobal information passing+4
23
citations
#74

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

George Wang, Jesse Hoogland, Stan van Wingerden et al.

ICLR 2025
23
citations
#75

A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language

Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert Dick et al.

ICLR 2025
23
citations
#76

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

Zhibo Yang, Sounak Mondal, Seoyoung Ahn et al.

CVPR 2024
22
citations
#77

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang, Ziwei Zheng, Yizeng Han et al.

ECCV 2024
21
citations
#78

Hyper-Connections

Defa Zhu, Hongzhi Huang, Zihao Huang et al.

ICLR 2025
20
citations
#79

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

Yifan Pu, Yiming Zhao, Zhicong Tang et al.

CVPR 2025
20
citations
#80

RealViformer: Investigating Attention for Real-World Video Super-Resolution

Yuehan Zhang, Angela Yao

ECCV 2024
20
citations
#81

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Weixian Lei, Jiacong Wang, Haochen Wang et al.

ICCV 2025
20
citations
#82

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Songhua Liu, Zhenxiong Tan, Xinchao Wang

NeurIPS 2025
20
citations
#83

Rope to Nope and Back Again: A New Hybrid Attention Strategy

Bowen Yang, Bharat Venkitesh, Dwaraknath Gnaneshwar Talupuru et al.

NeurIPS 2025
20
citations
#84

Task-driven Image Fusion with Learnable Fusion Loss

Haowen Bai, Jiangshe Zhang, Zixiang Zhao et al.

CVPR 2025
19
citations
#85

Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach

Wei Dong, Xing Zhang, Bihui Chen et al.

CVPR 2024
19
citations
#86

Distinguished In Uniform: Self-Attention Vs. Virtual Nodes

Eran Rosenbluth, Jan Tönshoff, Martin Ritzert et al.

ICLR 2024
19
citations
#87

Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data

Antonis Antoniades, Yiyi Yu, Joe Canzano et al.

ICLR 2024
19
citations
#88

Emergence of meta-stable clustering in mean-field transformer models

Giuseppe Bruno, Federico Pasqualotto, Andrea Agazzi

ICLR 2025
19
citations
#89

Occlusion-Embedded Hybrid Transformer for Light Field Super-Resolution

Zeyu Xiao, Zhuoyuan Li, Wei Jia

AAAI 2025
19
citations
#90

Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

Huicong Zhang, Haozhe Xie, Hongxun Yao

CVPR 2024
18
citations
#91

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought

Jianhao Huang, Zixuan Wang, Jason Lee

ICLR 2025arXiv:2502.21212
attention mechanismchain of thought promptingin-context learningtransformer architecture+4
18
citations
#92

Boosting Neural Combinatorial Optimization for Large-Scale Vehicle Routing Problems

Fu Luo, Xi Lin, Yaoxin Wu et al.

ICLR 2025
18
citations
#93

LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units

Zeyu Liu, Gourav Datta, Anni Li et al.

ICLR 2024
17
citations
#94

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Min Yang, gaohuan, Ping Guo et al.

CVPR 2024
17
citations
#95

PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling

Ruizhe Zhong, Junjie Ye, Zhentao Tang et al.

AAAI 2024arXiv:2403.00012
timing predictiongraph auto-encodergraph neural networkscircuit netlist embedding+4
17
citations
#96

Condition-Aware Neural Network for Controlled Image Generation

Han Cai, Muyang Li, Qinsheng Zhang et al.

CVPR 2024
17
citations
#97

Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding

Tatsunori Taniai, Ryo Igarashi, Yuta Suzuki et al.

ICLR 2024
17
citations
#98

OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

Stephen Zhang, Vardan Papyan

ICLR 2025
16
citations
#99

Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

Junyi Wu, Bin Duan, Weitai Kang et al.

CVPR 2024
16
citations
#100

Spiking Transformer with Spatial-Temporal Attention

Donghyun Lee, Yuhang Li, Youngeun Kim et al.

CVPR 2025
16
citations