🧬Architectures

Transformer Architecture

Transformer and attention-based architectures

100 papers4,855 total citations

Compare with other topics

Feb '24 — Jan '261224 papers

Top Conferences

ICLR: 43 CVPR: 23 AAAI: 19 ECCV: 6 NeurIPS: 3 ICCV: 3

Top Papers

#1

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li et al.

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

Yi-Lun Liao, Brandon Wood, Abhishek Das et al.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo et al.

ICLR 2025arXiv:2410.10819

kv cache pruninglong-context inferenceattention headsretrieval heads+4

165

citations

#4

Grounded Text-to-Image Synthesis with Attention Refocusing

Quynh Phung, Songwei Ge, Jia-Bin Huang

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, Ali Hatamizadeh

SCTNet: Single Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Authors: Zhengze Xu, Dongyue Wu, Changqian Yu et al.

AAAI 2024arXiv:2312.17071

real-time segmentationsemantic segmentationsingle branch cnntransformer semantic information+3

126

citations

#7

An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention

Yehjin Shin, Jeongwhan Choi, Hyowon Wi et al.

AAAI 2024arXiv:2312.10325

sequential recommendationself-attention mechanismoversmoothing problemtransformer-based models+4

99

citations

#8

HyperAttention: Long-context Attention in Near-Linear Time

Insu Han, Rajesh Jayaram, Amin Karbasi et al.

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du et al.

ICLR 2025arXiv:2410.10781

attention sink phenomenonlanguage model pre-trainingsoftmax normalizationkey biases+4

90

citations

#10

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

Jingfeng Wu, Difan Zou, Zixiang Chen et al.

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

Michael Zhang, Kush Bhatia, Hermann Kumbong et al.

Kolmogorov-Arnold Transformer

Xingyi Yang, Xinchao Wang

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Di Liu, Meng Chen, Baotong Lu et al.

Real-Time Video Generation with Pyramid Attention Broadcast

Xuanlei Zhao, Xiaolong Jin, Kai Wang et al.

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Xiuquan Hou, Meiqin Liu, Senlin Zhang et al.

Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

Jiawen Li, Yuxuan Chen, Hongbo Chu et al.

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song et al.

FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions

Zhen Liu, Hao Zhu, Qi Zhang et al.

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

guo, Tianwei Lin

MagicPIG: LSH Sampling for Efficient LLM Generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye et al.

ICLR 2025arXiv:2410.16179

attention approximation methodskv cache bottlenecklocality sensitive hashingsampling-based approximation+3

62

citations

#21

Accelerating Diffusion Transformers with Token-wise Feature Caching

Chang Zou, Xuyang Liu, Ting Liu et al.

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

Yunlong Zhang, Honglin Li, YUXUAN SUN et al.

HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention

Xiaolong Tang, Meina Kan, Shiguang Shan et al.

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Hongjie Wang, Bhishma Dedhia, Niraj Jha

Magnushammer: A Transformer-Based Approach to Premise Selection

Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak et al.

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian Parker, Anton Smirnov, Jordi Pons et al.

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim et al.

ICLR 2025arXiv:2503.03321

attention mechanismvisual attention sinkmultimodal modelsvision-language tasks+4

52

citations

#28

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Xiang Zhang, Yulun Zhang, Fisher Yu

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain et al.

Feature Fusion from Head to Tail for Long-Tailed Visual Recognition

Mengke Li, Zhikai HU, Yang Lu et al.

AAAI 2024arXiv:2306.06963

long-tailed recognitionfeature fusionclass imbalancedecision boundary optimization+3

48

citations

#31

LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection

hongcheng Guo, Jian Yang, Jiaheng Liu et al.

AAAI 2024arXiv:2401.04749

log anomaly detectiontransformer-based frameworkadapter-based tuningmulti-domain logs+3

47

citations

#32

Simplifying Transformer Blocks

Bobby He, Thomas Hofmann

EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

Daiheng Gao, Shilin Lu, Wenbo Zhou et al.

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

Jie Ren, Yaxin Li, Shenglai Zeng et al.

S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention

Chiyu Zhang, Xiaogang Xu, Lei Wang et al.

AAAI 2024arXiv:2210.12381

style transfervision transformerattention mechanismwindow attention+4

46

citations

#36

JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention

Yuandong Tian, Yiping Wang, Zhenyu Zhang et al.

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

Keon Hee Park, Kyungwoo Song, Gyeong-Moon Park

TabM: Advancing tabular deep learning with parameter-efficient ensembling

Yury Gorishniy, Akim Kotelnikov, Artem Babenko

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Minghong Cai, Xiaodong Cun, Xiaoyu Li et al.

TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling

Shimin Zhang, Qu Yang, Chenxiang Ma et al.

AAAI 2024arXiv:2308.13250

spiking neural networkstemporal classification tasksneuromorphic computing systemslong-term temporal dependency+4

41

citations

#41

On the Role of Attention Heads in Large Language Model Safety

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.

ICLR 2025arXiv:2410.13708

attention mechanismsmodel safetymechanistic interpretabilitysafety representations+4

40

citations

#42

Scene Adaptive Sparse Transformer for Event-based Object Detection

Yansong Peng, Li Hebei, Yueyi Zhang et al.

Trajectory attention for fine-grained video motion control

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou et al.

Combining Induction and Transduction for Abstract Reasoning

Wen-Ding Li, Keya Hu, Carter Larsen et al.

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Lianghui Zhu, Zilong Huang, Bencheng Liao et al.

MoH: Multi-Head Attention as Mixture-of-Head Attention

Peng Jin, Bo Zhu, Li Yuan et al.

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Weikang Meng, Yadan Luo, Xin Li et al.

ICLR 2025arXiv:2501.15061

linear attentionvision transformersattention mechanismquery-key interactions+2

36

citations

#48

Question Aware Vision Transformer for Multimodal Reasoning

Roy Ganz, Yair Kittenplon, Aviad Aberdam et al.

LION: Implicit Vision Prompt Tuning

Haixin Wang, Jianlong Chang, Yihang Zhai et al.

AAAI 2024arXiv:2303.09992

vision prompt tuningimplicit layersvision transformerscomputational efficiency+4

35

citations

#50

Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning

Jinsong Shi, Pan Gao, Jie Qin

AAAI 2024arXiv:2312.06995

image quality assessmentno-reference iqasupervised contrastive learningtransformer architecture+4

34

citations

#51

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Jason Ramapuram, Federico Danieli, Eeshan Gunesh Dhekane et al.

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang et al.

ICCV 2025arXiv:2412.03859

layout-to-image generationmultimodal diffusion transformerssiamese network architecturecreative layout planning+1

33

citations

#53

On the Emergence of Position Bias in Transformers

Xinyi Wu, Yifei Wang, Stefanie Jegelka et al.

Looped Transformers for Length Generalization

Ying Fan, Yilun Du, Kannan Ramchandran et al.

Attentive Eraser: Unleashing Diffusion Model’s Object Removal Potential via Self-Attention Redirection Guidance

Wenhao Sun, Xue-Mei Dong, Benlei Cui et al.

TCI-Former: Thermal Conduction-Inspired Transformer for Infrared Small Target Detection

Tianxiang Chen, Zhentao Tan, Qi Chu et al.

AAAI 2024arXiv:2402.02046

infrared small target detectionthermal conduction analogypixel movement differential equationthermal conduction-inspired attention+3

31

citations

#57

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

Saebom Leem, Hyunseok Seo

AAAI 2024arXiv:2402.04563

vision transformerattention mechanismvisual explanationsweakly-supervised localization+4

31

citations

#58

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

Shentong Mo, Pedro Morgado

GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers

Takeru Miyato, Bernhard Jaeger, Max Welling et al.

Logical Languages Accepted by Transformer Encoders with Hard Attention

Pablo Barcelo, Alexander Kozachinskiy, Anthony W. Lin et al.

Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi et al.

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Zhicheng Wang, Liwen Xiao, Zhiguo Cao et al.

AAAI 2024arXiv:2305.04440

class-agnostic countingvision transformerfew-shot learningself-attention mechanism+3

28

citations

#63

Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

Zanlin Ni, Yulin Wang, Renping Zhou et al.

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Hyesu Lim, Jinho Choi, Jaegul Choo et al.

Transformer-VQ: Linear-Time Transformers via Vector Quantization

Lucas D. Lingle

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Yizhe Xiong, Hui Chen, Tianxiang Hao et al.

Understanding Factual Recall in Transformers via Associative Memories

Eshaan Nichani, Jason Lee, Alberto Bietti

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

Chen-Xiao Gao, Chenyang Wu, Mingjun Cao et al.

AAAI 2024arXiv:2309.05915

decision transformeroffline policy optimizationadvantage conditioningdynamic programming+3

25

citations

#69

SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks

Meng Lou, Yunxiang Fu, Yizhou Yu

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov et al.

Attention Disturbance and Dual-Path Constraint Network for Occluded Person Re-identification

Jiaer Xia, Lei Tan, Pingyang Dai et al.

AAAI 2024arXiv:2303.10976

occluded person re-identificationattention mechanismtransformer architecturegeneralization enhancement+3

24

citations

#72

AesFA: An Aesthetic Feature

Aware Arbitrary Neural Style Transfer

AAAI 2024arXiv:2312.05928

neural style transferaesthetic feature extractionfrequency decompositionfeature disentanglement+3

24

citations

#73

SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation

Changsheng Lv, Mengshi Qi, Xia Li et al.

AAAI 2024arXiv:2303.11048

3d scene graph generationpoint cloud parsingsemantic graph transformerglobal information passing+4

23

citations

#74

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

George Wang, Jesse Hoogland, Stan van Wingerden et al.

A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language

Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert Dick et al.

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

Zhibo Yang, Sounak Mondal, Seoyoung Ahn et al.

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang, Ziwei Zheng, Yizeng Han et al.

Hyper-Connections

Defa Zhu, Hongzhi Huang, Zihao Huang et al.

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

Yifan Pu, Yiming Zhao, Zhicong Tang et al.

RealViformer: Investigating Attention for Real-World Video Super-Resolution

Yuehan Zhang, Angela Yao

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Weixian Lei, Jiacong Wang, Haochen Wang et al.

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Songhua Liu, Zhenxiong Tan, Xinchao Wang

Rope to Nope and Back Again: A New Hybrid Attention Strategy

Bowen Yang, Bharat Venkitesh, Dwaraknath Gnaneshwar Talupuru et al.

Task-driven Image Fusion with Learnable Fusion Loss

Haowen Bai, Jiangshe Zhang, Zixiang Zhao et al.

Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach

Wei Dong, Xing Zhang, Bihui Chen et al.

Distinguished In Uniform: Self-Attention Vs. Virtual Nodes

Eran Rosenbluth, Jan Tönshoff, Martin Ritzert et al.

Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data

Antonis Antoniades, Yiyi Yu, Joe Canzano et al.

Emergence of meta-stable clustering in mean-field transformer models

Giuseppe Bruno, Federico Pasqualotto, Andrea Agazzi

Occlusion-Embedded Hybrid Transformer for Light Field Super-Resolution

Zeyu Xiao, Zhuoyuan Li, Wei Jia

Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

Huicong Zhang, Haozhe Xie, Hongxun Yao

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought

Jianhao Huang, Zixuan Wang, Jason Lee

ICLR 2025arXiv:2502.21212

attention mechanismchain of thought promptingin-context learningtransformer architecture+4

18

citations

#92

Boosting Neural Combinatorial Optimization for Large-Scale Vehicle Routing Problems

Fu Luo, Xi Lin, Yaoxin Wu et al.

LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units

Zeyu Liu, Gourav Datta, Anni Li et al.

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Min Yang, gaohuan, Ping Guo et al.

PreRoutGNN for Timing Prediction with Order Preserving Partition: Global Circuit Pre-training, Local Delay Learning and Attentional Cell Modeling

Ruizhe Zhong, Junjie Ye, Zhentao Tang et al.

AAAI 2024arXiv:2403.00012

timing predictiongraph auto-encodergraph neural networkscircuit netlist embedding+4

17

citations

#96

Condition-Aware Neural Network for Controlled Image Generation

Han Cai, Muyang Li, Qinsheng Zhang et al.

Crystalformer: Infinitely Connected Attention for Periodic Structure Encoding

Tatsunori Taniai, Ryo Igarashi, Yuta Suzuki et al.

OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

Stephen Zhang, Vardan Papyan

Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

Junyi Wu, Bin Duan, Weitai Kang et al.

Spiking Transformer with Spatial-Temporal Attention

Donghyun Lee, Yuhang Li, Youngeun Kim et al.

CVPR 2025

16

citations

Transformer Architecture

Top Conferences

Related Topics (Architectures)

Top Papers

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Grounded Text-to-Image Synthesis with Attention Refocusing

Gated Delta Networks: Improving Mamba2 with Delta Rule

SCTNet: Single Branch CNN with Transformer Semantic Information for Real-Time Segmentation

An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention

HyperAttention: Long-context Attention in Near-Linear Time

When Attention Sink Emerges in Language Models: An Empirical View

How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?

The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry

Kolmogorov-Arnold Transformer

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Real-Time Video Generation with Pyramid Attention Broadcast

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement

Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

MagicPIG: LSH Sampling for Efficient LLM Generation

Accelerating Diffusion Transformers with Token-wise Feature Caching

Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Magnushammer: A Transformer-Based Approach to Premise Selection

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

See What You Are Told: Visual Attention Sink in Large Multimodal Models

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Feature Fusion from Head to Tail for Long-Tailed Visual Recognition

LogFormer: A Pre-train and Tuning Pipeline for Log Anomaly Detection

Simplifying Transformer Blocks

EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using Strips Window Attention

JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

TabM: Advancing tabular deep learning with parameter-efficient ensembling

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling

On the Role of Attention Heads in Large Language Model Safety

Scene Adaptive Sparse Transformer for Event-based Object Detection

Trajectory attention for fine-grained video motion control

Combining Induction and Transduction for Abstract Reasoning

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

MoH: Multi-Head Attention as Mixture-of-Head Attention

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Question Aware Vision Transformer for Multimodal Reasoning

LION: Implicit Vision Prompt Tuning

Transformer-Based No-Reference Image Quality Assessment via Supervised Contrastive Learning

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

On the Emergence of Position Bias in Transformers

Looped Transformers for Length Generalization

Attentive Eraser: Unleashing Diffusion Model’s Object Removal Potential via Self-Attention Redirection Guidance

TCI-Former: Thermal Conduction-Inspired Transformer for Infrared Small Target Detection

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers

Logical Languages Accepted by Transformer Encoders with Hard Attention

Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

Sparse autoencoders reveal selective remapping of visual concepts during adaptation

Transformer-VQ: Linear-Time Transformers via Vector Quantization

PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Understanding Factual Recall in Transformers via Associative Memories

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

SparX: A Sparse Cross-Layer Connection Mechanism for Hierarchical Vision Mamba and Transformer Networks

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Attention Disturbance and Dual-Path Constraint Network for Occluded Person Re-identification

AesFA: An Aesthetic Feature

SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation

Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers