Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length.
Notations
Symbol | Meaning |
---|---|
The model size / hidden state dimension / positional encoding size. | |
The number of heads in multi-head attention layer. | |
The segment length of input sequence. | |
The total number of attention layers in the model; not considering MoE. | |
The input sequence where each element has been mapped into an embedding vector of shape |
|
The key weight matrix. | |
The query weight matrix. | |
The value weight matrix. Often we have |
|
The weight matrices per head. | |
The output weight matrix. | |
The query embedding inputs. | |
The key embedding inputs. | |
The value embedding inputs. | |
Row vectors in query, key, value matrices, |
|
A collection of key positions for the |
|
The self-attention matrix between a input sequence of lenght |
|
The scalar attention score between query |
|
position encoding matrix, where the |
Transformer Basics
The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.
Attention and Self-Attention
Attention is a mechanism in neural network that a model can learn to make predictions by selectively attending to a given set of data. The amount of attention is quantified by learned weights and thus the output is usually formed as a weighted average.
Self-attention is a type of attention mechanism where the model makes prediction for one part of a data sample using other parts of the observation about the same sample. Conceptually, it feels quite similar to non-local means. Also note that self-attention is permutation-invariant; in other words, it is an operation on sets.
There are various forms of attention / self-attention, Transformer (Vaswani et al., 2017) relies on the scaled dot-product attention: given a query matrix
And for a query and a key vector
where
See my old post for other types of attention if interested.
Multi-Head Self-Attention
The multi-head self-attention module is a key component in Transformer. Rather than only computing the attention once, the multi-head mechanism splits the inputs into smaller chunks and then computes the scaled dot-product attention over each subspace in parallel. The independent attention outputs are simply concatenated and linearly transformed into expected dimensions.
where

Encoder-Decoder Architecture
The encoder generates an attention-based representation with capability to locate a specific piece of information from a large context. It consists of a stack of 6 identity modules, each containing two submodules, a multi-head self-attention layer and a point-wise fully connected feed-forward network. By point-wise, it means that it applies the same linear transformation (with same weights) to each element in the sequence. This can also be viewed as a convolutional layer with filter size 1. Each submodule has a residual connection and layer normalization. All the submodules output data of the same dimension
The function of Transformer decoder is to retrieve information from the encoded representation. The architecture is quite similar to the encoder, except that the decoder contains two multi-head attention submodules instead of one in each identical repeating module. The first multi-head attention submodule is masked to prevent positions from attending to the future.

Positional Encoding
Because self-attention operation is permutation invariant, it is important to use proper positional encoding to provide order information to the model. The positional encoding
Sinusoidal Positional Encoding
Sinusoidal positional encoding is defined as follows, given the token position
In this way each dimension of the positional encoding corresponds to a sinusoid of different wavelengths in different dimensions, from

Learned Positional Encoding
Learned positional encoding assigns each element with a learned column vector which encodes its absolute position (Gehring, et al. 2017) and furthermroe this encoding can be learned differently per layer (Al-Rfou et al. 2018).
Relative Position Encoding
Shaw et al. (2018)) incorporated relative positional information into
Transformer-XL (Dai et al., 2019) proposed a type of relative positional encoding based on reparametrization of dot-product of keys and queries. To keep the positional information flow coherently across segments, Transformer-XL encodes the relative position instead, as it could be sufficient enough to know the position offset for making good predictions, i.e.
If omitting the scalar
Transformer-XL reparameterizes the above four terms as follows:
- Replace
with relative positional encoding ; - Replace
with two trainable parameters (for content) and (for location) in two different terms; - Split
into two matrices, for content information and for location information.
Rotary Position Embedding
Rotary position embedding (RoPE; Su et al. 2021) encodes the absolution position with a rotation matrix and multiplies key and value matrices of every attention layer with it to inject relative positional information at every layer.
When encoding relative positional information into the inner product of the
Given a vector
When generalizing to higher dimensional space, RoPE divide the
where in the paper we have
Then both key and query matrices incorporates the positional information by multiplying with this rotation matrix:

Longer Context
The length of an input sequence for transformer models at inference time is upper-bounded by the context length used for training. Naively increasing context length leads to high consumption in both time (
This section introduces several improvements in transformer architecture to better support long context at inference; E.g. using additional memory, design for better context extrapolation, or recurrency mechanism.
Context Memory
The vanilla Transformer has a fixed and limited attention span. The model can only attend to other elements in the same segments during each update step and no information can flow across separated fixed-length segments. This context segmentation causes several issues:
- The model cannot capture very long term dependencies.
- It is hard to predict the first few tokens in each segment given no or thin context.
- The evaluation is expensive. Whenever the segment is shifted to the right by one, the new segment is re-processed from scratch, although there are a lot of overlapped tokens.
Transformer-XL (Dai et al., 2019; “XL” means “extra long”) modifies the architecture to reuse hidden states between segments with an additional memory. The recurrent connection between segments is introduced into the model by continuously using the hidden states from the previous segments.

Let’s label the hidden state of the
Note that both keys and values rely on extended hidden states, while queries only consume hidden states at the current step. The concatenation operation
Compressive Transformer (Rae et al. 2019) extends Transformer-XL by compressing past memories to support longer sequences. It explicitly adds memory slots of size

Both memory and compressed memory are FIFO queues. Given the model context length
- Max/mean pooling of kernel and stride size
; - 1D convolution with kernel and stride size
(need to learn additional parameters); - Dilated convolution (need to learn additional parameters). In their experiments, convolution compression works out the best on
EnWik8
dataset; - Most used memories.
Compressive transformer has two additional training losses:
-
Auto-encoding loss (lossless compression objective) measures how well we can reconstruct the original memories from compressed memories
reverses the compression function . -
Attention-reconstruction loss (lossy objective) reconstructs content-based attention over memory vs compressed memory and minimize the difference:
Transformer-XL with a memory of size
Attention weights, from oldest to newest, are stored in three locations: compressed memory → memory → causally masked sequence. In the experiments, they observed an increase in attention weights from oldest activations stored in the regular memory, to activations stored in the compressed memory, implying that the network is learning to preserve salient information.

Non-Differentiable External Memory
At inference time, the next token probability is a weighted sum of two predictions:
where
According to the experiments, larger datastore size or larger
SPALM (Adaptive semiparametric language models; Yogatama et al. 2021) incorporates both (1) Transformer-XL style memory for hidden states from external context as short-term memory and (2)

SPALM runs
where
During training, the key representations in the long-term memory stay constant, produced by a pretrained LM, but the value encoder, aka the word embedding matrix, gets updated.
Memorizing Transformer (Wu et al. 2022) adds a
The same QKV values are used for both local attention and
What they found during experiments with Memorizing Transformer:
- It is observed in some experiments that training models with a small memory and then finetuned with a larger memory works better than training with a large memory from scratch.
- The smaller Memorizing Transformer with just 8k tokens in memory can match the perplexity of a larger vanilla Transformer with 5X more trainable parameters.
- Increasing the size of external memory provided consistent gains up to a size of 262K.
- A non-memory transformer can be finetuned to use memory.

Distance-Enhanced Attention Scores
Distance Aware Transformer(DA-Transformer; Wu, et al. 2021) and Attention with Linear Biases (ALiBi; Press et al. 2022) are motivated by similar ideas — in order to encourage the model to extrapolate over longer context than what the model is trained on, we can explicitly attach the positional information to every pair of attention score based on the distance between key and query tokens.
Note that the default positional encoding in vanilla Transformer only adds positional information to the input sequence, while later improved encoding mechanisms alter attention scores of every layer, such as rotary position embedding, and they take on form very similar to distance enhanced attention scores.
DA-Transformer (Wu, et al. 2021) multiplies attention scores at each layer by a learnable bias that is formulated as a function of the distance between key and query. Different attention heads use different parameters to distinguish diverse preferences to short-term vs long-term context. Given two positions,
where
Instead of multipliers, ALiBi (Press et al. 2022) adds a constant bias term on query-key attention scores, proportional to pairwise distances. The bias introduces a strong recency preference and penalizes keys that are too far away. The penalties are increased at different rates within different heads.

With ALiBi, Press et al. (2022) trained a 1.3B model on context length 1024 during training and extrapolated to 2046 at inference time.

Make it Recurrent
Universal Transformer (Dehghani, et al. 2019) combines self-attention in Transformer with the recurrent mechanism in RNN, aiming to benefit from both a long-term global receptive field of Transformer and learned inductive biases of RNN. Rather than going through a fixed number of layers, Universal Transformer dynamically adjusts the number of steps using adaptive computation time. If we fix the number of steps, an Universal Transformer is equivalent to a multi-layer Transformer with shared parameters across layers.
On a high level, the universal transformer can be viewed as a recurrent function for learning the hidden state representation per token. The recurrent function evolves in parallel across token positions and the information between positions is shared through self-attention.

Given an input sequence of length
where
The positional encoding

In the adaptive version of Universal Transformer, the number of recurrent steps
Adaptive Modeling
Adaptive modeling refers to a mechanism that can adjust the amount of computation according to different inputs. For example, some tokens may only need local information and thus demand a shorter attention span; Or some tokens are relatively easier to predict and do not need to be processed through the entire attention stack.
Adaptive Attention Span
One key advantage of Transformer is the capability of capturing long-term dependencies. Depending on the context, the model may prefer to attend further sometime than others; or one attention head may had different attention pattern from the other. If the attention span could adapt its length flexibly and only attend further back when needed, it would help reduce both computation and memory cost to support longer maximum context size in the model.
This is the motivation for Adaptive Attention Span. Sukhbaatar et al (2019) proposed a self-attention mechanism that seeks an optimal attention span. They hypothesized that different attention heads might assign scores differently within the same context window (See Fig. 14) and thus the optimal span would be trained separately per head.

Given the
A soft mask function
where

The soft mask function is applied to the softmax elements in the attention weights:
In the above equation,
Using Adaptive Computation Time, the approach can be further enhanced to have flexible attention span length, adaptive to the current input dynamically. The span parameter
In the experiments of Transformer with adaptive attention span, Sukhbaatar, et al. (2019) found a general tendency that lower layers do not require very long attention spans, while a few attention heads in higher layers may use exceptionally long spans. Adaptive attention span also helps greatly reduce the number of FLOPS, especially in a big model with many attention layers and a large context length.
Depth-Adaptive Transformer
At inference time, it is natural to assume that some tokens are easier to predict and thus do not require as much computation as others. Therefore we may only process its prediction through a limited number of layers to achieve a good balance between speed and performance.
Both Depth-Adaptive Transformer (Elabyad et al. 2020) and Confident Adaptive Language Model (CALM; Schuster et al. 2022) are motivated by this idea and learn to predict optimal numbers of layers needed for different input tokens.
Depth-adaptive transformer (Elabyad et al. 2020) attaches an output classifier to every layer to produce exit predictions based on activations of that layer. The classifier weight matrices can be different per layer or shared across layers. During training, the model sample different sequences of exits such that the model is optimized with hidden states of different layers. The learning objective incorporates likelihood probabilities predicted at different layers,
Adaptive depth classifiers outputs a parametric distribution

(Image source: Elabyad et al. 2020).
-
Sequence-specific depth classifier: All tokens of the same sequence share the same exit block. It depends on the average of the encoder representation of the sequence. Given an input sequence
of length , the classifier takes as input and outputs a multinomial distribution of dimensions, corresponding to layers.where
is dirac delta (unit impulse) function and is a regularization term to encourage lower layer exits. The ground truth can be prepared in two way, based on maximum likelihood or correctness . -
Token-specific depth classifier (multinomial): Each token is decoded with different exit block, predicted conditioned on the first decoder hidden state
:
-
Token-specific depth classifier (geometric-like): A binary exit prediction distribution is made per layer per token,
. The RBF kernel is used to smooth the predictions to incorporate the impact of current decision on future time steps.
At inference time, the confidence threshold for making an exit decision needs to be calibrated. Depth-adaptive transformer finds such a threshold on a validation set via grid search. CALM (Schuster et al. 2022) applied the Learn then Test (LTT) framework (Angelopoulos et al. 2021) to identify a subset of valid thresholds and chose the minimum value as the threshold for inference. Except for training per-layer exit classifier, CALM also explored other methods for adaptive depth prediction, including the softmax responses (i.e. difference between top two softmax outputs) and hidden state saturation (i.e.
Efficient Attention
The computation and memory cost of the vanilla Transformer grows quadratically with sequence length and hence it is hard to be applied on very long sequences. Many efficiency improvements for Transformer architecture have something to do with the self-attention module - making it cheaper, smaller or faster to run. See the survey paper on Efficient Transformers (Tay et al. 2020).
Sparse Attention Patterns
Fixed Local Context
A simple alternation to make self-attention less expensive is to restrict the attention span of each token to local context only, so that self-attention grows linearly with the sequence length.
The idea was introduced by Image Transformer (Parmer, et al 2018), which formulates image generation as sequence modeling using an encoder-decoder transformer architecture:
- The encoder generates a contextualized, per-pixel-channel representation of the source image;
- Then the decoder autoregressively generates an output image, one channel per pixel at each time step.
Let’s label the representation of the current pixel to be generated as the query
Image Transformer introduced two types of localized

-
1D Local Attention: The input image is flattened in the raster scanning order, that is, from left to right and top to bottom. The linearized image is then partitioned into non-overlapping query blocks. The context window consists of pixels in the same query block as
and a fixed number of additional pixels generated before this query block. -
2D Local Attention: The image is partitioned into multiple non-overlapping rectangular query blocks. The query pixel can attend to all others in the same memory blocks. To make sure the pixel at the top-left corner can also have a valid context window, the memory block is extended to the top, left and right by a fixed amount, respectively.
Strided Context
Sparse Transformer (Child et al., 2019) introduced factorized self-attention, through sparse matrix factorization, making it possible to train dense attention networks with hundreds of layers on sequence length up to 16,384, which would be infeasible on modern hardware otherwise.
Given a set of attention connectivity pattern
Note that although the size of
In auto-regressive models, one attention span is defined as
In factorized self-attention, the set
Precisely, the set
Sparse Factorized Attention
Sparse Transformer proposed two types of fractorized attention. It is easier to understand the concepts as illustrated in Fig. 10 with 2D image inputs as examples.

-
Strided attention with stride
. This works well with image data as the structure is aligned with strides. In the image case, each pixel would attend to all the previous pixels in the raster scanning order (naturally cover the entire width of the image) and then those pixels attend to others in the same column (defined by another attention connectivity subset).
-
Fixed attention. A small set of tokens summarize previous locations and propagate that information to all future locations.
where
is a hyperparameter. If , it restricts the representation whereas many depend on a few positions. The paper chose for .
Use Factorized Self-Attention in Transformer
There are three ways to use sparse factorized attention patterns in Transformer architecture:
- One attention type per residual block and then interleave them,
, where is the index of the current residual block. - Set up a single head which attends to locations that all the factorized heads attend to,
. - Use a multi-head attention mechanism, but different from vanilla Transformer, each head might adopt a pattern presented above, 1 or 2.
This option often performs the best.
Sparse Transformer also proposed a set of changes so as to train the Transformer up to hundreds of layers, including gradient checkpointing, recomputing attention & FF layers during the backward pass, mixed precision training, efficient block-sparse implementation, etc. Please check the paper for more details or my previous post on techniques for scaling up model training.
Blockwise Attention (Qiu et al. 2019) introduces a sparse block matrix to only allow each token to attend to a small set of other tokens. Each attention matrix of size
The actual implementation of Blockwise Attention only stores QKV as block matrices, each of size
where
Combination of Local and Global Context
ETC (Extended Transformer Construction; Ainslie et al. 2019), Longformer (Beltagy et al. 2020) and Big Bird (Zaheer et al. 2020) models combine both local and global context when building an attention matrix. All these models can be initialized from existing pretrained models.
Global-Local Attention of ETC (Ainslie et al. 2019) takes two inputs, (1) the long input
ETC utilizes four binary matrices to handle structured inputs,
where

One more update in ETC is to incorporate a CPC (contrastive predictive coding) task using NCE loss into the pretraining stage, besides the MLM task: The representation of one sentence should be similar to the representation of context around it when this sentence is masked.
The global input
Attention pattern in Longformer contains three components:
- Local attention: Similar to ETC, local attention is controlled by a sliding window of fixed size
; - Global attention of preselected tokens: Longformer has a few pre-selected tokens (e.g.
[CLS]
token) assigned with global attention span, that is, attending to all other tokens in the input sequence. - Dilated attention: Dilated sliding window of fixed size
and gaps of dilation size , similar to Sparse Transformer;
Big Bird is quite similar to Longformer, equipped with both local attention and a few preselected tokens with global attention span, but Big Bird replaces dilated attention with a new mechanism where all tokens attend to a set of random tokens. The design is motivated by the fact that attention pattern can be viewed as a directed graph and a random graph has the property that information is able to rapidly flow between any pair of nodes.
Longformer uses smaller window size at lower layers and larger window sizes at higher layers. Ablation studies showed that this setup works better than reversed or fixed size config. Lower layers do not have dilated sliding windows to better learn to use immediate local context. Longformer also has a staged training procedure where initially the model is trained with small window size to learn from local context and then subsequent stages of training have window sizes increased and learning rate decreased.
Content-based Attention
The improvements proposed by Reformer (Kitaev, et al. 2020) aim to solve the following pain points in vanilla Transformer:
- Quadratic time and memory complexity within self-attention module.
- Memory in a model with
layers is -times larger than in a single-layer model because we need to store activations for back-propagation. - The intermediate FF layers are often quite large.
Reformer proposed two main changes:
- Replace the dot-product attention with locality-sensitive hashing (LSH) attention, reducing the complexity from
to . - Replace the standard residual blocks with reversible residual layers, which allows storing activations only once during training instead of
times (i.e. proportional to the number of layers).
Locality-Sensitive Hashing Attention
In
A hashing scheme

In LSH attention, a query can only attend to positions in the same hashing bucket,
- (a) The attention matrix for full attention is often sparse.
- (b) Using LSH, we can sort the keys and queries to be aligned according to their hash buckets.
- (c) Set
(precisely ), so that there are equal numbers of keys and queries in one bucket, easier for batching. Interestingly, this “shared-QK” config does not affect the performance of the Transformer. - (d) Apply batching where chunks of
consecutive queries are grouped together.

Reversible Residual Network
Another improvement by Reformer is to use reversible residual layers (Gomez et al. 2017). The motivation for reversible residual network is to design the architecture in a way that activations at any given layer can be recovered from the activations at the following layer, using only the model parameters. Hence, we can save memory by recomputing the activation during backprop rather than storing all the activations.
Given a layer
and reversing is easy:
Reformer applies the same idea to Transformer by combination attention (
The memory can be further reduced by chunking the feed-forward computation:
The resulting reversible Transformer does not need to store activation in every layer.
Routing Transformer (Roy et al. 2021) is also built on content-based clustering of keys and queries. Instead of using a static hashing function like LSH, it utilizes online
Within routing attention, both keys and queries are clustered with
In the experiments for Routing Transformer, some best config only has routing attention enabled in the last two layers of the model and half of the attention heads, while the other half utilizing local attention. They also observed that local attention is a pretty strong baseline and larger attention window always leads to better results.
Low-Rank Attention
Linformer (Wang et al. 2020) approximates the full attention matrix with a low rank matrix, reducing the time & space complexity to be linear. Instead of using expensive SVD to identify low rank decomposition, Linformer adds two linear projections
Additional techniques can be applied to further improve efficiency of Linformer:
- Parameter sharing between projection layers, such as head-wise, key-value and layer-wise (across all layers) sharing.
- Use different
at different layers, as heads in higher layers tend to have a more skewed distribution (lower rank) and thus we can use smaller at higher layers. - Use different types of projections; e.g. mean/max pooling, convolution layer with kernel and stride
.

Random Feature Attention (RFA; Peng et al. 2021) relies on random feature methods (Rahimi & Recht, 2007) to approximate softmax operation in self-attention with low rank feature maps in order to achieve linear time and space complexity. Performers (Choromanski et al. 2021) also adopts random feature attention with improvements on the kernel construction to further reduce the kernel approximation error.
The main theorem behind RFA is from Rahimi & Recht, 2007:
Let
be a nonlinear transformation: When-dimensional random vectors are i.i.d. from ,
An unbiased estimation of
Then we can write the attention function as follows, where

Causal Attention RFA has token at time step
where
RFA leads to significant speedup in autoregressive decoding and the memory complexity mainly depends on the choice of
Performer modifies the random feature attention with positive random feature maps to reduce the estimation error. It also keeps the randomly sampled

Transformers for Reinforcement Learning
The self-attention mechanism avoids compressing the whole past into a fixed-size hidden state and does not suffer from vanishing or exploding gradients as much as RNNs. Reinforcement Learning tasks can for sure benefit from these traits. However, it is quite difficult to train Transformer even in supervised learning, let alone in the RL context. It could be quite challenging to stabilize and train a LSTM agent by itself, after all.
The Gated Transformer-XL (GTrXL; Parisotto, et al. 2019) is one attempt to use Transformer for RL. GTrXL succeeded in stabilizing training with two changes on top of Transformer-XL:
- The layer normalization is only applied on the input stream in a residual module, but NOT on the shortcut stream. A key benefit to this reordering is to allow the original input to flow from the first to last layer.
- The residual connection is replaced with a GRU-style (Gated Recurrent Unit; Chung et al., 2014) gating mechanism.
The gating function parameters are explicitly initialized to be close to an identity map - this is why there is a

Decision Transformer (DT; Chen et al 2021) formulates Reinforcement Learning problems as a process of conditional sequence modeling, outputting the optimal actions conditioned on the desired return, past states and actions. It therefore becomes straightforward to use Transformer architecture. Decision Transformer is for off-policy RL, where the model only has access to a fixed collection of trajectories collected by other policies.
To encourage the model to learn how to act in order to achieve a desired return, it feeds the model with desired future return
Three linear layers are added and trained for return-to-go, state and action respectively to extract token embeddings. The prediction head learns to predict
The experiments compared DT with several model-free RL algorithm baselines and showed that:
- DT is more efficient than behavior cloning in low data regime;
- DT can model the distribution of returns very well;
- Having a long context is crucial for obtaining good results;
- DT can work with sparse rewards.
Citation
Cited as:
Weng, Lilian. (Jan 2023). The transformer family version 2.0. Lil’Log. https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/.
Or
@article{weng2023transformer,
title = "The Transformer Family Version 2.0",
author = "Weng, Lilian",
journal = "lilianweng.github.io",
year = "2023",
month = "Jan",
url = "https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/"
}
References
[1] Ashish Vaswani, et al. “Attention is all you need.” NIPS 2017.
[2] Rami Al-Rfou, et al. “Character-level language modeling with deeper self-attention.” AAAI 2019.
[3] Olah & Carter, “Attention and Augmented Recurrent Neural Networks”, Distill, 2016.
[4] Sainbayar Sukhbaatar, et al. “Adaptive Attention Span in Transformers”. ACL 2019.
[5] Rewon Child, et al. “Generating Long Sequences with Sparse Transformers” arXiv:1904.10509 (2019).
[6] Nikita Kitaev, et al. “Reformer: The Efficient Transformer” ICLR 2020.
[7] Alex Graves. (“Adaptive Computation Time for Recurrent Neural Networks”)[https://arxiv.org/abs/1603.08983]
[8] Niki Parmar, et al. “Image Transformer” ICML 2018.
[9] Zihang Dai, et al. “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.” ACL 2019.
[10] Aidan N. Gomez, et al. “The Reversible Residual Network: Backpropagation Without Storing Activations” NIPS 2017.
[11] Mostafa Dehghani, et al. “Universal Transformers” ICLR 2019.
[12] Emilio Parisotto, et al. “Stabilizing Transformers for Reinforcement Learning” arXiv:1910.06764 (2019).
[13] Rae et al. “Compressive Transformers for Long-Range Sequence Modelling.” 2019.
[14] Press et al. “Train Short, Test Long: Attention With Linear Biases Enables Input Length Extrapolation.” ICLR 2022.
[15] Wu, et al. “DA-Transformer: Distance Aware Transformer” 2021.
[16] Elabyad et al. “Depth-Adaptive Transformer.” ICLR 2020.
[17] Schuster et al. “Confident Adaptive Language Modeling” 2022.
[18] Qiu et al. “Blockwise self-attention for long document understanding” 2019
[19] Roy et al. “Efficient Content-Based Sparse Attention with Routing Transformers.” 2021.
[20] Ainslie et al. “ETC: Encoding Long and Structured Inputs in Transformers.” EMNLP 2019.
[21] Beltagy et al. “Longformer: The long-document transformer.” 2020.
[22] Zaheer et al. “Big Bird: Transformers for Longer Sequences.” 2020.
[23] Wang et al. “Linformer: Self-Attention with Linear Complexity.” arXiv preprint arXiv:2006.04768 (2020).
[24] Tay et al. 2020 “Sparse Sinkhorn Attention.” ICML 2020.
[25] Peng et al. “Random Feature Attention.” ICLR 2021.
[26] Choromanski et al. “Rethinking Attention with Performers.” ICLR 2021.
[27] Khandelwal et al. “Generalization through memorization: Nearest neighbor language models.” ICLR 2020.
[28] Yogatama et al. “Adaptive semiparametric language models.” ACL 2021.
[29] Wu et al. “Memorizing Transformers.” ICLR 2022.
[30] Su et al. “Roformer: Enhanced transformer with rotary position embedding.” arXiv preprint arXiv:2104.09864 (2021).
[31] Shaw et al. “Self-attention with relative position representations.” arXiv preprint arXiv:1803.02155 (2018).
[32] Tay et al. “Efficient Transformers: A Survey.” ACM Computing Surveys 55.6 (2022): 1-28.
[33] Chen et al., “Decision Transformer: Reinforcement Learning via Sequence Modeling” arXiv preprint arXiv:2106.01345 (2021).