Recommended reading before approaching this post:🔗
 RNN – Andrej Karpathy’s blog The Unreasonable Effectiveness of Recurrent Neural Networks
 Seq2Seq  Nathan Lintz Sequence Modeling With Neural Networks (Part 1): Language & Seq2Seq, Part2 Sequence modeling with attention
 LSTM – Christopher Olah’s blog Understanding LSTM Networks and R2Rt.com Written Memories: Understanding, Deriving and Extending the LSTM.
 Attention – Christopher Olah Attention and Augmented Recurrent Neural Networks
Objective or goal for the algorithm🔗

Parallelization of Seq2Seq: RNN/CNN handle sequences wordbyword sequentially which is an obstacle to parallelize. Transformer achieve parallelization by replacing recurrence with attention and encoding the symbol position in sequence. This in turn leads to significantly shorter training time.

Reduce sequential computation: Constant \(O(1)\) number of operations to learn dependency between two symbols independently of their position distance in sequence.
TL;DR🔗
RNN:
 Advantages: are popular and successful for variablelength representations such as sequences (e.g. languages), images, etc. RNN are considered core of seq2seq (with attention). The gating models such as LSTM or GRU are for longrange error propagation.
 Problems: The sequentiality prohibits parallelization within instances. Longrange dependencies still tricky, despite gating. Sequencealigned states in RNN are wasteful. Hard to model hierarchicalalike domains such as languages.
CNN:
 Advantages: Trivial to parallelize (per layer) and fit intuition that most dependencies are local.
 Problems: Path length between positions can be logarithmic when using dilated convolutions, leftpadding for text. (autoregressive CNNs WaveNet, ByteNET )
Solution: Multihead selfattention mechanism. Why attention? Table 2 of the paper shows that such attention networks can save 2–3 orders of magnitude of operations!
Intro🔗
In this post I will elaborate on lately presented paper introducing the Transformer architecture. Paper: ArXiv.
As the paper assumes the indepth prior knowledge of some topics, I will try to explain the ideas of the paper so that they can be understood by a DL beginner.
When RNN’s (or CNN) takes a sequence as an input, it handles sentences word by word. This sequentiality is an obstacle toward parallelization of the process. What is more, in cases when such sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with following positions’ content.
One solution to this was Convolution seq2seq. Convolution enables parallelization for GPU processing. Thus Gehring et al, 2017 (Facebook AI) present a 100% convolutional architecture to represent hierarchical representation of input sequence. The crux is that close input elements interact at lower layers while distant interacts at higher layers. The stacking of convolution layers is being used to evaluate long range dependencies between words. This is all possible despite fixed width kernels of convolution filters. The authors has also included two tricks:
 Positional embeddings  that are added to the input embeddings, capturing a sense of order in a sequence
 Multistep attention  attention is computed with current decoder state and embedding of previous target word token
Figure 1. Multistep attention form ConvS2S.
As an alternative to convolutions, a new approach is presented by the Transformer. It proposes to encode each position and applying the attention mechanism, to relate two distant words, which then can be parallelized thus, accelerating the training.
Currently, in NLP the SOTA (stateoftheart) performance achieved by seq2seq models is focused around the idea of encoding the input sentence into a fixedsize vector representation. The vector size is fixed, regardless of the length of the input sentence. In obvious way this must loose some information. To face this issue Transformer employs an alternative approach based on attention.
Due to multiple referred work it is beneficial to also read the mentioned research
Motivation🔗
The Transformer architecture is aimed at the problem of sequence transduction (by Alex Graves), meaning any task where input sequences are transformed into output sequences. This includes speech recognition, texttospeech transformation, machine translation, protein secondary structure prediction, Turing machines etc. Basically the goal is to design a single framework to handle as many sequences as possible.
Currently, complex RNN and CNN based on encoderdecoder scheme are dominating transduction models (language modeling and machine learning). Recurrent models due to sequential nature (computations focused on the position of symbol in input and output) are not allowing for parallelization along training, thus have a problem with learning longterm dependencies (Hochreiter,, et al.) from memory. The bigger the memory is, the better, but the memory eventually constrains batching across learning examples for long sequences, and this is why parallelization can not help.
Reducing this fundamental constraint of sequential computation has been target for numerous research like Wavenet / Bytenet or ConvS2S. However, in those CNNbased approaches, the number of calculations in parallel computation of the hidden representation, for input\(\rightarrow\)output position in sequence, grows with the distance between those positions. The complexity of \(O(n)\) for ConvS2S and \(O(nlogn)\) for ByteNet makes it harder to learn dependencies on distant positions.
Transformer reduces the number of sequential operations to relate two symbols from input/output sequences to a constant \(O(1)\) number of operations. Transformer achieves this with the multihead attention mechanism that allows to model dependencies regardless of their distance in input or output sentence.
Up till now, most of the research including attention is used along with RNNs. The novel approach of Transformer is however, to eliminate recurrence completely and replace it with attention to handle the dependencies between input and output. The Transformer moves the sweet spot of current ideas toward attention entirely. It eliminates the not only recurrence but also convolution in favor of applying selfattention (a.k.a intraattention). Additionally Transformer gives more space for parallelization (details present in paper).
The top performance in the paper is achieved while applying the attention mechanism connecting encoder and decoder.
Transformer is claimed by authors to be the first to rely entirely on selfattention to compute representations of input and output.
Information on processing strategy of the algorithm🔗
Transformer is based on sequencetosequence model for Statistical Machine Translation (SMT) as introduced in Cho et al., 2014. It includes two RNNs, one for encoder to process the input and the other as a decoder, for generating the output.
In general, transformer’s encoder maps input sequence to its continuous representation \(z\) which in turn is used by decoder to generate output, one symbol at a time.
The final state of the encoder is a fixed size vector \(z\) that must encode entire source sentence which includes the sentence meaning. This final state is therefore called sentence embedding^{1}.
The encoderdecoder model is designed at its each step to be autoregressive  i.e. use previously generated symbols as extra input while generating next symbol. Thus, \( x_i + y_{i1}\rightarrow y_i\)
Transformer is based on EncoderDecoder🔗
In Transformer (as in ByteNet or ConvS2S) the decoder is stacked directly on top of encoder. Encoder and decoder both are composed of stack of identical layers. Each of those stacked layers is composed out of two general types of sublayers:
 multihead selfattention mechanism, and

positionwise fully connected FFN.
In contrast to ConvS2S however, where input representation considered each input element combined with its absolute position number in sequence (providing sense of ordering; ByteNet have dilated convolutions and no positionwise FNN), transformer introduces two different NN for these two types of information.
The way the attention mechanism is applied and customized is what makes the Transformer novel.
One can find the reference Transformer model implementation from authors is present in Tensor2Tensor (T2T) library
Figure 2. Single layer of Encoder (left) and Decoder (right) that is build out of \(N=6\) identical layers.
Encoder🔗

Stage 1 – Encoder Input Information on sequence ordering is very important. As there is no recurrence, nor convolution, this information on absolute (or relative) position of each token in a sequence is represented with use of "positional encodings" (Read more). The input for the encoder is therefore, represented as:
 positional encodings added \(\oplus\) to
 embedded inputs

\(N=6\) layers. In practice the \(N=6\) means more than 6 layers. Each of those “layers” are actually composed of two layers: positionwise FNN and one (encoder), or two (decoder), attentionbased sublayers. Each of those additionally contains 4 linear projections and the attention logic. Thus, providing effectively deeper than 6 layer architecture.
 Stage 2 – Multihead attention
 Stage 3 – positionwise FFN
Stages 2 and 3 use the residual connection (thus, all employ \(d_{model}=512\)) followed by normalization layer at its output.
Thus, encoder works like this:
Stage1_out = Embedding512 + TokenPositionEncoding512
Stage2_out = layer_normalization(multihead_attention(Stage1_out) + Stage1_out)
Stage3_out = layer_normalization(FFN(Stage2_out) + Stage2_out)
out_enc = Stage3_out
Decoder🔗
Decoder’s architecture is similar however, it employs additional layer in Stage 3 with mask multihead attention over encoder output.

Stage 1 – Decoder input The input is the output embedding, offset by one position to ensure that the prediction for position \(i\) is only dependent on positions previous to/less than \(i\).

Stage 2 Masked Multihead attention Modified to prevent positions to attend to subsequent positions.
Stages 2, 3 and 4 also use the residual connection followed by normalization layer at its output.
The details of each mechanism applied in the mentioned layers is more elaborated in following section.
Put together decoder works as follows:
Stage1_out = OutputEmbedding512 + TokenPositionEncoding512
Stage2_Mask = masked_multihead_attention(Stage1_out)
Stage2_Norm1 = layer_normalization(Stage2_Mask) + Stage1_out
Stage2_Multi = multihead_attention(Stage2_Norm1 + out_enc) + Stage2_Norm1
Stage2_Norm2 = layer_normalization(Stage2_Multi) + Stage2_Multi
Stage3_FNN = FNN(Stage2_Norm2)
Stage3_Norm = layer_normalization(Stage3_FNN) + Stage2_Norm2
out_dec = Stage3_Norm
Mechanisms used to compose Transformer architecture🔗
There are couple types of layers that transformer consists of. Their details are depict in following sections.
Positional Encoding – PE🔗
In RNN (LSTM), the notion of time step is encoded in the sequence as inputs/outputs flow one at a time. In FNN, the positional encoding must be preserved to represent the time in some way to preserve the positional encoding. In case of the Transformer authors propose to encode time as \(sine\) wave, as an added extra input. Such signal is added to inputs and outputs to represent time passing^{2}.
In general, adding positional encodings to the input embeddings is a quite interesting topic. One way is to embed the absolute position of input elements (as in ConvS2S). However, authors use "sine and cosine functions of different frequencies". The "sinusoidal" version is quite complicated, while giving similar performance to the absolute position version. The crux is however, that it may allow the model to produce better translation on longer sentences at test time (at least longer than the sentences in the training data). This way sinusoidal method allows the model to extrapolate to longer sequence lengths^{3}.
This encoding gives the model a sense of which portion of the sequence of the input (or output) it is currently dealing with. The positional encoding can be learned, or fixed. Authors made tests (PPL, BLEU) showing that both: learned and fixed positional encodings perform similarly.
In paper authors have decided on fixed variant using \(sin\) and \(cos\) functions to enable the network to learn information about tokens relative positions to the sequence.
Of course authors motivate the use of sinusoidal functions due to enabling model to generalize to sequences longer than ones encountered during training.
Attention🔗
Attention between encoder and decoder is crucial in NMT. Authors formulate the definition of attention that has already been elaborated in Attention primer. Attention is a function that maps the 2element input (query, keyvalue pairs) to an output. The output given by the mapping function is a weighted sum of the values. Where weights for each value measures how much each input key interacts with (or answers) the query. While the attention is a goal for many research, the novelty about transformer attention is that it is multihead selfattention.
Scaled DotProduct Attention🔗
In terms of encoderdecoder, the query is usually the hidden state of the decoder. Whereas key, is the hidden state of the encoder, and the corresponding value is normalized weight, representing how much attention a key gets. Output is calculated as a wighted sum – here the dot product of query and key is used to get a value.
It is assumed that queries and keys are of \(d_k\) dimension and values are of \(d_v\) dimension. Those dimensions are imposed by the linear projection discussed in the multihead attention section. The input is represented by three matrices: queries’ matrix \(Q\), keys’ matrix \(K\) and values’ matrix \(V\).
The compatibility function (see Attention primer) is considered in terms of two, additive and multiplicative (dotproduct) variants Bahdanau et al. 2014 with similar theoretical complexity. However, the dotproduct (\(q \cdot k = \sum_{i=1}^{d_k}q_i k_i\)) with scaling factor \(1/\sqrt{d_k}\) is chosen due to being much faster and spaceefficient, as it uses optimized matrix multiplication code^{4}.
Using NumPy:
def attention(Q, K, V):
num = np.dot(Q, K.T)
denum = np.sqrt(K.shape[0])
return np.dot(softmax(num / denum), V)
Multihead attention🔗
Transformer reduces the number of operations required to relate (especially distant) positions in input and output sequence to a \(O(1)\). However, this comes at cost of reduced effective resolution because of averaging attentionweighted positions.
Figure 3. MultiHead Attention consists of \(h\) attention layers running in parallel.
To reduce this cost authors propose the multihead attention:

\(h=8\) attention layers (aka “heads”): that represent linear projection (for the purpose of dimension reduction) of key \(K\) and query \(Q\) into \(d_k\)dimension and value \(V\) into \(d_v\) dimension:
$$head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i) , i=1,\dots,h$$where projections are parameter matrices \(W^Q_i, W^K_i\in\mathbb{R}^{d_{model}\times d_k}, W^V_i\in\mathbb{R}^{d_{model}\times d_v}\), for \(d_k=d_v=d_{model}/h = 64\) 
scaleddot attention applied in parallel on each layer (different linear projections of \(k, q, v\)) results in \(d_v\)dimensional output.
 concatenate outputs of each layer (different linear projection; also referred as ”head”): \(Concat(head_1,\dots,head_h)\)

linearly project the concatenation result form the previous step:
$$ MultiHeadAttention(Q,K,V) = Concat(head_1,\dots,head_h) W^O$$where \(W^0\in\mathbb{R}^{d_{hd_v}\times d_{model}}\)
Transformer use multihead (\(d_{model}/h\) parallel attention functions) attention instead of single (\(d_{model}\)dimensional) attention function (i.e. \(q,k,v\) all \(d_{model}\)dimensional). It is at similar computational cost as in the case of singlehead attention due to reduced dimensions of each head.
Transformer imitates the classical attention mechanism (known e.g. from Bahdanau et al., 2014 or Conv2S2) where in encoderdecoder attention layers queries are form previous decoder layer, and the (memory) keys and values are from output of the encoder. Therefore, each position in decoder can attend over all positions in the input sequence.
SelfAttention (SA)🔗
See Attention Primer for basics on attention.
In encoder, selfattention layers process input \(queries, keys\) and \(values\) that comes form same place i.e. the output of previous layer in encoder. Each position in encoder can attend to all positions from previous layer of the encoder
In decoder, selfattention layer enable each position to attend to all previous positions in the decoder, including the current position. To preserve autoregressive property, the leftward information flow is presented inside the dotproduct attention by masking out (set to \( \infty\)) all \(values\) that are input for softmax which correspond to this illegal connections.
Authors motivates the use of selfattention layers instead of recurrent or convolutional layers with three desiderata:

Minimize total computational complexity per layer
 Pros: selfattention layers connects all positions with \(O(1)\) number of sequentially executed operations (eg. vs \(O(n)\) in RNN)

Maximize amount of parallelizable computations, measured by minimum number of sequential operations required
 Pros: for sequence length \(n\) < representation dimensionality \(d\) (true for SOTA sequence representation models like wordpiece, bytepair). For very long sequences \(n>d\) selfattention can consider only neighborhood of some size \(r\) in the input sequence centered around the respective output position, thus increasing the max path length to \(O(n/r)\)

Minimize maximum path length between any two input and output positions in network composed of the different layer types . The shorter the path between any combination of positions in the input and output sequences, the easier to learn longrange dependencies. (See why Hochreiter et al, 2001)
Positionwise FFN🔗
In encoder and decoder the attention sublayers is being processed by a fully connected FNN. It is applied to each position separately and identically meaning two linear transformations and a ReLU
Linear transformations are the same for each position, but use different parameters from layer to layer. It works similarly to two convolutions of kernel size 1. The input/output dimension is \(d_{model}=512\) while inner0layer is \(d_{ff}=2048\)
Pseudocode or flowchart description of the algorithm.🔗
Figure 4. Transformer stepbystep sequence transduction in form of EnglishtoFrench translation. Adopted from Google Blog
In encoder phase (shown in the Figure 1.), transformer first generates initial representation/embedding for each word in input sentence (empty circle). Next, for each word, selfattention aggregates information form all other words in context of sentence, and creates new representation (filled circles). The process is repeated for each word in sentence. Successively building new representations, based on previous ones is repeated multiple times and in parallel for each word (next layers of filled circles).
Decoder acts similarly generating one word at a time in a lefttorightpattern. It attends to previously generated words of decoder and final representation of encoder.
It is worth noting that this selfattention strategy allows to face the issue of coreference resolution where e.g. word “it” in a sentence can refer to different noun of the sentence depending on context.
Figure 5. Coreference resolution. The it in both cases relates to different token. Adopted from Google Blog
Heuristics or rules of thumb.🔗
Authors have conducted a series of tests (Table 3 of the paper) where they discuss recommendation of \(N=6\) layers with model size 512 based on \(h=8\) heads with key, values dimensions of 64 using 100K steps.
It is also stated that dotproduct compatibility function might be further optimized due to model quality is decreased with smaller \(d_k\) (row B).
The proposed, fixed sinusoidal positional encodings are claimed to produce nearly equal score comparing to learned positional encodings.
What classes of problem is the algorithm well suited?🔗
 sequence transduction (language translation)
 classic language analysis task of syntactic constituency parsing
 different inputs and outputs modalities, such as images and video
 coreference resolution
Common benchmark or example datasets used to demonstrate the algorithm.🔗
 Perplexity (PPL) BLEU
 EnglishtoGerman translation development set WMT 2014 EnglishtoGerman and WMT 2014 EnglishtoFrench translation tasks
 newstest2013
 English constituency parsing
Useful resources for learning more about the algorithm.🔗
There are also some more general learning opensource software frameworks for NMT:
Tensorflow: Neural Machine Translation (seq2seq) Tutorial
Groundup explaining NMT concepts tutorial with code, based on Tensorflow by Google Research. Based on Google Research blog.
Torch: FirSeq seq2seq and ConvS2S from Facebook AI.
* Tensorflow: Tensor2Tensor From used by Transformer authors.
Thoughts on the idea.🔗
To my limited knowledge there are some statements that might benefit form more explanation:
 How the scaling factor (Equation 3) makes an impact?
 How actually the positional encoding work? Why they have chosen the sin/cos functions and why the position and dimension are in this relation? Finally how sinusoidal helps translate long sentences?
 Does having separate positionwise FFNs help? (comparing to ConvS2S).
 The “cost of reduced effective resolution due to averaging attentionweighted position” is claimed to be a motivation for multihead attention. How to understand better what is the issue and how multihead attention helps?
 The Transformer brings a significantly improvement over ConvS2S, but where does the improvement come from? It is not clear from the work. ConvS2S lacks the selfattention, is it what brings the advantage?
 Masked Attention. The problem of using same parts of input on different decoding step is claimed to be solved by penalizing (maskout to \(\infty\)) input tokens that have obtained high attention scores in the past decoding steps – a bit vague. How does it work? Maybe explicitly having a positionwise FFN automatically fixes that problem?
 Applying multihead attention might improve performance due to better parallelization. However, Table 3 also show increasing \(h=1 to 8\) improves accuracy. Why? Moving \(h\) to 16 or 32 is not that beneficial. How to interpret this correctly?
 How important the autoregression is in context of this architecture?
Please leave a comment if you have any other question, or would like to get more explanation on any of the paper’s particularities.
Primary references or resources in which the algorithm was first described.🔗
 Paper: ArXiv
Some interesting future research🔗
 Devising more sophisticated compatibility function
 Increase maximum path length to \(O(n/r)\), where \(r\) would be only a neighborhood size of positions to be considered by selfattention instead of all positions in sequence
References:🔗
 While applying dimensionality reduction techniques (e.g. PCA, tSNE) on embeddings, the outcome plot gathers the semantically close sentences together Sutskever et al., 2014. ↩
 It is interesting how this resembles the brain waves and neural oscillations. ↩
 It reminds a bit the Pointer Networks that address similar problem ↩
 The additional scaling factor is advised for large \(d_k\) where dot product grow large in magnitude, as softmax is suspected to be pushed to vanishing gradient area, thus making the additive attention perform better. ↩
Comments
comments powered by Disqus