# Attention Is All You Need

【文献精读】Transformer

Transformer 是目前人工智能和深度学习领域最著名的模型之一，由 Google 团队于 2017 年 6 月提出，发表在 NeuralPS（Conference on Neural Information Processing Systems）上。起初是为了解决自然语言处理（Natural Language Processing, NLP）领域中的机器翻译问题，没想到它的效果竟然超越了循环神经网络（Recurrent Neural Networks, RNN），只需要用 encoderdecoder 以及注意力 attention 机制就可以达到很好的效果。

Transformer 本身是专门为 NLP 领域量身定制的，但是后来人们将图像等数据编码和序列化之后同样可以放进 Transformer 中进行训练，并且也能让模型达到和卷积神经网络（Convolutional Neural Networks, CNN）和深度神经网络（Deep Neural Networks, DNN）相比更加出其不意的效果。这才让 Transformer 在计算机视觉领域大火了起来。

## Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

## 7 Conclusion

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.

We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.

1. Transformer 是当时第一个完全基于注意力的序列转录模型，它把过去常用的循环层全部换成了 multi-headed self-attention
2. Transformer 在机器翻译的任务中比基于循环层和卷积层的架构要快很多。
3. Transformer 未来可以用在文本以外的数据类型上，例如图像、音频、视频等。现在看来，作者在当时多多少少是预测到未来的研究方向的，我十分佩服！

## 1 Introduction

Recurrent neural networks, long short-term memory1 and gated recurrent2 neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.

• LSTM (Long Short-Term Memory): 长短期记忆网络。它是一种时间循环神经网络，是为了解决一般的 RNN 存在的长期依赖问题而专门设计出来的，所有的 RNN 都具有一种重复神经网络模块的链式形式。
• GRU (Gate Recurrent Unit): 门控循环单元。是 LSTM 网络的一种效果很好的变体，它较 LSTM 网络的结构更加简单，而且效果也很好，因此也是当前非常流形的一种网络。

Recurrent models typically factor computation along the symbol positions of the input and output sequences.

RNN 的特点是序列从左向右移一步一步往前做。当前时刻 $t$ 的隐藏状态 hidden states 记作 $h_t$，它由上一个隐藏状态 $h_{t-1}$ 和当前时刻 $t$ 的输入决定。这就是为什么 RNN 能够处理时序信息的原因。也正因为 RNN 的这一特点，导致 RNN 存在如下问题：

• 计算难以并行，主流的多线程 GPU 只能按照时序一个一个计算。
• 序列长度和 $h_t$ 的长度之间的矛盾。如果序列长度特别长而 $h_t$ 不够长的话，前面的信息很可能会丢掉；但如果 $h_t$ 也设计得很长的话，内存开销太大。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

## 2 Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.

In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

## 3 Model Architecture

• $\left(x_1, x_2, …, x_n\right)$：表示一个序列。假设这个序列是一个英文句子，那么 $x_t$ 就表示第 $t$ 个单词。
• $\textbf{z} = \left(z_1, z_2, …, z_n\right)$：编码器的输出。$z_t$ 是 $x_t$ 的一个向量表示。
• $\left(y_1, y_2, …, y_m\right)$：编码器的输出，是一个长为 $m$ 的序列。和编码器不同的是，解码器的词是一个个生成的，这叫做自回归 auto-regressive。自回归的意思是当前的输出也会作为输入参与下一轮的输出。换句话说就是，翻译的结果出来是一个个词往外蹦儿的。

### 3.1 Encoder and Decoder Stacks

Encoder

• layers: $N=6$
• sub-layers:
• position-wise fully connected feed-forward network: 本质上就是一个 MLP（多层感知机，Multilayer Perceptron）
• output: $\textrm{LayerNorm}(x + \textrm{Sublayer}(x))$
• dimension: $d_{model} = 512$

BatchNorm 和 LayerNorm 的区别

\begin{align} Y = \frac{X - \mu}{\sigma} \end{align}

Decoder

• layers: $N=6$
• sub-layers:
• multi-head self-attention mechanism: 和 encoder 相同
• position-wise fully connected feed-forward network: 和 encoder 相同
• masking: 确保位置 $i$ 的预测只能依赖于小于 $i$ 位置的已知输出。因为训练时 decoder 的输入是上面一些时刻在 encoder 的输出，不应该看到后面时刻的输入。

### 3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

#### 3.2.1 Scaled Dot-Product Attention

\begin{align} \textrm{Attention}\left(Q, K, V\right) = \textrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \end{align}

$Q$ 即 query，$K$ 即 key，$QK^T$ 即 querykey 做内积。作者认为，两个向量的内积值越大，说明相似度越高。除以 $\sqrt{d_k}$ 则表示单位化，然后再用 softmax 得到权重。这里的道理其实就是机器学习中的余弦相似度（余弦距离）：

\begin{align} \textrm{similarity} = \cos{\theta} = \frac{\alpha \cdot \beta}{||\alpha|| \cdot ||\beta||} \end{align}

\begin{align} \textrm{MultiHead}\left(Q, K, V\right) &= \textrm{Concat}\left(\textrm{head}_1, ..., \textrm{head}_h\right)W^O \\\\ \textbf{where}\quad\textrm{head}_i &= \textrm{Attention}\left(QW^Q_i, KW^K_i, VW^V_i\right) \end{align}

#### 3.2.3 Applications of Attention in our Model

In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.

The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to $-\infty$) all values in the input of the softmax which correspond to illegal connections.

1. encoderMulti-Head Attentionkeyvaluequery 作为输入。图中的箭头一分为三，表示同一数据复制三次，这就叫做自注意力机制。输出的维度和输入一致。
2. decoderMasked Multi-Head AttentionencoderMulti-Head Attention 类似，只不过需要掩盖后面的输入，前文已详述。
3. decoderMulti-Head Attention 则不再像 encoder 那样是自注意力了，而是 keyvalue 来自于编码器的输出，query 来自于解码器下一个 attention 的输入。

### 3.3 Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

\begin{align} \textrm{FFN}\left(x\right) = \max \left(0, xW_1 + b_1\right)W_2 + b_2 \end{align}

### 3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.

Embeddings 将输入的每一个词 token 映射成维度为 $d_{model}$ 的向量。Softmax 的作用是归一化。

### 3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed.

## 4 Why Self-Attention

Layer Type Complexity per Layer Sequential Operations Maximum Path Length
Self-Attention $O(n^2 \cdot d)$ $O(1)$ $O(1)$
Recurrent $O(n \cdot d^2)$ $O(n)$ $O(n)$
Convolutional $O(k \cdot n \cdot d^2)$ $O(1)$ $O(\log_k{n})$
Self-Attention (restricted) $O(r \cdot n \cdot d)$ $O(1)$ $O(n/r)$

## 6 Results

### 6.2 Model Variations

