Attention is all you need

Introduction

Transformer, an encoder and decoder model, solely based on attention mechanisms.
Not using recurrence and convolutions, more parallelizable, capturing long-distance context in $O(1)$ . Therefore, it requires less time to train.

Model architecture

One layer contains two sub-layers: attention and Feed Forward (FFN).
Employ a residual connection around each of the two sub-layers, then layer normalization.
- The output of each sub-layer is $\text{LayerNorm}(x + \text{Sublayer}(x))$
Multiple $N$ layers. e.g., Llama-3-8B has 32 layers.

Input preparation

Input is a natural language sequence with $n$ tokens.
Input is first converted to a matrix $X$ of dimension $[n, d_{model}]$ . $d_{model}$ is the dimension of token embedding vector. e.g., 768, 4096.
Add positional embeddings to the input matrix to provide position info for each token. The dimension doesn’t change.
1. Using fixed sinusoid function. Can also use learned ones.

Self-attention

While processing a sequence, automatically learn the contexts, relations between different parts.

attention

Calculate $Q$ (Query), $K$ (Key), $V$ (Value) using the input matrix and projection matrices $W^Q, W^K, W^V$ .

, the query, for each position, what context it wants to search for
1. $W^Q$ is of dimension $[d_{model}, d_{model}]$ . $Q$ is of dimension $[n, d_{model}]$ .
, the key, for each position, what info it can provide for the search, like a search index
1. $W^K$ is of dimension $[d_{model}, d_{model}]$ . $V$ is of dimension $[n, d_{model}]$ .
$V = X W^V$$, the value, for each position, if selected, what detailed info it can provide, like the payload
1. $W^V$ is of dimension $[d_{model}, d_{model}]$ . $V$ is of dimension $[n, d_{model}]$ .

With training, $W^Q$ and $W^K$ learn about searching and matching contexts. $W^V$ learns about information providing.

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

$Q K^T$ is calculating similarity between query and all keys.
$\sqrt{d_k}$ is for scaling to prevent gradient vanishing.
Softmax is for normalizing similarity scores to probabilities, the total is 1.
$\times V$ gets the final value. Values with higher similarities occupy larger portion in the final value. The dimension is $[n, d_{model}]$ .

Multi-head attention

Learn the same sequence from multiple aspects. Multi-head attention allows the model to learn these aspects independently, in parallel. There could be shallow heads focusing on local info, deep heads for global info.

Linear projection for heads: project the initial for each head with .
1. The dimension $d_{model}$ is usually split equally across heads. $d_h = \frac{d_{model}}{h}$ . So the dimension of $QW_i^Q$ is $[n, d_h]$ . So multi-head won’t increase computation cost.
Parallel attention:
1. The dimension is $[n, d_h]$ .
Concatenation: Concatenate the outputs of all heads. The dimension is $[n, d_{model}]$ .
Final linear: use a $W^O$ to project the output back to the original dimension, also synchronize info from all heads.

Diversity: learns in various aspects.
Robustness: If one head fails due to noise, other heads are still working.
Sub-space learning: Easier to capture complex info by learning in multiple lower-dimensional spaces.

Feed Forward network (FFN)

FFN is after the attention layer. It’s position/point-wise, each position only cares about itself, learning deeper information. This layer adds nonlinear to the overall model.

\[\text{FFN}\left(x\right) = \text{activation}\left(xW_1+b_1\right)W_2+b_2\]

Dimension expansion
- Using $W_{up}$ of dimension $[d_{model}, d_{ff}]$ to expand the input dimension to $[n, d_{ff}]$
- $d_{ff}$ is usually larger (~4X) than $d_{model}$ . It’s easier to identify and extract complex details in higher-dimensional spaces.
Dimension reduction
- Use $W_{down}$ of dimension $[d_{ff}, d_{model}]$ to reduce to the original dimension.
- Filter noise and compress useful info.

The activation function is nonlinear, e.g., ReLU, $MAX()$ .
Token has gained context info from attention layer; FFN is for further digesting these info.
Dimension expansion:
Store knowledge and memories in FFN weight matrices.
~2/3 weights in a LLM are in FFN.

Training

Adam optimizer
Regularization
- Residual dropout on (1) the output of each sub-layer and (2) the addition of input embedding and positional embedding.
- Label smoothing

References

paper

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 18

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1