Attention is all you need Design patterns Mixture of Experts model papers English phrases Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Attention is all you need

2026-01-04

Introduction

  • Transformer, an encoder and decoder model, solely based on attention mechanisms.
  • Not using recurrence and convolutions, more parallelizable, capturing long-distance context in . Therefore, it requires less time to train.

Model architecture

  • One layer contains two sub-layers: attention and Feed Forward (FFN).
  • Employ a residual connection around each of the two sub-layers, then layer normalization.
    • The output of each sub-layer is
  • Multiple layers. e.g., Llama-3-8B has 32 layers.

Input preparation

  1. Input is a natural language sequence with tokens.
  2. Input is first converted to a matrix of dimension . is the dimension of token embedding vector. e.g., 768, 4096.
  3. Add positional embeddings to the input matrix to provide position info for each token. The dimension doesn’t change.
    1. Using fixed sinusoid function. Can also use learned ones.

Self-attention

While processing a sequence, automatically learn the contexts, relations between different parts.

attention

Calculate (Query), (Key), (Value) using the input matrix and projection matrices .

  1. , the query, for each position, what context it wants to search for
    1. is of dimension . is of dimension .
  2. , the key, for each position, what info it can provide for the search, like a search index
    1. is of dimension . is of dimension .
  3. $V = X W^V$$, the value, for each position, if selected, what detailed info it can provide, like the payload
    1. is of dimension . is of dimension .

With training, and learn about searching and matching contexts. learns about information providing.

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

  1. is calculating similarity between query and all keys.
  2. is for scaling to prevent gradient vanishing.
  3. Softmax is for normalizing similarity scores to probabilities, the total is 1.
  4. gets the final value. Values with higher similarities occupy larger portion in the final value. The dimension is .

Multi-head attention

Learn the same sequence from multiple aspects. Multi-head attention allows the model to learn these aspects independently, in parallel. There could be shallow heads focusing on local info, deep heads for global info.

  1. Linear projection for heads: project the initial for each head with .
    1. The dimension is usually split equally across heads. . So the dimension of is . So multi-head won’t increase computation cost.
  2. Parallel attention:
    1. The dimension is .
  3. Concatenation: Concatenate the outputs of all heads. The dimension is .
  4. Final linear: use a to project the output back to the original dimension, also synchronize info from all heads.
  • Diversity: learns in various aspects.
  • Robustness: If one head fails due to noise, other heads are still working.
  • Sub-space learning: Easier to capture complex info by learning in multiple lower-dimensional spaces.

Feed Forward network (FFN)

FFN is after the attention layer. It’s position/point-wise, each position only cares about itself, learning deeper information. This layer adds nonlinear to the overall model.

\[\text{FFN}\left(x\right) = \text{activation}\left(xW_1+b_1\right)W_2+b_2\]

  1. Dimension expansion
    • Using of dimension to expand the input dimension to
    • is usually larger (~4X) than . It’s easier to identify and extract complex details in higher-dimensional spaces.
  2. Dimension reduction
    • Use of dimension to reduce to the original dimension.
    • Filter noise and compress useful info.
  • The activation function is nonlinear, e.g., ReLU, .
  • Token has gained context info from attention layer; FFN is for further digesting these info.
  • Dimension expansion:
  • Store knowledge and memories in FFN weight matrices.
  • ~2/3 weights in a LLM are in FFN.

Training

  • Adam optimizer
  • Regularization
    • Residual dropout on (1) the output of each sub-layer and (2) the addition of input embedding and positional embedding.
    • Label smoothing

References


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2026. All rights reserved by melonskin. Powered by Jekyll.