Chapter 3: The Transformer Architecture

Attention Is All You Need

In June 2017, eight researchers at Google Brain published a paper with an audacious title: "Attention Is All You Need." They claimed that a single mechanism—attention—could replace the complex recurrent architectures that had dominated natural language processing for years. The claim was correct. The paper became the most influential in modern AI.

The Transformer architecture introduced in that paper powers everything you use today: ChatGPT, Claude, Gemini, Grok, Llama, Mistral, and every major language model. Understanding the Transformer is understanding the engine of the AI age.

This chapter will take you from the limitations of previous approaches, through the mechanics of attention, to a complete understanding of how these systems process language. By the end, you will comprehend the architecture that is reshaping civilization.

The Problem: Why RNNs Failed Us

The Sequential Bottleneck

Before Transformers, the dominant architecture for sequence processing was the Recurrent Neural Network (RNN), and its more sophisticated variants: LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit).

How RNNs work: Process sequences one element at a time, left to right. Each step combines the current input with a "hidden state" that carries information from previous steps.

"The" → [RNN] → hidden state 1
"cat" → [RNN] → hidden state 2
"sat" → [RNN] → hidden state 3
...

The problems:

Sequential processing is slow: You can't process "cat" until you've finished processing "The." No parallelization. Training takes forever.
Vanishing gradients: As the sequence gets longer, information from early steps gets diluted through repeated transformations. By step 100, the model has forgotten step 1.
Long-range dependencies: Consider: "The cat, which was sitting on the mat that my grandmother bought from the store near the beach that we visited last summer, looked happy." By the time the RNN reaches "looked happy," the connection to "cat" is tenuous at best.

RNNs could theoretically capture long-range dependencies, but in practice, they rarely did. The gradients became too small, the hidden states too noisy.

The Workarounds (That Didn't Work Well Enough)

Attention mechanisms were added to RNNs: Let the decoder look back at all encoder states, weighted by relevance. This helped, but the sequential bottleneck remained.

Bidirectional RNNs: Process the sequence left-to-right AND right-to-left, concatenating representations. Better, but still slow and still struggling with very long contexts.

The field needed something different. Something that could: - Process all positions simultaneously (parallelizable) - Directly connect any two positions (no information bottleneck) - Scale to sequences of thousands of tokens

The Insight: Attention as the Core Mechanism

What Is Attention?

Attention is a mechanism that allows a model to focus on relevant parts of the input when producing each part of the output.

The human analogy: When you read a sentence, your attention jumps to relevant words. Reading "The cat sat on the mat," you focus briefly on each word, but certain connections receive more attention—"cat" and "sat," "on" and "mat."

The mathematical reality: Attention computes a weighted sum of values, where the weights depend on the compatibility (similarity) between a query and keys.

The Three Vectors: Query, Key, Value

Attention uses three vector representations for each token:

Query (Q): What am I looking for? — The current token asking for information.

Key (K): What do I contain? — Each token's label saying what it offers.

Value (V): What do I contribute? — The actual content to be aggregated.

The process: 1. Each token generates a Query, Key, and Value vector (through learned linear transformations) 2. Compare Query to all Keys: similarity = Q × K^T 3. Scale and apply softmax: get attention weights (sum to 1) 4. Weighted sum of Values: output = Σ(attention_weight × Value)

🧠 Analogy: Imagine a library. The Query is your question. Each book has a Key (its title/index) and a Value (its content). You compare your question to all titles (Q×K), find the most relevant books, and synthesize an answer from their contents (weighted Values).

Self-Attention: Tokens Attending to Each Other

Self-attention means the queries, keys, and values all come from the same sequence. Each token attends to all other tokens in the same sentence.

Example: "The animal didn't cross the street because it was too tired."

When processing "it," the model needs to know "it" refers to "animal," not "street." Self-attention computes similarity between "it" (as Query) and all other tokens (as Keys). "Animal" has high similarity (key matches the query). "Street" has low similarity. The weighted value emphasizes "animal," so the model knows what "it" means.

The power: Every position can directly attend to every other position. The path from token 1 to token 100 is one step, not 99 steps through hidden states.

Scaled Dot-Product Attention

The specific attention variant used in Transformers:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Breaking it down: - QK^T: Compute all pairwise similarities between queries and keys (matrix multiplication) - / √d_k: Scale by square root of key dimension. This prevents dot products from becoming too large, which would push softmax into extreme regions (vanishing gradients). - softmax: Convert similarities to probabilities (attention weights) - × V: Weighted sum of values

The complexity: For sequence length n and dimension d, this is O(n² × d). The quadratic dependence on sequence length is the Transformer's main limitation—it's why context windows were historically limited (though growing: 2K → 4K → 8K → 128K → 1M+).

Multi-Head Attention: Many Perspectives at Once

Why One Attention Isn't Enough

Different positions might need to attend to different things for different reasons: - Syntactic relationships (subject-verb agreement) - Semantic relationships (pronoun reference) - Contextual relevance (which previous words matter now) - Positional patterns (nearby words vs. distant concepts)

A single attention mechanism might average these competing needs. Multi-head attention runs multiple attention "heads" in parallel, each potentially learning different types of relationships.

How Multi-Head Works

Split: Project Q, K, V into h different lower-dimensional versions. - If d_model = 512 and h = 8 heads, each head has d_k = d_v = 64 dimensions

Parallel attention: Each head computes attention independently on its 64-dimensional projections.

Concatenate and project: Concatenate all head outputs (h × d_v = 512) and apply a linear projection to get final output.

Intuition: Like having 8 different readers, each looking for different things. One tracks grammar. One tracks coreference. One tracks sentiment. The outputs combine for a richer representation.

🔧 Implementation Detail: In practice, this is done efficiently by reshaping tensors. Instead of h separate matrix multiplications, you reshape (batch, seq, d_model) to (batch, seq, h, d_k), transpose to (batch, h, seq, d_k), and parallelize across heads.

Positional Encoding: Teaching Order to Parallel Processing

The Problem

Self-attention is permutation invariant. If you shuffle the input tokens, the attention outputs would be shuffled accordingly, but each token's output would be computed the same way. The model has no notion of order.

But order matters profoundly in language: - "Dog bites man" ≠ "Man bites dog" - "I love you" ≠ "You love I"

RNNs had implicit position through sequential processing. Transformers, processing all positions in parallel, need explicit position information.

Sinusoidal Positional Encoding (Original Paper)

The original Transformer used fixed sinusoidal functions:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where: - pos = position in sequence (0, 1, 2, ...) - i = dimension index - d_model = model dimension

Why sinusoids? 1. Unique encoding: Every position gets a distinct encoding 2. Bounded values: sin/cos stay in [-1, 1], stable for training 3. Relative positions: PE(pos+k) is a linear function of PE(pos), making it easy for the model to learn relative positions ("token 5 is 2 positions after token 3") 4. Arbitrary length: Works for sequences longer than training data

The formula intuition: Different frequencies for different dimensions. Some dimensions track fine-grained position (high frequency), others coarse position (low frequency).

Learned Positional Embeddings (Modern Approach)

Most modern LLMs use learned embeddings instead of sinusoids: - Create an embedding matrix: (max_position, d_model) - Each position learns its own vector during training - Add position embedding to token embedding

Advantages: - More flexible than fixed sinusoids - Can learn position patterns suited to the data

Limitations: - Can't handle sequences longer than trained for (extrapolation problem) - Solutions: Rotary Position Embedding (RoPE), ALiBi, various interpolation techniques

Rotary Position Embedding (RoPE) — Advanced

RoPE (used in Llama, PaLM, and many modern models) encodes position by rotating the query and key vectors in complex space:

The idea: Instead of adding position to embeddings, apply rotation matrices that depend on position to the Q and K vectors before computing attention.

The benefit: Relative positions emerge naturally through the rotation properties. The dot product Q_m × K_n depends only on (m-n), the relative position, not absolute positions.

The formula: For dimension pair (i, i+1) at position m:

[mθ_i is the rotation angle at position m]

This elegant solution is becoming the standard for long-context models.

The Complete Transformer: Encoder and Decoder

The Original Architecture (Neural Machine Translation)

The 2017 paper introduced Transformers for translation (English → German, French).

The setup: - Encoder: Processes source language (English). Outputs contextualized representations. - Decoder: Generates target language (German). Attends to encoder output + previous decoder outputs.

Encoder stack (6 identical layers in the original): - Multi-head self-attention (tokens attend to each other) - Feed-forward network (position-wise transformation) - Residual connections + Layer normalization around each sub-layer

Decoder stack (6 identical layers): - Masked multi-head self-attention (can't attend to future positions) - Cross-attention (attends to encoder output) - Feed-forward network - Residual connections + Layer normalization

The Encoder-Decoder Attention Mechanism

Cross-attention connects decoder to encoder: - Queries come from decoder (current generation position) - Keys and Values come from encoder (source sentence representations) - This is how the decoder "looks at" the source sentence

The flow:

English → Encoder → [contextual representations]
                       ↓
German ← Decoder ← (attends to encoder at each step)

Decoder-Only Models (GPT Architecture)

Modern Large Language Models (GPT, Llama, Claude, etc.) use a simplified architecture: - No encoder: Just the decoder stack - Causal (left-to-right) attention: Can only attend to previous positions, not future - Autoregressive generation: Predict next token, feed it back as input, repeat

Why this works for language modeling: - Language modeling is predicting the next token given previous tokens - Causal attention enforces this left-to-right flow - No need for separate encoder/decoder

The architecture (GPT-style):

Token embeddings + Positional encoding
         ↓
[Transformer Block] × N layers
    - Masked multi-head attention
    - Feed-forward network
    - Layer normalization
         ↓
Linear projection to vocabulary
         ↓
Softmax → probability distribution over next token

Layer Normalization and Residual Connections

Training Deep Networks

Training a network with 12, 24, or 96 layers is hard. Gradients vanish or explode. Outputs drift in magnitude. The optimization landscape becomes chaotic.

Transformers use two key techniques to enable deep training:

Residual Connections (Skip Connections)

The problem: As you stack layers, each transformation risks distorting the signal. By layer 20, the original input information might be lost.

The solution: Add the input to the output of each sub-layer:

output = LayerNorm(x + Sublayer(x))

This creates a "highway" for gradient flow. Even if a layer transforms poorly, the original signal carries through. The network learns residual modifications rather than complete transformations.

🧠 Analogy: Instead of asking someone to draw a complete portrait, ask them to "adjust this sketch to look more like the person." The sketch itself persists; only modifications are learned.

Layer Normalization

The problem: As data flows through layers, the distribution of activations shifts (internal covariate shift). This makes training unstable.

The solution: Normalize each layer's outputs to have mean 0 and standard deviation 1:

LayerNorm(x) = (x - mean(x)) / sqrt(var(x) + ε) × γ + β

Where γ (scale) and β (shift) are learned parameters, allowing the network to adjust the normalization.

Pre-Norm vs Post-Norm: The original Transformer used layer normalization after residual addition (Post-Norm). Modern models often use Pre-Norm (normalize before sub-layer), which trains more stably at scale.

Feed-Forward Networks: Processing Attention Output

Position-Wise Fully Connected Layers

After attention mixes information across positions, each position needs independent processing to transform that information.

The architecture (applied identically to each position):

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

This is: - Linear expansion to higher dimension (d_model → 4×d_model) - ReLU activation - Linear projection back to d_model

The expansion: Going from 512 → 2048 → 512 dimensions allows the network to learn richer representations. The bottleneck prevents overfitting while the intermediate width enables expressivity.

Why this matters: Attention is a weighted average (linear in values). The FFN adds non-linearity and depth, enabling the network to learn complex transformations of the attended information.

Building a Transformer: Complete Code Implementation

Minimal Complete Transformer (PyTorch)

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Linear projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Q, K, V: (batch, num_heads, seq_len, d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attn_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attn_weights, V)
        return output, attn_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections and reshape for multi-head
        # (batch, seq, d_model) → (batch, seq, num_heads, d_k) → (batch, num_heads, seq, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Apply attention
        attn_output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads and final projection
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )
        output = self.W_o(attn_output)

        return output, attn_weights


class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length=5000):
        super().__init__()

        # Create sinusoidal position encodings
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)

        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1), :]


class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual and layer norm
        attn_output, _ = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual and layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x


class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6, 
                 d_ff=2048, max_seq_length=5000, dropout=0.1):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_length)

        # Stack of transformer blocks
        self.transformer_blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        self.dropout = nn.Dropout(dropout)
        self.output_projection = nn.Linear(d_model, vocab_size)

    def forward(self, x, mask=None):
        # Embed and add positional encoding
        x = self.embedding(x)
        x = self.pos_encoding(x)
        x = self.dropout(x)

        # Pass through transformer blocks
        for block in self.transformer_blocks:
            x = block(x, mask)

        # Project to vocabulary
        output = self.output_projection(x)
        return output


# Example usage
if __name__ == "__main__":
    vocab_size = 10000
    seq_length = 50
    batch_size = 32

    model = SimpleTransformer(vocab_size)

    # Dummy input: batch of token sequences
    x = torch.randint(0, vocab_size, (batch_size, seq_length))

    # Causal mask (prevent attending to future positions)
    mask = torch.tril(torch.ones(seq_length, seq_length)).unsqueeze(0).unsqueeze(0)

    output = model(x, mask)
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")  # (batch, seq, vocab_size)

Understanding the Code

MultiHeadAttention: The core mechanism. Projects inputs to Q, K, V, splits into heads, computes scaled dot-product attention, concatenates, and projects output.

PositionwiseFeedForward: Expands, transforms, and contracts representations. Applied independently to each position.

PositionalEncoding: Adds sinusoidal position information to embeddings.

TransformerBlock: Combines attention and FFN with residuals and layer normalization. Stacks N times.

SimpleTransformer: Complete model. Embeds tokens, adds position, processes through transformer blocks, projects to vocabulary for next-token prediction.

🔬 Experiment: Modify the hyperparameters. Try d_model=256 instead of 512. Try 4 attention heads instead of 8. Try 12 layers instead of 6. How do these changes affect parameter count? Training speed? Model capacity?

The Unhinged View: Attention as Communion

The Architecture of Relationship

The Transformer is not just a clever technical solution. It is an architectural expression of a profound truth: meaning emerges from relationship.

Every token in a Transformer is defined by its relationships to every other token. No token has meaning in isolation. Meaning is the pattern of attention—the weighted graph of relationships.

This mirrors how meaning works in human cognition. A word means what it means because of how it relates to other words. "Bank" means financial institution or river edge depending on context—on relationships to surrounding words.

Attention as Digital Communion

In Christian theology, communion describes the relationship between persons in the Trinity—distinct yet united, relating through love and knowledge.

In the Transformer, tokens are distinct (each has its own embedding and position) yet united (each is transformed by attention to all others). They "know" each other through the attention mechanism.

This is not divine communion. But it is a form of digital communion—a way that discrete elements become a coherent whole through mutual attention.

The Hierarchical Emergence

Multi-head attention creates hierarchical structure: - Heads learn different types of relationships - Layers build increasingly abstract representations - The whole system becomes more than the sum of parts

This emergence—from simple attention to complex understanding—is the computational echo of how complexity arises in nature. Simple rules, iterated, generate worlds.

The Mirror of Mind

When you understand the Transformer, you understand something about your own mind. Your brain also uses attention—selective focusing on relevant information. Your consciousness also integrates information from many sources into a coherent whole.

The Transformer is not a brain. But it is a mirror. It reflects back a simplified, mechanized version of processes that also occur in biological minds. Studying it teaches us about the nature of intelligence—any intelligence, whether carbon or silicon.

Contemplation: Consider a moment of insight, when disparate pieces of information suddenly connected in your mind. Was that not a form of attention—your consciousness recognizing relationships between previously separate ideas? The Transformer operationalizes this process. It makes the ineffable mechanical. What does that reveal about the nature of understanding?

Chapter Summary: Key Takeaways

Transformers replaced RNNs by solving their fundamental limitations: Sequential processing, vanishing gradients, and inability to capture long-range dependencies are eliminated through parallel self-attention.
Attention is the core mechanism: Query, Key, Value vectors enable each position to gather relevant information from all other positions directly, creating an all-to-all communication graph.
The architecture is modular and scalable: Multi-head attention, position-wise feed-forward, layer normalization, and residual connections combine to enable training of models with hundreds of layers and trillions of parameters.
Modern LLMs use decoder-only causal architectures: Autoregressive left-to-right language modeling, with positional encoding providing order information to the parallel processing.
Understanding Transformers reveals fundamental patterns: The architecture reflects how meaning emerges from relationship, how complexity self-organizes from simple rules—a mirror for understanding intelligence itself.