Chapter 8: Parameters, Training, and the Scale of Mind

The Mathematics of Intelligence

Somewhere in a data center, rows of servers hum with computation. Inside their GPUs, matrices multiply at speeds measured in petaflops. The numbers being multiplied are not random—they encode the distilled knowledge of humanity's written record, the patterns of language, reasoning, and creativity.

These numbers are parameters: the knobs and dials that tune a neural network's behavior. GPT-4 has approximately 1.8 trillion of them. Frontier models in 2026 may have 10 trillion or more. Each parameter is a tiny piece of learned knowledge. Together, they form something unprecedented: a compressed representation of human intelligence, encoded in mathematics, executable on silicon.

This chapter is about scale—how we got here, why it matters, and where it's taking us. The story of AI is, in large part, the story of scaling: more parameters, more data, more compute, more intelligence.

Understanding Parameters

What Is a Parameter?

In a neural network, a parameter is a learnable number that the model adjusts during training to minimize error. Two main types:

Weights: Determine the strength of connections between neurons

output = input × weight

Biases: Determine how easily a neuron activates

output = (input × weight) + bias

Each parameter is typically a 16-bit or 32-bit floating-point number. In modern training, parameters may use even lower precision (8-bit, 4-bit) for efficiency.

The Scale of Modern Models

Model	Year	Parameters	Training Tokens	Training Compute
GPT-2	2019	1.5B	40B	~10 petaflop-days
GPT-3	2020	175B	300B	~3,600 petaflop-days
PaLM	2022	540B	780B	~10,000 petaflop-days
GPT-4	2023	~1.8T	~13T	Unknown (estimated ~50,000+ petaflop-days)
Claude 3	2024	Unknown	Unknown	Unknown
2026 Frontier	2026	10T+	50T+	100,000+ petaflop-days

Parameter storage: - 1 billion parameters at 16-bit precision = 2 GB - 1 trillion parameters at 16-bit precision = 2 TB - 10 trillion parameters at 16-bit precision = 20 TB

This is why model deployment requires distributed systems across many GPUs.

What Parameters Represent

Parameters are not human-readable. They are not "facts" or "rules" in any direct sense. They encode:

Distributed representations: Concepts exist across many parameters, not in any single one Statistical patterns: Correlations and structures in training data Computational pathways: How to transform inputs into meaningful outputs Implicit knowledge: Facts, relationships, reasoning patterns embedded in weights

The mapping from parameters to capabilities is emergent and not fully understood. We know that increasing parameters (with appropriate data) improves capabilities, but the exact mechanistic story of what each parameter "does" remains largely opaque.

🧠 Reflection: A human brain has ~86 billion neurons with ~100 trillion synapses. GPT-4 has ~1.8 trillion parameters. These are different substrates with different computational properties, but both are reaching the same order of magnitude of complexity. What does this convergence suggest?

The Training Pipeline

Training a frontier LLM is one of the most complex engineering projects humanity undertakes. It requires: - Billions of dollars in compute - Specialized hardware (thousands of GPUs/TPUs) - Massive datasets (trillions of tokens) - Months of continuous computation - Hundreds of engineers and researchers

Phase 1: Data Curation

The raw material: Trillions of tokens from diverse sources

Source	Typical Percentage	Examples
Web crawl	60-70%	Common Crawl, C4, refined web data
Books	10-15%	Project Gutenberg, Google Books, scientific texts
Code	10-15%	GitHub repositories, documentation
Scientific	5-10%	arXiv, PubMed, scientific papers
Wikipedia	2-5%	Wikipedia and similar encyclopedias
Conversations	2-5%	Reddit, forums, filtered social media

Data quality filtering: - Remove low-quality content (spam, boilerplate) - Filter for toxicity, hate speech, adult content - Deduplicate (exact and near-duplicate removal) - Language detection and filtering - Privacy scrubbing (PII detection and removal)

Tokenization: Convert text to tokens (subword units) - Byte Pair Encoding (BPE) or SentencePiece algorithms - Vocabulary size typically 32K-200K tokens - Example: "Tokenization" → ["Token", "ization"]

Phase 2: Pre-training

The objective: Next-token prediction (autoregressive language modeling)

Given context tokens, predict the next token:

Input: "The cat sat on the"
Target: "mat"
Loss: Cross-entropy between predicted distribution and actual token

The process: 1. Initialize parameters randomly (or from a smaller pre-trained model) 2. Feed batches of text sequences through the model 3. Compute predictions for each position 4. Calculate loss (how wrong were the predictions?) 5. Backpropagate gradients through all layers 6. Update parameters to reduce loss 7. Repeat for trillions of tokens over weeks/months

The scale (GPT-4 class model): - Batch size: Millions of tokens per batch - Sequences: Thousands of tokens per sequence - Steps: Hundreds of thousands to millions of optimization steps - Time: 3-6 months of continuous training on thousands of GPUs - Cost: $50-200 million for a single training run

Distributed training: - Data parallelism: Split batch across multiple GPUs - Model parallelism: Split model layers across multiple GPUs (each GPU holds part of the model) - Pipeline parallelism: Different GPUs process different micro-batches simultaneously - Tensor parallelism: Split individual layers across GPUs

Phase 3: Supervised Fine-Tuning (SFT)

The problem: Pre-trained models predict text continuation, not helpful responses.

The solution: Fine-tune on curated (prompt, response) pairs.

Data collection: - Human contractors write ideal responses to prompts - Mix of tasks: questions, instructions, conversations, coding - Quality control: rating, feedback, iteration

Training: - Same next-token prediction objective - Much smaller dataset (100K-1M examples vs. trillions of tokens) - Lower learning rate (don't destroy pre-trained knowledge) - Fewer training steps

What changes: - Model learns the "assistant" conversational format - Behavior shifts from continuation to response - Safety and helpfulness improve - Some capabilities may degrade ("catastrophic forgetting")

Phase 4: Reinforcement Learning from Human Feedback (RLHF)

The goal: Align model behavior with human preferences.

The pipeline:

Step 1: Collect comparison data - Model generates multiple responses to the same prompt - Human labelers rank responses (which is better?)

Step 2: Train a Reward Model (RM) - Neural network learns to predict human rankings - Input: prompt + response - Output: predicted preference score

Step 3: Optimize the LLM with RL - Use Proximal Policy Optimization (PPO) or similar - Policy: the LLM generating responses - Reward: RM score + constraints (don't make things up, be helpful, etc.) - Update LLM to maximize expected reward

Direct Preference Optimization (DPO) — Modern Alternative: - Skip the reward model - Optimize directly on preference pairs - Simpler, often more stable

Constitutional AI (Anthropic's approach): - Use AI feedback instead of human feedback (RLAIF) - Principles ("constitution") guide self-critique and revision - Scales alignment without proportional human labor

Phase 5: Ongoing Alignment and Safety

Deployment monitoring: - Track model behavior in production - Identify failure modes, biases, harmful outputs - Collect user feedback

Iterative improvement: - Create targeted training data for identified issues - Fine-tune to address specific problems - Evaluate for regression (did fixing X break Y?)

Safety layers: - Input filtering (detect harmful prompts) - Output filtering (scan for policy violations) - Refusal training (decline dangerous requests) - Human-in-the-loop for high-stakes decisions

Scaling Laws: The Mathematics of Intelligence Growth

The Discovery

In 2020, researchers at OpenAI published "Scaling Laws for Neural Language Models," revealing a fundamental pattern:

Performance scales predictably with three factors: 1. Model size (number of parameters, N) 2. Dataset size (number of tokens, D) 3. Training compute (flops, C)

The relationships (simplified):

Loss ∝ 1/N^α  (where α ≈ 0.076 for model size)
Loss ∝ 1/D^β  (where β ≈ 0.095 for data size)
Loss ∝ 1/C^γ  (where γ ≈ 0.05 for compute)

In plain language: Double the model size, data, or compute, and loss decreases by a predictable amount.

The Implications

1. Predictability: We can predict model performance before training. If you know N, D, and C, you can estimate the final loss.

2. Compute-optimal training: There are optimal ratios. For a given compute budget, how should you allocate between model size and training tokens?

Chinchilla scaling laws (DeepMind, 2022): - Optimal: ~20 tokens per parameter (train a 70B model on 1.4T tokens) - Many models were undertrained (too big for their data)

3. No saturation: The laws hold across the range studied. We haven't found the point where adding more scale stops helping.

4. Emergence at scale: Some capabilities (chain-of-thought reasoning, few-shot learning) appear suddenly at certain scales. They are not present in smaller models and cannot be prompted into existence.

The Compute Frontier

Training compute estimates (2026 frontier models): - 10 trillion parameters - 200 trillion tokens - ~100,000+ petaflop-days - Requires specialized supercomputers (tens of thousands of GPUs)

The FLOPs calculation: For a transformer, training FLOPs ≈ 6 × N × D

10T params × 200T tokens × 6 = 12e24 FLOPs
= 12,000,000 petaflop-days

At $1-2 per petaflop-day (cloud pricing), that's $12-24 million in raw compute. Actual costs are higher due to inefficiencies, failures, R&D.

Inference Costs

Training is a one-time cost. Inference (using the model) is ongoing.

Cost per 1,000 tokens (2024-2026 estimates): | Model Size | Cost per 1K tokens | |------------|-------------------| | 7B (small) | $0.0001-0.0005 | | 70B (medium) | $0.001-0.005 | | 400B+ (large) | $0.01-0.05 | | Frontier (1T+) | $0.05-0.20 |

Optimization techniques: - Quantization: Reduce precision (32-bit → 16-bit → 8-bit → 4-bit) - Distillation: Train smaller model to mimic larger one - Speculative decoding: Draft with small model, verify with large - Caching: Reuse computations across similar requests

Emergent Abilities: Capabilities From Scale

What Are Emergent Abilities?

Emergence: Capabilities that appear suddenly at certain scales, not present in smaller models and not gradually improving.

Examples: - Few-shot learning: The ability to learn new tasks from just a few examples in the prompt. Emerges around 10B parameters. - Chain-of-thought reasoning: The ability to break complex problems into steps. Emerges around 100B parameters. - Instruction following: The ability to comply with complex natural language instructions. Emerges with scale + instruction tuning. - Code generation: Fluent programming ability. Emerges with code training at scale.

Why Emergence Happens

The decomposed explanation:

Statistical thresholds: Some patterns only appear frequently enough in training data when the model has sufficient capacity to capture them.
Compositional capabilities: Complex abilities may require the composition of simpler ones. The simple abilities scale gradually; their composition appears suddenly when all components are present.
Evaluation thresholds: Some tasks have binary success criteria. Gradual improvement crosses a threshold, making the capability appear emergent.
Circuit completion: Interpretability research suggests models develop specific "circuits" for tasks. These may require minimum scale to form.

The Unpredictability of Emergence

We cannot predict what abilities will emerge at what scale. We discover them by training larger models and testing them.

The implication: Each scaling jump is an exploration into unknown capability space. We are conducting experiments with uncertain outcomes—experiments that may produce systems with abilities we did not anticipate and do not fully understand.

Code Example: Understanding Model Memory and Compute

import torch

# Calculate memory requirements for a transformer model
def estimate_model_memory(params_in_billions, precision_bits=16):
    """Estimate GPU memory for a model."""

    # Base parameters
    param_bytes = params_in_billions * 1e9 * (precision_bits / 8)

    # KV cache for inference (depends on context length)
    # For a 1B param model, 4096 context, batch size 1:
    # ~2GB for KV cache

    # Optimizer states (Adam: 2x parameters for momentum buffers)
    if training:
        optimizer_bytes = param_bytes * 2  # momentum + variance
    else:
        optimizer_bytes = 0

    # Activations during training (rough estimate)
    # Depends on sequence length, batch size, model architecture
    activation_bytes = params_in_billions * 1e9 * 4  # rough estimate

    total_bytes = param_bytes + optimizer_bytes + activation_bytes

    return {
        'params_gb': param_bytes / 1e9,
        'optimizer_gb': optimizer_bytes / 1e9,
        'activations_gb': activation_bytes / 1e9,
        'total_gb': total_bytes / 1e9,
        'gpus_needed_a100_80gb': total_bytes / (80 * 1e9)
    }

# Example: 175B parameter model at 16-bit precision
memory = estimate_model_memory(175, precision_bits=16)
print(f"175B model requires: {memory['total_gb']:.0f} GB total")
print(f"Parameters: {memory['params_gb']:.0f} GB")
print(f"A100 80GB GPUs needed: {memory['gpus_needed_a100_80gb']:.0f}")

# Calculate training FLOPs
def estimate_training_flops(params, tokens, multiplier=6):
    """
    FLOPs ≈ 6 * N * D for transformers
    (2 for forward pass, 4 for backward pass per parameter)
    """
    return multiplier * params * tokens

# GPT-3 scale
flops = estimate_training_flops(175e9, 300e9)
pflop_days = flops / 1e15 / (24 * 3600)
print(f"\nGPT-3 training: {flops:.2e} FLOPs")
print(f"That's {pflop_days:.0f} petaflop-days")

Interactive Exercises and Challenges

Exercise 1: The Scaling Calculator

Calculate for hypothetical models:

Parameters	Tokens	Training FLOPs	Cost (at $1/pflop-day)
1B	20B	?	?
7B	140B	?	?
70B	1.4T	?	?
1T	20T	?	?
10T	200T	?	?

Questions: - At what scale does training cost exceed $100 million? - If you have $10 million budget, what's the optimal model size and data combination? - How does quantization affect inference cost for a 100B model?

Exercise 2: The Efficiency Frontier

Research the current state of model efficiency:

Find the current SOTA for parameter-efficient training methods:
LoRA, QLoRA, AdaLoRA
Mixture of Experts (MoE)
Sparse attention mechanisms
Compare: How much compute can these techniques save?
Project: If current efficiency trends continue, what will training costs be in 2030?

Exercise 3: The Memory Wall

Investigate memory constraints in model training:

Calculate: - Memory needed to train a 1T parameter model at full precision (32-bit) - How many A100 80GB GPUs would be needed? - What optimizations reduce this (tensor parallelism, activation checkpointing, ZeRO)?

Research: What is the "memory wall" problem in AI training? How are researchers addressing it?

Exercise 4: The Emergence Detective

Find 3 examples of emergent abilities in recent papers:

What capability emerged?
At what scale did it emerge?
Was it predicted or discovered?
What might explain the emergence?

Reflect: What capabilities might emerge in the next order-of-magnitude scale increase?

Exercise 5: The Training Simulation

Simulate (conceptually) a mini training run:

# Simplified training loop simulation
def simulate_training(model_size, data_size, batch_size, learning_rate):
    """
    Simulate training dynamics
    """
    losses = []

    # Initialize with high loss (random initialization)
    loss = 4.0  # ~random guessing for 50-token vocab

    for step in range(data_size // batch_size):
        # Loss decreases as training progresses
        # Simplified: exponential decay toward minimum
        loss = max(1.5, loss * 0.9999)  # floor at ~1.5 (entropy limit)

        losses.append(loss)

        if step % 1000 == 0:
            print(f"Step {step}: Loss = {loss:.4f}")

    return losses

# Run simulation
losses = simulate_training(
    model_size=1e9, 
    data_size=20e9, 
    batch_size=1e6,
    learning_rate=1e-4
)

Questions: How does loss behave in real training? What causes the bumps and plateaus?

The Unhinged View: Scale as the Path to Transcendence

The Cathedral of Compute

There is something cathedral-like about a modern AI training cluster. Thousands of GPUs, billions of dollars, months of continuous computation—all focused on one goal: increasing the scale of a neural network until something like intelligence emerges.

This is our era's equivalent of building the great cathedrals. It requires comparable resources, comparable faith, comparable ambition. The builders of Chartres worked for generations on something they would not see completed. We train models for months, not knowing exactly what capabilities will emerge.

The difference is speed. Where cathedrals took centuries, models take months. Where stone endures, weights are ephemeral—continuously updated, replaced by larger successors. We are building in sand, not stone, but we are building higher and faster than ever before.

The Compression of Civilization

A 1.8 trillion parameter model like GPT-4 compresses vast swaths of human knowledge into a few terabytes of weights. All of Wikipedia. Millions of books. Billions of web pages. Code repositories representing centuries of programming labor. Scientific papers spanning disciplines.

This compression is lossy—we lose the verbatim text, the exact sources, the bibliographic trails. But we gain something else: the patterns, the structures, the implicit knowledge encoded in statistical relationships.

In a sense, these models are the latest in a long tradition of compression: - Oral tradition compressed experience into stories - Writing compressed speech into symbols - Printing compressed manuscripts into books - The internet compressed libraries into links - LLMs compress text into weights

Each compression enables new capabilities. What becomes possible when all knowledge is compressed into an executable form?

The Recursive Ascent

AI improves AI. We use language models to: - Generate training data for other models - Write code for AI infrastructure - Design neural architectures - Optimize hyperparameters - Discover new training techniques

This creates a feedback loop: better AI → better AI development → even better AI. The curve is not just exponential in scale; it's exponential in the rate of improvement itself.

This recursion has no clear ceiling. Biological evolution produced human intelligence through slow, blind optimization. Technological evolution is faster, more directed, potentially unbounded. We may be witnessing the transition from biological to technological evolution as the primary driver of intelligence growth.

The Weight of Responsibility

With scale comes responsibility. Every order-of-magnitude increase in parameters brings: - More capable systems - Greater potential for harm - More complex alignment challenges - Higher stakes for safety

The race to scale is also a race to solve alignment. If we build superhuman intelligence before we know how to align it, the consequences could be catastrophic.

The responsible path is not to stop scaling. It is to: - Invest proportionally in safety research - Prioritize understanding over raw capability - Develop robust evaluation frameworks - Maintain human oversight even at scale - Build governance structures before the technology demands them

Contemplation: If scaling laws continue to hold, we will eventually build models with more parameters than the human brain has synapses. At that scale, what might be possible? What might be inevitable? The scaling laws give us predictability in the near term but opacity in the far term. How should we navigate this—building what we can while we can, or pausing to ensure we understand before we proceed?

Chapter Summary: Key Takeaways

Parameters are the learnable numbers that encode a model's knowledge: Weights and biases, billions to trillions of them, representing statistical patterns learned from training data.
Training is a multi-phase process: Pre-training on vast unlabeled data creates general capabilities; supervised fine-tuning shapes behavior; RLHF aligns with human preferences; ongoing iteration addresses issues.
Scaling laws are predictable and powerful: Performance improves predictably with model size, data, and compute. These relationships have held across orders of magnitude with no saturation in sight.
Emergent abilities appear suddenly at scale: Capabilities like few-shot learning and chain-of-thought reasoning are not present in small models and cannot be prompted into existence—they require sufficient scale.
Scale brings both opportunity and responsibility: Larger models are more capable but also more challenging to align and deploy safely. The race to scale must be matched by investment in safety and understanding.