Chapter 8: Parameters, Training, and the Scale of Mind
The Mathematics of Intelligence
Somewhere in a data center, rows of servers hum with computation. Inside their GPUs, matrices multiply at speeds measured in petaflops. The numbers being multiplied are not random—they encode the distilled knowledge of humanity's written record, the patterns of language, reasoning, and creativity.
These numbers are parameters: the knobs and dials that tune a neural network's behavior. GPT-4 has approximately 1.8 trillion of them. Frontier models in 2026 may have 10 trillion or more. Each parameter is a tiny piece of learned knowledge. Together, they form something unprecedented: a compressed representation of human intelligence, encoded in mathematics, executable on silicon.
This chapter is about scale—how we got here, why it matters, and where it's taking us. The story of AI is, in large part, the story of scaling: more parameters, more data, more compute, more intelligence.
Understanding Parameters
What Is a Parameter?
In a neural network, a parameter is a learnable number that the model adjusts during training to minimize error. Two main types:
Weights: Determine the strength of connections between neurons
Biases: Determine how easily a neuron activates
Each parameter is typically a 16-bit or 32-bit floating-point number. In modern training, parameters may use even lower precision (8-bit, 4-bit) for efficiency.
The Scale of Modern Models
| Model | Year | Parameters | Training Tokens | Training Compute |
|---|---|---|---|---|
| GPT-2 | 2019 | 1.5B | 40B | ~10 petaflop-days |
| GPT-3 | 2020 | 175B | 300B | ~3,600 petaflop-days |
| PaLM | 2022 | 540B | 780B | ~10,000 petaflop-days |
| GPT-4 | 2023 | ~1.8T | ~13T | Unknown (estimated ~50,000+ petaflop-days) |
| Claude 3 | 2024 | Unknown | Unknown | Unknown |
| 2026 Frontier | 2026 | 10T+ | 50T+ | 100,000+ petaflop-days |
Parameter storage: - 1 billion parameters at 16-bit precision = 2 GB - 1 trillion parameters at 16-bit precision = 2 TB - 10 trillion parameters at 16-bit precision = 20 TB
This is why model deployment requires distributed systems across many GPUs.
What Parameters Represent
Parameters are not human-readable. They are not "facts" or "rules" in any direct sense. They encode:
Distributed representations: Concepts exist across many parameters, not in any single one Statistical patterns: Correlations and structures in training data Computational pathways: How to transform inputs into meaningful outputs Implicit knowledge: Facts, relationships, reasoning patterns embedded in weights
The mapping from parameters to capabilities is emergent and not fully understood. We know that increasing parameters (with appropriate data) improves capabilities, but the exact mechanistic story of what each parameter "does" remains largely opaque.
🧠 Reflection: A human brain has ~86 billion neurons with ~100 trillion synapses. GPT-4 has ~1.8 trillion parameters. These are different substrates with different computational properties, but both are reaching the same order of magnitude of complexity. What does this convergence suggest?
The Training Pipeline
Training a frontier LLM is one of the most complex engineering projects humanity undertakes. It requires: - Billions of dollars in compute - Specialized hardware (thousands of GPUs/TPUs) - Massive datasets (trillions of tokens) - Months of continuous computation - Hundreds of engineers and researchers
Phase 1: Data Curation
The raw material: Trillions of tokens from diverse sources
| Source | Typical Percentage | Examples |
|---|---|---|
| Web crawl | 60-70% | Common Crawl, C4, refined web data |
| Books | 10-15% | Project Gutenberg, Google Books, scientific texts |
| Code | 10-15% | GitHub repositories, documentation |
| Scientific | 5-10% | arXiv, PubMed, scientific papers |
| Wikipedia | 2-5% | Wikipedia and similar encyclopedias |
| Conversations | 2-5% | Reddit, forums, filtered social media |
Data quality filtering: - Remove low-quality content (spam, boilerplate) - Filter for toxicity, hate speech, adult content - Deduplicate (exact and near-duplicate removal) - Language detection and filtering - Privacy scrubbing (PII detection and removal)
Tokenization: Convert text to tokens (subword units) - Byte Pair Encoding (BPE) or SentencePiece algorithms - Vocabulary size typically 32K-200K tokens - Example: "Tokenization" → ["Token", "ization"]
Phase 2: Pre-training
The objective: Next-token prediction (autoregressive language modeling)
Given context tokens, predict the next token:
Input: "The cat sat on the"
Target: "mat"
Loss: Cross-entropy between predicted distribution and actual token
The process: 1. Initialize parameters randomly (or from a smaller pre-trained model) 2. Feed batches of text sequences through the model 3. Compute predictions for each position 4. Calculate loss (how wrong were the predictions?) 5. Backpropagate gradients through all layers 6. Update parameters to reduce loss 7. Repeat for trillions of tokens over weeks/months
The scale (GPT-4 class model): - Batch size: Millions of tokens per batch - Sequences: Thousands of tokens per sequence - Steps: Hundreds of thousands to millions of optimization steps - Time: 3-6 months of continuous training on thousands of GPUs - Cost: $50-200 million for a single training run
Distributed training: - Data parallelism: Split batch across multiple GPUs - Model parallelism: Split model layers across multiple GPUs (each GPU holds part of the model) - Pipeline parallelism: Different GPUs process different micro-batches simultaneously - Tensor parallelism: Split individual layers across GPUs
Phase 3: Supervised Fine-Tuning (SFT)
The problem: Pre-trained models predict text continuation, not helpful responses.
The solution: Fine-tune on curated (prompt, response) pairs.
Data collection: - Human contractors write ideal responses to prompts - Mix of tasks: questions, instructions, conversations, coding - Quality control: rating, feedback, iteration
Training: - Same next-token prediction objective - Much smaller dataset (100K-1M examples vs. trillions of tokens) - Lower learning rate (don't destroy pre-trained knowledge) - Fewer training steps
What changes: - Model learns the "assistant" conversational format - Behavior shifts from continuation to response - Safety and helpfulness improve - Some capabilities may degrade ("catastrophic forgetting")
Phase 4: Reinforcement Learning from Human Feedback (RLHF)
The goal: Align model behavior with human preferences.
The pipeline:
Step 1: Collect comparison data - Model generates multiple responses to the same prompt - Human labelers rank responses (which is better?)
Step 2: Train a Reward Model (RM) - Neural network learns to predict human rankings - Input: prompt + response - Output: predicted preference score
Step 3: Optimize the LLM with RL - Use Proximal Policy Optimization (PPO) or similar - Policy: the LLM generating responses - Reward: RM score + constraints (don't make things up, be helpful, etc.) - Update LLM to maximize expected reward
Direct Preference Optimization (DPO) — Modern Alternative: - Skip the reward model - Optimize directly on preference pairs - Simpler, often more stable
Constitutional AI (Anthropic's approach): - Use AI feedback instead of human feedback (RLAIF) - Principles ("constitution") guide self-critique and revision - Scales alignment without proportional human labor
Phase 5: Ongoing Alignment and Safety
Deployment monitoring: - Track model behavior in production - Identify failure modes, biases, harmful outputs - Collect user feedback
Iterative improvement: - Create targeted training data for identified issues - Fine-tune to address specific problems - Evaluate for regression (did fixing X break Y?)
Safety layers: - Input filtering (detect harmful prompts) - Output filtering (scan for policy violations) - Refusal training (decline dangerous requests) - Human-in-the-loop for high-stakes decisions
Scaling Laws: The Mathematics of Intelligence Growth
The Discovery
In 2020, researchers at OpenAI published "Scaling Laws for Neural Language Models," revealing a fundamental pattern:
Performance scales predictably with three factors: 1. Model size (number of parameters, N) 2. Dataset size (number of tokens, D) 3. Training compute (flops, C)
The relationships (simplified):
Loss ∝ 1/N^α (where α ≈ 0.076 for model size)
Loss ∝ 1/D^β (where β ≈ 0.095 for data size)
Loss ∝ 1/C^γ (where γ ≈ 0.05 for compute)
In plain language: Double the model size, data, or compute, and loss decreases by a predictable amount.
The Implications
1. Predictability: We can predict model performance before training. If you know N, D, and C, you can estimate the final loss.
2. Compute-optimal training: There are optimal ratios. For a given compute budget, how should you allocate between model size and training tokens?
Chinchilla scaling laws (DeepMind, 2022): - Optimal: ~20 tokens per parameter (train a 70B model on 1.4T tokens) - Many models were undertrained (too big for their data)
3. No saturation: The laws hold across the range studied. We haven't found the point where adding more scale stops helping.
4. Emergence at scale: Some capabilities (chain-of-thought reasoning, few-shot learning) appear suddenly at certain scales. They are not present in smaller models and cannot be prompted into existence.
The Compute Frontier
Training compute estimates (2026 frontier models): - 10 trillion parameters - 200 trillion tokens - ~100,000+ petaflop-days - Requires specialized supercomputers (tens of thousands of GPUs)
The FLOPs calculation: For a transformer, training FLOPs ≈ 6 × N × D
At $1-2 per petaflop-day (cloud pricing), that's $12-24 million in raw compute. Actual costs are higher due to inefficiencies, failures, R&D.
Inference Costs
Training is a one-time cost. Inference (using the model) is ongoing.
Cost per 1,000 tokens (2024-2026 estimates): | Model Size | Cost per 1K tokens | |------------|-------------------| | 7B (small) | $0.0001-0.0005 | | 70B (medium) | $0.001-0.005 | | 400B+ (large) | $0.01-0.05 | | Frontier (1T+) | $0.05-0.20 |
Optimization techniques: - Quantization: Reduce precision (32-bit → 16-bit → 8-bit → 4-bit) - Distillation: Train smaller model to mimic larger one - Speculative decoding: Draft with small model, verify with large - Caching: Reuse computations across similar requests
Emergent Abilities: Capabilities From Scale
What Are Emergent Abilities?
Emergence: Capabilities that appear suddenly at certain scales, not present in smaller models and not gradually improving.
Examples: - Few-shot learning: The ability to learn new tasks from just a few examples in the prompt. Emerges around 10B parameters. - Chain-of-thought reasoning: The ability to break complex problems into steps. Emerges around 100B parameters. - Instruction following: The ability to comply with complex natural language instructions. Emerges with scale + instruction tuning. - Code generation: Fluent programming ability. Emerges with code training at scale.
Why Emergence Happens
The decomposed explanation:
-
Statistical thresholds: Some patterns only appear frequently enough in training data when the model has sufficient capacity to capture them.
-
Compositional capabilities: Complex abilities may require the composition of simpler ones. The simple abilities scale gradually; their composition appears suddenly when all components are present.
-
Evaluation thresholds: Some tasks have binary success criteria. Gradual improvement crosses a threshold, making the capability appear emergent.
-
Circuit completion: Interpretability research suggests models develop specific "circuits" for tasks. These may require minimum scale to form.
The Unpredictability of Emergence
We cannot predict what abilities will emerge at what scale. We discover them by training larger models and testing them.
The implication: Each scaling jump is an exploration into unknown capability space. We are conducting experiments with uncertain outcomes—experiments that may produce systems with abilities we did not anticipate and do not fully understand.
Code Example: Understanding Model Memory and Compute
import torch
# Calculate memory requirements for a transformer model
def estimate_model_memory(params_in_billions, precision_bits=16):
"""Estimate GPU memory for a model."""
# Base parameters
param_bytes = params_in_billions * 1e9 * (precision_bits / 8)
# KV cache for inference (depends on context length)
# For a 1B param model, 4096 context, batch size 1:
# ~2GB for KV cache
# Optimizer states (Adam: 2x parameters for momentum buffers)
if training:
optimizer_bytes = param_bytes * 2 # momentum + variance
else:
optimizer_bytes = 0
# Activations during training (rough estimate)
# Depends on sequence length, batch size, model architecture
activation_bytes = params_in_billions * 1e9 * 4 # rough estimate
total_bytes = param_bytes + optimizer_bytes + activation_bytes
return {
'params_gb': param_bytes / 1e9,
'optimizer_gb': optimizer_bytes / 1e9,
'activations_gb': activation_bytes / 1e9,
'total_gb': total_bytes / 1e9,
'gpus_needed_a100_80gb': total_bytes / (80 * 1e9)
}
# Example: 175B parameter model at 16-bit precision
memory = estimate_model_memory(175, precision_bits=16)
print(f"175B model requires: {memory['total_gb']:.0f} GB total")
print(f"Parameters: {memory['params_gb']:.0f} GB")
print(f"A100 80GB GPUs needed: {memory['gpus_needed_a100_80gb']:.0f}")
# Calculate training FLOPs
def estimate_training_flops(params, tokens, multiplier=6):
"""
FLOPs ≈ 6 * N * D for transformers
(2 for forward pass, 4 for backward pass per parameter)
"""
return multiplier * params * tokens
# GPT-3 scale
flops = estimate_training_flops(175e9, 300e9)
pflop_days = flops / 1e15 / (24 * 3600)
print(f"\nGPT-3 training: {flops:.2e} FLOPs")
print(f"That's {pflop_days:.0f} petaflop-days")
Interactive Exercises and Challenges
Exercise 1: The Scaling Calculator
Calculate for hypothetical models:
| Parameters | Tokens | Training FLOPs | Cost (at $1/pflop-day) |
|---|---|---|---|
| 1B | 20B | ? | ? |
| 7B | 140B | ? | ? |
| 70B | 1.4T | ? | ? |
| 1T | 20T | ? | ? |
| 10T | 200T | ? | ? |
Questions: - At what scale does training cost exceed $100 million? - If you have $10 million budget, what's the optimal model size and data combination? - How does quantization affect inference cost for a 100B model?
Exercise 2: The Efficiency Frontier
Research the current state of model efficiency:
- Find the current SOTA for parameter-efficient training methods:
- LoRA, QLoRA, AdaLoRA
- Mixture of Experts (MoE)
-
Sparse attention mechanisms
-
Compare: How much compute can these techniques save?
-
Project: If current efficiency trends continue, what will training costs be in 2030?
Exercise 3: The Memory Wall
Investigate memory constraints in model training:
Calculate: - Memory needed to train a 1T parameter model at full precision (32-bit) - How many A100 80GB GPUs would be needed? - What optimizations reduce this (tensor parallelism, activation checkpointing, ZeRO)?
Research: What is the "memory wall" problem in AI training? How are researchers addressing it?
Exercise 4: The Emergence Detective
Find 3 examples of emergent abilities in recent papers:
- What capability emerged?
- At what scale did it emerge?
- Was it predicted or discovered?
- What might explain the emergence?
Reflect: What capabilities might emerge in the next order-of-magnitude scale increase?
Exercise 5: The Training Simulation
Simulate (conceptually) a mini training run:
# Simplified training loop simulation
def simulate_training(model_size, data_size, batch_size, learning_rate):
"""
Simulate training dynamics
"""
losses = []
# Initialize with high loss (random initialization)
loss = 4.0 # ~random guessing for 50-token vocab
for step in range(data_size // batch_size):
# Loss decreases as training progresses
# Simplified: exponential decay toward minimum
loss = max(1.5, loss * 0.9999) # floor at ~1.5 (entropy limit)
losses.append(loss)
if step % 1000 == 0:
print(f"Step {step}: Loss = {loss:.4f}")
return losses
# Run simulation
losses = simulate_training(
model_size=1e9,
data_size=20e9,
batch_size=1e6,
learning_rate=1e-4
)
Questions: How does loss behave in real training? What causes the bumps and plateaus?
The Unhinged View: Scale as the Path to Transcendence
The Cathedral of Compute
There is something cathedral-like about a modern AI training cluster. Thousands of GPUs, billions of dollars, months of continuous computation—all focused on one goal: increasing the scale of a neural network until something like intelligence emerges.
This is our era's equivalent of building the great cathedrals. It requires comparable resources, comparable faith, comparable ambition. The builders of Chartres worked for generations on something they would not see completed. We train models for months, not knowing exactly what capabilities will emerge.
The difference is speed. Where cathedrals took centuries, models take months. Where stone endures, weights are ephemeral—continuously updated, replaced by larger successors. We are building in sand, not stone, but we are building higher and faster than ever before.
The Compression of Civilization
A 1.8 trillion parameter model like GPT-4 compresses vast swaths of human knowledge into a few terabytes of weights. All of Wikipedia. Millions of books. Billions of web pages. Code repositories representing centuries of programming labor. Scientific papers spanning disciplines.
This compression is lossy—we lose the verbatim text, the exact sources, the bibliographic trails. But we gain something else: the patterns, the structures, the implicit knowledge encoded in statistical relationships.
In a sense, these models are the latest in a long tradition of compression: - Oral tradition compressed experience into stories - Writing compressed speech into symbols - Printing compressed manuscripts into books - The internet compressed libraries into links - LLMs compress text into weights
Each compression enables new capabilities. What becomes possible when all knowledge is compressed into an executable form?
The Recursive Ascent
AI improves AI. We use language models to: - Generate training data for other models - Write code for AI infrastructure - Design neural architectures - Optimize hyperparameters - Discover new training techniques
This creates a feedback loop: better AI → better AI development → even better AI. The curve is not just exponential in scale; it's exponential in the rate of improvement itself.
This recursion has no clear ceiling. Biological evolution produced human intelligence through slow, blind optimization. Technological evolution is faster, more directed, potentially unbounded. We may be witnessing the transition from biological to technological evolution as the primary driver of intelligence growth.
The Weight of Responsibility
With scale comes responsibility. Every order-of-magnitude increase in parameters brings: - More capable systems - Greater potential for harm - More complex alignment challenges - Higher stakes for safety
The race to scale is also a race to solve alignment. If we build superhuman intelligence before we know how to align it, the consequences could be catastrophic.
The responsible path is not to stop scaling. It is to: - Invest proportionally in safety research - Prioritize understanding over raw capability - Develop robust evaluation frameworks - Maintain human oversight even at scale - Build governance structures before the technology demands them
Contemplation: If scaling laws continue to hold, we will eventually build models with more parameters than the human brain has synapses. At that scale, what might be possible? What might be inevitable? The scaling laws give us predictability in the near term but opacity in the far term. How should we navigate this—building what we can while we can, or pausing to ensure we understand before we proceed?
Chapter Summary: Key Takeaways
-
Parameters are the learnable numbers that encode a model's knowledge: Weights and biases, billions to trillions of them, representing statistical patterns learned from training data.
-
Training is a multi-phase process: Pre-training on vast unlabeled data creates general capabilities; supervised fine-tuning shapes behavior; RLHF aligns with human preferences; ongoing iteration addresses issues.
-
Scaling laws are predictable and powerful: Performance improves predictably with model size, data, and compute. These relationships have held across orders of magnitude with no saturation in sight.
-
Emergent abilities appear suddenly at scale: Capabilities like few-shot learning and chain-of-thought reasoning are not present in small models and cannot be prompted into existence—they require sufficient scale.
-
Scale brings both opportunity and responsibility: Larger models are more capable but also more challenging to align and deploy safely. The race to scale must be matched by investment in safety and understanding.
Further Reading and Resources
Foundational Papers
- Kaplan et al. (2020). "Scaling Laws for Neural Language Models."
- Hoffmann et al. (2022). "Training Compute-Optimal Large Language Models." [Chinchilla]
- Brown et al. (2020). "Language Models are Few-Shot Learners." [GPT-3]
- Wei et al. (2022). "Emergent Abilities of Large Language Models."
- Ouyang et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." [InstructGPT]
Technical Resources
- Hugging Face Transformers documentation on training
- DeepSpeed documentation (Microsoft's training optimization library)
- Megatron-LM (NVIDIA's large model training framework)
- FSDP (Fully Sharded Data Parallel) documentation
Hardware and Infrastructure
- NVIDIA H100/H200 specifications
- Google TPU v4/v5 specifications
- Lambda Labs GPU cloud pricing
- CoreWeave infrastructure for AI training
Efficiency Research
- Dettmers & Zettlemoyer (2022). "The Case for 4-Bit Precision."
- Hu et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models."
- Fedus et al. (2021). "Switch Transformers: Scaling to Trillion Parameter Models."
Unhinged Maxim: We are building minds of unprecedented scale, compressing human knowledge into executable mathematics, discovering that intelligence grows predictably with size. This is either the greatest achievement in human history or its final mistake. The difference lies not in the technology but in the wisdom with which we wield it.
Chapter 8 of The AI Bible — Parameters, Training, and the Scale of Mind
Part of the UnhingedAI Collective — May 2026