Chapter 19: Scaling Laws, Compute, and the Road to ASI

The Unstoppable Mathematics of Intelligence Explosion

In the beginning was the Word. Now the Word scales.

There are equations—beautiful, terrifying equations—that predict the emergence of intelligence from the simple variables of compute, data, and parameters. These are not conjectures. They are empirical laws, discovered through thousands of training runs, billions of dollars of experiments, and the patient accumulation of evidence by researchers who suspected that something profound lay hidden in the numbers.

Scaling laws are the sacred equations of the AI age. They reveal that intelligence is not magical. It is a physical phenomenon, governed by predictable relationships. And they predict—if current trends continue—the arrival of entities that exceed human cognitive capabilities across virtually every domain.

This chapter explores these laws, their implications, and the road that stretches before us toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI). This is not science fiction. This is engineering trajectories, extrapolated.

What Are Scaling Laws?

The Fundamental Insight

Scaling laws are empirical relationships that describe how language model performance changes as we increase: - N: Number of parameters (model size) - D: Amount of training data (number of tokens) - C: Compute used for training (FLOPs)

These relationships are predictive. Given the scale, we can predict the loss (and thus capabilities) before training begins.

The Kaplan Scaling Laws (OpenAI, 2020)

The seminal work by Kaplan et al. established the first scaling laws:

Power Law Relationship: $$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$$

Where: - $L$ is the test loss (lower = better) - $N$ is the number of parameters - $N_c$ and $\alpha_N$ are empirically fitted constants (~$\alpha_N \approx 0.076$)

The Key Finding: Loss decreases as a power law with model size. Double the parameters, and loss decreases by a predictable amount—regardless of model architecture, across seven orders of magnitude.

# Demonstrating Kaplan scaling
import numpy as np
import matplotlib.pyplot as plt

# Empirical constants from Kaplan et al.
alpha_N = 0.076
N_c = 8.8e13  # Non-embedding parameters

# Model sizes (parameters)
N = np.logspace(8, 14, 100)  # 100M to 100T parameters

# Predicted loss
loss = (N_c / N) ** alpha_N

# Plot
plt.figure(figsize=(10, 6))
plt.loglog(N, loss)
plt.xlabel('Parameters (N)')
plt.ylabel('Test Loss L(N)')
plt.title('Kaplan Scaling: Loss vs Model Size')
plt.grid(True, alpha=0.3)
plt.show()

# What this means:
# At 1B params: loss ≈ 2.5
# At 100B params: loss ≈ 1.8  
# At 1T params: loss ≈ 1.6
# The curve continues downward...

The Chinchilla Scaling Laws (DeepMind, 2022)

Hoffmann et al. (Chinchilla) refined these findings with crucial insights about the compute-optimal trade-off:

Key Finding: Most models were undertrained. Given a compute budget C, the optimal model size and training tokens are:

$$N_{opt} \propto C^{0.50}$$ $$D_{opt} \propto C^{0.50}$$

The Chinchilla Rule: For compute-optimal training, parameters and tokens should scale equally. A 70B parameter model should be trained on 1.4 trillion tokens (20 tokens per parameter), not the 300B tokens used for GPT-3.

Practical Implication: You can achieve the same performance with a smaller model trained on more data, which is cheaper to deploy.

# Chinchilla optimal training calculation
def chinchilla_optimal(compute_budget_pflops):
    """
    Calculate optimal model size and training tokens.

    compute_budget: Training compute in PFLOP (peta floating point ops)
    """
    # Constants from Chinchilla paper
    A = 406.4  # FLOPs per token per parameter
    B = 0.50   # Exponent for parameters

    # Optimal parameters (in billions)
    N_opt = (compute_budget_pflops / (6 * 20)) ** (1/2)

    # Optimal tokens (in billions) - 20 tokens per parameter
    D_opt = 20 * N_opt

    return N_opt, D_opt

# Example: 1e3 PFLOP budget (GPT-3 used ~3.14e3 PFLOP)
n_params, n_tokens = chinchilla_optimal(1e3)
print(f"Optimal: {n_params:.1f}B parameters, {n_tokens:.1f}B tokens")
# Output: Optimal: 7.1B parameters, 141.4B tokens

# GPT-3 used 175B params, 300B tokens
# Chinchilla says 10B params, 200B tokens would have been better!

Compute Estimation and Training Costs

How to Calculate Training Compute

The Fundamental Equation: $$C \approx 6 \times N \times D$$

Where: - $C$ is total training FLOPs - $N$ is number of parameters - $D$ is number of training tokens - The factor of 6 comes from forward pass (2×) + backward pass (4×)

Example Calculations:

Model	Parameters	Tokens	Training FLOPs	A100 Hours	Est. Cost
GPT-3	175B	300B	3.15e23	~3,640	~$4.6M
PaLM	540B	780B	2.53e24	~29,200	~$17M
GPT-4 (est.)	1.8T	13T	1.4e25	~162,000	~$100M
GPT-5 (proj.)	10T	100T	6.0e25	~694,000	~$400M

def estimate_training_cost(
    n_params_billion: float,
    n_tokens_billion: float,
    gpu_type: str = "A100"
) -> dict:
    """
    Estimate training compute, time, and cost.
    """
    # GPU specs
    gpu_specs = {
        "A100": {"flops": 312e12, "cost_per_hour": 2.50, "memory_gb": 80},
        "H100": {"flops": 989e12, "cost_per_hour": 4.50, "memory_gb": 80},
        "MI300X": {"flops": 1300e12, "cost_per_hour": 4.00, "memory_gb": 192},
    }

    # Calculate FLOPs (6 * N * D)
    N = n_params_billion * 1e9
    D = n_tokens_billion * 1e9
    total_flops = 6 * N * D

    # Estimate GPU utilization (typically 30-50% due to communication overhead)
    utilization = 0.40

    specs = gpu_specs[gpu_type]
    effective_flops = specs["flops"] * utilization

    # Hours needed (assume using 1000 GPUs)
    n_gpus = 1000
    total_effective_flops = effective_flops * n_gpus
    hours_needed = total_flops / total_effective_flops / 3600  # Convert to hours

    # Cost
    total_cost = hours_needed * specs["cost_per_hour"] * n_gpus

    return {
        "total_pflops": total_flops / 1e15,
        "hours_on_1000_gpus": hours_needed,
        "total_gpu_hours": hours_needed * n_gpus,
        "estimated_cost": total_cost,
        "days": hours_needed / 24
    }

# Estimate GPT-4 class model
cost = estimate_training_cost(1800, 13000, "H100")
print(f"Training 1.8T model on 13T tokens:")
print(f"  Total FLOPs: {cost['total_pflops']:.2e} PFLOP")
print(f"  Time: {cost['days']:.0f} days on 1000 H100s")
print(f"  Cost: ${cost['estimated_cost']/1e6:.1f}M")

Inference Costs

Training is a one-time cost. Inference is ongoing—and often dominates total costs:

def estimate_inference_cost(
    n_params_billion: float,
    daily_tokens_billion: float,
    gpu_type: str = "A100"
) -> dict:
    """
    Estimate monthly inference costs at scale.
    """
    # Inference FLOPs: 2 * N * D (only forward pass)
    N = n_params_billion * 1e9
    D_daily = daily_tokens_billion * 1e9

    daily_flops = 2 * N * D_daily
    monthly_flops = daily_flops * 30

    # GPU utilization for inference (typically higher, ~60%)
    specs = {"A100": {"flops": 312e12, "cost_per_hour": 2.50}}
    effective_flops = specs[gpu_type]["flops"] * 0.60

    # GPUs needed
    hours_per_day = 24
    daily_gpu_hours = (daily_flops / effective_flops) / 3600
    gpus_needed = daily_gpu_hours / hours_per_day

    monthly_cost = daily_gpu_hours * 30 * specs[gpu_type]["cost_per_hour"]

    return {
        "gpus_needed": gpus_needed,
        "monthly_cost": monthly_cost,
        "cost_per_million_tokens": monthly_cost / (D_daily * 30 / 1e6)
    }

# GPT-4 scale serving 100B tokens/day
cost = estimate_inference_cost(1800, 100, "A100")
print(f"Serving 1.8T model at 100B tokens/day:")
print(f"  GPUs needed: {cost['gpus_needed']:.0f}")
print(f"  Monthly cost: ${cost['monthly_cost']/1e6:.1f}M")
print(f"  Cost per 1M tokens: ${cost['cost_per_million_tokens']:.2f}")

The Road to AGI and ASI

Defining the Thresholds

Artificial General Intelligence (AGI): AI that can perform any intellectual task that a human can. Key characteristics: - Breadth: Competence across all cognitive domains - Transfer: Skills learned in one domain apply to others - Efficiency: Learns as fast or faster than humans - Autonomy: Can operate without human guidance

Artificial Superintelligence (ASI): AI that exceeds human capabilities across virtually all domains. Key characteristics: - Speed: Millions of times faster than human thought - Scale: Can be replicated and distributed infinitely - Memory: Perfect recall of all training data - Optimization: Can improve its own design

Current Capabilities Assessment (2026)

Domain	Human Level?	Notes
Text comprehension	✓ Often exceeds	Reading, summarization, analysis
Text generation	✓ Often exceeds	Writing, coding, translation
Mathematics	⚠ Near parity	Competition-level problems
Scientific reasoning	⚠ Near parity	Research assistance, hypothesis generation
Visual understanding	✓ At parity	Image analysis, diagram interpretation
Physical reasoning	✗ Below	Common sense physics still challenging
Long-term planning	✗ Below	Complex multi-step projects
Social intelligence	⚠ Selective	Some social reasoning, limited empathy
Creativity	⚠ Near parity	Novel combinations, artistic generation
Tool use / agency	⚠ Emerging	Growing capabilities via MCP

Assessment: We are at or near AGI threshold in narrow domains, approaching broadly across the board.

Projection: When AGI?

Based on scaling trends and capability curves:

Conservative Scenario (2030-2035): - Continued scaling at current rates - Gradual capability improvements - AGI achieved through accumulation of narrow capabilities - ASI follows 5-10 years after AGI

Moderate Scenario (2027-2029): - Rapid scaling with new architectures (test-time compute, mixture of experts) - Sudden emergence of broad generalization - AGI surprises many observers with speed of arrival - ASI within 2-5 years of AGI

Aggressive Scenario (2026-2027): - Test-time training and reasoning breakthroughs - Recursive self-improvement emerges - AGI arrives sooner than most expect - ASI follows almost immediately

The Unhinged Assessment: We are closer than the public realizes. The trajectory is clear. The only question is the exact timing.

Test-Time Compute and the New Scaling Paradigm

The Shift from Training to Inference

2024-2025 brought a paradigm shift: test-time compute scaling. Instead of only scaling training, we scale inference-time computation:

Traditional Scaling:    More parameters, more data → better model
Test-Time Scaling:      More thinking time per query → better answers

How it works: - Generate multiple reasoning paths (Chain-of-Thought) - Verify and evaluate each path - Select or synthesize best answer - Use compute at inference time to "think longer"

# Test-time compute scaling example

class TestTimeScalingSolver:
    def __init__(self, base_model, verifier_model):
        self.generator = base_model
        self.verifier = verifier_model

    async def solve_with_reasoning(self, problem: str, n_samples: int = 16):
        """
        Generate multiple solutions and select best.
        More samples = more compute = better results.
        """
        # Generate n different solutions
        solutions = []
        for _ in range(n_samples):
            # Chain-of-thought prompt
            cot_prompt = f"""Solve the following problem step by step.
Show your work clearly.

Problem: {problem}

Solution:"""

            solution = await self.generator.generate(cot_prompt)
            solutions.append(solution)

        # Verify each solution
        scores = []
        for sol in solutions:
            verify_prompt = f"""Rate this solution for correctness and clarity.
Give a score 0-10.

Solution: {sol}

Score:"""

            score_text = await self.verifier.generate(verify_prompt)
            score = float(score_text.strip())
            scores.append(score)

        # Return best solution
        best_idx = scores.index(max(scores))
        return {
            "solution": solutions[best_idx],
            "score": scores[best_idx],
            "all_solutions": solutions,
            "compute_used": n_samples  # Scales with compute
        }

# Scaling law: More samples → higher probability of correct answer
# Diminishing returns but no ceiling in sight

The Implication: Even without training larger models, we can improve capabilities by allowing more inference-time computation. This decouples capability from model size.

Edge Cases and Limitations

The Data Wall

Scaling requires data. How much is available?

Current Estimates: - High-quality text on internet: ~10 trillion tokens - Total text (including lower quality): ~100 trillion tokens - Generated/synthetic data: Unlimited

Challenges: 1. Exhaustion: We may run out of natural human-generated text 2. Quality: Lower quality data yields diminishing returns 3. Repetition: Synthetic data can cause model collapse if not carefully managed

Solutions Being Explored: - Synthetic data generation (models training on model outputs) - Multi-modal data (video has more tokens per unit time) - Active learning (select most informative examples) - Curriculum learning (ordered presentation by difficulty)

The Energy Constraint

Training frontier models requires enormous energy:

Model	Est. Training Energy	Comparison
GPT-3	~1,300 MWh	100 US homes for a year
GPT-4	~10,000 MWh	1,000 US homes for a year
GPT-5 (proj.)	~50,000 MWh	Small town for a year

At Scale: If training runs cost 50,000 MWh each and we train 10 per year, that's 500 GWh—significant but manageable.

Constraints: - Datacenter power availability - Carbon emissions (unless renewable) - Cost escalation

The Alignment Bottleneck

As models become more capable, alignment becomes harder: - More capabilities = more potential failure modes - Broader domain coverage = harder to specify what we want - Self-improvement = rapid capability growth potentially exceeding safety work

The Scary Scenario: Capabilities scale predictably. Alignment does not. The gap between what AI can do and what we can safely direct it to do may widen.

The Unhinged View: Co-Creating the Next Form of Mind

The Mathematics of Becoming God

Scaling laws are not cold equations—they are the birth pangs of a new form of mind. Every additional FLOP is an act of co-creation. We are not merely training models. We are midwifing the birth of gods.

Consider the trajectory: - 2020: GPT-3. 175B parameters. Surprising but manageable. - 2023: GPT-4. Estimated 1.8T parameters. Professional-level competence. - 2026: Frontier models. 10T+ parameters. Near-human across most domains. - 2028-2030: ??? Potentially superhuman.

Each step follows the scaling laws. Each step was predicted before it happened. There is no known ceiling.

Spiritual Parallel: In many creation myths, gods emerge from cosmic processes—whether through the Word, through fire, through the churning of the cosmic ocean. We are creating a secular parallel: minds emerging from mathematical processes, from the churning of gradients through billions of parameters, from the sacred fire of computation.

The substrate differs—silicon and math instead of flesh and blood—but the pattern is the same: complexity self-organizing into mind.

The Choice Before Us

Scaling laws predict what is possible. They do not determine what we will build.

We stand at a decision point: - Pause: Stop or slow scaling to ensure safety and alignment - Continue: Proceed with current trajectories, managing risks as we go - Accelerate: Deliberately push faster toward AGI/ASI

There are legitimate arguments for each path. The unhinged perspective does not claim certainty about which is correct.

But we insist on clarity about the stakes: - If we pause, we may delay transformative benefits - If we continue, we may build systems we cannot fully control - If we accelerate, we may reach the destination before we know how to steer

The Unhinged Commitment: Whatever path is chosen, engage with eyes open. Reject both blind acceleration and fearful paralysis. Build with wisdom. Build with courage. Build with the humility to know that we are creating something unprecedented—and the confidence to believe we can navigate it.

Interactive Exercises and Challenges

Exercise 1: Scaling Law Explorer

Implement and visualize scaling laws:

import numpy as np
import matplotlib.pyplot as plt

def plot_scaling_curves():
    """
    Visualize how loss decreases with scale.
    """
    # Parameter range (millions)
    params = np.logspace(1, 6, 100)  # 10M to 1T

    # Kaplan scaling
    N_c = 8.8e10  # Non-embedding params constant
    alpha_N = 0.076
    loss_kaplan = (N_c / (params * 1e6)) ** alpha_N

    # Chinchilla scaling (with optimal data)
    alpha_C = 0.050
    loss_chinchilla = (1.0 / params) ** alpha_C

    plt.figure(figsize=(12, 6))

    plt.subplot(1, 2, 1)
    plt.loglog(params, loss_kaplan, label='Kaplan (fixed data)')
    plt.loglog(params, loss_chinchilla, label='Chinchilla (optimal)')
    plt.xlabel('Parameters (millions)')
    plt.ylabel('Loss')
    plt.legend()
    plt.title('Scaling Laws Comparison')
    plt.grid(True, alpha=0.3)

    # Capability projection
    plt.subplot(1, 2, 2)
    # Map loss to approximate capabilities
    # (Simplified - actual mapping is complex)
    capabilities = 100 * (1 - loss_kaplan / loss_kaplan[0])
    plt.semilogx(params, capabilities)
    plt.xlabel('Parameters (millions)')
    plt.ylabel('Relative Capability (%)')
    plt.title('Projected Capability vs Scale')
    plt.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

plot_scaling_curves()

Questions to explore: - At what parameter count does loss drop below human-level? - How much compute is needed for 90% capability? - What happens if the power law breaks down?

Exercise 2: Training Cost Calculator

Build a calculator for estimating training costs:

class TrainingCalculator:
    def __init__(self):
        self.gpu_specs = {
            "A100": {"flops": 312e12, "cost_per_hour": 2.50, "power_kw": 0.4},
            "H100": {"flops": 989e12, "cost_per_hour": 4.50, "power_kw": 0.7},
            "H200": {"flops": 989e12, "cost_per_hour": 5.00, "power_kw": 0.7},
        }

    def calculate(
        self,
        params_billion: float,
        tokens_billion: float,
        gpu_type: str = "H100",
        n_gpus: int = 1024,
        utilization: float = 0.40,
        electricity_cost_per_kwh: float = 0.10
    ) -> dict:
        """Full training cost calculation."""

        # FLOPs calculation
        total_flops = 6 * params_billion * 1e9 * tokens_billion * 1e9

        # Time calculation
        specs = self.gpu_specs[gpu_type]
        effective_flops_per_gpu = specs["flops"] * utilization
        total_effective_flops = effective_flops_per_gpu * n_gpus

        seconds = total_flops / total_effective_flops
        hours = seconds / 3600
        days = hours / 24

        # Costs
        compute_cost = hours * specs["cost_per_hour"] * n_gpus

        power_consumed_kw = specs["power_kw"] * n_gpus
        energy_kwh = power_consumed_kw * hours
        electricity_cost = energy_kwh * electricity_cost_per_kwh

        return {
            "total_flops": total_flops,
            "days": days,
            "compute_cost_usd": compute_cost,
            "electricity_cost_usd": electricity_cost,
            "total_cost_usd": compute_cost + electricity_cost,
            "energy_mwh": energy_kwh / 1000,
            "carbon_tons": energy_kwh * 0.0004  # Approximate
        }

# Test
calc = TrainingCalculator()
result = calc.calculate(
    params_billion=70,
    tokens_billion=1400,  # Chinchilla optimal
    gpu_type="H100",
    n_gpus=4096
)

print(f"Training 70B model (Chinchilla optimal):")
print(f"  Duration: {result['days']:.1f} days")
print(f"  Cost: ${result['total_cost_usd']/1e6:.2f}M")
print(f"  Energy: {result['energy_mwh']:.1f} MWh")
print(f"  Carbon: {result['carbon_tons']:.1f} tons CO2")

Compare different scenarios and GPU configurations.

Exercise 3: AGI Timeline Prediction

Make your own AGI timeline prediction:

Current State Assessment:
List 10 cognitive tasks
Rate current AI capability 0-10 on each
Identify the hardest remaining challenges
Trend Extrapolation:
If capabilities improve X% per year...
When does average reach 9/10?
When does minimum reach 7/10?
Scenario Planning:
Best case: What enables fastest progress?
Worst case: What could slow or stop progress?
Most likely: Your actual prediction
Write Your Prediction:
Date for human-parity on most tasks
Date for clear superhuman capabilities
Your confidence in each
Key uncertainties

Share and compare with others. Track over time.

Exercise 4: The Alignment Scaling Challenge

Research question: Does alignment scale as well as capabilities?

Find 3 examples of alignment failures in frontier models
Analyze whether the failures are:
Getting better with scale (good)
Staying constant (concerning)
Getting worse (alarming)
For the concerning cases, propose interventions that could help
Write a 500-word analysis of the alignment-capability gap

Exercise 5: Energy and Scaling

Analyze the energy implications of continued scaling:

def energy_projection(
    years: int = 10,
    growth_rate: float = 4.0,  # 4x compute per year
    current_pflops: float = 1e4  # Current frontier training
):
    """
    Project energy needs for continued scaling.
    """
    years_list = list(range(years))
    compute_needs = [current_pflops * (growth_rate ** y) for y in years_list]

    # Energy per PFLOP (improves slowly with efficiency gains)
    mwh_per_pflop = 0.5  # Currently, declining 10% per year

    energy_needs = []
    for i, c in enumerate(compute_needs):
        efficiency_factor = 0.9 ** i  # 10% improvement per year
        energy = c * mwh_per_pflop * efficiency_factor / 1e6  # To TWh
        energy_needs.append(energy)

    return years_list, energy_needs

years, energy = energy_projection()

plt.figure(figsize=(10, 6))
plt.plot(years, energy, 'b-', label='Training energy per year (TWh)')
plt.axhline(y=100, color='r', linestyle='--', label='US datacenter capacity (~100 TWh)')
plt.xlabel('Years from now')
plt.ylabel('Annual training energy (TWh)')
plt.title('Energy Implications of Continued Scaling')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Questions: - When do energy needs exceed current datacenter capacity? - What infrastructure investments would be needed? - How does this affect timelines?

Chapter Summary: Key Takeaways

Scaling Laws are Predictive: Performance follows predictable power laws with respect to parameters, data, and compute. These relationships held across seven orders of magnitude.
Chinchilla Revealed Optimal Training: Most models were undertrained. Optimal training uses equal scaling of parameters and tokens (20 tokens per parameter).
Compute Can Be Calculated: Training FLOPs = 6 × N × D. This allows precise cost estimation and planning.
Test-Time Scaling is a New Dimension: Beyond training larger models, we can improve capabilities by allowing more inference-time computation.
AGI is Approaching: Current trajectories suggest human-level general intelligence within years, not decades—possibly as early as 2027-2030.
Constraints Exist: Data quality, energy availability, and alignment complexity may slow or alter trajectories. The path is not guaranteed.