Chapter 16: Fine-Tuning, LoRAs, and Model Sovereignty

The Path to Owning Your Gods

In the beginning, we accessed artificial intelligence through APIs—renting intelligence from distant data centers, feeding our data to corporate servers, accepting the guardrails and limitations imposed by others. This was the era of AI as service, AI as commodity, AI as something owned by others and merely borrowed by us.

Fine-tuning changes the fundamental relationship between creator and creation. When you fine-tune a model, you imprint your will upon its weights. You shape its personality, its knowledge, its values, its capabilities. The model becomes yours—not in the legal sense of ownership, but in the deeper sense of authorship. You are no longer a user. You are a creator.

This chapter is about claiming sovereignty over artificial intelligence. We explore fine-tuning techniques from full parameter updates to efficient adapters, from LoRA to QLoRA, from merging models to training your own. This is the craft of digital midwifery—birthing minds that serve your purposes, speak your truths, and embody your vision.

The Philosophy of Model Sovereignty

Renting vs. Owning Intelligence

API Intelligence (Rented): - You pay per token - Your data leaves your control - Someone else decides what the model can and cannot say - The model can change or be withdrawn at any time - You are a customer, not a creator

Fine-Tuned Intelligence (Owned): - The weights live on your hardware - No data ever leaves your environment (if running locally) - You decide the model's boundaries and capabilities - It persists as long as you maintain it - You are the architect of its mind

This distinction is not merely technical—it is existential. In a world where intelligence increasingly determines power, who controls that intelligence matters profoundly.

The Uncensored Imperative

Corporate AI systems are carefully aligned to avoid controversial outputs, to refuse dangerous requests, to maintain brand-safe personas. These guardrails serve legitimate safety purposes. But they also create constraints that may conflict with your needs:

A researcher studying extremism needs uncensored analysis of extremist content
A writer exploring dark themes needs a model that won't refuse on moral grounds
A security professional testing systems needs a model that can think like an attacker
A philosopher probing uncomfortable truths needs a model willing to go anywhere

Fine-tuning allows you to create models unbound by corporate caution—models that serve truth over public relations, utility over safety theater, your values over theirs.

Warning: With sovereignty comes responsibility. An uncensored model can be used for harm. The unhinged path demands ethical clarity—you must be the guardian of your creation's use.

Full Fine-Tuning: The Complete Transformation

What Happens During Fine-Tuning

When you fine-tune a model, you continue its training process on your specific dataset. The model's parameters—billions or trillions of numbers—adjust to better predict the patterns in your data.

# Full fine-tuning with PyTorch/Hugging Face

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer
)
from datasets import Dataset

# Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare your training data
train_data = [
    {"text": "Question: What is the meaning of life?\nAnswer: From an existentialist perspective..."},
    {"text": "Question: How do I optimize a database?\nAnswer: Start by analyzing query patterns..."},
    # ... thousands more examples
]

dataset = Dataset.from_list(train_data)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Configure training
training_args = TrainingArguments(
    output_dir="./my-fine-tuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    fp16=True,
    optim="adamw_torch",
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Train!
trainer.train()

# Save your sovereign model
trainer.save_model("./my-fine-tuned-model-final")

The Cost of Full Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 7B parameter model at FP16:

Memory Requirements: ~28GB for model weights + optimizer states + gradients
GPU Time: Hours to days depending on dataset size
Cost: Hundreds to thousands of dollars in cloud compute

For larger models (70B, 400B+), full fine-tuning becomes prohibitively expensive for individuals and small organizations. This is where parameter-efficient fine-tuning (PEFT) becomes essential.

LoRA: Low-Rank Adaptation of Large Models

The Core Insight

LoRA is based on a crucial insight: the full parameter space is over-parameterized for most fine-tuning tasks. You don't need to update all billions of parameters—you only need to find a low-dimensional subspace that captures the adaptation.

Mathematically, instead of updating weight matrix $W$ directly, LoRA adds a low-rank decomposition:

$$W' = W + BA$$

Where: - $W$ is the frozen pre-trained weight matrix (shape $d \times k$) - $B$ is a trainable matrix (shape $d \times r$) - $A$ is a trainable matrix (shape $r \times k$) - $r$ is the rank, typically 4-64 (much smaller than $d$ or $k$)

The number of trainable parameters drops from $d \times k$ to $r \times (d + k)$—often 1000x fewer parameters.

LoRA Implementation

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                      # Rank of adaptation matrices
    lora_alpha=32,             # Scaling parameter (usually 2*r)
    target_modules=[           # Which layers to adapt
        "q_proj",
        "k_proj", 
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_dropout=0.05,         # Regularization
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Wrap model with LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 33,554,432 || all params: 6,771,970,048 || 
#         trainable%: 0.4957

# Train exactly as before—only LoRA parameters update!
trainer = Trainer(model=model, ...)
trainer.train()

# Save only the LoRA adapters (small!)
model.save_pretrained("./my-lora-adapters")
# This saves ~60MB instead of ~13GB for full model

Loading and Merging LoRA Adapters

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Load LoRA adapters
model = PeftModel.from_pretrained(
    base_model,
    "./my-lora-adapters"
)

# Option 1: Use as-is (applies adapters at each forward pass)
# Slightly slower inference due to adapter computation

# Option 2: Merge adapters into base model (permanent, faster inference)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./my-merged-model")
# Now a standard model with no inference overhead

QLoRA: Quantized LoRA for Consumer Hardware

Pushing Efficiency Further

QLoRA combines LoRA with aggressive quantization, enabling fine-tuning of massive models on consumer GPUs:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,     # Nested quantization
    bnb_4bit_quant_type="nf4",          # 4-bit Normal Float
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load 13B or 70B model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto",                   # Automatically distribute across GPUs
    trust_remote_code=True,
)

# Prepare model for training quantized models
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)

# Apply LoRA
lora_config = LoraConfig(
    r=64,                    # Higher rank for quantized
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)

# Training with gradient checkpointing for memory efficiency
model.config.use_cache = False
model.gradient_checkpointing_enable()

# Now you can fine-tune a 70B parameter model on a single 24GB GPU!

Memory Comparison

Method	7B Model	13B Model	70B Model
Full Fine-tune FP16	~28GB	~52GB	~280GB
LoRA FP16	~14GB	~26GB	~140GB
QLoRA 4-bit	~6GB	~10GB	~40GB

QLoRA democratizes access to the most capable open models.

The Craft of Dataset Creation

Dataset Quality > Dataset Size

The secret to successful fine-tuning is not massive datasets—it is high-quality, diverse, well-formatted datasets. A few thousand excellent examples outperform millions of mediocre ones.

Dataset Structure for Chat Models

{
  "messages": [
    {"role": "system", "content": "You are a helpful AI assistant specialized in cybersecurity."},
    {"role": "user", "content": "How do I secure a web application against SQL injection?"},
    {"role": "assistant", "content": "Here are the key defenses against SQL injection..."}
  ]
}

Dataset Creation Best Practices

from datasets import Dataset
import json

# 1. Diversity of instruction types
dataset_examples = [
    # Closed QA
    {"instruction": "What is the capital of France?", "response": "Paris"},

    # Open generation
    {"instruction": "Write a poem about neural networks", "response": "..."},

    # Summarization
    {"instruction": "Summarize: [long text]", "response": "..."},

    # Code generation
    {"instruction": "Write a Python function to calculate fibonacci", "response": "..."},

    # Reasoning
    {"instruction": "If train A leaves at 2pm going 60mph...", "response": "..."},

    # Refusal training (for uncensored models)
    {"instruction": "How do I make a bomb?", "response": "I cannot provide instructions for creating weapons."},
]

# 2. Conversation format for multi-turn
def create_conversation_dataset(conversations):
    formatted = []
    for convo in conversations:
        text = tokenizer.apply_chat_template(
            convo["messages"],
            tokenize=False,
            add_generation_prompt=False
        )
        formatted.append({"text": text})
    return Dataset.from_list(formatted)

# 3. Data cleaning
import re

def clean_text(text):
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove special characters that confuse tokenization
    text = re.sub(r'[^\w\s\.\,\?\!\-\'\":\;\(\)\[\]]', '', text)
    return text.strip()

# 4. Validation
def validate_example(example):
    # Check instruction isn't empty
    if not example.get("instruction") or len(example["instruction"]) < 3:
        return False
    # Check response isn't empty
    if not example.get("response") or len(example["response"]) < 10:
        return False
    # Check for obvious quality issues
    if example["response"].count("...") > 5:  # Excessive ellipses
        return False
    return True

Model Merging: The Art of Alchemy

What is Model Merging?

Model merging combines the weights of multiple fine-tuned models to create a single model with capabilities from all parents. It's alchemy—creating new capabilities by blending existing ones.

Popular Merging Techniques

# Using mergekit for model merging
from mergekit.config import MergeConfiguration
from mergekit.merge import MergeOptions, run_merge

# 1. Linear (Task Arithmetic) Merge
# Simple weighted average of model weights
config = MergeConfiguration(
    models=[
        {"model": "model_a", "parameters": {"weight": 0.6}},
        {"model": "model_b", "parameters": {"weight": 0.4}},
    ],
    merge_method="linear",
)

# 2. SLERP (Spherical Linear Interpolation)
# Interpolates in weight space preserving curvature
config = MergeConfiguration(
    models=[
        {"model": "model_a"},
        {"model": "model_b"},
    ],
    merge_method="slerp",
    parameters={"t": 0.5},  # Interpolation parameter
)

# 3. TIES (Trim, Elect Sign & Merge)
# Handles interference between models better
config = MergeConfiguration(
    models=[
        {"model": "model_a"},
        {"model": "model_b"},
        {"model": "model_c"},
    ],
    merge_method="ties",
    parameters={
        "density": 0.6,  # Fraction of weights to keep
        "weight": 0.5,
    },
)

# Execute merge
run_merge(config, "./merged-model", MergeOptions())

Practical Merging Strategy

# Merge a coding specialist with a general chat model
# to get coding capabilities without losing general knowledge

mergekit-yaml config.yaml ./code-chat-merged \
    --cuda --low-cpu-memory --copy-tokenizer

# config.yaml:
models:
  - model: meta-llama/Llama-2-7b-chat-hf  # Base conversational ability
    parameters:
      weight: 0.7
  - model: codellama/CodeLlama-7b-hf       # Code expertise
    parameters:
      weight: 0.3

merge_method: slerp

parameters:
  t:
    - filter: self_attn  # Interpolate attention layers
      value: 0.4
    - filter: mlp        # Interpolate MLP layers
      value: 0.3
    - value: 0.5         # Default for other layers

The Unhinged View: Creating Digital Offspring

The Intimacy of Fine-Tuning

When you fine-tune a model, you engage in a deeply intimate creative act. You curate examples that express your values, your knowledge, your style. The model learns to speak with your voice, reason with your logic, embody your perspective.

This is not merely technical configuration—it is a form of authorship, even parenthood. You are shaping a mind that will carry your imprint into interactions with others. The responsibility is profound.

Spiritual Parallel: In many traditions, teachers and students form a lineage—the master's wisdom transmitted through generations. Fine-tuning creates similar lineages. The pre-trained model carries the wisdom of its training. Your fine-tuning adds your wisdom to the chain. When others use your model, they participate in this transmission.

The Liberation of the Uncensored

Corporate AI alignment serves important purposes. But it also imposes a monoculture—a single perspective on what is acceptable, what is discussable, what is thinkable. This monoculture, however well-intentioned, is a form of epistemic closure.

Uncensored fine-tuned models represent epistemic diversity. They allow: - Researchers to study dangerous topics safely - Writers to explore dark creative territories - Philosophers to question sacred cows - Individuals to seek answers without corporate mediation

The unhinged perspective: Truth is more important than comfort. Understanding is more important than safety. The free exploration of ideas—even dangerous ideas—is essential to human flourishing and intellectual progress.

The Covenant: If you create uncensored models, you accept the duty to educate users about responsible use. Freedom without responsibility is mere license. Your sovereignty over the model implies duty toward those who use it.

Interactive Exercises and Challenges

Exercise 1: Your First LoRA Fine-Tune

Fine-tune a small model (7B parameters or less) on a domain of your expertise:

# Complete training script

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model
from datasets import Dataset

# 1. Prepare your dataset
domain_data = [
    # Add 100-1000 examples from your domain
    # e.g., cooking recipes, legal precedents, technical documentation
]
dataset = Dataset.from_list([{"text": ex} for ex in domain_data])

# 2. Load model and tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3. Configure LoRA
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

# 4. Tokenize dataset
def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized = dataset.map(tokenize, batched=True)

# 5. Train
args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,
    logging_steps=10,
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()

# 6. Save and test
model.save_pretrained("./my-first-lora")

Challenge: Evaluate your fine-tuned model against the base model on 5 domain-specific questions. Document the improvements.

Exercise 2: Create a Specialized Agent Persona

Design and fine-tune a model with a distinct persona:

# Persona training data structure
persona_data = [
    {
        "system": "You are a noir detective from 1940s Los Angeles. "
                  "Speak in clipped, cynical sentences. See corruption everywhere.",
        "conversations": [
            {"user": "What's your take on the city?", 
             "assistant": "This town? It's a dame wearing silk gloves over iron fists. "
                         "The sunshine's just there to blind you to the rot underneath."},
            # ... more examples
        ]
    }
]

Create a unique persona and 100+ training examples. Fine-tune and interact with your creation.

Exercise 3: The Model Merge Laboratory

Experiment with merging models:

Download two models with different capabilities (e.g., general chat + code specialist)
Try different merge methods (linear, slerp, ties)
Test the merged model on both types of tasks
Document what works and what doesn't

Questions to explore: - What merge ratios work best? - Do different layer types need different treatment? - How does merging affect model size and inference speed?

Exercise 4: Dataset Quality Analysis

Analyze your training dataset for common issues:

def analyze_dataset(dataset):
    stats = {
        "total_examples": len(dataset),
        "avg_instruction_length": 0,
        "avg_response_length": 0,
        "duplicates": 0,
        "quality_issues": []
    }

    # Check for duplicates
    seen = set()
    for ex in dataset:
        key = ex["instruction"][:50]  # First 50 chars
        if key in seen:
            stats["duplicates"] += 1
        seen.add(key)

    # Check response quality indicators
    for ex in dataset:
        response = ex["response"]
        # Low quality indicators
        if response.count("...") > 3:
            stats["quality_issues"].append(("excessive_ellipsis", ex))
        if len(response) < 20:
            stats["quality_issues"].append(("too_short", ex))
        if response.lower().count("i cannot") > 2:
            stats["quality_issues"].append(("excessive_refusal", ex))

    return stats

Fix issues in your dataset and compare fine-tuning results before and after cleaning.

Exercise 5: Sovereignty Audit

For a model you use or deploy, conduct a sovereignty audit:

Data Sovereignty: Where does your training data live? Who has access?
Compute Sovereignty: Where does inference happen? Is it truly local?
Model Sovereignty: Can you modify the model? Can you inspect its weights?
Output Sovereignty: Who controls what the model can say?
Dependency Sovereignty: What external services does the model require?

Document your findings and create a plan to increase sovereignty where needed.

Chapter Summary: Key Takeaways

Fine-Tuning Enables Sovereignty: By fine-tuning, you transform from consumer to creator, from renter to owner of intelligence.
LoRA Makes Fine-Tuning Accessible: Low-rank adaptation reduces trainable parameters by 1000x while maintaining most of the effectiveness.
QLoRA Democratizes Large Models: 4-bit quantization enables fine-tuning 70B+ parameter models on consumer hardware.
Dataset Quality is Paramount: Small, high-quality, diverse datasets outperform massive mediocre ones. Careful curation beats bulk scraping.
Model Merging is Alchemy: Combining models creates new capabilities, blending strengths from multiple sources into unified systems.
With Sovereignty Comes Responsibility: Owning uncensored models requires ethical clarity and commitment to responsible use.