Learn AI Series (#70) - Running Local Models

in StemSocial4 days ago

Learn AI Series (#70) - Running Local Models

ai-banner.png

What will I learn

  • You will learn why running models locally matters: privacy, cost, control, and offline capability;
  • the local inference stack: Ollama, llama.cpp, and vLLM;
  • quantization formats: GGUF, GPTQ, AWQ and what they trade off;
  • model selection: which local model for which task;
  • hardware realities: what actually matters (VRAM is king);
  • running and comparing local models in practice.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#70) - Running Local Models

Solutions to Episode #69 Exercises

Exercise 1: LoRA parameter calculator and comparison tool.

def lora_analysis(hidden_dim, num_layers, ranks, target_modules):
    """Calculate LoRA params, memory, and file size per rank."""
    modules_per_layer = len(target_modules)
    full_params_per_module = hidden_dim * hidden_dim
    total_full = full_params_per_module * modules_per_layer * num_layers

    print(f"Model config: d={hidden_dim}, {num_layers} layers, "
          f"{modules_per_layer} modules/layer")
    print(f"Full model params: {total_full:,}")
    print(f"\n{'Rank':>6} {'LoRA Params':>14} {'% of Full':>10} "
          f"{'Savings (GB)':>13} {'Adapter (MB)':>13}")
    print("-" * 60)

    results = {}
    for r in ranks:
        # Each LoRA module: A is (d_in x r) + B is (r x d_out)
        lora_per_module = hidden_dim * r + r * hidden_dim
        total_lora = lora_per_module * modules_per_layer * num_layers
        pct = total_lora / total_full * 100

        # Memory: full model in fp16 vs LoRA in fp16
        full_gb = total_full * 2 / 1e9
        lora_gb = total_lora * 2 / 1e9
        savings_gb = full_gb - lora_gb

        # Adapter file size (just the LoRA weights)
        adapter_mb = total_lora * 2 / 1e6

        results[r] = {
            "params": total_lora,
            "pct": pct,
            "savings_gb": savings_gb,
            "adapter_mb": adapter_mb,
        }
        print(f"{r:>6} {total_lora:>14,} {pct:>9.3f}% "
              f"{savings_gb:>12.2f} {adapter_mb:>12.1f}")

    return results


def find_optimal_rank(results, budget_mb):
    """Find highest rank that fits within storage budget."""
    best_rank = None
    for rank in sorted(results.keys()):
        if results[rank]["adapter_mb"] <= budget_mb:
            best_rank = rank
    return best_rank


# Test with realistic config
ranks = [4, 8, 16, 32, 64]
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
results = lora_analysis(4096, 32, ranks, target_modules)

# Budget search
print("\nOptimal rank for budget:")
for budget in [10, 50, 100, 500]:
    optimal = find_optimal_rank(results, budget)
    if optimal:
        mb = results[optimal]["adapter_mb"]
        print(f"  {budget:>4} MB budget -> rank {optimal} "
              f"({mb:.1f} MB adapter)")
    else:
        print(f"  {budget:>4} MB budget -> no rank fits")

The key insight is how dramatically LoRA reduces storage. A rank-16 adapter for a 7B model is around 52MB -- compare that to roughly 16GB for the full model. That 300x reduction is what makes serving dozens of fine-tuned variants practical. The find_optimal_rank function is a simple greedy search: pick the biggest rank your storage can handle, since higher rank generally means better adaptation quality (up to a point of diminishing returns, usually around rank 32-64).

Exercise 2: Dataset quality checker for fine-tuning.

import statistics

class DatasetValidator:
    """Validate fine-tuning dataset quality."""

    def __init__(self, examples):
        self.examples = examples
        self.issues = []

    def validate(self):
        """Run all quality checks."""
        self.issues = []
        self._check_required_fields()
        self._check_duplicates()
        self._check_empty_outputs()
        self._check_length_outliers("instruction")
        self._check_length_outliers("output")
        self._check_format_consistency()
        return self.issues

    def _check_required_fields(self):
        required = {"instruction", "output"}
        for i, ex in enumerate(self.examples):
            missing = required - set(ex.keys())
            if missing:
                self.issues.append({
                    "severity": "error",
                    "check": "required_fields",
                    "index": i,
                    "detail": f"Missing: {missing}",
                })

    def _check_duplicates(self):
        seen = {}
        for i, ex in enumerate(self.examples):
            inst = ex.get("instruction", "")
            if inst in seen:
                self.issues.append({
                    "severity": "error",
                    "check": "duplicate",
                    "index": i,
                    "detail": f"Duplicate of index {seen[inst]}",
                })
            else:
                seen[inst] = i

    def _check_empty_outputs(self):
        for i, ex in enumerate(self.examples):
            out = ex.get("output", "")
            if not out or not out.strip():
                self.issues.append({
                    "severity": "error",
                    "check": "empty_output",
                    "index": i,
                    "detail": "Output is empty or whitespace",
                })

    def _check_length_outliers(self, field):
        lengths = []
        for ex in self.examples:
            val = ex.get(field, "")
            if val:
                lengths.append(len(val))
        if len(lengths) < 3:
            return
        mean = statistics.mean(lengths)
        stdev = statistics.stdev(lengths)
        if stdev == 0:
            return
        for i, ex in enumerate(self.examples):
            val = ex.get(field, "")
            if val and abs(len(val) - mean) > 2 * stdev:
                self.issues.append({
                    "severity": "warning",
                    "check": f"{field}_length_outlier",
                    "index": i,
                    "detail": (f"Length {len(val)} is >2 std devs "
                               f"from mean {mean:.0f}"),
                })

    def _check_format_consistency(self):
        endings = {"period": 0, "no_period": 0}
        trailing_ws = 0
        for ex in self.examples:
            out = ex.get("output", "")
            if not out:
                continue
            if out.rstrip() != out:
                trailing_ws += 1
            if out.rstrip().endswith("."):
                endings["period"] += 1
            else:
                endings["no_period"] += 1

        total = endings["period"] + endings["no_period"]
        if total > 0:
            minority = min(endings.values())
            if 0 < minority < total * 0.3:
                for i, ex in enumerate(self.examples):
                    out = ex.get("output", "").rstrip()
                    has_period = out.endswith(".")
                    is_minority = (
                        (has_period and endings["period"] < endings["no_period"])
                        or (not has_period and endings["no_period"] < endings["period"])
                    )
                    if is_minority and out:
                        self.issues.append({
                            "severity": "warning",
                            "check": "format_inconsistency",
                            "index": i,
                            "detail": "Ending punctuation differs from majority",
                        })

        if trailing_ws > 0:
            for i, ex in enumerate(self.examples):
                out = ex.get("output", "")
                if out and out.rstrip() != out:
                    self.issues.append({
                        "severity": "warning",
                        "check": "trailing_whitespace",
                        "index": i,
                        "detail": "Output has trailing whitespace",
                    })

    def report(self):
        """Print quality report."""
        issues = self.validate()
        errors = [i for i in issues if i["severity"] == "error"]
        warnings = [i for i in issues if i["severity"] == "warning"]

        print(f"Dataset Quality Report")
        print(f"  Total examples: {len(self.examples)}")
        print(f"  Errors:   {len(errors)}")
        print(f"  Warnings: {len(warnings)}")

        if errors:
            print(f"\n  ERRORS:")
            for e in errors:
                print(f"    [{e['index']:>3}] {e['check']}: "
                      f"{e['detail']}")
        if warnings:
            print(f"\n  WARNINGS:")
            for w in warnings:
                print(f"    [{w['index']:>3}] {w['check']}: "
                      f"{w['detail']}")
        return issues


# Generate 30 test examples (5 with deliberate issues)
examples = []
for i in range(25):
    examples.append({
        "instruction": f"Summarize the concept of topic_{i}.",
        "output": f"Topic_{i} is a concept that involves "
                  f"specific principles and applications.",
    })

# Issue 1: missing field
examples.append({"instruction": "Explain gravity."})
# Issue 2: duplicate instruction
examples.append({
    "instruction": "Summarize the concept of topic_0.",
    "output": "Duplicate entry here.",
})
# Issue 3: empty output
examples.append({
    "instruction": "What is entropy?",
    "output": "   ",
})
# Issue 4: extreme length
examples.append({
    "instruction": "A" * 2000,
    "output": "Very long instruction above.",
})
# Issue 5: format inconsistency (no period)
examples.append({
    "instruction": "Define neural networks.",
    "output": "Neural networks are computing systems "
              "inspired by biological brains",
})

validator = DatasetValidator(examples)
validator.report()

Garbage in, garbage out. This validator catches the five most common dataset problems before they corrupt your fine-tuning run. The severity distinction matters: errors (missing fields, duplicates, empty outputs) should block training entirely, while warnings (length outliers, format inconsistencies) are worth reviewing but might be intentional. I've seen people burn 4 hours of GPU time on a dataset with duplicate entries that taught the model to repeat itself -- 30 seconds of validation would have caught it.

Exercise 3: Fine-tuning experiment tracker.

import random
import math

class FTExperimentTracker:
    """Track and compare fine-tuning experiments."""

    def __init__(self):
        self.experiments = []

    def log_experiment(self, name, hyperparams, train_log,
                       eval_scores, compute):
        """Log a complete experiment."""
        self.experiments.append({
            "name": name,
            "hyperparams": hyperparams,
            "train_log": train_log,
            "eval_scores": eval_scores,
            "compute": compute,
        })

    def compare(self):
        """Print comparison table ranked by eval loss."""
        sorted_exps = sorted(
            self.experiments,
            key=lambda e: e["train_log"][-1]["eval_loss"])

        print(f"{'Name':<18} {'Rank':>4} {'LR':>8} "
              f"{'Alpha':>6} {'Final Loss':>11} "
              f"{'Eval Loss':>10} {'Params':>10} "
              f"{'GPU-hrs':>8}")
        print("-" * 90)
        for exp in sorted_exps:
            hp = exp["hyperparams"]
            last = exp["train_log"][-1]
            print(f"{exp['name']:<18} {hp['rank']:>4} "
                  f"{hp['lr']:>8.1e} {hp['alpha']:>6} "
                  f"{last['train_loss']:>11.4f} "
                  f"{last['eval_loss']:>10.4f} "
                  f"{hp['total_params']:>10,} "
                  f"{exp['compute']['gpu_hours']:>8.2f}")

    def recommend(self):
        """Pick best experiment considering perf + efficiency."""
        if not self.experiments:
            return None

        sorted_exps = sorted(
            self.experiments,
            key=lambda e: e["train_log"][-1]["eval_loss"])

        best = sorted_exps[0]
        simplest_params = min(
            e["hyperparams"]["total_params"]
            for e in self.experiments)

        # Penalize configs using >2x params of simplest
        # for <5% improvement over next-simplest
        for i, exp in enumerate(sorted_exps):
            params = exp["hyperparams"]["total_params"]
            eval_loss = exp["train_log"][-1]["eval_loss"]

            if params > 2 * simplest_params:
                # Check if the improvement is worth it
                simpler = [e for e in sorted_exps
                           if e["hyperparams"]["total_params"]
                           <= 2 * simplest_params]
                if simpler:
                    simpler_loss = simpler[0]["train_log"][-1]["eval_loss"]
                    improvement = (simpler_loss - eval_loss) / simpler_loss
                    if improvement < 0.05:
                        best = simpler[0]
                        break

        hp = best["hyperparams"]
        last = best["train_log"][-1]
        print(f"\nRecommended: {best['name']}")
        print(f"  Rank: {hp['rank']}, LR: {hp['lr']:.1e}, "
              f"Alpha: {hp['alpha']}")
        print(f"  Eval loss: {last['eval_loss']:.4f}, "
              f"Params: {hp['total_params']:,}")
        return best


# Simulate 4 experiments
tracker = FTExperimentTracker()
random.seed(42)

configs = [
    {"rank": 4,  "lr": 2e-4, "alpha": 8,   "epochs": 3},
    {"rank": 16, "lr": 2e-4, "alpha": 32,  "epochs": 3},
    {"rank": 32, "lr": 1e-4, "alpha": 64,  "epochs": 3},
    {"rank": 64, "lr": 1e-4, "alpha": 128, "epochs": 3},
]

for cfg in configs:
    d = 4096
    modules = 4  # q, k, v, o
    layers = 32
    lora_params = 2 * d * cfg["rank"] * modules * layers
    cfg["total_params"] = lora_params

    # Simulate training: decreasing loss with noise
    train_log = []
    base_loss = 2.5 - (cfg["rank"] / 100)
    for step in range(50):
        t = (step + 1) / 50
        decay = base_loss * math.exp(-3 * t)
        noise = random.gauss(0, 0.02)
        train_loss = max(0.1, decay + 0.15 + noise)
        eval_noise = random.gauss(0, 0.03)
        # Higher rank = slightly lower eval loss
        rank_bonus = cfg["rank"] * 0.0003
        eval_loss = max(0.12, train_loss + 0.05
                        - rank_bonus + eval_noise)
        train_log.append({
            "step": step + 1,
            "train_loss": train_loss,
            "eval_loss": eval_loss,
        })

    # Simulated compute
    gpu_hours = 0.5 + cfg["rank"] * 0.03

    tracker.log_experiment(
        name=f"lora_r{cfg['rank']}_lr{cfg['lr']:.0e}",
        hyperparams=cfg,
        train_log=train_log,
        eval_scores={"final_eval_loss": train_log[-1]["eval_loss"]},
        compute={"gpu_hours": gpu_hours, "peak_mem_gb": 6 + cfg["rank"] * 0.1},
    )

tracker.compare()
tracker.recommend()

The recommend() method is where the practical wisdom lives. Raw performance ranking would always pick the biggest model -- more parameters almost always means slightly better eval loss. But the marginal gain from rank 32 to rank 64 is often tiny (less than 5% improvement) while doubling the parameter count, training time, and adapter storage. The penalty function catches this: if a complex config doesn't meaningfully outperform a simpler one, pick the simpler one. In production, "good enough with half the resources" beats "marginally better at double the cost" almost every time.

On to today's episode

Here we go! Over the last few episodes we've gone deep on working with language models from the outside -- API calls (episode #66), building agent systems on top of them (#67, #68), and customizing them through fine-tuning (#69). But in every single one of those scenarios, the model lives on someone else's server. Your prompts travel over the network to a provider, get processed on their hardware, and the response comes back. You're renting intelligence by the token.

That works for a lot of use cases. But every API call is a request you don't fully control. The provider can change pricing tomorrow, add rate limits, modify the model's behavior, discontinue the endpoint, or (depending on their terms of service) peek at your data. For many applications that's an acceptable trade-off. For others -- medical records, proprietary code, financial data, air-gapped environments, or simply keeping your monthly bill predictable -- you want the model running on YOUR hardware, under YOUR control.

And here's what's remarkable: local inference has gotten really good. Models that needed a data center two years ago now run on a laptop. Let me show you how ;-)

Why local?

Four reasons keep coming up when people move to local inference, and they're all legitimate:

Privacy. Your data never leaves your machine. No terms of service granting the provider some vague rights to your inputs. No wondering whether your prompts end up in a training dataset somewhere. For regulated industries (healthcare, finance, legal), local inference can be a hard compliance requirement rather than a preference.

Cost. API calls add up fast. A busy application making thousands of calls per day can run into hundreds or thousands of dollars monthly. A local model has a one-time hardware cost and near-zero marginal cost per query. Once you own the GPU, every inference is essentially free (minus electricity, which is pennies compared to API pricing).

Control. You pick the model, the version, the quantization level, the serving configuration. No surprise model updates that change output behavior. No dependency on an external service's uptime. Your system works during internet outages. If you need to reproduce results from three months ago, the model file hasn't changed.

Latency. No network round trip. For applications where response time matters (code completion, real-time assistants, interactive tools), local inference on good hardware can be faster than API calls. No waiting for network hops, no queuing behind other users, no provider-side rate limiting.

The tradeoff is capability. Local models are smaller and less capable than the frontier API models. A 7B parameter model running locally won't match GPT-4 or Claude on complex multi-step reasoning tasks. But for focused applications -- code completion, summarization, classification, entity extraction, simple Q&A -- smaller models are often good enough. And "good enough with zero latency and zero cost per query" beats "slightly better at $0.01 per call" for many production scenarios.

The inference stack

Three tools dominate local inference right now, and each occupies a different niche:

Ollama is the "just works" option. Install it, pull a model, run it. It handles quantization, memory management, model downloading, and exposes an OpenAI-compatible API. That last part is huge -- if you built API clients in episode #66, your existing code works with Ollama by changing one base URL. Zero code changes to switch between cloud and local inference.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain gradient descent in one paragraph"

# Or use the API (OpenAI-compatible!)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
# Your existing OpenAI client code works unchanged
from openai import OpenAI

# Just point it at Ollama instead of OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed",  # Ollama doesn't require a key
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user",
               "content": "What is backpropagation?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

llama.cpp is the engine underneath Ollama (and many other tools). Written in C/C++, it provides maximum performance and flexibility. If you need custom quantization options, batch processing, fine-grained memory allocation control, or embedding generation with specific parameters, llama.cpp is where you go. It's lower-level -- you manage model files and configuration directly.

# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Run inference directly
./llama-cli -m models/llama-3.1-8b-q4_k_m.gguf \
  -p "The transformer architecture" -n 256

# Start an API server
./llama-server -m models/llama-3.1-8b-q4_k_m.gguf --port 8080

vLLM is optimized for throughput -- serving many requests concurrently. It uses PagedAttention to manage GPU memory efficiently, achieving 2-4x higher throughput than naive implementations. The key idea: in stead of pre-allocating contiguous memory for each request's KV cache (which wastes memory on padding), PagedAttention allocates memory in pages, like a modern operating system manages RAM. Use vLLM when you're serving a model to multiple users simultaneously or processing large batches.

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --max-model-len 4096

The practical guidance: Ollama for individual use and development. llama.cpp for maximum control and custom setups. vLLM for production serving with multiple concurrent users.

Quantization: making big models fit

Here's the core problem. A 7B parameter model in float16 (2 bytes per parameter) is about 14GB. A 70B model is 140GB. Most consumer GPUs have 8-24GB of VRAM. The math doesn't work -- the models are simply too big to fit.

Quantization compresses model weights by using fewer bits per number. In stead of storing each weight as a 16-bit float, you store it as an 8-bit, 4-bit, or even 2-bit integer (plus some scaling factors). This trades a small amount of output quality for dramatically less memory. And the quality tradeoff is surprisingly small -- the model's actual reasoning ability degrades much less than you'd expect from cutting the precision in half or more.

The key formats you need to know:

GGUF (GPT-Generated Unified Format) is the standard for llama.cpp and Ollama. It supports mixed quantization -- different layers can use different precision levels. Attention layers (where the "reasoning" happens) can stay at higher precision while less critical layers get compressed more aggressively.

Common GGUF quantization levels and what they mean in practice:

# Quantization comparison for a 7B parameter model
quant_levels = [
    ("Q8_0",   8, 8.0,  "Negligible quality loss. Use if it fits."),
    ("Q6_K",   6, 6.0,  "Very close to full precision."),
    ("Q5_K_M", 5, 5.0,  "Good balance of size and quality."),
    ("Q4_K_M", 4, 4.5,  "The sweet spot for most users."),
    ("Q3_K_M", 3, 3.5,  "Noticeable degradation on complex tasks."),
    ("Q2_K",   2, 3.0,  "Significant quality loss. Emergency option."),
]

print(f"{'Format':<10} {'Bits':>5} {'~Size (GB)':>11} Note")
print("-" * 65)
for name, bits, size, note in quant_levels:
    reduction = (1 - size / 14) * 100
    print(f"{name:<10} {bits:>5} {size:>10.1f}  "
          f"({reduction:.0f}% smaller) {note}")
# Using llama-cpp-python to load quantized models
from llama_cpp import Llama

llm = Llama(
    model_path="models/llama-3.1-8b-q4_k_m.gguf",
    n_ctx=4096,        # context window size
    n_gpu_layers=-1,   # offload all layers to GPU (-1 = all)
    verbose=False
)

response = llm.create_chat_completion(
    messages=[{
        "role": "user",
        "content": "What is attention in transformers?"
    }],
    max_tokens=512,
    temperature=0.7
)
print(response["choices"][0]["message"]["content"])

GPTQ (Frantar et al., 2022) is a GPU-optimized quantization method. It doesn't just naively round weights to fewer bits -- it runs a calibration pass using real data samples and adjusts the quantized weights to minimize the output error across the entire layer. This compensates for quantization mistakes in one weight by slightly adjusting neighboring weights. GPTQ models are fast on NVIDIA GPUs but require GPU inference -- they don't run on CPU.

AWQ (Activation-aware Weight Quantization) takes a different approach: it analyzes which weights have the biggest impact on activations and preserves those at higher precision. The insight is that a small fraction of weights (roughly 1%) are disproportionately important -- quantization errors in those weights cause much larger output errors than errors in the remaining 99%. AWQ often achieves better quality than GPTQ at the same bit width by being smarter about which weights to protect.

The practical rule: start with Q4_K_M in GGUF format. If quality isn't good enough, move up to Q5 or Q6. If you need to squeeze a bigger model into limited VRAM, try Q3. Below Q3, you're usually better off switching to a smaller model at higher quantization -- a 7B model at Q5 will almost always outperform a 13B model at Q2.

Model selection: which model for what

The local model landscape changes fast. New models appear almost weekly. But some selection principles hold steady regardless of which specific models are trending this month:

For general chat and instruction following: Llama 3.1 (8B, 70B), Mistral (7B), and Qwen 2.5 are strong choices at the time of writing. The 8B class runs comfortably on consumer hardware with 8-16GB VRAM. The 70B class needs 40-48GB VRAM (quantized) or CPU offloading with plenty of RAM (slow but functional).

For code generation: Code Llama, DeepSeek Coder, and StarCoder2 are purpose-built for code. They outperform general models on coding tasks despite being smaller, because their training data is heavily skewed toward code repositories.

For embedding and retrieval: nomic-embed-text, all-MiniLM, and bge models are small, fast, and designed specifically for generating embeddings. These are directly relevant to the RAG systems we built in episodes #63-65 -- you can run your entire retrieval pipeline locally.

For constrained environments: Phi-3 (3.8B) and Gemma 2 (2B) punch well above their weight class. If you're deploying on edge devices or have very limited VRAM, evaluate these first. Their quality-to-size ratio is impressively high.

import ollama

# Compare models on the same task
models = ["llama3.1:8b", "mistral:7b", "qwen2.5:7b", "phi3:3.8b"]
prompt = ("Explain the difference between LoRA and full "
          "fine-tuning in 3 sentences.")

for model_name in models:
    try:
        response = ollama.chat(
            model=model_name,
            messages=[{"role": "user", "content": prompt}]
        )
        content = response["message"]["content"]
        words = len(content.split())
        print(f"\n--- {model_name} ({words} words) ---")
        print(content[:200])
    except Exception as e:
        print(f"\n--- {model_name}: not available ({e}) ---")

Don't just trust benchmarks. Seriously. A model that scores highest on MMLU or HumanEval might not be the best for YOUR specific classification task or YOUR specific document summarization pipeline. Always evaluate on your actual use case with your actual data. Benchmarks tell you about general capability; your deployment requires specific capability ;-)

Hardware: what actually matters

VRAM is king. This is the single most important factor for local inference. More VRAM means bigger models, longer context windows, and faster token generation. Everything else -- clock speed, CUDA cores, memory bandwidth -- is secondary to raw VRAM capacity.

Here are the practical hardware tiers and what you can actually run on them:

# Hardware tiers and what fits
tiers = [
    {
        "vram": "8 GB",
        "fits": "7B models at Q4, some 13B at Q2",
        "gpus": "RTX 4060, M1/M2 (shared memory)",
        "tok_per_sec": "20-40",
    },
    {
        "vram": "16 GB",
        "fits": "7B at Q8, 13B at Q4, some 30B at Q2",
        "gpus": "RTX 4070 Ti, M2 Pro",
        "tok_per_sec": "30-60",
    },
    {
        "vram": "24 GB",
        "fits": "13B at Q8, 30B at Q4, 70B at Q2-Q3",
        "gpus": "RTX 4090, RTX 3090, M2 Max",
        "tok_per_sec": "40-100",
    },
    {
        "vram": "48 GB+",
        "fits": "70B at Q4-Q5, multiple smaller models",
        "gpus": "2x RTX 4090, M2 Ultra, A6000",
        "tok_per_sec": "50-120+",
    },
]

print(f"{'VRAM':<10} {'What Fits':<42} {'Example GPUs':<30}")
print("-" * 85)
for t in tiers:
    print(f"{t['vram']:<10} {t['fits']:<42} {t['gpus']:<30}")
    print(f"{'':>10} ~{t['tok_per_sec']} tokens/sec")

Apple Silicon deserves special mention. M1/M2/M3/M4 chips use unified memory -- the CPU and GPU share the same RAM pool. A MacBook Pro with 32GB unified memory can run models that would need a discrete GPU on other platforms. A Mac Mini M4 with 64GB can run 70B models quantized to Q4 without breaking a sweat. Performance is competitive with mid-range NVIDIA GPUs, though a high-end RTX 4090 still wins on raw throughput.

CPU inference works but is significantly slower. llama.cpp and Ollama support CPU-only inference, and it's viable for models up to about 13B if you have enough RAM. Expect 5-15 tokens per second on a modern CPU versus 50-100+ on a good GPU. Usable for batch processing overnight; too slow for interactive chat with larger models. Having said that, CPU inference on a beefy server with 256GB RAM and many cores can serve a 70B model -- just don't expect interactive speeds.

Memory bandwidth is the second most important factor after VRAM capacity. This explains why Apple Silicon performs surprisingly well despite lower raw compute -- its unified memory architecture has high bandwidth (200+ GB/s on higher-end chips). On desktop systems, DDR5 RAM helps CPU inference meaningfully compared to DDR4.

# Check what your system can handle
import torch
import platform

print(f"Platform: {platform.system()} {platform.machine()}")

if torch.cuda.is_available():
    props = torch.cuda.get_device_properties(0)
    vram = props.total_mem / 1e9
    print(f"GPU: {props.name}")
    print(f"VRAM: {vram:.1f} GB")
    print(f"Compute capability: {props.major}.{props.minor}")

    # Estimate what fits
    if vram >= 24:
        print("Can run: 13B at Q8, 70B at Q3")
    elif vram >= 16:
        print("Can run: 7B at Q8, 13B at Q4")
    elif vram >= 8:
        print("Can run: 7B at Q4")
    else:
        print("Limited to very small models or CPU inference")

elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    import subprocess
    result = subprocess.run(
        ["sysctl", "hw.memsize"],
        capture_output=True, text=True)
    total_ram = int(result.stdout.split(":")[1].strip()) / 1e9
    print(f"Apple Silicon -- Unified memory: {total_ram:.0f} GB")
    print(f"Usable for models: ~{total_ram * 0.7:.0f} GB "
          f"(leave ~30% for OS)")
else:
    import psutil
    ram = psutil.virtual_memory().total / 1e9
    print(f"CPU only -- RAM: {ram:.0f} GB")
    print("CPU inference: viable but slow for interactive use")

Practical workflow: from download to deployment

Here's the workflow I actually recomend for getting started with local models. Don't overthink it -- pick a model, run it, benchmark it on your tasks, and decide if it's good enough.

import ollama
import time

def benchmark_model(model_name, prompts):
    """Test a model's speed and quality on actual tasks."""
    results = []
    for prompt in prompts:
        start = time.time()
        response = ollama.chat(
            model=model_name,
            messages=[{"role": "user", "content": prompt}]
        )
        elapsed = time.time() - start
        content = response["message"]["content"]
        # eval_count comes from Ollama's response metadata
        tokens = response.get("eval_count", len(content.split()))
        results.append({
            "prompt": prompt[:50],
            "tokens": tokens,
            "time_sec": elapsed,
            "tok_per_sec": tokens / elapsed if elapsed > 0 else 0,
            "response_preview": content[:100],
        })
    return results


# YOUR actual test prompts -- not benchmarks, your real work
test_prompts = [
    "Classify this text as positive, negative, or neutral: "
    "'The API was decent but documentation was lacking'",

    "Extract the key entities from: 'Apple announced the "
    "M4 chip at WWDC in Cupertino'",

    "Summarize in one sentence: Gradient descent is an "
    "optimization algorithm that iteratively adjusts "
    "parameters by moving in the direction of steepest "
    "descent of the loss function.",
]

# Compare two models
for model in ["llama3.1:8b", "phi3:3.8b"]:
    print(f"\n=== {model} ===")
    try:
        results = benchmark_model(model, test_prompts)
        for r in results:
            print(f"  {r['tok_per_sec']:.1f} tok/s | "
                  f"{r['prompt'][:45]}...")
    except Exception as e:
        print(f"  Not available: {e}")

Test on your actual tasks. Measure tokens per second. Check output quality by reading the responses. Then decide: is the local model good enough for this use case, or do you need the API? Often the answer is "the 8B model handles 80% of my use cases perfectly fine, and I only call the API for the complex 20%." That hybrid approach -- local for the bulk, API for the hard stuff -- gives you the best of both worlds in terms of cost, latency, and quality.

Building a local model comparison pipeline

Let's put it all together into something you can actually use. A structured comparison pipeline that tests multiple models on your specific tasks and produces a clear recommendation:

import json
import time

class LocalModelEvaluator:
    """Compare local models on your actual tasks."""

    def __init__(self):
        self.results = {}

    def evaluate(self, model_name, test_cases):
        """Run a model through all test cases."""
        model_results = []

        for case in test_cases:
            start = time.time()
            try:
                import ollama
                response = ollama.chat(
                    model=model_name,
                    messages=[{
                        "role": "user",
                        "content": case["prompt"]
                    }]
                )
                elapsed = time.time() - start
                output = response["message"]["content"]
                tokens = response.get("eval_count",
                                      len(output.split()))

                # Score against expected output if provided
                score = self._score(output, case.get("expected"))

                model_results.append({
                    "task": case["name"],
                    "output": output[:200],
                    "tokens": tokens,
                    "time_sec": elapsed,
                    "tok_per_sec": tokens / max(elapsed, 0.001),
                    "score": score,
                })
            except Exception as e:
                model_results.append({
                    "task": case["name"],
                    "error": str(e),
                    "score": 0,
                })

        self.results[model_name] = model_results
        return model_results

    def _score(self, output, expected):
        """Simple keyword-based scoring."""
        if not expected:
            return 1.0 if len(output.strip()) > 10 else 0.0

        # Check if expected keywords appear in output
        keywords = expected.lower().split(",")
        found = sum(1 for kw in keywords
                    if kw.strip() in output.lower())
        return found / len(keywords) if keywords else 0.0

    def summary(self):
        """Print comparison summary."""
        print(f"\n{'Model':<20} {'Avg Score':>10} "
              f"{'Avg tok/s':>10} {'Tasks OK':>10}")
        print("-" * 55)

        for model_name, results in self.results.items():
            scores = [r["score"] for r in results
                      if "error" not in r]
            speeds = [r["tok_per_sec"] for r in results
                      if "error" not in r]
            ok = len(scores)
            avg_score = sum(scores) / len(scores) if scores else 0
            avg_speed = sum(speeds) / len(speeds) if speeds else 0
            print(f"{model_name:<20} {avg_score:>9.2f} "
                  f"{avg_speed:>9.1f} {ok:>6}/{len(results)}")


# Define YOUR test cases
test_cases = [
    {
        "name": "classification",
        "prompt": "Classify as positive/negative/neutral: "
                  "'The new update broke my workflow'",
        "expected": "negative",
    },
    {
        "name": "extraction",
        "prompt": "Extract entities (person, org, location): "
                  "'Satoshi Nakamoto created Bitcoin in Japan'",
        "expected": "satoshi,bitcoin,japan",
    },
    {
        "name": "summarization",
        "prompt": "One-sentence summary: Transformers replaced "
                  "RNNs because self-attention processes all "
                  "tokens in parallel rather than sequentially, "
                  "enabling much faster training on long sequences.",
        "expected": "transformers,attention,parallel",
    },
    {
        "name": "code_generation",
        "prompt": "Write a Python function that checks if a "
                  "string is a palindrome. Return True or False.",
        "expected": "def,palindrome,return,true,false",
    },
]

evaluator = LocalModelEvaluator()

# In practice: run this for each model you're considering
# evaluator.evaluate("llama3.1:8b", test_cases)
# evaluator.evaluate("phi3:3.8b", test_cases)
# evaluator.summary()

# Demo output (simulated for the tutorial)
print("Local Model Comparison Pipeline")
print("================================")
print(f"Test cases defined: {len(test_cases)}")
for tc in test_cases:
    print(f"  - {tc['name']}: {tc['prompt'][:50]}...")
print("\nRun evaluator.evaluate('model', test_cases) "
      "for each model")
print("Then evaluator.summary() for the comparison table")

This is the kind of tooling I wish I had when I started working with local models. You define your tasks ONCE, run every candidate model through them, and get a clear apples-to-apples comparison. No more "I think model X felt better than model Y" -- you have numbers. Numbers you can compare, numbers you can track over time as new models come out.

The API vs local decision framework

One more thing before we wrap up. The question isn't "API or local?" -- it's "which tasks go where?" Almost every production system I've seen that uses local models also uses APIs. The right architecture is a hybrid where you route each request to the most cost-effective backend that meets your quality requirements.

# Routing logic for hybrid local + API setup
class InferenceRouter:
    """Route requests to local or API based on task complexity."""

    def __init__(self, local_models, api_threshold=0.7):
        self.local_models = local_models
        self.api_threshold = api_threshold
        self.stats = {"local": 0, "api": 0}

    def route(self, task_type, complexity_score):
        """Decide where to send this request.

        complexity_score: 0.0 (trivial) to 1.0 (very complex)
        """
        if complexity_score < self.api_threshold:
            self.stats["local"] += 1
            return "local", self.local_models.get(
                task_type, "llama3.1:8b")
        else:
            self.stats["api"] += 1
            return "api", "gpt-4"

    def cost_report(self, local_cost_per_query=0.0001,
                    api_cost_per_query=0.03):
        """Estimate cost savings from hybrid routing."""
        local_cost = self.stats["local"] * local_cost_per_query
        api_cost = self.stats["api"] * api_cost_per_query
        all_api = ((self.stats["local"] + self.stats["api"])
                   * api_cost_per_query)
        savings = all_api - (local_cost + api_cost)
        return {
            "total_queries": self.stats["local"] + self.stats["api"],
            "local_queries": self.stats["local"],
            "api_queries": self.stats["api"],
            "hybrid_cost": local_cost + api_cost,
            "all_api_cost": all_api,
            "savings": savings,
            "savings_pct": (savings / all_api * 100
                           if all_api > 0 else 0),
        }


# Simulate 100 requests with varying complexity
router = InferenceRouter(
    local_models={
        "classification": "phi3:3.8b",
        "extraction": "llama3.1:8b",
        "summarization": "llama3.1:8b",
        "reasoning": "llama3.1:8b",
    }
)

import random
random.seed(42)
tasks = ["classification", "extraction", "summarization",
         "reasoning"]

for _ in range(100):
    task = random.choice(tasks)
    # Classification/extraction tend to be simpler
    if task in ("classification", "extraction"):
        complexity = random.uniform(0.1, 0.6)
    else:
        complexity = random.uniform(0.3, 0.95)
    router.route(task, complexity)

report = router.cost_report()
print("Hybrid Routing Report (100 queries)")
print(f"  Local: {report['local_queries']} queries")
print(f"  API:   {report['api_queries']} queries")
print(f"  Hybrid cost:  ${report['hybrid_cost']:.2f}")
print(f"  All-API cost: ${report['all_api_cost']:.2f}")
print(f"  Savings:      ${report['savings']:.2f} "
      f"({report['savings_pct']:.0f}%)")

If that routing sends 70% of your queries to local models, you've just cut your inference bill by roughly 70% while maintaining the same quality on the complex 30% that goes to the API. That's the real power of running local models -- not replacing the API entirely, but dramatically reducing how much you depend on it.

Samengevat

  • Local models give you privacy, cost control, and independence from API providers, but they trade off capability compared to frontier models -- choose based on your specific task requirements, not hype;
  • Ollama for ease of use and OpenAI-compatible APIs, llama.cpp for maximum control over memory and quantization, vLLM for production serving with high concurrent throughput;
  • Quantization (GGUF Q4_K_M as the default sweet spot) makes 7B+ models practical on consumer hardware by storing weights in 4 bits in stead of 16, with surprisingly small quality degradation;
  • Model selection depends entirely on your task: general chat, code generation, embedding/retrieval, and edge deployment each have specialist models that outperform generalists at a fraction of the size;
  • VRAM is king for hardware -- everything else is secondary. Apple Silicon's unified memory makes it surprisingly competitive for local inference. CPU inference works but is 5-10x slower;
  • Always benchmark on YOUR actual use case rather than trusting leaderboards. Build a structured comparison pipeline (like the one above) and let the numbers decide;
  • The smartest architecture is usually hybrid: route simple tasks to local models (cheap, fast, private) and complex tasks to APIs (powerful, expensive). This gets you 70-80% cost reduction while maintaining quality where it matters.

Exercises

Exercise 1: Build a model memory calculator. Create a function estimate_memory(num_params_billions, quantization_bits, context_length, batch_size) that estimates total GPU memory needed for inference. It should account for: (a) model weights at the given quantization level (params * bits / 8), (b) KV cache memory (2 * num_layers * hidden_dim * context_length * batch_size * 2 bytes, where num_layers and hidden_dim are estimated from the param count using standard ratios: 7B -> 32 layers, 4096 dim; 13B -> 40 layers, 5120 dim; 70B -> 80 layers, 8192 dim), (c) activation memory overhead (~10% of model weights). Test with: 7B at Q4 with 4096 context, 13B at Q4 with 2048 context, 70B at Q4 with 4096 context. Print a table showing each component and the total. Add a fits_in_gpu(vram_gb) method that returns True/False and prints what you'd need to change (lower quantization, shorter context, or smaller model) if it doesn't fit.

Exercise 2: Build an Ollama model manager (no actual Ollama required -- simulate the API). Create a class ModelManager that tracks: which models are "downloaded" (stored in a dict with name, size_gb, quantization, and capabilities list), total "disk" used, and a usage log. Implement: pull(model_name, size_gb, quant, capabilities) to add a model, remove(model_name) to delete one, list_models() to show all with sizes, find_best(task) that picks the smallest model whose capabilities include the requested task, and usage_report() that shows which models were queried most often. Pre-populate with 5 models (e.g., llama3.1:8b, phi3:3.8b, codellama:7b, nomic-embed, mistral:7b) with different capabilities (chat, code, embedding, reasoning). Simulate 50 queries across different task types and print the usage report showing which model handled the most queries and total "disk" usage.

Exercise 3: Build a quantization quality simulator. Create a function simulate_quantization(weights, bits) that takes a NumPy array of float32 weights and quantizes them to the specified bit width using uniform quantization (map the full range to 2^bits levels, then dequantize back to floats). Measure the mean squared error between original and dequantized weights, the max absolute error, and the signal-to-noise ratio in dB (10 * log10(signal_power / noise_power)). Generate a test weight array of 10,000 values drawn from a normal distribution (mean=0, std=0.02 -- realistic for neural network weights). Test with 2, 3, 4, 5, 6, and 8 bits. Print a comparison table. Then implement awq_simulate(weights, bits, importance_scores) that gives higher precision to the top 1% most "important" weights (keep those at float32, quantize the rest). Compare AWQ-simulated vs uniform quantization at 4-bit and show the MSE improvement.

De groeten! Thanks for reading.

@scipio