Learn AI Series (#72) - Tokenization Deep Dive

in StemSocial12 days ago

Learn AI Series (#72) - Tokenization Deep Dive

ai-banner.png

What will I learn

  • You will learn why tokenization is one of the most underappreciated decisions in language model design;
  • character-level, word-level, and subword tokenization -- and why subword won;
  • Byte Pair Encoding (BPE): the algorithm step by step, implemented from scratch;
  • WordPiece and Unigram models: alternative subword approaches;
  • SentencePiece: language-agnostic tokenization from raw text;
  • tiktoken and real tokenizer behavior with actual models;
  • vocabulary size tradeoffs and why 32K-128K is the sweet spot;
  • tokenization artifacts and how they explain many apparent model failures.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#72) - Tokenization Deep Dive

Solutions to Episode #71 Exercises

Exercise 1: Temperature visualizer.

import torch
import torch.nn.functional as F
import math


def visualize_temperature(logits, temperatures):
    """Show how temperature changes the output distribution."""
    print(f"{'Temp':>6} {'Entropy':>9} {'Top-1 Prob':>11} "
          f"{'Eff Vocab':>10}")
    print("-" * 40)

    results = {}
    for t in temperatures:
        probs = F.softmax(logits / t, dim=-1)

        # Shannon entropy
        log_p = torch.log2(probs + 1e-10)
        entropy = -(probs * log_p).sum().item()

        # Top-1 probability
        top1 = probs.max().item()

        # Effective vocab: tokens with prob > 1% of total
        eff_vocab = (probs > 0.01).sum().item()

        results[t] = {
            "entropy": entropy,
            "top1": top1,
            "eff_vocab": eff_vocab,
        }
        print(f"{t:>6.1f} {entropy:>9.3f} {top1:>10.4f} "
              f"{eff_vocab:>10}")

    return results


def recommend_temperature(task_type):
    """Return recommended temperature range by task."""
    ranges = {
        "factual": (0.1, 0.4),
        "balanced": (0.5, 0.8),
        "creative": (0.9, 1.5),
    }
    if task_type not in ranges:
        return None
    low, high = ranges[task_type]
    print(f"  {task_type}: temperature {low} - {high}")
    return (low, high)


# Test it
logits = torch.randn(20) * 2.0
temps = [0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 2.0]
visualize_temperature(logits, temps)

print()
for task in ["factual", "balanced", "creative"]:
    recommend_temperature(task)

The entropy column is the one to watch here. At temperature 0.1, entropy is near zero -- the distribution is essentially a one-hot vector, so there's almost no randomness. At temperature 2.0, entropy approaches its maximum value, meaning the model is practically sampling uniformly. The "effective vocabulary" metric gives you a quick intuitive sense for how many tokens the model is genuinely considering at each temperature setting.

Exercise 2: Repetition detector and auto-penalizer.

from collections import Counter


class RepetitionGuard:
    """Monitor generated text for repetitive patterns."""

    def __init__(self, window=50):
        self.window = window

    def check(self, generated_ids):
        recent = generated_ids[-self.window:]
        if len(recent) < 4:
            return {"token_repeats": [], "bigram_repeats": [],
                    "longest_repeat": 0, "severity": 0.0}

        # Token frequencies
        counts = Counter(recent)
        total = len(recent)
        token_repeats = [
            (tok, c) for tok, c in counts.most_common(5)
            if c > total * 0.15
        ]

        # Bigram frequencies
        bigrams = [(recent[i], recent[i + 1])
                    for i in range(len(recent) - 1)]
        bi_counts = Counter(bigrams)
        bigram_repeats = [
            (bg, c) for bg, c in bi_counts.most_common(5)
            if c > total * 0.08
        ]

        # Longest repeated consecutive substring
        longest = self._longest_repeat(recent)

        # Severity: combine signals
        tok_score = min(len(token_repeats) * 0.15, 0.4)
        bi_score = min(len(bigram_repeats) * 0.2, 0.4)
        len_score = min(longest / 10.0, 0.4)
        severity = min(tok_score + bi_score + len_score, 1.0)

        return {
            "token_repeats": token_repeats,
            "bigram_repeats": bigram_repeats,
            "longest_repeat": longest,
            "severity": severity,
        }

    def _longest_repeat(self, seq):
        """Find longest substring that appears more than once."""
        n = len(seq)
        best = 0
        for length in range(2, n // 2 + 1):
            for start in range(n - length):
                pattern = tuple(seq[start:start + length])
                for j in range(start + length, n - length + 1):
                    if tuple(seq[j:j + length]) == pattern:
                        best = max(best, length)
                        break
        return best

    def adaptive_penalty(self, severity):
        """Map severity [0,1] to repetition penalty [1.0,1.5]."""
        return 1.0 + severity * 0.5


# Test with intentional repetition
import random
random.seed(42)
tokens = [random.randint(0, 100) for _ in range(50)]
# Insert repeating pattern at positions 50-70
pattern = [10, 20, 30]
for i in range(20):
    tokens.append(pattern[i % 3])
# Add more random tokens
tokens.extend([random.randint(0, 100) for _ in range(30)])

guard = RepetitionGuard(window=50)
result = guard.check(tokens)
print(f"Token repeats:   {result['token_repeats']}")
print(f"Bigram repeats:  {result['bigram_repeats']}")
print(f"Longest repeat:  {result['longest_repeat']}")
print(f"Severity:        {result['severity']:.2f}")
print(f"Recommended penalty: {guard.adaptive_penalty(result['severity']):.2f}")

The repeating [10, 20, 30] pattern gets flagged immediately by both the bigram detector (the pair (30, 10) appears way too often) and the longest-repeat finder. A severity of ~0.6-0.8 maps to a penalty around 1.3-1.4, which is aggressive enough to break the loop without making the output incoherent. Sensible defaults.

Exercise 3: Decoding strategy comparator.

import torch
import torch.nn.functional as F
import math
from collections import Counter


class DecodingComparator:
    """Compare sampling strategies on a fixed distribution."""

    def __init__(self, probs, n_trials=1000):
        self.probs = probs
        self.n_trials = n_trials

    def greedy(self):
        token = torch.argmax(self.probs).item()
        return [token] * self.n_trials

    def pure_sample(self):
        return torch.multinomial(
            self.probs.unsqueeze(0).expand(self.n_trials, -1),
            num_samples=1).squeeze(-1).tolist()

    def top_k_sample(self, k):
        top_vals, top_idx = torch.topk(self.probs, k)
        renorm = top_vals / top_vals.sum()
        samples = torch.multinomial(
            renorm.unsqueeze(0).expand(self.n_trials, -1),
            num_samples=1).squeeze(-1)
        return [top_idx[s].item() for s in samples.tolist()]

    def top_p_sample(self, p):
        sorted_p, sorted_i = torch.sort(
            self.probs, descending=True)
        cumsum = torch.cumsum(sorted_p, dim=-1)
        mask = cumsum - sorted_p > p
        sorted_p[mask] = 0.0
        sorted_p = sorted_p / sorted_p.sum()
        samples = torch.multinomial(
            sorted_p.unsqueeze(0).expand(self.n_trials, -1),
            num_samples=1).squeeze(-1)
        return [sorted_i[s].item() for s in samples.tolist()]

    def analyze(self, samples, name):
        counts = Counter(samples)
        n_unique = len(counts)

        # Empirical distribution
        emp = torch.zeros_like(self.probs)
        for tok, c in counts.items():
            emp[tok] = c / self.n_trials

        # Entropy of empirical distribution
        log_e = torch.log2(emp + 1e-10)
        entropy = -(emp * log_e).sum().item()

        # Top-1 from original selected at least once?
        top1 = torch.argmax(self.probs).item()
        top1_hit = top1 in counts

        # KL divergence: D_KL(original || empirical)
        kl = F.kl_div(
            torch.log(emp + 1e-10),
            self.probs,
            reduction="sum"
        ).item()

        return {
            "name": name, "unique": n_unique,
            "entropy": entropy, "top1_hit": top1_hit,
            "kl_div": kl,
        }

    def compare_all(self):
        strategies = [
            (self.greedy(), "greedy"),
            (self.pure_sample(), "pure"),
            (self.top_k_sample(10), "top-k=10"),
            (self.top_k_sample(20), "top-k=20"),
            (self.top_k_sample(50), "top-k=50"),
            (self.top_p_sample(0.5), "top-p=0.5"),
            (self.top_p_sample(0.9), "top-p=0.9"),
            (self.top_p_sample(0.95), "top-p=0.95"),
        ]
        results = [self.analyze(s, n) for s, n in strategies]

        print(f"{'Strategy':<12} {'Unique':>7} {'Entropy':>8} "
              f"{'Top-1?':>7} {'KL Div':>8}")
        print("-" * 46)
        for r in results:
            hit = "yes" if r["top1_hit"] else "no"
            print(f"{r['name']:<12} {r['unique']:>7} "
                  f"{r['entropy']:>8.3f} {hit:>7} "
                  f"{r['kl_div']:>8.4f}")
        return results


# Zipf distribution over 1000 tokens
vocab = 1000
ranks = torch.arange(1, vocab + 1, dtype=torch.float32)
probs = (1.0 / ranks)
probs = probs / probs.sum()

comp = DecodingComparator(probs, n_trials=1000)
comp.compare_all()

The KL divergence column tells the real story. Greedy has infinite KL divergence from the original (it completely ignores 999 tokens). Pure sampling matches the original distribution almost perfectly (near-zero KL). Top-p=0.9 sits in a nice middle ground -- it cuts the tail off but keeps the core distribution intact. That's exactly why top-p is the default for most production systems: it preserves the model's beliefs where they matter while trimming the noise.

On to today's episode

Here we go! We touched on tokenization briefly back in episode #57 when we discussed language models. But tokenization deserves its own deep dive because it's one of those deceptively simple-looking steps that has profound effects on everything downstream. How you chop text into tokens determines what the model can learn, how efficiently it processes language, and what kinds of mistakes it makes.

Here's a question that trips up most people: why can modern language models solve complex calculus but struggle to count the number of "r"s in "strawberry"? The answer is tokenization. The model doesn't see individual characters -- it sees tokens, and "strawberry" gets split into pieces that hide the individual letters. We'll get to exactly how and why by the end of this episode.

If you remember from episode #30 (NLP fundamentals), we already dealt with text preprocessing -- cleaning, lowercasing, removing stopwords. Tokenization sits one layer below all of that. It's the very first transformation that raw text goes through before the model ever sees it. And unlike most preprocessing choices (which you can change during training), the tokenizer is baked into the model. Change the tokenizer and you have a completely different model. No exceptions ;-)

The three levels of tokenization

There are three fundamental approaches to splitting text into pieces, and understanding the tradeoffs between them is critical.

Character-level tokenization splits text into individual characters. "Hello" becomes ["H", "e", "l", "l", "o"]. The vocabulary is tiny -- a few hundred characters covers basically every language. Any text whatsoever can be represented. Sounds great, right? The problem is that sequences become extremely long. A 1000-word document turns into roughly 5000 characters. The model has to learn everything from scratch: spelling, word boundaries, grammar, idioms, all from individual characters. It works (and some research models do use character-level tokenization), but training is slow and the model needs to spend a lot of its capacity just learning to spell.

Word-level tokenization goes the opposite direction: split on whitespace and punctuation. "Hello world" becomes ["Hello", "world"]. Sequences are short and each token is already a meaningful unit. But the vocabulary explodes. English alone has hundreds of thousands of words, and you still can't handle misspellings, new words, made-up compound words, or source code. Any word not in the vocabulary becomes an <UNK> (unknown) token -- a dead end where all information about that word is lost.

Subword tokenization is the compromise that won, and it won decisively. Split common words into single tokens ("the", "hello") while decomposing rare words into meaningful pieces ("un" + "expect" + "edly"). Vocabulary size is bounded (typically 32K-100K tokens), sequences stay reasonably short, and any text can be represented because unknown words naturally decompose into known subwords.

# See it in action with a real tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Tokenization is surprisingly important for LLMs"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
print(f"Number of tokens: {len(ids)}")
# Tokens: ['Token', 'ization', ' is', ' surprisingly',
#          ' important', ' for', ' LL', 'Ms']
# Common words -> single tokens
# "Tokenization" -> 2 pieces, "LLMs" -> 2 pieces

Every modern language model -- GPT-2, GPT-4, Llama, Mistral, all of them -- uses some form of subword tokenization. The three main algorithms are BPE (Byte Pair Encoding), WordPiece, and Unigram. Let's build each one from scratch.

Byte Pair Encoding: the algorithm that powers GPT

BPE (Sennrich et al., 2015) is the most widely used subword tokenization algorithm. GPT-2, GPT-3, GPT-4, Llama, and most modern LLMs use BPE or a variant of it. The algorithm itself is genuinely elegant.

Training phase (building the vocabulary):

  1. Start with a base vocabulary of all individual characters (or bytes)
  2. Count all adjacent pairs of tokens in the training corpus
  3. Merge the most frequent pair into a single new token
  4. Repeat from step 2 until the vocabulary reaches the desired size

That's it. Four steps. Let's implement it:

def get_vocab(words):
    """Extract the current vocabulary from tokenized words."""
    vocab = set()
    for word in words:
        for token in word:
            vocab.add(token)
    return vocab


def train_bpe(corpus, vocab_size):
    """Train BPE from scratch on a corpus."""
    # Start with character-level tokens + end-of-word marker
    words = {}
    for word in corpus.split():
        chars = list(word) + ["</w>"]
        words[tuple(chars)] = words.get(tuple(chars), 0) + 1

    merges = []

    while len(get_vocab(words)) < vocab_size:
        # Count all adjacent pairs across the corpus
        pairs = {}
        for word, freq in words.items():
            for i in range(len(word) - 1):
                pair = (word[i], word[i + 1])
                pairs[pair] = pairs.get(pair, 0) + freq

        if not pairs:
            break

        # Find the most frequent pair
        best_pair = max(pairs, key=pairs.get)
        merges.append(best_pair)

        # Apply the merge everywhere
        new_words = {}
        for word, freq in words.items():
            new_word = merge_pair(word, best_pair)
            new_words[new_word] = freq
        words = new_words

    return merges


def merge_pair(word, pair):
    """Merge all occurrences of pair in word."""
    new_word = []
    i = 0
    while i < len(word):
        if (i < len(word) - 1
                and (word[i], word[i + 1]) == pair):
            new_word.append(word[i] + word[i + 1])
            i += 2
        else:
            new_word.append(word[i])
            i += 1
    return tuple(new_word)

The beauty of BPE is that it naturally discovers meaningful subwords from raw frequency statistics. After training on English text, the merge list will look something like: individual characters -> common two-letter combos ("th", "in", "er") -> common words ("the", "and", "is") -> longer words ("because", "important"). Rare words naturally decompose into common pieces.

Encoding phase (tokenizing new text): apply the learned merges in order. Start with characters, apply merge 1, then merge 2, and so on until no more merges apply:

def encode_bpe(text, merges):
    """Tokenize text using trained BPE merges."""
    tokens = list(text)

    for merge_pair_item in merges:
        i = 0
        while i < len(tokens) - 1:
            if (tokens[i], tokens[i + 1]) == merge_pair_item:
                tokens[i] = tokens[i] + tokens[i + 1]
                del tokens[i + 1]
            else:
                i += 1

    return tokens


# Quick demo
corpus = ("the cat sat on the mat the cat ate the rat "
          "the dog sat on the log")
merges = train_bpe(corpus, vocab_size=30)
print("Learned merges:")
for i, m in enumerate(merges[:10]):
    print(f"  {i+1}: {m[0]!r} + {m[1]!r}")

encoded = encode_bpe("the cat", merges)
print(f"\n'the cat' -> {encoded}")

Notice the merge order matters. If merge #3 is ("t", "h") and merge #7 is ("th", "e"), then "the" will first become ["th", "e"] after merge #3, then ["the"] after merge #7. The order of merges IS the tokenizer -- it's all you need to reproduce the exact same tokenization on any new text.

WordPiece: how BERT tokenizes

WordPiece (used by BERT and its derivatives) is similar to BPE but uses a different criterion for deciding which pair to merge. In stead of picking the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data.

The practical difference: BPE merges ("t", "h") because the pair appears 50,000 times. WordPiece merges ("t", "h") because the combined token "th" is much more likely than "t" and "h" appearing independently. This subtle distinction means WordPiece tends to create slightly more linguistically meaningful subwords.

# WordPiece uses "##" prefix for continuation tokens
from transformers import BertTokenizer

bert_tok = BertTokenizer.from_pretrained("bert-base-uncased")

tokens = bert_tok.tokenize("unbelievable preprocessing")
print(tokens)
# ['un', '##bel', '##ie', '##va', '##ble',
#  'pre', '##pro', '##ces', '##sing']

# The ## prefix means: "this is a continuation of the
# previous token, NOT a new word"
tokens2 = bert_tok.tokenize("I love transformers")
print(tokens2)
# ['i', 'love', 'transformers']
# Common words stay as single tokens

That ## prefix is WordPiece's way of marking word-internal boundaries. When you see ['un', '##bel', '##ie', '##va', '##ble'], you know it's one word ("unbelievable") split into five subword pieces. The first piece has no prefix (it's a word start), and the rest have ## (they're continuations). This explicit boundary information helps the model understand word structure -- which is why BERT is so good at tasks that require word-level understanding like named entity recognition.

Unigram: the probabilistic approach

Unigram (Kudo, 2018) works in the opposite direction from BPE. In stead of starting small and building up, Unigram starts with a very large vocabulary and prunes it down:

  1. Start with a huge initial vocabulary (all substrings up to some length)
  2. For each token, compute how much the overall corpus likelihood drops if you remove it
  3. Remove the tokens that cause the smallest drop (they're the most redundant)
  4. Repeat until the vocabulary reaches the target size

During encoding, Unigram finds the most probable segmentation of the input. A single word might have dozens of valid segmentations, and Unigram picks the one that maximizes the product of individual token probabilities:

# Unigram can produce different segmentations
# "preprocessing" might become:
# ["pre", "process", "ing"]     <- most probable
# ["prep", "rocess", "ing"]     <- less probable
# ["p", "r", "e", "p", ...]    <- character fallback

# The advantage: Unigram naturally handles ambiguity
# and can even SAMPLE different segmentations, which
# works as a form of data augmentation during training

The cool thing about Unigram is that by sampling different segmentations of the same word during training, you effectively augment your data for free. The model sees "preprocessing" as ["pre", "process", "ing"] in one batch and ["pre", "proc", "essing"] in another. This makes the model more robust to unusual tokenizations at inference time. SentencePiece uses Unigram as one of its two backends.

SentencePiece: language-agnostic tokenization

BPE and WordPiece both assume whitespace separates words. That works for English and most European languages, but it falls apart for Chinese, Japanese, Thai, or any language where words aren't space-delimited. SentencePiece (Kudo and Richardson, 2018) solves this by treating the input as raw bytes -- including spaces.

import sentencepiece as spm

# Train a SentencePiece model from a text file
spm.SentencePieceTrainer.train(
    input="training_corpus.txt",
    model_prefix="my_tokenizer",
    vocab_size=32000,
    model_type="bpe",          # or "unigram"
    character_coverage=0.9995,
    byte_fallback=True         # handle ANY Unicode char
)

# Load and use the trained model
sp = spm.SentencePieceProcessor(
    model_file="my_tokenizer.model")

text = "This handles any language"
pieces = sp.encode(text, out_type=str)
print(pieces)
# Something like: ['_This', '_handles', '_any', '_language']
# The _ character represents a space

Notice that _ (the Unicode lower one eighth block character) represents a space in SentencePiece output. By treating spaces as explicit characters rather than word delimiters, SentencePiece can tokenize any language uniformly. Llama and most multilingual models use SentencePiece.

Byte fallback is a critical feature: if a character isn't in the vocabulary (rare emoji, unusual script, a Unicode codepoint the tokenizer has never seen), SentencePiece falls back to encoding the raw UTF-8 bytes. This means it can represent absolutely any valid text without ever producing an unknown token. No more <UNK> dead ends.

Tiktoken: OpenAI's fast BPE

OpenAI's tiktoken is a fast BPE implementation used by GPT-3.5, GPT-4 and related models. It's worth knowing because when you're counting tokens for API calls (and paying per token), you need to know exactly how your input will be tokenized:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

text = "Tokenization affects everything downstream"
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]

print(f"Token count: {len(tokens)}")
print(f"Tokens: {decoded}")
print(f"Token IDs: {tokens}")

# Interesting comparison: code is expensive in tokens
code = ("def fibonacci(n): return n if n <= 1 "
        "else fibonacci(n-1) + fibonacci(n-2)")
print(f"\nCode tokens: {len(enc.encode(code))}")

# Non-English text uses MORE tokens per character
dutch = "Tokenisatie is verrassend belangrijk"
english = "Tokenization is surprisingly important"
print(f"Dutch:   {len(enc.encode(dutch))} tokens")
print(f"English: {len(enc.encode(english))} tokens")

That Dutch vs English comparison is a real eye-opener. Same semantic content, but Dutch (and most non-English languages) costs more tokens because the tokenizer was trained primarily on English text. English words appear more frequently in the training data, so they get merged into fewer, larger tokens. Dutch words are rarer, so they stay split into more pieces. This has real consequences: the model's effective context window is smaller for non-English text, and API costs are higher per word.

As someone who reads and writes Dutch daily, I find this both annoying and fascinatiing from a technical perspective. The tokenizer is making an implicit economic judgment: English gets premium treatment because it dominated the training corpus ;-)

Vocabulary size: the fundamental tradeoff

Why do most models pick 32K or 100K tokens? Why not 500K or 5K? Vocabulary size is a fundamental engineering tradeoff:

Larger vocabulary (100K+): common words and phrases become single tokens, so sequences are shorter. The model processes text faster and can fit more content in its context window. But the embedding matrix grows (vocab_size x hidden_dim parameters), rare tokens get very few training examples each, and the output softmax layer becomes more expensive to compute. Remember from episode #58 (GPT architecture): the embedding matrix and the output projection share parameters in most models. Making the vocabulary bigger makes both of those bigger.

Smaller vocabulary (8K-16K): sequences are longer because each word takes more tokens. The model has to do more work to process the same text. But every single token appears frequently in the training data (better learned embeddings), the embedding matrix is compact, and the softmax is cheaper.

The sweet spot for most LLMs today is 32K-128K tokens. GPT-4 uses roughly 100K. Llama 3 uses 128K. Smaller models (like GPT-2) used 50K. These numbers all reflect a balance between sequence efficiency and model overhead.

# Quick calculation: how vocabulary affects model size
hidden = 4096  # typical hidden dimension

for vocab in [8000, 32000, 50257, 100000, 128000]:
    embed_params = vocab * hidden
    embed_mb = embed_params * 2 / 1e6  # fp16
    print(f"vocab={vocab:>7,}: embedding = {embed_params:>12,} "
          f"params ({embed_mb:>7.1f} MB)")

# vocab=  8,000: embedding =   32,768,000 params (  65.5 MB)
# vocab= 32,000: embedding =  131,072,000 params ( 262.1 MB)
# vocab= 50,257: embedding =  205,852,672 params ( 411.7 MB)
# vocab=100,000: embedding =  409,600,000 params ( 819.2 MB)
# vocab=128,000: embedding =  524,288,000 params (1048.6 MB)

Going from 32K to 128K vocabulary adds nearly 800MB just for the embedding matrix alone. That's not nothing, especially for models you want to run on consumer hardware (as we discussed in episode #70). Having said that, the benefit of shorter sequences usually outweighs the parameter cost -- which is why the trend has been toward larger vocabularies in recent models.

Tokenization artifacts: when the tokenizer fights you

This is the section that really matters for practical work. Tokenization creates subtle behaviors that look like model failures but are really tokenizer artifacts:

The strawberry problem: "How many r's in strawberry?" is a classic failure case for language models. Why? Because "strawberry" tokenizes as something like ["str", "aw", "berry"]. The model never sees individual "r" characters directly. To answer correctly, it would need to know that "str" contains one "r", "aw" contains zero, and "berry" contains two -- knowledge that has to be learned from training data rather than being directly observable in the token sequence.

Arithmetic gone wrong: "What is 7842 + 1359?" is harder than it should be because numbers tokenize inconsistently. "7842" might be one token, but "78423" could become ["784", "23"]. The model can't do digit-by-digit addition the way we were taught in school because it doesn't see individual digits as separate tokens. This is why chain-of-thought prompting helps with math -- it forces the model to work through the problem step by step in stead of trying to produce the answer in one shot.

The non-English penalty: tokenizers trained primarily on English text require more tokens to represent the same content in other languages. A sentence in Hindi might take 3x more tokens than its English translation. That means the model's effective context window is 3x smaller for Hindi, and inference is 3x more expensive per word.

Whitespace sensitivity: "Hello world" and "Hello world" (two spaces) produce different token sequences. Leading spaces, trailing spaces, tabs, newlines -- they all tokenize differently and can affect model behavior in unexpected ways. This is why prompt engineering sometimes feels like black magic: adding or removing a single space can change the tokenization enough to alter the model's output.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

# Same meaning, very different token counts
texts = [
    "the cat sat on the mat",
    "THE CAT SAT ON THE MAT",       # uppercase = more tokens
    "thecatsatonthemat",             # no spaces = different split
    "the  cat  sat  on  the  mat",   # double spaces = extra tokens
]

for t in texts:
    toks = enc.encode(t)
    decoded = [enc.decode([x]) for x in toks]
    print(f"{len(toks):2d} tokens: {t!r}")
    print(f"   split: {decoded}")
    print()

Code formatting: tabs vs spaces tokenize completely differently in Python code. A model might handle 4-space indentation well but struggle with tab indentation (or vice versa) purely because the token sequences are different for the same logical structure. If you're using a language model for code generation and getting weird indentation, check how your whitespace is tokenizing.

Understanding these artifacts is essential for debugging model behavior. When a model fails at a task, the first question to ask yourself is: "What does this look like after tokenization?" More often than not, the answer explains the failure.

Building your own tokenizer from scratch

Let's put everything together and build a complete BPE tokenizer that trains on any corpus and can encode/decode arbitrary text. This is a simplified version of what tiktoken and SentencePiece do under the hood:

class SimpleBPE:
    """A complete BPE tokenizer you can train and use."""

    def __init__(self):
        self.merges = []
        self.vocab = {}

    def train(self, text, vocab_size=256 + 50):
        """Train BPE on a text corpus."""
        # Start with byte-level vocab (covers all UTF-8)
        self.vocab = {i: bytes([i]) for i in range(256)}
        next_id = 256

        # Convert text to byte sequence
        data = list(text.encode("utf-8"))

        while next_id < vocab_size:
            # Count adjacent pairs
            pairs = {}
            for i in range(len(data) - 1):
                pair = (data[i], data[i + 1])
                pairs[pair] = pairs.get(pair, 0) + 1

            if not pairs:
                break

            # Find most common pair
            best = max(pairs, key=pairs.get)
            if pairs[best] < 2:
                break

            # Create new token for this pair
            self.merges.append(best)
            merged = best[0] * 256 + best[1]  # unique ID
            self.vocab[next_id] = (
                self.vocab.get(best[0], bytes([best[0]]))
                + self.vocab.get(best[1], bytes([best[1]])))

            # Replace all occurrences in data
            new_data = []
            i = 0
            while i < len(data):
                if (i < len(data) - 1
                        and (data[i], data[i + 1]) == best):
                    new_data.append(next_id)
                    i += 2
                else:
                    new_data.append(data[i])
                    i += 1
            data = new_data
            next_id += 1

        print(f"Trained: {len(self.merges)} merges, "
              f"{len(self.vocab)} vocab entries")

    def encode(self, text):
        """Encode text to token IDs."""
        data = list(text.encode("utf-8"))
        for pair in self.merges:
            new_data = []
            i = 0
            while i < len(data):
                if (i < len(data) - 1
                        and (data[i], data[i + 1]) == pair):
                    # Find the merge target ID
                    idx = self.merges.index(pair) + 256
                    new_data.append(idx)
                    i += 2
                else:
                    new_data.append(data[i])
                    i += 1
            data = new_data
        return data

    def decode(self, ids):
        """Decode token IDs back to text."""
        raw = b""
        for tid in ids:
            if tid in self.vocab:
                raw += self.vocab[tid]
            else:
                raw += bytes([tid])
        return raw.decode("utf-8", errors="replace")


# Train on some text
corpus = ("the quick brown fox jumps over the lazy dog "
          * 100)
corpus += "the tokenizer splits text into pieces " * 50

bpe = SimpleBPE()
bpe.train(corpus, vocab_size=300)

# Encode and decode
test = "the quick tokenizer"
ids = bpe.encode(test)
back = bpe.decode(ids)
print(f"Original:  {test!r}")
print(f"Token IDs: {ids}")
print(f"Decoded:   {back!r}")
print(f"Match:     {test == back}")

This tokenizer starts from raw bytes (covering all of UTF-8) and builds up from there. That's the same approach used by GPT-4's tokenizer -- start from bytes, not from characters. This means you never need an <UNK> token because any possible byte sequence can be represented by the base vocabulary alone. The merges just make common patterns more efficient.

Visualizing how text gets tokenized

One last practical tool. When you're debugging model behavior or optimizing prompts, being able to see exactly how text breaks down into tokens is invaluable:

import tiktoken


def visualize_tokens(text, model="gpt-4"):
    """Show exactly how text splits into tokens."""
    enc = tiktoken.encoding_for_model(model)
    ids = enc.encode(text)
    pieces = [enc.decode([i]) for i in ids]

    print(f"Text ({len(text)} chars, {len(ids)} tokens):")
    print(f"  {text!r}")
    print(f"\nTokens:")
    for i, (tid, piece) in enumerate(zip(ids, pieces)):
        # Show whitespace explicitly
        display = piece.replace(" ", "_").replace("\n", "\\n")
        print(f"  [{i:3d}] id={tid:>6d}  {display!r}")

    # Token efficiency
    ratio = len(text) / len(ids)
    print(f"\nEfficiency: {ratio:.1f} chars/token")
    return ids


# Compare different types of content
print("=" * 50)
print("English prose:")
visualize_tokens("The quick brown fox jumps over the lazy dog.")

print("\n" + "=" * 50)
print("Python code:")
visualize_tokens("def fib(n):\n    if n <= 1:\n        return n")

print("\n" + "=" * 50)
print("Mixed language:")
visualize_tokens("Hello wereld, hoe gaat het vandaag?")

You'll notice that common English words like "the", "quick", "brown" are typically single tokens, while code keywords, punctuation, and non-English words often split into multiple pieces. The chars-per-token ratio gives you a quick estimate of how efficiently the tokenizer handles different types of content -- higher is better (fewer tokens for the same text means cheaper API calls and more room in the context window).

Before you close this tab...

  • Subword tokenization (BPE, WordPiece, Unigram) won over character-level and word-level because it balances vocabulary size, sequence length, and coverage. Every modern LLM uses some variant;
  • BPE iteratively merges the most frequent adjacent pairs, naturally discovering meaningful subwords from character-level building blocks. It's the most widely used approach (GPT-2/3/4, Llama);
  • WordPiece merges based on likelihood gain in stead of raw frequency, producing slightly more linguistically meaningful tokens. Used by BERT and its derivatives;
  • Unigram starts large and prunes down, selecting the vocabulary that maximizes corpus likelihood. Can sample different segmentations for data augmentation;
  • SentencePiece treats text as raw bytes (including spaces), making tokenization truly language-agnostic -- critical for multilingual models like Llama;
  • vocabulary size is a real tradeoff: larger vocabularies mean shorter sequences but bigger embedding matrices and sparser training signal for rare tokens. The sweet spot is 32K-128K for most models;
  • tokenization artifacts explain many apparent model failures: character counting, arithmetic errors, non-English penalties, and whitespace sensitivity all trace back to how text gets split into tokens. When debugging model behavior, always check the tokenization first.

Exercises

Exercise 1: Build a tokenizer efficiency benchmark. Create a function benchmark_tokenizer(tokenizer_name, texts) that takes a tokenizer (by name, e.g. "gpt2", "gpt-4") and a dictionary of text samples (keys are category names like "english_prose", "python_code", "dutch_text", "json_data", "math_notation"). For each category, compute: (a) average tokens per word (split original on whitespace), (b) average characters per token, (c) the 5 most "expensive" words (words that use the most tokens), and (d) compression ratio (bytes in original / bytes if each token ID were stored as a 2-byte integer). Print a summary table and identify which content type is most token-efficient and which is least. Test with at least 4 categories of content, each with at least 200 words.

Exercise 2: Build a BPE merge visualizer. Create a class BPEVisualizer that trains BPE on a small corpus (provide at least 500 words) and stores the complete merge history. Implement show_merge_history(n) that prints the first n merges with: the pair being merged, the resulting token, how many times that pair appeared, and an example word from the corpus that contains this merge. Implement trace_word(word) that shows step-by-step how a specific word gets tokenized: starting from individual characters, showing which merge applies at each step, until the final tokenization. Implement vocab_growth_curve() that prints the vocabulary size after every 10 merges. Test by training on a paragraph of English text and tracing the words "understanding", "the", and "tokenization".

Exercise 3: Build a cross-lingual tokenization analyzer. Create parallel sentences in 4 languages: English, Dutch, German, and Spanish (at least 5 sentence pairs with identical meaning). Using tiktoken's GPT-4 encoding, for each sentence compute: (a) token count, (b) characters per token ratio, and (c) the individual token breakdown. Print a table showing all 4 languages side by side for each sentence pair. Calculate the "tokenization tax" -- how many extra tokens each non-English language requires compared to English (as a percentage). Finally, find the sentence where the gap between English and the most expensive language is largest, and explain why by examining the token-level breakdown. Which specific words or subwords cause the biggest divergence?

Much respect for staying with me this far. Thanks!

@scipio