Learn Python Series (#47) - Generators & Iterators Advanced

in StemSocial6 hours ago

Learn Python Series (#47) - Generators & Iterators Advanced

python-logo.png

Repository

What will I learn

  • You will learn how iteration actually works in Python beyond basic for loops;
  • the difference between iterables, iterators, and generators (and why that distinction matters);
  • how to build custom iteration protocols for domain-specific data structures;
  • generator expressions vs. list comprehensions and when each is appropriate;
  • advanced generator patterns like pipelines, delegation with yield from, and stateful generators;
  • practical itertools recipes for composable data processing.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution, such as (for example) the Anaconda Distribution;
  • The ambition to learn Python programming.

Difficulty

  • Intermediate, advanced

Curriculum (of the Learn Python Series):

GitHub Account

https://github.com/realScipio

Learn Python Series (#47) - Generators & Iterators Advanced

Here's a question I want you to sit with for a moment: how does Python know how to loop over anything? Lists, dictionaries, files, range(), database query results, the output of zip() — they're all completely different types with completely different internals, yet for item in thing works on all of them. That's not magic. That's not compiler trickery. That's the iterator protocol — and once you understand it, you can make anything iterable.

Generators take this concept further. Instead of building an entire collection in memory and handing it over, a generator produces values one at a time, on demand, keeping only the current state in memory. Need to process a 50GB log file? No problem — you never load more than one line. Need to represent an infinite mathematical sequence? Sure — you just keep yielding values until someone stops asking.

Nota bene: If you've been building lists just to iterate over them, you've been doing extra work that Python doesn't require. Generators and the iterator protocol are how experienced Python developers handle data that's too large, too slow, or too unpredictable to materialize all at once. And once you internalize this way of thinking — lazy evaluation, processing streams instead of collections — you'll write fundamentally different (and better) code.

The iterator protocol: __iter__ and __next__

Let's start from first principles. An iterable is anything you can loop over. An iterator is the object that actually produces the values, one at a time. These are different things, and the distinction matters.

The protocol is simple — just two dunder methods:

class MyIterator:
    def __iter__(self):
        return self  # An iterator returns itself
    
    def __next__(self):
        # Return the next value, or raise StopIteration when done
        raise StopIteration

When you write for item in something, Python does this behind the scenes:

# What 'for item in iterable:' actually does
iterator = iter(iterable)      # Calls iterable.__iter__()
while True:
    try:
        item = next(iterator)  # Calls iterator.__next__()
        # ... body of the for loop ...
    except StopIteration:
        break

That's it. Two function calls: iter() to get the iterator, next() to get each value, and StopIteration to signal "I'm done." Every for loop in every Python program you've ever written works exactly this way. Understanding this protocol is the foundation for everything else in this episode.

Building a custom iterator (the verbose way)

Let's build a simple countdown iterator to see the protocol in action:

class Countdown:
    def __init__(self, start):
        self.current = start
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.current <= 0:
            raise StopIteration
        value = self.current
        self.current -= 1
        return value

for num in Countdown(5):
    print(num, end=" ")
# 5 4 3 2 1

The class maintains state (self.current) and produces values on demand. Memory usage is constant — it doesn't matter if you're counting down from 5 or from 5 billion. Only the current number exists in memory at any point.

But there's a subtle gotcha here. Because __iter__ returns self, the iterator is single-use:

cd = Countdown(3)
print(list(cd))  # [3, 2, 1]
print(list(cd))  # [] — exhausted!

Once you've iterated through it, self.current is 0 and stays there. The iterator is spent. This surprises people coming from lists (which you can iterate multiple times), so let's fix it.

Separating iterable from iterator

The cleaner pattern is to separate the iterable (the thing you loop over) from the iterator (the thing that tracks position). Each call to __iter__ returns a fresh iterator:

class Countdown:
    """The iterable — knows the start value, creates fresh iterators."""
    def __init__(self, start):
        self.start = start
    
    def __iter__(self):
        return CountdownIterator(self.start)

class CountdownIterator:
    """The iterator — tracks current position, yields values."""
    def __init__(self, start):
        self.current = start
    
    def __iter__(self):
        return self
    
    def __next__(self):
        if self.current <= 0:
            raise StopIteration
        value = self.current
        self.current -= 1
        return value

cd = Countdown(3)
print(list(cd))  # [3, 2, 1]
print(list(cd))  # [3, 2, 1] — works again!

Now Countdown is reusable. Each for loop gets its own CountdownIterator with its own state. This is exactly how Python's built-in containers work — list.__iter__() returns a list_iterator, dict.__iter__() returns a dict_keyiterator, and so on. The container never IS the iterator; it creates one.

But here's the thing — writing two classes for simple iteration is verbose. Wouldn't it be nice if Python could handle the state management for us? That's exactly what generators do ;-)

Generators: the elegant shortcut

A generator function looks like a regular function, but uses yield instead of return:

def countdown(start):
    current = start
    while current > 0:
        yield current
        current -= 1

for num in countdown(5):
    print(num, end=" ")
# 5 4 3 2 1

When you call countdown(5), Python doesn't execute the function body. Instead, it returns a generator object that implements the iterator protocol automatically. Each call to next() executes the function body until it hits a yield, returns that value, and suspends — freezing all local variables in place. The next next() call resumes execution right after the yield, with all local variables intact.

This is fundamentally different from return. A return terminates the function and discards its state. A yield pauses the function and preserves its state. The generator remembers where it stopped and what every local variable was.

Let me demonstrate with some tracing:

def traced_countdown(start):
    print(f"  Generator started with start={start}")
    current = start
    while current > 0:
        print(f"  About to yield {current}")
        yield current
        print(f"  Resumed after yielding {current}")
        current -= 1
    print(f"  Generator exhausted")

gen = traced_countdown(3)
print("Created generator, nothing executed yet")

print(f"Got: {next(gen)}")
# Generator started with start=3
# About to yield 3
# Got: 3

print(f"Got: {next(gen)}")
# Resumed after yielding 3
# About to yield 2
# Got: 2

print(f"Got: {next(gen)}")
# Resumed after yielding 2
# About to yield 1
# Got: 1

# next(gen) would print "Resumed after yielding 1" then 
# "Generator exhausted" then raise StopIteration

Notice how "Generator started" only prints when we first call next(), not when we call traced_countdown(3). The function body is entirely lazy.

Generator expressions vs. list comprehensions

You already know list comprehensions. Generator expressions use the same syntax but with parentheses:

# List comprehension — builds entire list in memory
squares_list = [x**2 for x in range(1_000_000)]

# Generator expression — lazy, one value at a time
squares_gen = (x**2 for x in range(1_000_000))

The memory difference is dramatic:

import sys

squares_list = [x**2 for x in range(1_000_000)]
squares_gen = (x**2 for x in range(1_000_000))

print(sys.getsizeof(squares_list))  # ~8,448,728 bytes (8 MB!)
print(sys.getsizeof(squares_gen))   #        208 bytes

That's a factor of ~40,000. The generator stores only the state needed to produce the next value — the current position in the range and the expression to evaluate.

When you're feeding values into an aggregation function, always prefer generators:

# Good — generator, processes one value at a time
total = sum(x**2 for x in range(1_000_000))

# Wasteful — builds a million-element list, then sums it, then discards it
total = sum([x**2 for x in range(1_000_000)])

# Good — finds max without materializing everything
biggest = max(len(line) for line in open('data.txt'))

# Good — checks condition lazily, stops at first True
has_errors = any(line.startswith('ERROR') for line in open('app.log'))

That last example is particularly powerful: any() short-circuits, so if the first line of a 50GB log file starts with "ERROR", it reads exactly one line and stops. The generator never produces the remaining billions of lines.

Rule of thumb: use list comprehensions when you need to iterate multiple times or need random access (indexing, slicing). Use generator expressions when you iterate once and the data is large or you want lazy evaluation.

The pipeline pattern

This is where generators really shine. You can compose them into processing pipelines where data flows through multiple transformation stages, one item at a time:

def read_lines(filename):
    """Stage 1: Read lines from file."""
    with open(filename) as f:
        for line in f:
            yield line.rstrip('\n')

def skip_comments(lines):
    """Stage 2: Skip comment lines and blank lines."""
    for line in lines:
        stripped = line.strip()
        if stripped and not stripped.startswith('#'):
            yield line

def parse_csv_row(lines):
    """Stage 3: Split CSV into fields."""
    for line in lines:
        yield line.split(',')

def filter_by_field(rows, field_index, value):
    """Stage 4: Keep rows where a field matches a value."""
    for row in rows:
        if len(row) > field_index and row[field_index].strip() == value:
            yield row

# Compose the pipeline:
lines = read_lines('sales_data.csv')
clean = skip_comments(lines)
rows = parse_csv_row(clean)
electronics = filter_by_field(rows, 2, 'Electronics')

for row in electronics:
    print(row)

Each stage is a generator. No stage materializes the entire dataset. Data flows through the pipeline one record at a time. If sales_data.csv is 100GB, your memory usage is still effectively one line. And adding, removing, or reordering stages is trivial — just change how you connect the generators.

This pattern is conceptually similar to Unix pipes (cat file | grep ERROR | sort | uniq -c). Each generator is a filter or transformation that consumes from upstream and yields downstream.

yield from: delegating to sub-generators

When you need a generator to yield all values from another iterable, you could write a loop:

def flatten(nested_list):
    for sublist in nested_list:
        for item in sublist:
            yield item

But yield from (Python 3.3+) is cleaner and more efficient:

def flatten(nested_list):
    for sublist in nested_list:
        yield from sublist

result = list(flatten([[1, 2, 3], [4, 5], [6]]))
# [1, 2, 3, 4, 5, 6]

yield from does more than just syntactic sugar — it properly delegates the iteration protocol, forwarding .send(), .throw(), and .close() to the sub-generator. This matters when you're using generators as coroutines.

For recursive structures, yield from is particularly elegant:

class TreeNode:
    def __init__(self, value, children=None):
        self.value = value
        self.children = children or []

def depth_first(node):
    """Depth-first traversal without explicit stack management."""
    yield node.value
    for child in node.children:
        yield from depth_first(child)

def breadth_first(root):
    """Breadth-first traversal using a queue."""
    from collections import deque
    queue = deque([root])
    while queue:
        node = queue.popleft()
        yield node.value
        queue.extend(node.children)

tree = TreeNode('A', [
    TreeNode('B', [TreeNode('D'), TreeNode('E')]),
    TreeNode('C', [TreeNode('F')])
])

print(list(depth_first(tree)))    # ['A', 'B', 'D', 'E', 'C', 'F']
print(list(breadth_first(tree)))  # ['A', 'B', 'C', 'D', 'E', 'F']

No explicit stack, no accumulator lists — just yield and yield from. The recursive generator manages its own call stack via Python's generator frame objects. Clean, readable, and memory-efficient.

Sending values into generators with .send()

Generators aren't just one-way streets. You can send values into a running generator using .send():

def running_average():
    """Accepts values via send(), yields running average."""
    total = 0.0
    count = 0
    average = None
    
    while True:
        value = yield average
        if value is None:
            return  # Cleanly terminate
        total += value
        count += 1
        average = total / count

# Usage:
avg = running_average()
next(avg)              # "Prime" the generator (advance to first yield)
print(avg.send(10))    # 10.0
print(avg.send(20))    # 15.0
print(avg.send(30))    # 20.0
print(avg.send(15))    # 18.75

The yield expression works both ways: it sends a value out (the average) and receives a value in (the new data point). The next(avg) call at the beginning is required to advance the generator to the first yield — this is called "priming" the generator.

This pattern was the precursor to Python's async/await system. Before PEP 492 gave us native coroutines, developers used generator-based coroutines with .send() extensively. Modern code should use async/await for concurrent I/O (as we covered in episodes #40 and #41), but understanding .send() explains why async and await work the way they do — they're syntactic sugar over a generator-like protocol.

Generator cleanup with .close() and .throw()

Generators support two more methods for lifecycle management:

def managed_resource():
    print("Opening resource")
    try:
        while True:
            data = yield "ready"
            print(f"Processing: {data}")
    except GeneratorExit:
        print("Cleaning up resource")
    finally:
        print("Finally block executed")

gen = managed_resource()
next(gen)             # Opening resource → 'ready'
gen.send("chunk 1")   # Processing: chunk 1
gen.send("chunk 2")   # Processing: chunk 2
gen.close()           # Cleaning up resource → Finally block executed

.close() throws a GeneratorExit exception into the generator, giving it a chance to clean up. This is called automatically when a generator is garbage collected, which is why generators work well as context-manager-like resources.

.throw() lets you inject an exception into a generator at the point where it's suspended:

def resilient_processor():
    while True:
        try:
            value = yield
            print(f"Processed: {value}")
        except ValueError as e:
            print(f"Skipped bad value: {e}")

gen = resilient_processor()
next(gen)
gen.send("good data")     # Processed: good data
gen.throw(ValueError, "corrupt record")  # Skipped bad value: corrupt record
gen.send("more data")     # Processed: more data

The generator catches the injected exception and continues. This enables error recovery in pipeline architectures without breaking the entire chain.

Infinite generators

Because generators are lazy, they can represent infinite sequences. This is impossible with lists (you'd run out of memory) but trivial with generators:

def fibonacci():
    """Infinite Fibonacci sequence."""
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

def primes():
    """Infinite prime sequence (simple sieve)."""
    yield 2
    candidate = 3
    found = [2]
    while True:
        if all(candidate % p != 0 for p in found):
            found.append(candidate)
            yield candidate
        candidate += 2

import itertools

# First 10 Fibonacci numbers
print(list(itertools.islice(fibonacci(), 10)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

# First 15 primes
print(list(itertools.islice(primes(), 15)))
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47]

# Sum of first 1000 Fibonacci numbers
print(sum(itertools.islice(fibonacci(), 1000)))
# (a very large number)

itertools.islice is the standard way to take a finite slice from an infinite generator. You could also use zip with a range or write a custom take() function.

itertools: the standard library's generator toolkit

The itertools module provides a collection of blazing-fast, memory-efficient building blocks for iterator-based programming. These are implemented in C, so they're significantly faster than writing the equivalent Python generator:

import itertools

# chain: concatenate multiple iterables
combined = itertools.chain([1, 2], [3, 4], [5, 6])
print(list(combined))  # [1, 2, 3, 4, 5, 6]

# cycle: repeat an iterable forever
colors = itertools.cycle(['red', 'green', 'blue'])
print(list(itertools.islice(colors, 7)))
# ['red', 'green', 'blue', 'red', 'green', 'blue', 'red']

# count: infinite arithmetic progression
counter = itertools.count(start=10, step=3)
print(list(itertools.islice(counter, 5)))  # [10, 13, 16, 19, 22]

# takewhile / dropwhile: conditional slicing
nums = [1, 3, 5, 7, 2, 4, 6, 8]
print(list(itertools.takewhile(lambda x: x < 6, nums)))  # [1, 3, 5]
print(list(itertools.dropwhile(lambda x: x < 6, nums)))  # [7, 2, 4, 6, 8]

# accumulate: running reduction
data = [1, 2, 3, 4, 5]
print(list(itertools.accumulate(data)))             # [1, 3, 6, 10, 15] (running sum)
print(list(itertools.accumulate(data, max)))        # [1, 2, 3, 4, 5]  (running max)
print(list(itertools.accumulate(data, lambda a, b: a * b)))  # [1, 2, 6, 24, 120] (running product)

And the combinatoric generators:

# product: cartesian product
print(list(itertools.product('AB', [1, 2])))
# [('A', 1), ('A', 2), ('B', 1), ('B', 2)]

# permutations and combinations
print(list(itertools.permutations('ABC', 2)))
# [('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]

print(list(itertools.combinations('ABCD', 2)))
# [('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

All of these return generators (lazy evaluation), so itertools.product(range(100), range(100), range(100)) doesn't create a list of a million tuples — it yields them one at a time.

groupby: the tricky one

itertools.groupby groups consecutive elements that share a key. The "consecutive" part trips people up:

import itertools

# Data MUST be sorted by the grouping key first!
data = [
    ('Alice', 'Engineering'),
    ('Bob', 'Engineering'),
    ('Carol', 'Marketing'),
    ('Dave', 'Marketing'),
    ('Eve', 'Engineering'),
]

# This groups CONSECUTIVE items, so Eve is a separate Engineering group:
for dept, members in itertools.groupby(data, key=lambda x: x[1]):
    print(f"{dept}: {[m[0] for m in members]}")
# Engineering: ['Alice', 'Bob']
# Marketing: ['Carol', 'Dave']
# Engineering: ['Eve']          ← separate group!

# Sort first for true grouping:
sorted_data = sorted(data, key=lambda x: x[1])
for dept, members in itertools.groupby(sorted_data, key=lambda x: x[1]):
    print(f"{dept}: {[m[0] for m in members]}")
# Engineering: ['Alice', 'Bob', 'Eve']
# Marketing: ['Carol', 'Dave']

The reason groupby works on consecutive runs (rather than collecting all matching elements like SQL's GROUP BY) is precisely because it's a generator — it processes the stream item by item and has no lookahead. If you want SQL-style grouping, sort first.

Real-world example: processing large CSV data

Let's put everything together in a realistic scenario — processing a large dataset with a pipeline of generators:

import csv
import itertools
from collections import defaultdict

def read_csv_rows(filename):
    """Read CSV file lazily, one row at a time."""
    with open(filename, newline='') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

def parse_numeric(rows, fields):
    """Convert specified fields from strings to floats."""
    for row in rows:
        for field in fields:
            try:
                row[field] = float(row[field])
            except (ValueError, KeyError):
                row[field] = 0.0
        yield row

def filter_rows(rows, predicate):
    """Keep only rows matching a predicate function."""
    for row in rows:
        if predicate(row):
            yield row

def running_totals(rows, amount_field, group_field):
    """Compute running total per group."""
    totals = defaultdict(float)
    for row in rows:
        group = row[group_field]
        totals[group] += row[amount_field]
        row['running_total'] = totals[group]
        yield row

# Compose the pipeline:
rows = read_csv_rows('transactions.csv')
parsed = parse_numeric(rows, ['amount'])
large_only = filter_rows(parsed, lambda r: r['amount'] > 100)
with_totals = running_totals(large_only, 'amount', 'category')

# Process — constant memory regardless of file size
for row in itertools.islice(with_totals, 20):  # Preview first 20
    print(f"{row['category']:>15} | ${row['amount']:>10.2f} | "
          f"Running: ${row['running_total']:>12.2f}")

This pipeline can process a 100GB CSV file using only a few kilobytes of memory. Each generator transforms one row and passes it along. Nothing is stored. Nothing is buffered. The file is never loaded into memory. That's the power of generator pipelines.

When NOT to use generators

Generators aren't always the answer. Here are the cases where a list is better:

# Bad — need random access
gen = (x**2 for x in range(100))
# gen[50]  ← TypeError! Generators don't support indexing

# Bad — need to iterate multiple times
gen = (x**2 for x in range(100))
total = sum(gen)
average = total / len(list(gen))  # gen is already exhausted!

# Bad — need to know length upfront
gen = (x for x in range(100) if x % 7 == 0)
# len(gen)  ← TypeError! Generators don't have length

# Bad — small dataset where list is simpler
names = [name.upper() for name in ['alice', 'bob', 'carol']]
# No point using a generator for 3 items

Use lists when:

  • You need random access (indexing, slicing)
  • You need to iterate multiple times
  • You need len(), .sort(), .reverse(), or other list methods
  • The data fits comfortably in memory and is small

Use generators when:

  • You iterate once through large or infinite data
  • You're building processing pipelines
  • You're feeding into sum(), max(), min(), any(), all(), or similar aggregators
  • Memory efficiency matters

Samengevat

In this episode, we explored the iteration machinery that powers Python:

  • The iterator protocol (__iter__ and __next__) is the foundation of all iteration in Python
  • Iterables create fresh iterators; iterators produce values and track position
  • Generator functions (using yield) are the Pythonic shortcut for writing iterators
  • Generator expressions (x for x in ...) are memory-efficient alternatives to list comprehensions
  • Generator pipelines compose multiple stages for stream processing with constant memory
  • yield from delegates to sub-generators and handles the full iterator protocol
  • .send() enables bidirectional communication with running generators
  • .close() and .throw() support lifecycle management and error injection
  • Infinite generators represent unbounded sequences in finite memory
  • itertools provides C-optimized building blocks: chain, cycle, islice, accumulate, groupby, product, permutations, combinations
  • groupby groups consecutive elements — sort first for SQL-style grouping

The shift from "materialize everything, then process" to "process as you go" is a fundamental change in how you think about data. Generators make this shift natural. They're one of Python's most powerful features — and now you know how the machinery works under the hood.

Heel erg bedankt, tot de volgende keer!

@scipio