Learn Python Series (#47) - Generators & Iterators Advanced

Repository
What will I learn
- You will learn how iteration actually works in Python beyond basic for loops;
- the difference between iterables, iterators, and generators (and why that distinction matters);
- how to build custom iteration protocols for domain-specific data structures;
- generator expressions vs. list comprehensions and when each is appropriate;
- advanced generator patterns like pipelines, delegation with
yield from, and stateful generators; - practical
itertoolsrecipes for composable data processing.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution, such as (for example) the Anaconda Distribution;
- The ambition to learn Python programming.
Difficulty
- Intermediate, advanced
Curriculum (of the Learn Python Series):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
- Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
- Learn Python Series (#17) - Roundup #2 - Combining and analyzing any-to-any multi-currency historical data
- Learn Python Series (#18) - PyMongo Part 1
- Learn Python Series (#19) - PyMongo Part 2
- Learn Python Series (#20) - PyMongo Part 3
- Learn Python Series (#21) - Handling Dates and Time Part 1
- Learn Python Series (#22) - Handling Dates and Time Part 2
- Learn Python Series (#23) - Handling Regular Expressions Part 1
- Learn Python Series (#24) - Handling Regular Expressions Part 2
- Learn Python Series (#25) - Handling Regular Expressions Part 3
- Learn Python Series (#26) - pipenv & Visual Studio Code
- Learn Python Series (#27) - Handling Strings Part 3 (F-Strings)
- Learn Python Series (#28) - Using Pickle and Shelve
- Learn Python Series (#29) - Handling CSV
- Learn Python Series (#30) - Data Science Part 1 - Pandas
- Learn Python Series (#31) - Data Science Part 2 - Pandas
- Learn Python Series (#32) - Data Science Part 3 - Pandas
- Learn Python Series (#33) - Data Science Part 4 - Pandas
- Learn Python Series (#34) - Working with APIs in 2026: What's Changed
- Learn Python Series (#35) - Working with APIs Part 2: Beyond GET Requests
- Learn Python Series (#36) - Type Hints and Modern Python
- Learn Python Series (#37) - Virtual Environments and Dependency Management
- Learn Python Series (#38) - Testing Your Code Part 1
- Learn Python Series (#39) - Testing Your Code Part 2
- Learn Python Series (#40) - Asynchronous Python Part 1
- Learn Python Series (#41) - Asynchronous Python Part 2
- Learn Python Series (#42) - Building CLI Applications
- Learn Python Series (#43) - Mini Project - Crypto Price Tracker
- Learn Python Series (#44) - Context Managers & Decorators Deep Dive
- Learn Python Series (#45) - Metaclasses & Class Design Patterns
- Learn Python Series (#46) - Descriptors & Properties
- Learn Python Series (#47) - Generators & Iterators Advanced (this post)
GitHub Account
Learn Python Series (#47) - Generators & Iterators Advanced
Here's a question I want you to sit with for a moment: how does Python know how to loop over anything? Lists, dictionaries, files, range(), database query results, the output of zip() — they're all completely different types with completely different internals, yet for item in thing works on all of them. That's not magic. That's not compiler trickery. That's the iterator protocol — and once you understand it, you can make anything iterable.
Generators take this concept further. Instead of building an entire collection in memory and handing it over, a generator produces values one at a time, on demand, keeping only the current state in memory. Need to process a 50GB log file? No problem — you never load more than one line. Need to represent an infinite mathematical sequence? Sure — you just keep yielding values until someone stops asking.
Nota bene: If you've been building lists just to iterate over them, you've been doing extra work that Python doesn't require. Generators and the iterator protocol are how experienced Python developers handle data that's too large, too slow, or too unpredictable to materialize all at once. And once you internalize this way of thinking — lazy evaluation, processing streams instead of collections — you'll write fundamentally different (and better) code.
The iterator protocol: __iter__ and __next__
Let's start from first principles. An iterable is anything you can loop over. An iterator is the object that actually produces the values, one at a time. These are different things, and the distinction matters.
The protocol is simple — just two dunder methods:
class MyIterator:
def __iter__(self):
return self # An iterator returns itself
def __next__(self):
# Return the next value, or raise StopIteration when done
raise StopIteration
When you write for item in something, Python does this behind the scenes:
# What 'for item in iterable:' actually does
iterator = iter(iterable) # Calls iterable.__iter__()
while True:
try:
item = next(iterator) # Calls iterator.__next__()
# ... body of the for loop ...
except StopIteration:
break
That's it. Two function calls: iter() to get the iterator, next() to get each value, and StopIteration to signal "I'm done." Every for loop in every Python program you've ever written works exactly this way. Understanding this protocol is the foundation for everything else in this episode.
Building a custom iterator (the verbose way)
Let's build a simple countdown iterator to see the protocol in action:
class Countdown:
def __init__(self, start):
self.current = start
def __iter__(self):
return self
def __next__(self):
if self.current <= 0:
raise StopIteration
value = self.current
self.current -= 1
return value
for num in Countdown(5):
print(num, end=" ")
# 5 4 3 2 1
The class maintains state (self.current) and produces values on demand. Memory usage is constant — it doesn't matter if you're counting down from 5 or from 5 billion. Only the current number exists in memory at any point.
But there's a subtle gotcha here. Because __iter__ returns self, the iterator is single-use:
cd = Countdown(3)
print(list(cd)) # [3, 2, 1]
print(list(cd)) # [] — exhausted!
Once you've iterated through it, self.current is 0 and stays there. The iterator is spent. This surprises people coming from lists (which you can iterate multiple times), so let's fix it.
Separating iterable from iterator
The cleaner pattern is to separate the iterable (the thing you loop over) from the iterator (the thing that tracks position). Each call to __iter__ returns a fresh iterator:
class Countdown:
"""The iterable — knows the start value, creates fresh iterators."""
def __init__(self, start):
self.start = start
def __iter__(self):
return CountdownIterator(self.start)
class CountdownIterator:
"""The iterator — tracks current position, yields values."""
def __init__(self, start):
self.current = start
def __iter__(self):
return self
def __next__(self):
if self.current <= 0:
raise StopIteration
value = self.current
self.current -= 1
return value
cd = Countdown(3)
print(list(cd)) # [3, 2, 1]
print(list(cd)) # [3, 2, 1] — works again!
Now Countdown is reusable. Each for loop gets its own CountdownIterator with its own state. This is exactly how Python's built-in containers work — list.__iter__() returns a list_iterator, dict.__iter__() returns a dict_keyiterator, and so on. The container never IS the iterator; it creates one.
But here's the thing — writing two classes for simple iteration is verbose. Wouldn't it be nice if Python could handle the state management for us? That's exactly what generators do ;-)
Generators: the elegant shortcut
A generator function looks like a regular function, but uses yield instead of return:
def countdown(start):
current = start
while current > 0:
yield current
current -= 1
for num in countdown(5):
print(num, end=" ")
# 5 4 3 2 1
When you call countdown(5), Python doesn't execute the function body. Instead, it returns a generator object that implements the iterator protocol automatically. Each call to next() executes the function body until it hits a yield, returns that value, and suspends — freezing all local variables in place. The next next() call resumes execution right after the yield, with all local variables intact.
This is fundamentally different from return. A return terminates the function and discards its state. A yield pauses the function and preserves its state. The generator remembers where it stopped and what every local variable was.
Let me demonstrate with some tracing:
def traced_countdown(start):
print(f" Generator started with start={start}")
current = start
while current > 0:
print(f" About to yield {current}")
yield current
print(f" Resumed after yielding {current}")
current -= 1
print(f" Generator exhausted")
gen = traced_countdown(3)
print("Created generator, nothing executed yet")
print(f"Got: {next(gen)}")
# Generator started with start=3
# About to yield 3
# Got: 3
print(f"Got: {next(gen)}")
# Resumed after yielding 3
# About to yield 2
# Got: 2
print(f"Got: {next(gen)}")
# Resumed after yielding 2
# About to yield 1
# Got: 1
# next(gen) would print "Resumed after yielding 1" then
# "Generator exhausted" then raise StopIteration
Notice how "Generator started" only prints when we first call next(), not when we call traced_countdown(3). The function body is entirely lazy.
Generator expressions vs. list comprehensions
You already know list comprehensions. Generator expressions use the same syntax but with parentheses:
# List comprehension — builds entire list in memory
squares_list = [x**2 for x in range(1_000_000)]
# Generator expression — lazy, one value at a time
squares_gen = (x**2 for x in range(1_000_000))
The memory difference is dramatic:
import sys
squares_list = [x**2 for x in range(1_000_000)]
squares_gen = (x**2 for x in range(1_000_000))
print(sys.getsizeof(squares_list)) # ~8,448,728 bytes (8 MB!)
print(sys.getsizeof(squares_gen)) # 208 bytes
That's a factor of ~40,000. The generator stores only the state needed to produce the next value — the current position in the range and the expression to evaluate.
When you're feeding values into an aggregation function, always prefer generators:
# Good — generator, processes one value at a time
total = sum(x**2 for x in range(1_000_000))
# Wasteful — builds a million-element list, then sums it, then discards it
total = sum([x**2 for x in range(1_000_000)])
# Good — finds max without materializing everything
biggest = max(len(line) for line in open('data.txt'))
# Good — checks condition lazily, stops at first True
has_errors = any(line.startswith('ERROR') for line in open('app.log'))
That last example is particularly powerful: any() short-circuits, so if the first line of a 50GB log file starts with "ERROR", it reads exactly one line and stops. The generator never produces the remaining billions of lines.
Rule of thumb: use list comprehensions when you need to iterate multiple times or need random access (indexing, slicing). Use generator expressions when you iterate once and the data is large or you want lazy evaluation.
The pipeline pattern
This is where generators really shine. You can compose them into processing pipelines where data flows through multiple transformation stages, one item at a time:
def read_lines(filename):
"""Stage 1: Read lines from file."""
with open(filename) as f:
for line in f:
yield line.rstrip('\n')
def skip_comments(lines):
"""Stage 2: Skip comment lines and blank lines."""
for line in lines:
stripped = line.strip()
if stripped and not stripped.startswith('#'):
yield line
def parse_csv_row(lines):
"""Stage 3: Split CSV into fields."""
for line in lines:
yield line.split(',')
def filter_by_field(rows, field_index, value):
"""Stage 4: Keep rows where a field matches a value."""
for row in rows:
if len(row) > field_index and row[field_index].strip() == value:
yield row
# Compose the pipeline:
lines = read_lines('sales_data.csv')
clean = skip_comments(lines)
rows = parse_csv_row(clean)
electronics = filter_by_field(rows, 2, 'Electronics')
for row in electronics:
print(row)
Each stage is a generator. No stage materializes the entire dataset. Data flows through the pipeline one record at a time. If sales_data.csv is 100GB, your memory usage is still effectively one line. And adding, removing, or reordering stages is trivial — just change how you connect the generators.
This pattern is conceptually similar to Unix pipes (cat file | grep ERROR | sort | uniq -c). Each generator is a filter or transformation that consumes from upstream and yields downstream.
yield from: delegating to sub-generators
When you need a generator to yield all values from another iterable, you could write a loop:
def flatten(nested_list):
for sublist in nested_list:
for item in sublist:
yield item
But yield from (Python 3.3+) is cleaner and more efficient:
def flatten(nested_list):
for sublist in nested_list:
yield from sublist
result = list(flatten([[1, 2, 3], [4, 5], [6]]))
# [1, 2, 3, 4, 5, 6]
yield from does more than just syntactic sugar — it properly delegates the iteration protocol, forwarding .send(), .throw(), and .close() to the sub-generator. This matters when you're using generators as coroutines.
For recursive structures, yield from is particularly elegant:
class TreeNode:
def __init__(self, value, children=None):
self.value = value
self.children = children or []
def depth_first(node):
"""Depth-first traversal without explicit stack management."""
yield node.value
for child in node.children:
yield from depth_first(child)
def breadth_first(root):
"""Breadth-first traversal using a queue."""
from collections import deque
queue = deque([root])
while queue:
node = queue.popleft()
yield node.value
queue.extend(node.children)
tree = TreeNode('A', [
TreeNode('B', [TreeNode('D'), TreeNode('E')]),
TreeNode('C', [TreeNode('F')])
])
print(list(depth_first(tree))) # ['A', 'B', 'D', 'E', 'C', 'F']
print(list(breadth_first(tree))) # ['A', 'B', 'C', 'D', 'E', 'F']
No explicit stack, no accumulator lists — just yield and yield from. The recursive generator manages its own call stack via Python's generator frame objects. Clean, readable, and memory-efficient.
Sending values into generators with .send()
Generators aren't just one-way streets. You can send values into a running generator using .send():
def running_average():
"""Accepts values via send(), yields running average."""
total = 0.0
count = 0
average = None
while True:
value = yield average
if value is None:
return # Cleanly terminate
total += value
count += 1
average = total / count
# Usage:
avg = running_average()
next(avg) # "Prime" the generator (advance to first yield)
print(avg.send(10)) # 10.0
print(avg.send(20)) # 15.0
print(avg.send(30)) # 20.0
print(avg.send(15)) # 18.75
The yield expression works both ways: it sends a value out (the average) and receives a value in (the new data point). The next(avg) call at the beginning is required to advance the generator to the first yield — this is called "priming" the generator.
This pattern was the precursor to Python's async/await system. Before PEP 492 gave us native coroutines, developers used generator-based coroutines with .send() extensively. Modern code should use async/await for concurrent I/O (as we covered in episodes #40 and #41), but understanding .send() explains why async and await work the way they do — they're syntactic sugar over a generator-like protocol.
Generator cleanup with .close() and .throw()
Generators support two more methods for lifecycle management:
def managed_resource():
print("Opening resource")
try:
while True:
data = yield "ready"
print(f"Processing: {data}")
except GeneratorExit:
print("Cleaning up resource")
finally:
print("Finally block executed")
gen = managed_resource()
next(gen) # Opening resource → 'ready'
gen.send("chunk 1") # Processing: chunk 1
gen.send("chunk 2") # Processing: chunk 2
gen.close() # Cleaning up resource → Finally block executed
.close() throws a GeneratorExit exception into the generator, giving it a chance to clean up. This is called automatically when a generator is garbage collected, which is why generators work well as context-manager-like resources.
.throw() lets you inject an exception into a generator at the point where it's suspended:
def resilient_processor():
while True:
try:
value = yield
print(f"Processed: {value}")
except ValueError as e:
print(f"Skipped bad value: {e}")
gen = resilient_processor()
next(gen)
gen.send("good data") # Processed: good data
gen.throw(ValueError, "corrupt record") # Skipped bad value: corrupt record
gen.send("more data") # Processed: more data
The generator catches the injected exception and continues. This enables error recovery in pipeline architectures without breaking the entire chain.
Infinite generators
Because generators are lazy, they can represent infinite sequences. This is impossible with lists (you'd run out of memory) but trivial with generators:
def fibonacci():
"""Infinite Fibonacci sequence."""
a, b = 0, 1
while True:
yield a
a, b = b, a + b
def primes():
"""Infinite prime sequence (simple sieve)."""
yield 2
candidate = 3
found = [2]
while True:
if all(candidate % p != 0 for p in found):
found.append(candidate)
yield candidate
candidate += 2
import itertools
# First 10 Fibonacci numbers
print(list(itertools.islice(fibonacci(), 10)))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
# First 15 primes
print(list(itertools.islice(primes(), 15)))
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47]
# Sum of first 1000 Fibonacci numbers
print(sum(itertools.islice(fibonacci(), 1000)))
# (a very large number)
itertools.islice is the standard way to take a finite slice from an infinite generator. You could also use zip with a range or write a custom take() function.
itertools: the standard library's generator toolkit
The itertools module provides a collection of blazing-fast, memory-efficient building blocks for iterator-based programming. These are implemented in C, so they're significantly faster than writing the equivalent Python generator:
import itertools
# chain: concatenate multiple iterables
combined = itertools.chain([1, 2], [3, 4], [5, 6])
print(list(combined)) # [1, 2, 3, 4, 5, 6]
# cycle: repeat an iterable forever
colors = itertools.cycle(['red', 'green', 'blue'])
print(list(itertools.islice(colors, 7)))
# ['red', 'green', 'blue', 'red', 'green', 'blue', 'red']
# count: infinite arithmetic progression
counter = itertools.count(start=10, step=3)
print(list(itertools.islice(counter, 5))) # [10, 13, 16, 19, 22]
# takewhile / dropwhile: conditional slicing
nums = [1, 3, 5, 7, 2, 4, 6, 8]
print(list(itertools.takewhile(lambda x: x < 6, nums))) # [1, 3, 5]
print(list(itertools.dropwhile(lambda x: x < 6, nums))) # [7, 2, 4, 6, 8]
# accumulate: running reduction
data = [1, 2, 3, 4, 5]
print(list(itertools.accumulate(data))) # [1, 3, 6, 10, 15] (running sum)
print(list(itertools.accumulate(data, max))) # [1, 2, 3, 4, 5] (running max)
print(list(itertools.accumulate(data, lambda a, b: a * b))) # [1, 2, 6, 24, 120] (running product)
And the combinatoric generators:
# product: cartesian product
print(list(itertools.product('AB', [1, 2])))
# [('A', 1), ('A', 2), ('B', 1), ('B', 2)]
# permutations and combinations
print(list(itertools.permutations('ABC', 2)))
# [('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]
print(list(itertools.combinations('ABCD', 2)))
# [('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]
All of these return generators (lazy evaluation), so itertools.product(range(100), range(100), range(100)) doesn't create a list of a million tuples — it yields them one at a time.
groupby: the tricky one
itertools.groupby groups consecutive elements that share a key. The "consecutive" part trips people up:
import itertools
# Data MUST be sorted by the grouping key first!
data = [
('Alice', 'Engineering'),
('Bob', 'Engineering'),
('Carol', 'Marketing'),
('Dave', 'Marketing'),
('Eve', 'Engineering'),
]
# This groups CONSECUTIVE items, so Eve is a separate Engineering group:
for dept, members in itertools.groupby(data, key=lambda x: x[1]):
print(f"{dept}: {[m[0] for m in members]}")
# Engineering: ['Alice', 'Bob']
# Marketing: ['Carol', 'Dave']
# Engineering: ['Eve'] ← separate group!
# Sort first for true grouping:
sorted_data = sorted(data, key=lambda x: x[1])
for dept, members in itertools.groupby(sorted_data, key=lambda x: x[1]):
print(f"{dept}: {[m[0] for m in members]}")
# Engineering: ['Alice', 'Bob', 'Eve']
# Marketing: ['Carol', 'Dave']
The reason groupby works on consecutive runs (rather than collecting all matching elements like SQL's GROUP BY) is precisely because it's a generator — it processes the stream item by item and has no lookahead. If you want SQL-style grouping, sort first.
Real-world example: processing large CSV data
Let's put everything together in a realistic scenario — processing a large dataset with a pipeline of generators:
import csv
import itertools
from collections import defaultdict
def read_csv_rows(filename):
"""Read CSV file lazily, one row at a time."""
with open(filename, newline='') as f:
reader = csv.DictReader(f)
for row in reader:
yield row
def parse_numeric(rows, fields):
"""Convert specified fields from strings to floats."""
for row in rows:
for field in fields:
try:
row[field] = float(row[field])
except (ValueError, KeyError):
row[field] = 0.0
yield row
def filter_rows(rows, predicate):
"""Keep only rows matching a predicate function."""
for row in rows:
if predicate(row):
yield row
def running_totals(rows, amount_field, group_field):
"""Compute running total per group."""
totals = defaultdict(float)
for row in rows:
group = row[group_field]
totals[group] += row[amount_field]
row['running_total'] = totals[group]
yield row
# Compose the pipeline:
rows = read_csv_rows('transactions.csv')
parsed = parse_numeric(rows, ['amount'])
large_only = filter_rows(parsed, lambda r: r['amount'] > 100)
with_totals = running_totals(large_only, 'amount', 'category')
# Process — constant memory regardless of file size
for row in itertools.islice(with_totals, 20): # Preview first 20
print(f"{row['category']:>15} | ${row['amount']:>10.2f} | "
f"Running: ${row['running_total']:>12.2f}")
This pipeline can process a 100GB CSV file using only a few kilobytes of memory. Each generator transforms one row and passes it along. Nothing is stored. Nothing is buffered. The file is never loaded into memory. That's the power of generator pipelines.
When NOT to use generators
Generators aren't always the answer. Here are the cases where a list is better:
# Bad — need random access
gen = (x**2 for x in range(100))
# gen[50] ← TypeError! Generators don't support indexing
# Bad — need to iterate multiple times
gen = (x**2 for x in range(100))
total = sum(gen)
average = total / len(list(gen)) # gen is already exhausted!
# Bad — need to know length upfront
gen = (x for x in range(100) if x % 7 == 0)
# len(gen) ← TypeError! Generators don't have length
# Bad — small dataset where list is simpler
names = [name.upper() for name in ['alice', 'bob', 'carol']]
# No point using a generator for 3 items
Use lists when:
- You need random access (indexing, slicing)
- You need to iterate multiple times
- You need
len(),.sort(),.reverse(), or other list methods - The data fits comfortably in memory and is small
Use generators when:
- You iterate once through large or infinite data
- You're building processing pipelines
- You're feeding into
sum(),max(),min(),any(),all(), or similar aggregators - Memory efficiency matters
Samengevat
In this episode, we explored the iteration machinery that powers Python:
- The iterator protocol (
__iter__and__next__) is the foundation of all iteration in Python - Iterables create fresh iterators; iterators produce values and track position
- Generator functions (using
yield) are the Pythonic shortcut for writing iterators - Generator expressions
(x for x in ...)are memory-efficient alternatives to list comprehensions - Generator pipelines compose multiple stages for stream processing with constant memory
yield fromdelegates to sub-generators and handles the full iterator protocol.send()enables bidirectional communication with running generators.close()and.throw()support lifecycle management and error injection- Infinite generators represent unbounded sequences in finite memory
itertoolsprovides C-optimized building blocks:chain,cycle,islice,accumulate,groupby,product,permutations,combinationsgroupbygroups consecutive elements — sort first for SQL-style grouping
The shift from "materialize everything, then process" to "process as you go" is a fundamental change in how you think about data. Generators make this shift natural. They're one of Python's most powerful features — and now you know how the machinery works under the hood.