Learn Python Series (#48) - Concurrency - Threading vs Multiprocessing

Repository
What will I learn
- You will learn the fundamental difference between threading and multiprocessing in Python;
- what the Global Interpreter Lock (GIL) is, why it exists, and why it matters for performance;
- when to use threads vs. processes vs. async (we covered async in episodes #40 and #41);
- how to safely share data between threads and between processes;
- common concurrency pitfalls — race conditions, deadlocks — and how to avoid them;
- the high-level
concurrent.futuresinterface that unifies both models.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution, such as (for example) the Anaconda Distribution;
- The ambition to learn Python programming.
Difficulty
- Intermediate, advanced
Curriculum (of the Learn Python Series):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
- Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
- Learn Python Series (#17) - Roundup #2 - Combining and analyzing any-to-any multi-currency historical data
- Learn Python Series (#18) - PyMongo Part 1
- Learn Python Series (#19) - PyMongo Part 2
- Learn Python Series (#20) - PyMongo Part 3
- Learn Python Series (#21) - Handling Dates and Time Part 1
- Learn Python Series (#22) - Handling Dates and Time Part 2
- Learn Python Series (#23) - Handling Regular Expressions Part 1
- Learn Python Series (#24) - Handling Regular Expressions Part 2
- Learn Python Series (#25) - Handling Regular Expressions Part 3
- Learn Python Series (#26) - pipenv & Visual Studio Code
- Learn Python Series (#27) - Handling Strings Part 3 (F-Strings)
- Learn Python Series (#28) - Using Pickle and Shelve
- Learn Python Series (#29) - Handling CSV
- Learn Python Series (#30) - Data Science Part 1 - Pandas
- Learn Python Series (#31) - Data Science Part 2 - Pandas
- Learn Python Series (#32) - Data Science Part 3 - Pandas
- Learn Python Series (#33) - Data Science Part 4 - Pandas
- Learn Python Series (#34) - Working with APIs in 2026: What's Changed
- Learn Python Series (#35) - Working with APIs Part 2: Beyond GET Requests
- Learn Python Series (#36) - Type Hints and Modern Python
- Learn Python Series (#37) - Virtual Environments and Dependency Management
- Learn Python Series (#38) - Testing Your Code Part 1
- Learn Python Series (#39) - Testing Your Code Part 2
- Learn Python Series (#40) - Asynchronous Python Part 1
- Learn Python Series (#41) - Asynchronous Python Part 2
- Learn Python Series (#42) - Building CLI Applications
- Learn Python Series (#43) - Mini Project - Crypto Price Tracker
- Learn Python Series (#44) - Context Managers & Decorators Deep Dive
- Learn Python Series (#45) - Metaclasses & Class Design Patterns
- Learn Python Series (#46) - Descriptors & Properties
- Learn Python Series (#47) - Generators & Iterators Advanced
- Learn Python Series (#48) - Concurrency - Threading vs Multiprocessing (this post)
GitHub Account
Learn Python Series (#48) - Concurrency - Threading vs Multiprocessing
Your CPU has 8, 12, maybe 16 cores. Python uses one of them. One. And if you've ever wondered why your CPU-intensive Python script pegs a single core at 100% while the rest sit idle — welcome to the GIL, Python's most controversial design decision.
But here's the nuance that most "Python is slow" hot takes miss entirely: for I/O-bound work (network requests, database queries, file reads), Python's threading works beautifully. The problem is specifically CPU-bound parallelism. Knowing the difference — and picking the right concurrency tool — is the entire game.
In episodes #40 and #41, we covered asyncio — Python's single-threaded, cooperative concurrency model. This episode covers the other two concurrency models: threading (for I/O-bound concurrency with shared memory) and multiprocessing (for CPU-bound true parallelism). By the end, you'll know exactly when to reach for each one ;-)
The mental model: concurrency vs. parallelism
These two words get used interchangeably, but they mean different things:
Concurrency is about managing multiple tasks at once. Tasks may interleave (take turns) on a single core — like a chef preparing three dishes by switching between them.
Parallelism is about executing multiple tasks simultaneously. Tasks run at the same time on different cores — like three chefs each preparing one dish.
Threading in CPython provides concurrency (interleaving). Multiprocessing provides parallelism (simultaneous execution). The distinction matters enormously, because concurrent code may not run any faster (tasks still share a single core), while parallel code can achieve linear speedup on multi-core machines.
The Global Interpreter Lock (GIL): what it is and why it exists
CPython (the standard Python interpreter — the one you're almost certainly using) has a Global Interpreter Lock: a mutex that prevents multiple threads from executing Python bytecode at the same time.
Even on a 16-core machine, if you spawn 16 Python threads doing computation, they don't run in parallel. They take turns — only one thread holds the GIL and executes bytecode at any given moment. The others wait.
Why on earth would Python do this?
It's a pragmatic engineering tradeoff. CPython's memory management uses reference counting for garbage collection:
import sys
a = [] # refcount = 1
b = a # refcount = 2
print(sys.getrefcount(a)) # 3 (including the getrefcount argument itself)
del b # refcount drops to 2
Every assignment, every function call, every variable read modifies reference counts. Without the GIL, every single one of these operations would need its own fine-grained lock to be thread-safe. That would make all Python code slower (even single-threaded code), and would be a nightmare to implement correctly. The GIL is a single coarse lock that makes the entire interpreter thread-safe in one stroke.
The practical impact:
- I/O-bound tasks: Threading works great — threads release the GIL during I/O operations (network calls, file reads,
time.sleep()). While one thread waits for a response, another thread can run. - CPU-bound tasks: Threading provides zero speedup — the GIL prevents parallel execution of Python bytecode. You need separate processes (each with its own interpreter and its own GIL) for true parallelism.
A note on Python 3.13+: PEP 703 introduced an experimental "free-threaded" CPython build that removes the GIL entirely. As of early 2026, this is still experimental and not the default build. The concepts in this episode remain the standard approach for production Python.
Threading for I/O-bound work
When your program spends most of its time waiting — for network responses, for disk reads, for database queries — threading provides real speedup because threads release the GIL during these waits:
import threading
import time
import urllib.request
def fetch_url(url):
"""Fetch a URL and print response size."""
start = time.perf_counter()
with urllib.request.urlopen(url) as response:
data = response.read()
elapsed = time.perf_counter() - start
print(f" {url}: {len(data):,} bytes in {elapsed:.2f}s")
urls = [
'https://www.python.org',
'https://docs.python.org/3/',
'https://pypi.org',
'https://peps.python.org',
'https://wiki.python.org',
]
# Sequential: each request waits for the previous one
print("Sequential:")
start = time.perf_counter()
for url in urls:
fetch_url(url)
print(f"Total: {time.perf_counter() - start:.2f}s\n")
# Threaded: requests happen concurrently
print("Threaded:")
start = time.perf_counter()
threads = []
for url in urls:
t = threading.Thread(target=fetch_url, args=(url,))
t.start()
threads.append(t)
for t in threads:
t.join() # Wait for all threads to finish
print(f"Total: {time.perf_counter() - start:.2f}s")
Typical results on a reasonable connection:
- Sequential: ~2.5 seconds (each request waits for the previous one)
- Threaded: ~0.6 seconds (all requests happen concurrently)
That's a ~4x speedup with minimal code change. During each urlopen() call, the thread releases the GIL, allowing other threads to make their requests simultaneously. The threads aren't truly parallel (they share one Python interpreter), but they overlap their waiting time — which is what matters for I/O-bound work.
Thread safety and race conditions
Threads share memory. That's both their advantage (easy data sharing) and their curse (easy data corruption). Here's the classic race condition:
import threading
counter = 0
def increment():
global counter
for _ in range(100_000):
counter += 1 # This is NOT atomic!
threads = [threading.Thread(target=increment) for _ in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Expected: 1,000,000")
print(f"Actual: {counter}") # Something less, e.g. 734,219
Why? counter += 1 looks atomic but compiles to multiple bytecode instructions:
- LOAD
counterfrom global scope - LOAD constant
1 - ADD them together
- STORE result back to
counter
The GIL can release between any of these steps. So Thread A loads counter = 42, then Thread B loads counter = 42 (same value!), both add 1, and both store 43. Two increments, but the counter only went up by 1. This is called a lost update.
Fixing it with locks
A threading.Lock ensures mutual exclusion:
import threading
counter = 0
lock = threading.Lock()
def increment():
global counter
for _ in range(100_000):
with lock: # Only one thread at a time
counter += 1
threads = [threading.Thread(target=increment) for _ in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Result: {counter}") # Exactly 1,000,000
The with lock: context manager acquires the lock, executes the body, and releases it. Other threads trying to acquire the same lock will block until it's released.
But be careful — overusing locks negates the benefit of threading. If every operation acquires a lock, you've effectively serialized your code. The art is locking only the critical sections where shared state is modified.
Other synchronization primitives
Python's threading module provides several tools beyond basic locks:
import threading
# RLock: Re-entrant lock (same thread can acquire multiple times)
rlock = threading.RLock()
with rlock:
with rlock: # Wouldn't deadlock! RLock allows re-entry
pass
# Semaphore: Allow up to N concurrent accesses
semaphore = threading.Semaphore(3) # Max 3 threads at once
with semaphore:
pass # Only 3 threads can be here simultaneously
# Event: Signal between threads
event = threading.Event()
def waiter():
print("Waiting for signal...")
event.wait() # Blocks until event is set
print("Got the signal!")
def signaler():
import time
time.sleep(1)
event.set() # Wake up all waiters
threading.Thread(target=waiter).start()
threading.Thread(target=signaler).start()
# Condition: Wait for a specific condition
condition = threading.Condition()
data_ready = False
def producer():
global data_ready
with condition:
data_ready = True
condition.notify_all() # Wake up consumers
def consumer():
with condition:
condition.wait_for(lambda: data_ready)
print("Data is ready!")
Semaphore is particularly useful for rate-limiting — for example, limiting concurrent API requests to avoid hitting rate limits. We'll see this with ThreadPoolExecutor shortly.
Multiprocessing for CPU-bound work
For computation that genuinely needs parallel execution, you need separate processes. Each process gets its own Python interpreter with its own GIL:
import multiprocessing
import time
import math
def compute_primes(limit):
"""Find primes up to limit using trial division."""
primes = []
for num in range(2, limit):
if all(num % p != 0 for p in primes if p * p <= num):
primes.append(num)
return len(primes)
ranges = [200_000, 200_000, 200_000, 200_000]
# Sequential
start = time.perf_counter()
results_seq = [compute_primes(r) for r in ranges]
seq_time = time.perf_counter() - start
print(f"Sequential: {seq_time:.2f}s — {results_seq}")
# Parallel with Pool
start = time.perf_counter()
with multiprocessing.Pool(processes=4) as pool:
results_par = pool.map(compute_primes, ranges)
par_time = time.perf_counter() - start
print(f"Parallel: {par_time:.2f}s — {results_par}")
print(f"Speedup: {seq_time / par_time:.1f}x")
On a 4-core machine, you'll see close to a 4x speedup. Each process has its own GIL, its own memory space, and runs on its own core. True parallelism.
Process communication
Processes don't share memory by default (unlike threads). This is actually a feature — isolated memory means no race conditions by design. But when you need to exchange data, Python provides several mechanisms:
Queues
multiprocessing.Queue is the most common IPC (inter-process communication) tool:
from multiprocessing import Process, Queue
import time
def worker(task_queue, result_queue, worker_id):
"""Process tasks from queue, put results in result queue."""
while True:
task = task_queue.get()
if task is None:
break # Poison pill — shut down
# Simulate expensive computation
result = sum(i**2 for i in range(task))
result_queue.put((worker_id, task, result))
# Create queues
tasks = Queue()
results = Queue()
# Start 4 workers
workers = []
for i in range(4):
p = Process(target=worker, args=(tasks, results, i))
p.start()
workers.append(p)
# Send tasks
for n in [100_000, 200_000, 150_000, 300_000, 250_000, 180_000]:
tasks.put(n)
# Send poison pills (one per worker)
for _ in workers:
tasks.put(None)
# Collect results
for _ in range(6):
worker_id, task, result = results.get()
print(f" Worker {worker_id}: computed sum of squares up to {task:,}")
# Clean up
for p in workers:
p.join()
The poison pill pattern (sending None to signal shutdown) is a clean way to terminate worker processes. Each worker pulls tasks from the queue, processes them, and pushes results to a result queue. This is essentially a home-built task pool.
Shared memory with Value and Array
When you genuinely need shared state between processes (use sparingly!):
from multiprocessing import Process, Value, Lock
def increment(shared_counter, lock, n):
for _ in range(n):
with lock:
shared_counter.value += 1
counter = Value('i', 0) # 'i' = signed int, initial value 0
lock = Lock()
processes = [
Process(target=increment, args=(counter, lock, 100_000))
for _ in range(4)
]
for p in processes:
p.start()
for p in processes:
p.join()
print(f"Counter: {counter.value}") # 400,000
Value('i', 0) creates a shared integer in memory mapped between processes. The 'i' is a ctypes type code — 'i' for int, 'd' for double, 'f' for float. You still need a lock to prevent race conditions, just like with threads.
concurrent.futures: the high-level interface
The concurrent.futures module provides ThreadPoolExecutor and ProcessPoolExecutor — two classes with the exact same API but different execution models. This makes switching between threads and processes trivial:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from concurrent.futures import as_completed
import time
import urllib.request
def fetch_size(url):
"""Return (url, size) tuple."""
with urllib.request.urlopen(url, timeout=10) as resp:
return url, len(resp.read())
def compute_factorial(n):
"""CPU-heavy: compute factorial iteratively."""
result = 1
for i in range(2, n + 1):
result *= i
return n, len(str(result)) # Return n and digit count
# I/O-bound: use ThreadPoolExecutor
urls = [
'https://www.python.org',
'https://docs.python.org/3/',
'https://pypi.org',
]
print("Fetching URLs (threaded):")
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(fetch_size, url): url for url in urls}
for future in as_completed(futures):
url, size = future.result()
print(f" {url}: {size:,} bytes")
# CPU-bound: use ProcessPoolExecutor
numbers = [50_000, 60_000, 70_000, 80_000]
print("\nComputing factorials (parallel processes):")
with ProcessPoolExecutor(max_workers=4) as executor:
futures = {executor.submit(compute_factorial, n): n for n in numbers}
for future in as_completed(futures):
n, digits = future.result()
print(f" {n}! has {digits:,} digits")
The beauty of this API is as_completed() — it yields futures as they finish, regardless of submission order. First result back? You see it first. No waiting for slow tasks to unblock fast ones.
Error handling with futures
Futures capture exceptions cleanly:
from concurrent.futures import ThreadPoolExecutor
def risky_fetch(url):
if 'badurl' in url:
raise ConnectionError(f"Cannot connect to {url}")
return f"Success: {url}"
urls = ['https://python.org', 'https://badurl.invalid', 'https://pypi.org']
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {executor.submit(risky_fetch, url): url for url in urls}
for future in futures:
try:
result = future.result(timeout=5)
print(f" OK: {result}")
except ConnectionError as e:
print(f" FAILED: {e}")
except TimeoutError:
print(f" TIMEOUT: {futures[future]}")
Exceptions raised inside a worker are stored in the Future object and re-raised when you call .result(). This is much cleaner than the low-level threading approach where exceptions in threads are silently swallowed.
Deadlocks: the silent killer
A deadlock occurs when two or more threads (or processes) each hold a resource the other needs, and neither will release theirs first:
import threading
import time
lock_a = threading.Lock()
lock_b = threading.Lock()
def thread_1():
with lock_a:
print("Thread 1: holding lock_a, waiting for lock_b...")
time.sleep(0.1) # Gives thread_2 time to acquire lock_b
with lock_b:
print("Thread 1: got both locks")
def thread_2():
with lock_b:
print("Thread 2: holding lock_b, waiting for lock_a...")
time.sleep(0.1) # Gives thread_1 time to acquire lock_a
with lock_a:
print("Thread 2: got both locks")
t1 = threading.Thread(target=thread_1)
t2 = threading.Thread(target=thread_2)
t1.start()
t2.start()
# Both threads hang forever — deadlock!
Thread 1 holds lock_a and waits for lock_b. Thread 2 holds lock_b and waits for lock_a. Neither can proceed. Your program hangs silently — no exception, no error message, just... nothing.
Prevention strategies:
Always acquire locks in the same order. If both threads acquire
lock_afirst, thenlock_b, no deadlock can occur.Use timeouts:
acquired = lock_b.acquire(timeout=2.0)
if not acquired:
print("Couldn't get lock_b — backing off")
# Release lock_a, retry later, or fail gracefully
Avoid holding multiple locks when possible. Restructure your code so each critical section needs only one lock.
Use higher-level abstractions.
Queue,concurrent.futures, andmultiprocessing.Poolhandle synchronization internally — you don't manage locks yourself.
The decision matrix: which tool when?
After years of writing concurrent Python, here's my practical decision tree:
| Scenario | Tool | Why |
|---|---|---|
| Many HTTP requests | ThreadPoolExecutor or asyncio | I/O-bound, GIL not a factor |
| Database queries in parallel | ThreadPoolExecutor | I/O-bound, shared connection pool |
| Processing large images | ProcessPoolExecutor | CPU-bound, need true parallelism |
| Number crunching (no NumPy) | ProcessPoolExecutor | CPU-bound Python code |
| Number crunching (with NumPy) | NumPy/threading | NumPy releases GIL internally |
| 10,000+ concurrent connections | asyncio | Lowest overhead per connection |
| Simple background task | threading.Thread | Low overhead, easy to set up |
| Periodic background work | threading.Timer | Built-in scheduling |
A few important nuances:
NumPy releases the GIL. If your CPU-bound work is NumPy operations, threading actually works — NumPy's C extensions release the GIL during computation. Same goes for many other C-extension libraries (Pandas, scikit-learn, etc.).
Process startup is expensive. Creating a process takes ~100ms and duplicates the entire interpreter. Don't create processes for tiny tasks. Use a Pool or ProcessPoolExecutor to reuse processes across many tasks.
Async is not faster for CPU work. We covered asyncio in episodes #40 and #41 — it's excellent for I/O concurrency, but it's single-threaded. Don't reach for asyncio when your bottleneck is computation.
Combining threading and multiprocessing
Sometimes you need both. A common pattern is multiprocessing for CPU-bound work with threading inside each process for I/O:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import urllib.request
import time
def fetch_and_process(url):
"""Fetch a page (I/O) then do computation (CPU)."""
# I/O-bound: fetch the page
with urllib.request.urlopen(url, timeout=10) as resp:
data = resp.read()
# CPU-bound: process the data (simulate with hash computation)
import hashlib
for _ in range(1000):
data = hashlib.sha256(data).digest()
return url, len(data)
urls = [
'https://www.python.org',
'https://docs.python.org/3/',
'https://pypi.org',
'https://peps.python.org',
] * 3 # 12 URLs total
# Use ProcessPoolExecutor for the combined I/O + CPU work
start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(fetch_and_process, urls))
elapsed = time.perf_counter() - start
print(f"Processed {len(results)} URLs in {elapsed:.2f}s")
for url, size in results[:4]:
print(f" {url}: {size} bytes (after hashing)")
Each process handles both the I/O and CPU work for its assigned URLs. Since processes have their own GIL, the CPU work runs in true parallel across cores.
The if __name__ == '__main__' guard
One gotcha that catches every Python beginner with multiprocessing:
from multiprocessing import Process
def worker():
print("Worker running")
# This MUST be inside the guard on Windows/macOS:
if __name__ == '__main__':
p = Process(target=worker)
p.start()
p.join()
On Windows and macOS (with the "spawn" start method), new processes import the main module to set up. Without the guard, this creates an infinite loop of process spawning. Always protect multiprocessing code with if __name__ == '__main__'. On Linux (which uses "fork" by default), it works without the guard — but it's good practice everywhere.
Real-world example: parallel file processing
Let's put it all together with a practical example — processing a directory of text files in parallel:
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import re
from collections import Counter
def analyze_file(filepath):
"""Analyze a single text file: word count, line count, top words."""
text = Path(filepath).read_text(encoding='utf-8', errors='ignore')
words = re.findall(r'\b[a-z]+\b', text.lower())
return {
'file': filepath.name,
'lines': text.count('\n'),
'words': len(words),
'unique_words': len(set(words)),
'top_5': Counter(words).most_common(5),
}
def analyze_directory(directory, pattern='*.txt'):
"""Analyze all matching files in parallel."""
files = list(Path(directory).glob(pattern))
if not files:
print(f"No {pattern} files found in {directory}")
return
print(f"Analyzing {len(files)} files using {4} processes...\n")
results = []
with ProcessPoolExecutor(max_workers=4) as executor:
future_to_file = {
executor.submit(analyze_file, f): f for f in files
}
for future in as_completed(future_to_file):
filepath = future_to_file[future]
try:
result = future.result()
results.append(result)
print(f" Done: {result['file']} "
f"({result['words']:,} words, "
f"{result['unique_words']:,} unique)")
except Exception as e:
print(f" Error processing {filepath.name}: {e}")
# Summary
total_words = sum(r['words'] for r in results)
total_lines = sum(r['lines'] for r in results)
print(f"\nTotal: {total_words:,} words, "
f"{total_lines:,} lines across {len(results)} files")
if __name__ == '__main__':
analyze_directory('/path/to/text/files', '*.txt')
Each file is processed by a separate process — true parallel execution on multiple cores. Results stream back via as_completed(), so you see progress as files finish. Error handling is built in via the Future pattern. And the if __name__ == '__main__' guard ensures clean process spawning.
Oké, samengevat
In this episode, we explored concurrency and parallelism in Python:
- Concurrency (threading) manages multiple tasks by interleaving; parallelism (multiprocessing) executes them simultaneously
- The GIL prevents threads from running Python bytecode in parallel — it exists to protect CPython's reference counting
- Threading works great for I/O-bound tasks because I/O operations release the GIL
- Multiprocessing provides true parallelism via separate processes, each with its own GIL
- Race conditions occur when threads share mutable state without synchronization — use
Lock,RLock,Semaphore, orEvent - Processes communicate via Queues, Pipes, or shared
Value/Arrayobjects concurrent.futuresprovidesThreadPoolExecutorandProcessPoolExecutorwith the same clean APIas_completed()yields results as they finish — no waiting for slow tasks- Deadlocks happen when locks are acquired in inconsistent order — always acquire in the same order, or use timeouts
- NumPy (and similar C-extensions) release the GIL, so threading works for NumPy-heavy computation
- Always use the
if __name__ == '__main__'guard with multiprocessing
The golden rule: profile first, then choose your concurrency model based on whether the bottleneck is I/O (threads or async) or CPU (processes). Don't guess — measure ;-)
I want to learn python but I have seen only king cobra so far in my country. I would like to make a python my pet.