Learn Python Series (#56) - NumPy Beyond the Basics

Repository
What will I learn
- You will learn why memory layout (C-order vs Fortran-order) matters for performance;
- how views vs copies work in NumPy and when modifications propagate;
- advanced broadcasting rules and powerful patterns like distance matrices;
- vectorization patterns that replace Python loops entirely;
- structured arrays for heterogeneous data without Pandas overhead;
- how NumPy connects to the machine learning ecosystem (scikit-learn, PyTorch).
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- Familiarity with NumPy basics from episode #11;
- The ambition to learn Python programming.
Difficulty
- Intermediate, advanced
Curriculum (of the Learn Python Series):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
- Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
- Learn Python Series (#17) - Roundup #2 - Combining and analyzing any-to-any multi-currency historical data
- Learn Python Series (#18) - PyMongo Part 1
- Learn Python Series (#19) - PyMongo Part 2
- Learn Python Series (#20) - PyMongo Part 3
- Learn Python Series (#21) - Handling Dates and Time Part 1
- Learn Python Series (#22) - Handling Dates and Time Part 2
- Learn Python Series (#23) - Handling Regular Expressions Part 1
- Learn Python Series (#24) - Handling Regular Expressions Part 2
- Learn Python Series (#25) - Handling Regular Expressions Part 3
- Learn Python Series (#26) - pipenv & Visual Studio Code
- Learn Python Series (#27) - Handling Strings Part 3 (F-Strings)
- Learn Python Series (#28) - Using Pickle and Shelve
- Learn Python Series (#29) - Handling CSV
- Learn Python Series (#30) - Data Science Part 1 - Pandas
- Learn Python Series (#31) - Data Science Part 2 - Pandas
- Learn Python Series (#32) - Data Science Part 3 - Pandas
- Learn Python Series (#33) - Data Science Part 4 - Pandas
- Learn Python Series (#34) - Working with APIs in 2026: What's Changed
- Learn Python Series (#35) - Working with APIs Part 2: Beyond GET Requests
- Learn Python Series (#36) - Type Hints and Modern Python
- Learn Python Series (#37) - Virtual Environments and Dependency Management
- Learn Python Series (#38) - Testing Your Code Part 1
- Learn Python Series (#39) - Testing Your Code Part 2
- Learn Python Series (#40) - Asynchronous Python Part 1
- Learn Python Series (#41) - Asynchronous Python Part 2
- Learn Python Series (#42) - Building CLI Applications
- Learn Python Series (#43) - Mini Project - Crypto Price Tracker
- Learn Python Series (#44) - Context Managers & Decorators Deep Dive
- Learn Python Series (#45) - Metaclasses & Class Design Patterns
- Learn Python Series (#46) - Descriptors & Properties
- Learn Python Series (#47) - Generators & Iterators Advanced
- Learn Python Series (#48) - Concurrency - Threading vs Multiprocessing
- Learn Python Series (#49) - FastAPI Basics - Modern Web APIs
- Learn Python Series (#50) - FastAPI Advanced - Validation & Dependencies
- Learn Python Series (#51) - Database Integration - SQLAlchemy
- Learn Python Series (#52) - Authentication & Security
- Learn Python Series (#53) - Deployment & Production Best Practices
- Learn Python Series (#54) - Pandas in 2026: What's Changed
- Learn Python Series (#55) - Advanced Data Wrangling with Pandas
- Learn Python Series (#56) - NumPy Beyond the Basics (this post)
GitHub Account
Learn Python Series (#56) - NumPy Beyond the Basics
In episode #11 we covered NumPy fundamentals - creating arrays, basic operations, reshaping, and why NumPy is fast compared to Python lists. That was the "what." This episode is the "why" and "how" at a deeper level.
Understanding memory layout, broadcasting rules, and vectorization patterns is what separates using NumPy from understanding it. And understanding it is what you need before stepping into machine learning, where NumPy is the foundation everything else sits on. Every major ML library - scikit-learn, PyTorch, TensorFlow - operates on array-like structures that follow NumPy's conventions. The mental models you build here transfer directly.
Nota bene: This episode goes deeper than typical "NumPy tutorial" material. We're getting into how arrays work at the memory level, because that's where performance intuition comes from. If you know why certain operations are fast and others aren't, you stop guessing and start writing efficient code on the first try.
Memory Layout - Why C-Order vs Fortran-Order Matters
A 2D NumPy array looks like a grid on screen, but in memory it's a flat sequence of bytes. The question is: which dimension is contiguous?
import numpy as np
# C-order (row-major): rows are contiguous in memory
c_array = np.array([[1, 2, 3], [4, 5, 6]], order='C')
# Memory layout: [1, 2, 3, 4, 5, 6]
# Row 0 is bytes 0-2, row 1 is bytes 3-5 - contiguous
# Fortran-order (column-major): columns are contiguous
f_array = np.array([[1, 2, 3], [4, 5, 6]], order='F')
# Memory layout: [1, 4, 2, 5, 3, 6]
# Column 0 is bytes 0,1, column 1 is bytes 2,3 - contiguous
Why does this matter? Because of how CPU caches work. When you access one byte of memory, the CPU loads an entire cache line (typically 64 bytes - that's 8 float64 values) into L1 cache. If your next access is adjacent in memory, it's already in cache - a "cache hit," which is ~100x faster than fetching from main RAM.
big = np.random.rand(10000, 10000) # ~800 MB
# Fast: summing along axis=1 (along rows - contiguous in C-order)
row_sums = big.sum(axis=1) # Accesses memory sequentially
# Slower: summing along axis=0 (along columns - strided access)
col_sums = big.sum(axis=0) # Jumps 80,000 bytes between elements
The difference can be 2-5x for large arrays. NumPy defaults to C-order, so row-wise operations are naturally fast. If your algorithm is inherently column-oriented (common in statistics and linear algebra), you can create arrays in F-order - or just be aware that column operations will be slower and plan accordingly.
Checking and controlling layout:
print(big.flags)
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False
# Strides tell you the byte-jump per dimension
print(big.strides)
# (80000, 8) - 80000 bytes to next row, 8 bytes to next column
The strides tuple is the key to understanding performance. (80000, 8) means moving to the next row jumps 80,000 bytes (10,000 elements × 8 bytes each), while moving to the next column jumps just 8 bytes. Operations along the small-stride dimension are fast; operations along the large-stride dimension generate cache misses.
Views vs Copies - Know What You Have
NumPy aggressively avoids copying data. Many operations return views - arrays that share memory with the original:
a = np.array([1, 2, 3, 4, 5])
# These create VIEWS (shared memory):
b = a[1:4] # Slice
c = a.reshape(5, 1) # Reshape (same total size)
d = a.T # Transpose
# These create COPIES (independent memory):
e = a[np.array([0, 2, 4])] # Fancy indexing (arbitrary index array)
f = a[[0, 2, 4]] # Also fancy indexing
g = a.copy() # Explicit copy
Modifying a view modifies the original - and this is by design, not a bug:
b = a[1:4]
b[0] = 99
print(a) # [1, 99, 3, 4, 5] - a was modified!
This is the opposite of Pandas' copy-on-write behavior (which we covered in episode #54). NumPy gives you shared memory for performance - slicing a 1GB array creates a view in nanoseconds instead of copying gigabytes. But you must be aware of it.
How to check:
print(b.base is a) # True - b is a view of a
print(e.base is a) # False - e is a copy, independent
# Or check if two arrays share memory
print(np.shares_memory(a, b)) # True
print(np.shares_memory(a, e)) # False
The rule of thumb: basic indexing (slices, single integers, ...) returns views. Advanced indexing (arrays of indices, boolean arrays) returns copies. When in doubt, check np.shares_memory().
When transposing creates trouble:
big = np.random.rand(1000, 1000)
transposed = big.T # View - no data copied, just strides reversed
# But transposed is now F-contiguous, not C-contiguous
print(transposed.flags['C_CONTIGUOUS']) # False
# If you need C-contiguous for performance
contiguous = np.ascontiguousarray(transposed) # Now a copy, C-order
Advanced Broadcasting - The Full Rules
Episode #11 covered the basics: adding a scalar to an array, combining 1D and 2D arrays. The full broadcasting rules handle arbitrary dimensions:
The three rules (applied from the trailing dimensions backward):
- If arrays have different numbers of dimensions, pad the smaller shape with 1s on the left
- Dimensions of size 1 are stretched to match the other array's dimension
- If dimensions are neither equal nor 1 - it fails with a shape mismatch error
a = np.ones((3, 4)) # Shape: (3, 4)
b = np.array([1, 2, 3, 4]) # Shape: (4,) → padded to (1, 4) → stretched to (3, 4)
result = a + b # Works: (3, 4) + (4,) → (3, 4)
Outer Operations via Broadcasting
This is where broadcasting gets genuinely powerful. By adding dimensions with np.newaxis, you can compute outer products, distance matrices, and cross-tables without any loops:
x = np.array([1, 2, 3])[:, np.newaxis] # Shape: (3, 1)
y = np.array([10, 20, 30, 40]) # Shape: (4,)
# Multiplication table via broadcasting
table = x * y # Shape: (3, 1) × (4,) → (3, 4)
# [[ 10, 20, 30, 40],
# [ 20, 40, 60, 80],
# [ 30, 60, 90, 120]]
np.newaxis (which is just None) inserts a dimension of size 1 at the specified position. This is the universal tool for controlling which dimensions get broadcast against each other.
Distance Matrix - A Classic Pattern
Computing all pairwise distances between points is a common operation in data science (nearest neighbors, clustering, spatial analysis). Broadcasting makes it elegant:
# 1000 points in 3D space
points = np.random.rand(1000, 3)
# Pairwise distance matrix WITHOUT loops
# Step 1: compute pairwise differences
diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
# shapes: (1000, 1, 3) - (1, 1000, 3) → (1000, 1000, 3)
# Step 2: Euclidean distance
distances = np.sqrt((diff ** 2).sum(axis=2)) # Shape: (1000, 1000)
This computes all 1,000,000 pairwise distances in one vectorized operation. A Python loop doing the same thing would take minutes; this takes tens of milliseconds. The memory cost is significant (a 1000×1000×3 intermediate array = ~24 MB), which is the tradeoff - broadcasting trades memory for speed.
For very large point sets where the intermediate array doesn't fit in memory, use scipy.spatial.distance.cdist instead - it computes the same result with a more memory-efficient implementation.
When Broadcasting Fails
a = np.ones((3, 4))
b = np.ones((3, 5))
# a + b → ValueError: operands could not be broadcast together
# Trailing dimensions: 4 and 5 - neither is 1, neither equal → fails
The error message tells you the shapes. When you get a broadcasting error, print both shapes, align them from the right, and check which dimension doesn't match.
Vectorization Patterns - Replacing Loops
The core skill of NumPy programming: every time you write for element in array, ask "can I do this without a loop?"
Pattern 1: Boolean Masking
data = np.random.randn(1_000_000)
# Python loop: ~500ms
result_slow = [x for x in data if x > 0]
# NumPy boolean mask: ~5ms (100x faster)
result_fast = data[data > 0]
# Combining conditions (use & for AND, | for OR, ~ for NOT)
mask = (data > -1) & (data < 1) # Within 1 standard deviation
filtered = data[mask]
Boolean masks are also useful for conditional assignment:
# Set all negative values to zero
data[data < 0] = 0
# This is an in-place operation on the original array!
Pattern 2: np.where for Conditional Selection
prices = np.array([100, -5, 200, -10, 300])
# Clip negative values to zero
clipped = np.where(prices > 0, prices, 0)
# [100, 0, 200, 0, 300]
# Conditional based on another array
signals = np.where(prices > 150, 'buy', 'hold')
# ['hold', 'hold', 'buy', 'hold', 'buy']
Pattern 3: np.select for Multi-Branch Logic
When you need if/elif/elif/else:
values = np.random.randn(1000) * 100
conditions = [
values > 100,
values > 0,
values > -50,
]
choices = ['strong_buy', 'buy', 'hold']
labels = np.select(conditions, choices, default='sell')
np.select evaluates conditions in order and returns the first match - exactly like if/elif/else. For two-branch logic, use np.where. For three or more branches, use np.select.
Pattern 4: np.searchsorted for Binning
# Define bucket boundaries
boundaries = np.array([0, 100, 500, 1000, 5000, 50000])
prices = np.array([50, 250, 750, 3000, 45000, 80])
bins = np.searchsorted(boundaries, prices)
# [1, 2, 3, 4, 5, 1] - which bucket each price belongs to
np.searchsorted uses binary search, making it O(log n) per element. For assigning millions of data points to hundreds of bins, this is orders of magnitude faster than a loop with if/elif chains.
Pattern 5: np.einsum for Complex Tensor Operations
Einstein summation notation handles matrix operations concisely:
A = np.random.rand(3, 4)
B = np.random.rand(4, 5)
# Matrix multiplication (equivalent to A @ B)
C = np.einsum('ij,jk->ik', A, B)
# Batch matrix multiplication - 100 pairs of 3×4 and 4×5 matrices
batch_A = np.random.rand(100, 3, 4)
batch_B = np.random.rand(100, 4, 5)
batch_C = np.einsum('bij,bjk->bik', batch_A, batch_B)
# Diagonal extraction from a batch of matrices
diags = np.einsum('bii->bi', np.random.rand(100, 4, 4))
# Trace (sum of diagonal) for each matrix in a batch
traces = np.einsum('bii->b', np.random.rand(100, 4, 4))
The notation reads as: "for indices i,j in the first operand and j,k in the second, contract over the shared index j and output indices i,k." Repeated indices on the left that don't appear on the right are summed over. It takes a bit to internalize, but once you do, einsum replaces entire pages of loop-based tensor code.
Structured Arrays - Tables Without Pandas
When you need heterogeneous columnar data (mixed types per column) but Pandas is too heavy:
dt = np.dtype([
('name', 'U20'), # Unicode string, max 20 chars
('price', 'f8'), # 64-bit float
('volume', 'i8'), # 64-bit integer
('active', '?'), # Boolean
])
trades = np.array([
('BTC/USDT', 50000.0, 1000, True),
('ETH/USDT', 3000.0, 5000, True),
('SOL/USDT', 100.0, 20000, False),
], dtype=dt)
# Access by field name - returns a view (no copy!)
print(trades['price']) # [50000. 3000. 100.]
print(trades['name']) # ['BTC/USDT' 'ETH/USDT' 'SOL/USDT']
# Filter with boolean masks
expensive = trades[trades['price'] > 1000]
# Sort by field
sorted_trades = np.sort(trades, order='volume')
Structured arrays are useful for:
- Reading binary data files with mixed types (sensor logs, network packet captures)
- Interfacing with C libraries via
ctypesorcffi - Lightweight tabular data where Pandas import time or memory overhead isn't justified
- Intermediate representations when processing data that will eventually become a DataFrame
They're not a Pandas replacement - no groupby, no merge, no method chaining. But when you need a typed, contiguous, zero-overhead table, structured arrays are the tool.
NumPy's Role in the ML Ecosystem
Understanding NumPy arrays is understanding the lingua franca of Python data science:
# scikit-learn expects NumPy arrays (or array-likes)
from sklearn.linear_model import LinearRegression
X = np.array([[1], [2], [3], [4], [5]]) # Features: shape (5, 1)
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1]) # Targets: shape (5,)
model = LinearRegression().fit(X, y)
print(f"Slope: {model.coef_[0]:.2f}, Intercept: {model.intercept_:.2f}")
# Slope: 2.00, Intercept: 0.04
# PyTorch tensors convert seamlessly to/from NumPy
import torch
numpy_array = np.array([1.0, 2.0, 3.0])
tensor = torch.from_numpy(numpy_array) # Shared memory!
tensor[0] = 99.0
print(numpy_array) # [99. 2. 3.] - shared memory, like NumPy views
The array protocol (__array_interface__ and the newer __dlpack__ interface) means any library that follows NumPy conventions can interoperate with zero-copy data sharing. This is why the entire Python data science ecosystem works together seamlessly - Pandas DataFrames, scikit-learn models, PyTorch tensors, TensorFlow operations all speak the same array language.
The mental model: NumPy arrays are the universal data format. Pandas DataFrames contain NumPy arrays (or Arrow arrays) per column. Scikit-learn expects features as 2D arrays and targets as 1D arrays. PyTorch converts to/from NumPy with shared memory. Understanding shapes, strides, broadcasting, and vectorized operations makes every library in the ecosystem more intuitive.
Performance Tips Summary
| Situation | Do this | Not this |
|---|---|---|
| Element-wise operation | Vectorized ops (a + b) | Python loop |
| Conditional selection | Boolean mask (a[a > 0]) | List comprehension |
| Conditional assignment | np.where() / np.select() | Loop with if/else |
| Multi-branch logic | np.select() | Nested np.where() |
| Iterate over rows | Rethink - reshape and vectorize | for row in array |
| Need contiguous data | np.ascontiguousarray() | Ignore layout, wonder why it's slow |
| Complex tensor ops | np.einsum() | Nested loops |
| Binning/bucketing | np.searchsorted() | Loop with if/elif |
| Large pairwise computations | scipy.spatial.distance | Broadcasting with huge intermediates |
What to remember from this one
In this episode, we went deeper into NumPy than the basics:
- Memory layout (C-order vs F-order) directly impacts cache performance - row operations are fast by default because NumPy uses C-order, and strides tell you exactly how many bytes each dimension step costs
- Views share memory with the original array (slicing, reshaping, transposing); copies are independent (fancy indexing, explicit
.copy()) - check withnp.shares_memory() - Broadcasting rules: pad shapes with 1s on the left, stretch dimensions of size 1, fail if dimensions are neither equal nor 1
np.newaxisis the key to outer operations - it adds a dimension of size 1 for broadcasting control- Vectorization patterns replace loops: boolean masks for filtering,
np.where/np.selectfor conditionals,np.searchsortedfor binning,np.einsumfor tensor operations - Structured arrays provide typed, contiguous tabular data without Pandas overhead - useful for binary data, C interop, and lightweight tables
- NumPy is the universal array format - Pandas, scikit-learn, PyTorch, and TensorFlow all interoperate through NumPy's array protocol
This is the foundation that everything in data science and machine learning builds upon. Shapes, broadcasting, vectorized operations, memory layout - these concepts reappear in every library, every framework, every model. With Pandas and NumPy solid, you're equipped for whatever comes next.
Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!
Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).
Consider setting @stemsocial as a beneficiary of this post's rewards if you would like to support the community and contribute to its mission of promoting science and education on Hive.
Thank you STEMsocial!!
<3 <3