Introduction: Why NumPy Matters
If you're entering the world of data science, machine learning, or scientific computing with Python, NumPy is your essential starting point. NumPy (Numerical Python) provides the foundation upon which the entire Python data science ecosystem is built—from Pandas for data manipulation to scikit-learn for machine learning.
Key Insight: NumPy isn't just another library—it's the backbone of scientific Python. Understanding NumPy is critical because Pandas, SciPy, scikit-learn, TensorFlow, and PyTorch all depend on its efficient array operations and mathematical capabilities.
NumPy excels at handling large, multi-dimensional arrays and matrices, performing mathematical operations at speeds comparable to compiled languages like C and Fortran. This performance comes from its implementation in C and intelligent memory management, making Python viable for computationally intensive scientific work.
The Evolution of NumPy
Historical Context
NumPy's story begins in the mid-1990s when Python was emerging as a scientific computing platform:
- 1995: Jim Hugunin creates Numeric, the first array package for Python
- 2001: Numarray emerges to address Numeric's limitations with large arrays
- 2005: Travis Oliphant unifies Numeric and Numarray into NumPy, combining the best of both
- 2006-Present: NumPy becomes the de facto standard for numerical computing in Python
Historical Milestone
Why NumPy Succeeded
NumPy succeeded where predecessors struggled by solving three critical problems:
- Performance: C-based implementation with optimized algorithms
- Memory efficiency: Contiguous memory allocation and views instead of copies
- Unified API: Single, consistent interface replacing fragmented tools
This combination made NumPy 10-100x faster than pure Python for numerical operations while maintaining Python's ease of use.
Why It Matters Today
NumPy's importance has only grown with the data science revolution:
- Foundation for ML: Deep learning frameworks like TensorFlow and PyTorch use NumPy-compatible tensors
- Data pipelines: Pandas DataFrames are built on NumPy arrays underneath
- Scientific computing: SciPy extends NumPy for advanced scientific functions
- Industry standard: Over 1,000 packages depend on NumPy in the Python ecosystem
NumPy Fundamentals
Installation and Setup
Installing NumPy is straightforward with pip:
# Install NumPy
pip install numpy
# Verify installation
python -c "import numpy as np; print(f'NumPy version: {np.__version__}')"
The ndarray: NumPy's Core Data Structure
The ndarray (N-dimensional array) is NumPy's fundamental object. Unlike Python lists, ndarrays:
- Store elements of the same type (homogeneous)
- Use contiguous memory for fast access
- Support vectorized operations without explicit loops
- Enable broadcasting for efficient element-wise operations
import numpy as np
# Creating arrays from Python lists
arr_1d = np.array([1, 2, 3, 4, 5])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("1D array:", arr_1d)
print("Shape:", arr_1d.shape) # (5,)
print("2D array:\n", arr_2d)
print("Shape:", arr_2d.shape) # (2, 3)
Pro Tip: The shape attribute is crucial—it tells you the dimensions of your array. A shape of (2, 3) means 2 rows and 3 columns. This becomes critical when debugging matrix operations.
Working with Arrays
Array Creation Methods
NumPy provides numerous ways to create arrays beyond converting lists:
import numpy as np
# Zeros, ones, and empty arrays
zeros = np.zeros((3, 4)) # 3x4 array of zeros
ones = np.ones((2, 3)) # 2x3 array of ones
empty = np.empty((2, 2)) # Uninitialized 2x2 array
print("Zeros:\n", zeros)
print("Ones:\n", ones)
import numpy as np
# Ranges and sequences
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced values: [0, 0.25, 0.5, 0.75, 1]
print("Range:", range_arr)
print("Linspace:", linspace)
import numpy as np
# Identity and diagonal matrices
identity = np.eye(3) # 3x3 identity matrix
diagonal = np.diag([1, 2, 3, 4]) # Diagonal matrix
print("Identity:\n", identity)
print("Diagonal:\n", diagonal)
Array Attributes
Understanding array properties is essential for effective NumPy usage:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Shape:", arr.shape) # (2, 3) - dimensions
print("Size:", arr.size) # 6 - total elements
print("Dtype:", arr.dtype) # int64 - data type
print("Ndim:", arr.ndim) # 2 - number of dimensions
print("Itemsize:", arr.itemsize) # 8 - bytes per element
Data Types Matter
Memory Efficiency with dtypes
Choosing the right data type can dramatically reduce memory usage:
# Same data, different memory footprints
arr_int64 = np.array([1, 2, 3], dtype=np.int64)
arr_int8 = np.array([1, 2, 3], dtype=np.int8)
print(f"int64 uses: {arr_int64.nbytes} bytes") # 24 bytes
print(f"int8 uses: {arr_int8.nbytes} bytes") # 3 bytes
Result: Using int8 instead of int64 reduces memory by 8x when values fit in 8 bits. For large datasets, this saves gigabytes.
Practice Exercises
Arrays Section Exercises
Exercise 1 (Beginner): Create a 3x4 array with values from 1-12, then print its shape, size, and dtype. Modify the array to use float32 data type.
Exercise 2 (Beginner): Create three arrays: zeros(5), ones(5), and arange(5). Print their shapes and data types.
Exercise 3 (Intermediate): Create arrays using linspace (5 values from 0-1), eye (3x3), and diag([1,2,3]). Explain what each creates.
Exercise 4 (Intermediate): Create a 4x4 array using arange and reshape. Check its attributes: shape, size, dtype, ndim, itemsize. Convert dtype to float32 and verify memory reduction.
Challenge (Advanced): Create arrays with different dtypes (int8, int16, int32, float32) using arange() and astype(). Calculate memory usage (nbytes) for each. Verify which dtype saves the most memory.
Array Operations & Vectorization
Vectorization: The NumPy Advantage
Vectorization means operations apply to entire arrays without explicit Python loops. This is NumPy's superpower:
import numpy as np
import time
# Python list approach (slow)
python_list = list(range(1000000))
start = time.time()
result_list = [x * 2 for x in python_list]
list_time = time.time() - start
print(f"Python list time: {list_time*1000:.2f}ms")
import numpy as np
import time
# NumPy vectorized approach (fast)
numpy_arr = np.arange(1000000)
start = time.time()
result_arr = numpy_arr * 2
numpy_time = time.time() - start
print(f"NumPy time: {numpy_time*1000:.2f}ms")
print(f"Speedup: {list_time/numpy_time:.1f}x faster!")
Element-wise Operations
NumPy supports intuitive mathematical operations:
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
# Arithmetic operations
print("Addition:", a + b) # [11, 22, 33, 44]
print("Multiplication:", a * b) # [10, 40, 90, 160]
print("Power:", a ** 2) # [1, 4, 9, 16]
# Universal functions (ufuncs)
print("Square root:", np.sqrt(a)) # [1.0, 1.414..., 1.732..., 2.0]
print("Exponential:", np.exp(a)) # [2.718..., 7.389..., ...]
print("Sine:", np.sin(a)) # [0.841..., 0.909..., ...]
Aggregation Functions
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
print("Sum all:", data.sum()) # 21
print("Sum by column:", data.sum(axis=0)) # [5, 7, 9]
print("Sum by row:", data.sum(axis=1)) # [6, 15]
print("Mean:", data.mean()) # 3.5
print("Standard dev:", data.std()) # ~1.707
Axis Confusion Alert: axis=0 means "down the rows" (column-wise), axis=1 means "across columns" (row-wise). Think of it as which axis to collapse.
Indexing, Slicing & Boolean Masking
Basic Indexing and Slicing
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
# Indexing (like Python lists)
print("First element:", arr[0]) # 10
print("Last element:", arr[-1]) # 50
# Slicing [start:stop:step]
print("Slice [1:4]:", arr[1:4]) # [20, 30, 40]
print("Every other:", arr[::2]) # [10, 30, 50]
print("Reversed:", arr[::-1]) # [50, 40, 30, 20, 10]
Multi-dimensional Indexing
import numpy as np
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Indexing with [row, col]
print("Element [1, 2]:", matrix[1, 2]) # 6
print("Row 0:", matrix[0, :]) # [1, 2, 3]
print("Column 1:", matrix[:, 1]) # [2, 5, 8]
print("Submatrix:\n", matrix[0:2, 1:3]) # [[2, 3], [5, 6]]
Boolean Masking: Powerful Filtering
Boolean indexing lets you filter arrays based on conditions:
import numpy as np
scores = np.array([85, 92, 78, 95, 88, 73])
# Create boolean mask
high_scores_mask = scores > 85
print("Mask:", high_scores_mask) # [False, True, False, True, True, False]
# Apply mask to filter
print("High scores:", scores[high_scores_mask]) # [92, 95, 88]
# Combine conditions with & (and) | (or)
elite = scores[(scores > 85) & (scores < 95)]
print("Elite scores:", elite) # [92, 88]
Real-World Example
Data Cleaning with Boolean Masking
Boolean masking is essential for data cleaning:
temperatures = np.array([22.5, 23.1, -999, 24.3, -999, 22.8])
# Remove sensor error readings (-999)
valid_temps = temperatures[temperatures != -999]
print("Valid readings:", valid_temps) # [22.5, 23.1, 24.3, 22.8]
# Replace outliers with mean
mean_temp = valid_temps.mean()
temperatures[temperatures == -999] = mean_temp
print("Cleaned:", temperatures)
Practice Exercises
Indexing & Filtering Exercises
Exercise 1 (Beginner): Create a 1D array [10, 20, 30, 40, 50]. Practice indexing: first element, last element, elements at indices 1 and 3.
Exercise 2 (Beginner): Create a 2D array [[1,2,3],[4,5,6],[7,8,9]]. Extract row 1, column 2, and the submatrix from rows 0-1, columns 1-2.
Exercise 3 (Intermediate): Create an array of scores [85, 92, 78, 95, 88, 73]. Use boolean masking to find all scores > 85. Use & operator to find scores between 80 and 90.
Exercise 4 (Intermediate): Create an array with sensor values including -999 (errors). Use boolean masking to filter out errors, calculate mean of valid data, and replace errors with the mean.
Challenge (Advanced): Create a 3x3 matrix. Use fancy indexing to select specific rows and columns. Create boolean masks for multiple conditions (e.g., values > 5 AND < 8). Verify mask combinations with & | ~ operators.
Broadcasting: NumPy's Superpower
Broadcasting is NumPy's ability to perform operations on arrays of different shapes without explicit loops or copying data. This makes code both faster and more readable.
Broadcasting Rules
NumPy compares shapes element-wise from right to left. Two dimensions are compatible when:
- They are equal, or
- One of them is 1
import numpy as np
# Example 1: Scalar broadcasting
arr = np.array([1, 2, 3, 4])
result = arr + 10 # Adds 10 to each element
print(result) # [11, 12, 13, 14]
import numpy as np
# Example 2: 1D to 2D broadcasting
matrix = np.array([[1, 2, 3],
[4, 5, 6]])
row_vec = np.array([10, 20, 30])
result = matrix + row_vec # Adds row_vec to each row
print(result)
# [[11, 22, 33],
# [14, 25, 36]]
Practical Broadcasting Examples
import numpy as np
# Normalize data (zero mean, unit variance)
data = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
mean = data.mean(axis=0) # Column means: [4, 5, 6]
std = data.std(axis=0) # Column stds
normalized = (data - mean) / std
print("Normalized:\n", normalized)
Broadcasting Magic: Without broadcasting, you'd need nested loops. Broadcasting does this automatically in optimized C code, making it 100x faster while keeping code readable.
Practice Exercises
Operations & Broadcasting Exercises
Exercise 1 (Beginner): Create two arrays: a = [1,2,3,4] and b = [10,20,30,40]. Perform addition, subtraction, multiplication, and division. Print all results.
Exercise 2 (Beginner): Create a 2D array [[1,2],[3,4]]. Add 10 to every element. Multiply each element by 2. Print the result.
Exercise 3 (Intermediate): Given data = np.array([[10,20,30],[40,50,60]]), subtract the mean of each column from that column (normalization). Verify the column means are now ~0.
Exercise 4 (Intermediate): Create a matrix and row vector. Use broadcasting to add the row vector to each row. Then multiply each row by a column vector [1,2,3,...].
Challenge (Advanced): Implement z-score normalization: (x - mean) / std for data = np.arange(12).reshape(3,4) without using explicit loops. Verify mean is ~0 and std is ~1.
Array Reshaping & Manipulation
Reshaping arrays is essential for preparing data for different algorithms and transforming between data representations.
Reshaping Arrays
import numpy as np
# Create 1D array
arr = np.arange(12)
print("Original:", arr) # [0, 1, 2, ..., 11]
# Reshape to 2D
matrix = arr.reshape(3, 4)
print("3x4 matrix:\n", matrix)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
import numpy as np
arr = np.arange(12)
# Reshape to 3D
cube = arr.reshape(2, 3, 2)
print("2x3x2 cube:\n", cube)
# Automatic dimension with -1
auto = arr.reshape(4, -1) # NumPy calculates: (4, 3)
print("Auto reshape shape:", auto.shape)
print("Auto reshape:\n", auto)
Flattening Arrays
import numpy as np
matrix = np.array([[1, 2, 3], [4, 5, 6]])
# Flatten to 1D (returns copy)
flat = matrix.flatten()
print("Flatten:", flat) # [1, 2, 3, 4, 5, 6]
# Ravel (returns view when possible)
ravel = matrix.ravel()
print("Ravel:", ravel) # [1, 2, 3, 4, 5, 6]
# Ravel is faster but be careful with modifications!
ravel[0] = 999
print("Matrix changed:", matrix) # [[999, 2, 3], [4, 5, 6]]
Transpose and Swapping Axes
import numpy as np
# Simple transpose
A = np.array([[1, 2, 3], [4, 5, 6]])
print("Transpose:\n", A.T)
# [[1, 4]
# [2, 5]
# [3, 6]]
import numpy as np
# For 3D+ arrays, specify axes
cube = np.random.rand(2, 3, 4)
swapped = cube.transpose(2, 0, 1) # Swap axes: (2,3,4) -> (4,2,3)
print("Original shape:", cube.shape)
print("Swapped shape:", swapped.shape)
Adding and Removing Dimensions
import numpy as np
# Expand dimensions
arr = np.array([1, 2, 3])
print("Original shape:", arr.shape) # (3,)
# Add dimension at axis 0
expanded = np.expand_dims(arr, axis=0)
print("Expanded (axis=0):", expanded.shape) # (1, 3)
# Add dimension at axis 1
expanded = np.expand_dims(arr, axis=1)
print("Expanded (axis=1):", expanded.shape) # (3, 1)
import numpy as np
# Squeeze: remove single-dimensional entries
arr_with_singles = np.array([[[1, 2, 3]]])
print("Before squeeze:", arr_with_singles.shape) # (1, 1, 3)
squeezed = np.squeeze(arr_with_singles)
print("After squeeze:", squeezed.shape) # (3,)
import numpy as np
# Ensure minimum dimensions
x = 5 # scalar
print("atleast_1d:", np.atleast_1d(x).shape) # (1,)
print("atleast_2d:", np.atleast_2d(x).shape) # (1, 1)
print("atleast_3d:", np.atleast_3d(x).shape) # (1, 1, 1)
Tiling and Repeating
import numpy as np
# Tile: repeat entire array
arr = np.array([[1, 2], [3, 4]])
tiled = np.tile(arr, 2) # Repeat 2 times horizontally
print("Tiled 2x:\n", tiled)
# [[1, 2, 1, 2]
# [3, 4, 3, 4]]
import numpy as np
arr = np.array([[1, 2], [3, 4]])
tiled = np.tile(arr, (2, 3)) # Repeat 2x vertically, 3x horizontally
print("Tiled (2,3):\n", tiled)
import numpy as np
# Repeat: repeat each element
arr = np.array([1, 2, 3])
repeated = np.repeat(arr, 3)
print("Repeat each 3x:", repeated) # [1, 1, 1, 2, 2, 2, 3, 3, 3]
import numpy as np
# Repeat along axis
matrix = np.array([[1, 2], [3, 4]])
repeated = np.repeat(matrix, 2, axis=0)
print("Repeat rows:\n", repeated)
# [[1, 2]
# [1, 2]
# [3, 4]
# [3, 4]]
Creating Coordinate Grids
import numpy as np
# Meshgrid: create coordinate matrices for vectorized computations
x = np.array([1, 2, 3])
y = np.array([4, 5])
xx, yy = np.meshgrid(x, y)
print("X grid:\n", xx)
# [[1, 2, 3]
# [1, 2, 3]]
print("Y grid:\n", yy)
# [[4, 4, 4]
# [5, 5, 5]]
# Use for function evaluation
z = xx**2 + yy**2
print("f(x,y) = x² + y²:\n", z)
import numpy as np
# mgrid and ogrid for dense/sparse grids
x, y = np.mgrid[0:5, 0:5] # Dense grid
print("mgrid x shape:", x.shape) # (5, 5)
x, y = np.ogrid[0:5, 0:5] # Open (sparse) grid
print("ogrid x shape:", x.shape) # (5, 1) - memory efficient!
Rearranging Elements
import numpy as np
# Roll: circular shift
arr = np.array([1, 2, 3, 4, 5])
rolled = np.roll(arr, 2) # Shift right by 2
print("Rolled:", rolled) # [4, 5, 1, 2, 3]
import numpy as np
# Roll 2D arrays along axis
matrix = np.array([[1, 2, 3], [4, 5, 6]])
rolled = np.roll(matrix, 1, axis=1) # Shift columns
print("Rolled columns:\n", rolled)
# [[3, 1, 2]
# [6, 4, 5]]
import numpy as np
# Flip arrays
arr = np.array([1, 2, 3, 4, 5])
flipped = np.flip(arr)
print("Flipped:", flipped) # [5, 4, 3, 2, 1]
import numpy as np
matrix = np.array([[1, 2], [3, 4]])
print("Flip vertical:\n", np.flipud(matrix))
# [[3, 4]
# [1, 2]]
print("Flip horizontal:\n", np.fliplr(matrix))
# [[2, 1]
# [4, 3]]
import numpy as np
# Rotate 90 degrees
matrix = np.array([[1, 2], [3, 4]])
rotated = np.rot90(matrix)
print("Rotated 90°:\n", rotated)
# [[2, 4]
# [1, 3]]
Common Pattern
Reshaping for Machine Learning
ML algorithms often require specific input shapes. Here's how to prepare image data:
# Convert 100 28x28 grayscale images to flat vectors
images = np.random.rand(100, 28, 28)
print("Original shape:", images.shape) # (100, 28, 28)
# Flatten each image: (100, 28, 28) -> (100, 784)
flat_images = images.reshape(100, -1)
print("Flattened shape:", flat_images.shape) # (100, 784)
# Add channel dimension for CNNs: (100, 28, 28) -> (100, 28, 28, 1)
cnn_images = images.reshape(100, 28, 28, 1)
print("CNN shape:", cnn_images.shape) # (100, 28, 28, 1)
Scientific Computing
Using Meshgrid for 2D Function Plotting
Meshgrid is essential for evaluating functions over 2D grids, commonly used in plotting and numerical methods:
# Create 2D Gaussian function
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)
# Evaluate 2D Gaussian: f(x,y) = exp(-(x² + y²)/2)
z = np.exp(-(xx**2 + yy**2) / 2)
print("Grid shape:", xx.shape) # (100, 100)
print("Z values shape:", z.shape) # (100, 100)
print("Max value (at origin):", z.max()) # 1.0
# This z array can be directly plotted with matplotlib's contour or imshow
Combining Arrays: Stack & Concatenate
NumPy provides powerful functions to combine multiple arrays in various ways.
Concatenation
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
# Concatenate along rows (axis 0)
vcat = np.concatenate([a, b], axis=0)
print("Vertical concat:\n", vcat)
# [[1, 2]
# [3, 4]
# [5, 6]]
# Concatenate along columns (axis 1)
c = np.array([[5], [6]])
hcat = np.concatenate([a, c], axis=1)
print("Horizontal concat:\n", hcat)
# [[1, 2, 5]
# [3, 4, 6]]
Stacking Functions
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Vertical stack (row-wise)
vstack = np.vstack([a, b])
print("vstack:\n", vstack)
# [[1, 2, 3]
# [4, 5, 6]]
# Horizontal stack (column-wise)
hstack = np.hstack([a, b])
print("hstack:", hstack) # [1, 2, 3, 4, 5, 6]
# Depth stack (3D stacking)
dstack = np.dstack([a, b])
print("dstack shape:", dstack.shape) # (1, 3, 2)
Column and Row Stack
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Column stack: 1D arrays -> 2D columns
col_stack = np.column_stack([a, b])
print("Column stack:\n", col_stack)
# [[1, 4]
# [2, 5]
# [3, 6]]
# Row stack (alias for vstack)
row_stack = np.row_stack([a, b])
print("Row stack:\n", row_stack)
Splitting Arrays
import numpy as np
# Split array into equal parts
arr = np.array([1, 2, 3, 4, 5, 6])
split_arrays = np.split(arr, 3) # Split into 3 equal parts
print("Split into 3:", [a for a in split_arrays])
# [array([1, 2]), array([3, 4]), array([5, 6])]
# Split at specific indices
split_arrays = np.split(arr, [2, 4]) # Split at indices 2 and 4
print("Split at indices:", [a for a in split_arrays])
# [array([1, 2]), array([3, 4]), array([5, 6])]
import numpy as np
# Horizontal and vertical splits
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
# Split horizontally (column-wise)
h_split = np.hsplit(matrix, 2)
print("Horizontal split:")
for arr in h_split:
print(arr)
# [[1, 2] [[3, 4]
# [5, 6] [7, 8]
# [9, 10]] [11, 12]]
# Split vertically (row-wise)
v_split = np.vsplit(matrix, 3)
print("Vertical split:")
for arr in v_split:
print(arr) # [[1,2,3,4]], [[5,6,7,8]], [[9,10,11,12]]
Quick Reference:
- Stacking:
vstack (rows), hstack (columns), dstack (depth), column_stack (1D→columns)
- Splitting:
split (any axis), hsplit (columns), vsplit (rows), dsplit (depth)
Sorting, Searching & Set Operations
Sorting Arrays
import numpy as np
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6])
# Sort array (returns new sorted array)
sorted_arr = np.sort(arr)
print("Sorted:", sorted_arr) # [1, 1, 2, 3, 4, 5, 6, 9]
# Argsort (returns indices that would sort the array)
indices = np.argsort(arr)
print("Argsort:", indices) # [1, 3, 6, 0, 2, 4, 7, 5]
print("Using indices:", arr[indices]) # Same as sorted_arr
import numpy as np
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6])
# Sort in descending order
desc = np.sort(arr)[::-1]
print("Descending:", desc) # [9, 6, 5, 4, 3, 2, 1, 1]
Sorting 2D Arrays
import numpy as np
matrix = np.array([[3, 1, 4],
[1, 5, 9],
[2, 6, 5]])
# Sort along axis 0 (columns)
col_sorted = np.sort(matrix, axis=0)
print("Column sorted:\n", col_sorted)
# Sort along axis 1 (rows)
row_sorted = np.sort(matrix, axis=1)
print("Row sorted:\n", row_sorted)
Searching Arrays
import numpy as np
arr = np.array([1, 3, 5, 7, 9, 11])
# Binary search in sorted array
idx = np.searchsorted(arr, 6)
print("Insert position for 6:", idx) # 3 (between 5 and 7)
# Search for multiple values
indices = np.searchsorted(arr, [4, 8, 10])
print("Insert positions:", indices) # [2, 4, 5]
# Find where condition is true
greater_than_5 = np.where(arr > 5)
print("Indices > 5:", greater_than_5) # (array([3, 4, 5]),)
Set Operations
import numpy as np
# Unique values
arr_dup = np.array([1, 2, 2, 3, 3, 3, 4])
unique = np.unique(arr_dup)
print("Unique:", unique) # [1, 2, 3, 4]
# Unique with counts
unique, counts = np.unique(arr_dup, return_counts=True)
print("Counts:", dict(zip(unique, counts))) # {1: 1, 2: 2, 3: 3, 4: 1}
import numpy as np
a = np.array([1, 2, 3, 4, 5])
b = np.array([4, 5, 6, 7, 8])
# Set operations
intersection = np.intersect1d(a, b)
print("Intersection:", intersection) # [4, 5]
union = np.union1d(a, b)
print("Union:", union) # [1, 2, 3, 4, 5, 6, 7, 8]
difference = np.setdiff1d(a, b)
print("Difference (a-b):", difference) # [1, 2, 3]
Practical Example
Finding Top K Elements
import numpy as np
# Find top 3 scores and their indices
scores = np.array([85, 92, 78, 95, 88, 73, 99, 82])
# Use argpartition for efficient partial sorting
k = 3
top_k_indices = np.argpartition(scores, -k)[-k:]
top_k_scores = scores[top_k_indices]
# Sort the top K in descending order
sorted_idx = top_k_indices[np.argsort(top_k_scores)][::-1]
print("Top 3 scores:", scores[sorted_idx]) # [99, 95, 92]
print("Top 3 indices:", sorted_idx) # [6, 3, 1]
Statistical Functions
Descriptive Statistics
import numpy as np
data = np.array([85, 92, 78, 95, 88, 73, 99, 82])
# Basic statistics
print("Mean:", data.mean()) # 86.5
print("Median:", np.median(data)) # 86.5
print("Std Dev:", data.std()) # 7.94
print("Variance:", data.var()) # 63.0
print("Min:", data.min()) # 73
print("Max:", data.max()) # 99
print("Range:", data.ptp()) # 26 (peak-to-peak)
Percentiles and Quantiles
import numpy as np
data = np.array([85, 92, 78, 95, 88, 73, 99, 82])
# Percentiles
p25 = np.percentile(data, 25)
p50 = np.percentile(data, 50) # Same as median
p75 = np.percentile(data, 75)
print(f"25th: {p25}, 50th: {p50}, 75th: {p75}")
# Quantiles (equivalent to percentiles / 100)
q = np.quantile(data, [0.25, 0.5, 0.75])
print("Quantiles:", q)
Correlation and Covariance
import numpy as np
# Two variables
hours_studied = np.array([2, 3, 4, 5, 6, 7, 8])
test_scores = np.array([65, 70, 75, 80, 85, 90, 95])
# Correlation coefficient
correlation = np.corrcoef(hours_studied, test_scores)
print("Correlation matrix:\n", correlation)
print("Correlation:", correlation[0, 1]) # ~1.0 (perfect positive)
# Covariance
covariance = np.cov(hours_studied, test_scores)
print("Covariance matrix:\n", covariance)
Histograms
import numpy as np
# Create histogram data
data = np.random.randn(1000)
counts, bins = np.histogram(data, bins=10)
print("Counts:", counts)
print("Bin edges:", bins)
# Histogram with custom bins
counts, bins = np.histogram(data, bins=[-3, -2, -1, 0, 1, 2, 3])
print("Custom bins counts:", counts)
Statistical Functions Summary:
mean(), median(), std(), var() - Central tendency and spread
min(), max(), ptp() - Range statistics
percentile(), quantile() - Distribution percentiles
corrcoef(), cov() - Relationships between variables
histogram() - Frequency distributions
Saving & Loading Arrays
NumPy provides efficient binary formats for saving and loading arrays, preserving data types and shapes.
Single Array: .npy Format
import numpy as np
# Save single array
data = np.array([[1, 2, 3], [4, 5, 6]])
np.save('my_array.npy', data)
# Load array
loaded = np.load('my_array.npy')
print("Loaded:\n", loaded)
print("Same array:", np.array_equal(data, loaded)) # True
Multiple Arrays: .npz Format
import numpy as np
# Save multiple arrays
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
z = np.array([[7, 8], [9, 10]])
np.savez('multiple_arrays.npz', x=x, y=y, z=z)
# Load multiple arrays
loaded = np.load('multiple_arrays.npz')
print("Available arrays:", loaded.files) # ['x', 'y', 'z']
print("X:", loaded['x'])
print("Y:", loaded['y'])
print("Z:", loaded['z'])
Compressed Format: .npz
import numpy as np
# Save compressed (smaller file size)
large_array = np.random.rand(1000, 1000)
np.savez_compressed('compressed.npz', data=large_array)
# Load compressed (transparent decompression)
loaded = np.load('compressed.npz')
print("Shape:", loaded['data'].shape)
Text Files: CSV-like
import numpy as np
# Save to text file
data = np.array([[1, 2, 3], [4, 5, 6]])
np.savetxt('data.txt', data, delimiter=',', fmt='%d')
# Load from text file
loaded = np.loadtxt('data.txt', delimiter=',')
print("Loaded from text:\n", loaded)
Performance Comparison
Binary vs Text Format
import numpy as np
import time
# Create large array
large = np.random.rand(10000, 100)
# Binary save (.npy) - FAST
start = time.time()
np.save('binary.npy', large)
binary_time = time.time() - start
# Text save (.txt) - SLOW
start = time.time()
np.savetxt('text.txt', large)
text_time = time.time() - start
print(f"Binary save: {binary_time:.3f}s")
print(f"Text save: {text_time:.3f}s")
print(f"Binary is {text_time/binary_time:.1f}x faster")
Result: Binary format is typically 10-50x faster and produces much smaller files. Use .npy/.npz for NumPy-to-NumPy storage!
Practice Exercises
Reshaping & Manipulation Exercises
Exercise 1 (Beginner): Create a 1D array with 24 elements. Reshape it to 2x3x4, then back to 6x4. Verify each shape transformation.
Exercise 2 (Beginner): Given a 3x4 array, flatten it to 1D using flatten() and reshape(). Are they the same? Why or why not?
Exercise 3 (Intermediate): Create a 4x4 array. Transpose it. Flip it horizontally (flip(axis=1)). Stack the original, transposed, and flipped versions along axis 0.
Exercise 4 (Intermediate): Create three 2D arrays of shape (3,3). Use hstack() and vstack() to combine them different ways. Compare results.
Challenge (Advanced): Create a 3D array (2,3,4). Reshape to 2D, apply some operations, then reshape back to 3D. Ensure data integrity throughout.
Linear Algebra Operations
NumPy's linalg module provides comprehensive linear algebra functionality essential for machine learning and scientific computing.
Matrix Operations
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Matrix multiplication
C = A @ B # Python 3.5+ syntax
# or: C = np.dot(A, B)
print("A @ B:\n", C)
# Transpose
print("Transpose:\n", A.T)
# Determinant
det = np.linalg.det(A)
print("Determinant:", det) # -2.0
# Inverse (if exists)
A_inv = np.linalg.inv(A)
print("Inverse:\n", A_inv)
Eigenvalues and Eigenvectors
import numpy as np
matrix = np.array([[4, -2],
[1, 1]])
eigenvalues, eigenvectors = np.linalg.eig(matrix)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)
Machine Learning Application
Solving Linear Regression with Linear Algebra
Linear regression can be solved using the normal equation: θ = (X^T X)^(-1) X^T y
import numpy as np
# Generate synthetic data
X = np.array([[1, 1], [1, 2], [1, 3]]) # Features with bias term
y = np.array([2, 4, 6]) # Target values
# Normal equation: theta = (X^T X)^(-1) X^T y
theta = np.linalg.inv(X.T @ X) @ X.T @ y
print("Coefficients:", theta) # [0, 2] (intercept=0, slope=2)
This is exactly how libraries like scikit-learn solve linear regression under the hood!
Random Number Generation
NumPy's random module is crucial for simulations, data generation, and machine learning initialization.
Basic Random Generation
import numpy as np
# Set seed for reproducibility
np.random.seed(42)
# Random floats [0, 1)
uniform = np.random.rand(3, 3)
print("Uniform [0,1):\n", uniform)
# Random integers
integers = np.random.randint(0, 10, size=(2, 3))
print("Random integers:\n", integers)
# Normal distribution (mean=0, std=1)
normal = np.random.randn(1000)
print("Normal mean:", normal.mean()) # ~0
print("Normal std:", normal.std()) # ~1
Statistical Distributions
import numpy as np
# Different distributions
exponential = np.random.exponential(scale=2.0, size=1000)
poisson = np.random.poisson(lam=3, size=1000)
binomial = np.random.binomial(n=10, p=0.5, size=1000)
# Choice and shuffle
deck = np.arange(52)
np.random.shuffle(deck)
hand = np.random.choice(deck, size=5, replace=False)
Reproducibility: Always set np.random.seed() at the start of notebooks or scripts. This ensures random operations produce identical results across runs—critical for debugging and scientific reproducibility.
Performance Optimization
Views vs Copies
Understanding when NumPy creates views (references) versus copies is critical for performance:
import numpy as np
original = np.array([1, 2, 3, 4, 5])
# Slicing creates a VIEW (shares memory)
view = original[1:4]
view[0] = 999
print("Original changed:", original) # [1, 999, 3, 4, 5]
# Fancy indexing creates a COPY
copy = original[[0, 2, 4]]
copy[0] = 777
print("Original unchanged:", original) # [1, 999, 3, 4, 5]
Memory Layout: C vs Fortran Order
import numpy as np
# C-contiguous (row-major, default)
c_array = np.array([[1, 2], [3, 4]], order='C')
# Fortran-contiguous (column-major)
f_array = np.array([[1, 2], [3, 4]], order='F')
# Check with flags
print("C-contiguous:", c_array.flags['C_CONTIGUOUS'])
print("F-contiguous:", f_array.flags['F_CONTIGUOUS'])
Performance Benchmark
Vectorization Speed Comparison
import numpy as np
import time
size = 1_000_000
a = np.random.rand(size)
b = np.random.rand(size)
# Python loop approach
start = time.time()
result = []
for i in range(size):
result.append(a[i] + b[i])
python_time = time.time() - start
# NumPy vectorized approach
start = time.time()
result_np = a + b
numpy_time = time.time() - start
print(f"Python loop: {python_time:.4f}s")
print(f"NumPy vectorized: {numpy_time:.4f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")
Typical result: NumPy is 50-100x faster for element-wise operations!
Practical Applications
Image Processing
Images are just 3D arrays (height × width × channels):
import numpy as np
# Simulate a 100x100 RGB image
image = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)
# Convert to grayscale (weighted average)
gray = np.dot(image[...,:3], [0.299, 0.587, 0.114])
# Flip horizontally
flipped = image[:, ::-1, :]
# Crop center
h, w = image.shape[:2]
cropped = image[h//4:3*h//4, w//4:3*w//4, :]
Statistical Analysis
import numpy as np
# Generate sample data
data = np.random.normal(loc=100, scale=15, size=1000)
# Compute statistics
mean = data.mean()
median = np.median(data)
std = data.std()
percentiles = np.percentile(data, [25, 50, 75])
print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Std Dev: {std:.2f}")
print(f"IQR: {percentiles[2] - percentiles[0]:.2f}")
Time Series Analysis
import numpy as np
# Moving average
def moving_average(data, window):
return np.convolve(data, np.ones(window)/window, mode='valid')
prices = np.random.randn(100).cumsum() + 100
ma_5 = moving_average(prices, 5)
ma_20 = moving_average(prices, 20)
Next Steps in the Series
Congratulations! You now understand NumPy's core concepts and capabilities. You've mastered:
- ✅ Array creation and manipulation
- ✅ Vectorized operations and broadcasting
- ✅ Indexing, slicing, and boolean masking
- ✅ Linear algebra operations
- ✅ Performance optimization techniques
What's Next: In Part 2: Pandas for Data Analysis, you'll learn how Pandas builds on NumPy to provide powerful tools for working with real-world tabular data—DataFrames, data cleaning, transformation, and exploratory analysis.
NumPy is the foundation, but Pandas is where data science really comes alive. See you in the next article!
NumPy API Cheat Sheet
Quick reference for the most commonly used NumPy operations covered in this article.
np.array([1,2,3]) | Create from list |
np.zeros((3,4)) | 3×4 array of zeros |
np.ones((2,3)) | 2×3 array of ones |
np.arange(0,10,2) | [0,2,4,6,8] |
np.linspace(0,1,5) | 5 evenly spaced |
np.eye(3) | 3×3 identity matrix |
np.random.rand(3,4) | 3×4 random [0,1) |
np.random.randn(3,4) | Normal distribution |
arr[0] | First element |
arr[-1] | Last element |
arr[1:4] | Elements 1 to 3 |
arr[::2] | Every 2nd element |
arr[1,2] | 2D: row 1, col 2 |
arr[:,0] | All rows, col 0 |
arr[0,:] | Row 0, all cols |
arr[arr > 5] | Boolean indexing |
arr + 5 | Add scalar |
arr1 + arr2 | Element-wise add |
arr * 2 | Multiply scalar |
arr1 * arr2 | Element-wise mult |
arr ** 2 | Element-wise power |
np.sqrt(arr) | Square root |
np.exp(arr) | Exponential |
np.log(arr) | Natural log |
arr.sum() | Sum all elements |
arr.mean() | Mean (average) |
arr.std() | Standard deviation |
arr.min() | Minimum value |
arr.max() | Maximum value |
arr.sum(axis=0) | Sum by column |
arr.sum(axis=1) | Sum by row |
np.median(arr) | Median value |
arr.reshape(3,4) | New shape (3×4) |
arr.flatten() | To 1D array |
arr.ravel() | To 1D (view) |
arr.T | Transpose |
np.vstack([a,b]) | Stack vertically |
np.hstack([a,b]) | Stack horizontally |
np.concatenate() | Join arrays |
np.split(arr, 3) | Split into 3 |
A @ B | Matrix multiply |
A.dot(B) | Dot product |
np.linalg.inv(A) | Matrix inverse |
np.linalg.det(A) | Determinant |
np.linalg.eig(A) | Eigenvalues/vectors |
np.linalg.solve(A,b) | Solve Ax=b |
np.trace(A) | Trace (diagonal sum) |
np.linalg.norm(v) | Vector norm |
Pro Tips:
- Broadcasting: Arrays with different shapes can work together if dimensions are compatible
- Views vs Copies: Slicing creates views (modifies original); use
.copy() for independence
- Axis parameter:
axis=0 operates on columns (down), axis=1 on rows (across)
- Performance: Vectorized operations are 10-100× faster than Python loops
Practice Exercises
Linear Algebra Exercises
Exercise 1 (Beginner): Create two 2x2 matrices and perform matrix multiplication using @ operator. Compare with element-wise multiplication (*).
Exercise 2 (Beginner): Create a 3x3 matrix and calculate its determinant, trace (sum of diagonal), and rank. Verify the relationship between these properties.
Exercise 3 (Intermediate): Create a 3x3 matrix, compute its inverse, and multiply it by the original. The result should be close to identity matrix.
Exercise 4 (Intermediate): Create a 4x2 matrix A and 2x3 matrix B. Compute A @ B and verify the shape is 4x3. What happens with incompatible shapes?
Challenge (Advanced): Implement Least Squares regression: Given A (design matrix) and b (target), solve A @ x = b using linalg.lstsq(). Verify solution minimizes error.
Related Articles in This Series
Part 2: Pandas for Data Analysis
Master Pandas DataFrames, Series, data cleaning, transformation, groupby operations, and merge techniques for real-world data analysis.
Read Article
Part 3: Data Visualization with Matplotlib & Seaborn
Create compelling visualizations with Python's most powerful plotting libraries. Learn line plots, bar charts, scatter plots, and statistical graphics.
Read Article
Part 4: Machine Learning with Scikit-learn
Build predictive models with classification, regression, clustering algorithms, and complete ML pipelines using Scikit-learn.
Read Article