Python Data Science Series Part 1: NumPy Foundations

Introduction: Why NumPy Matters

Prerequisites: Before running the code examples in this tutorial, make sure you have Python and Jupyter notebooks properly set up. If you haven't configured your development environment yet, check out our complete setup guide for VS Code, PyCharm, Jupyter, and Colab.

If you're entering the world of data science, machine learning, or scientific computing with Python, NumPy is your essential starting point. NumPy (Numerical Python) provides the foundation upon which the entire Python data science ecosystem is built—from Pandas for data manipulation to scikit-learn for machine learning.

                        
                        Key Insight: NumPy isn't just another library—it's the backbone of scientific Python. Understanding NumPy is critical because Pandas, SciPy, scikit-learn, TensorFlow, and PyTorch all depend on its efficient array operations and mathematical capabilities.
                    

NumPy excels at handling large, multi-dimensional arrays and matrices, performing mathematical operations at speeds comparable to compiled languages like C and Fortran. This performance comes from its implementation in C and intelligent memory management, making Python viable for computationally intensive scientific work.

The Evolution of NumPy

Historical Context

NumPy's story begins in the mid-1990s when Python was emerging as a scientific computing platform:

1995: Jim Hugunin creates Numeric, the first array package for Python
2001: Numarray emerges to address Numeric's limitations with large arrays
2005: Travis Oliphant unifies Numeric and Numarray into NumPy, combining the best of both
2006-Present: NumPy becomes the de facto standard for numerical computing in Python

Historical Milestone

Why NumPy Succeeded

NumPy succeeded where predecessors struggled by solving three critical problems:

Performance: C-based implementation with optimized algorithms
Memory efficiency: Contiguous memory allocation and views instead of copies
Unified API: Single, consistent interface replacing fragmented tools

This combination made NumPy 10-100x faster than pure Python for numerical operations while maintaining Python's ease of use.

Why It Matters Today

NumPy's importance has only grown with the data science revolution:

Foundation for ML: Deep learning frameworks like TensorFlow and PyTorch use NumPy-compatible tensors
Data pipelines: Pandas DataFrames are built on NumPy arrays underneath
Scientific computing: SciPy extends NumPy for advanced scientific functions
Industry standard: Over 1,000 packages depend on NumPy in the Python ecosystem

NumPy Fundamentals

Installation and Setup

Installing NumPy is straightforward with pip:

# Install NumPy
pip install numpy

# Verify installation
python -c "import numpy as np; print(f'NumPy version: {np.__version__}')"

The ndarray: NumPy's Core Data Structure

The ndarray (N-dimensional array) is NumPy's fundamental object. Unlike Python lists, ndarrays:

Store elements of the same type (homogeneous)
Use contiguous memory for fast access
Support vectorized operations without explicit loops
Enable broadcasting for efficient element-wise operations

import numpy as np

# Creating arrays from Python lists
arr_1d = np.array([1, 2, 3, 4, 5])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

print("1D array:", arr_1d)
print("Shape:", arr_1d.shape)  # (5,)
print("2D array:\n", arr_2d)
print("Shape:", arr_2d.shape)  # (2, 3)

                            
                            Pro Tip: The shape attribute is crucial—it tells you the dimensions of your array. A shape of (2, 3) means 2 rows and 3 columns. This becomes critical when debugging matrix operations.
                        

Working with Arrays

Array Creation Methods

NumPy provides numerous ways to create arrays beyond converting lists:

import numpy as np

# Zeros, ones, and empty arrays
zeros = np.zeros((3, 4))        # 3x4 array of zeros
ones = np.ones((2, 3))          # 2x3 array of ones
empty = np.empty((2, 2))        # Uninitialized 2x2 array

print("Zeros:\n", zeros)
print("Ones:\n", ones)

import numpy as np

# Ranges and sequences
range_arr = np.arange(0, 10, 2)       # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)       # 5 evenly spaced values: [0, 0.25, 0.5, 0.75, 1]

print("Range:", range_arr)
print("Linspace:", linspace)

import numpy as np

# Identity and diagonal matrices
identity = np.eye(3)                   # 3x3 identity matrix
diagonal = np.diag([1, 2, 3, 4])      # Diagonal matrix

print("Identity:\n", identity)
print("Diagonal:\n", diagonal)

Array Attributes

Understanding array properties is essential for effective NumPy usage:

import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])

print("Shape:", arr.shape)           # (2, 3) - dimensions
print("Size:", arr.size)             # 6 - total elements
print("Dtype:", arr.dtype)           # int64 - data type
print("Ndim:", arr.ndim)             # 2 - number of dimensions
print("Itemsize:", arr.itemsize)     # 8 - bytes per element

Data Types Matter

Memory Efficiency with dtypes

Choosing the right data type can dramatically reduce memory usage:

# Same data, different memory footprints
arr_int64 = np.array([1, 2, 3], dtype=np.int64)
arr_int8 = np.array([1, 2, 3], dtype=np.int8)

print(f"int64 uses: {arr_int64.nbytes} bytes")  # 24 bytes
print(f"int8 uses: {arr_int8.nbytes} bytes")    # 3 bytes

Result: Using int8 instead of int64 reduces memory by 8x when values fit in 8 bits. For large datasets, this saves gigabytes.

Practice Exercises

Arrays Section Exercises

Exercise 1 (Beginner): Create a 3x4 array with values from 1-12, then print its shape, size, and dtype. Modify the array to use float32 data type.

Exercise 2 (Beginner): Create three arrays: zeros(5), ones(5), and arange(5). Print their shapes and data types.

Exercise 3 (Intermediate): Create arrays using linspace (5 values from 0-1), eye (3x3), and diag([1,2,3]). Explain what each creates.

Exercise 4 (Intermediate): Create a 4x4 array using arange and reshape. Check its attributes: shape, size, dtype, ndim, itemsize. Convert dtype to float32 and verify memory reduction.

Challenge (Advanced): Create arrays with different dtypes (int8, int16, int32, float32) using arange() and astype(). Calculate memory usage (nbytes) for each. Verify which dtype saves the most memory.

Array Operations & Vectorization

Vectorization: The NumPy Advantage

Vectorization means operations apply to entire arrays without explicit Python loops. This is NumPy's superpower:

import numpy as np
import time

# Python list approach (slow)
python_list = list(range(1000000))
start = time.time()
result_list = [x * 2 for x in python_list]
list_time = time.time() - start

print(f"Python list time: {list_time*1000:.2f}ms")

import numpy as np
import time

# NumPy vectorized approach (fast)
numpy_arr = np.arange(1000000)
start = time.time()
result_arr = numpy_arr * 2
numpy_time = time.time() - start

print(f"NumPy time: {numpy_time*1000:.2f}ms")
print(f"Speedup: {list_time/numpy_time:.1f}x faster!")

Element-wise Operations

NumPy supports intuitive mathematical operations:

import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

# Arithmetic operations
print("Addition:", a + b)           # [11, 22, 33, 44]
print("Multiplication:", a * b)     # [10, 40, 90, 160]
print("Power:", a ** 2)             # [1, 4, 9, 16]

# Universal functions (ufuncs)
print("Square root:", np.sqrt(a))   # [1.0, 1.414..., 1.732..., 2.0]
print("Exponential:", np.exp(a))    # [2.718..., 7.389..., ...]
print("Sine:", np.sin(a))           # [0.841..., 0.909..., ...]

Aggregation Functions

import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6]])

print("Sum all:", data.sum())           # 21
print("Sum by column:", data.sum(axis=0))  # [5, 7, 9]
print("Sum by row:", data.sum(axis=1))     # [6, 15]
print("Mean:", data.mean())             # 3.5
print("Standard dev:", data.std())      # ~1.707

                            
                            Axis Confusion Alert: axis=0 means "down the rows" (column-wise), axis=1 means "across columns" (row-wise). Think of it as which axis to collapse.
                        

Indexing, Slicing & Boolean Masking

Basic Indexing and Slicing

import numpy as np

arr = np.array([10, 20, 30, 40, 50])

# Indexing (like Python lists)
print("First element:", arr[0])      # 10
print("Last element:", arr[-1])      # 50

# Slicing [start:stop:step]
print("Slice [1:4]:", arr[1:4])      # [20, 30, 40]
print("Every other:", arr[::2])      # [10, 30, 50]
print("Reversed:", arr[::-1])        # [50, 40, 30, 20, 10]

Multi-dimensional Indexing

import numpy as np

matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Indexing with [row, col]
print("Element [1, 2]:", matrix[1, 2])     # 6
print("Row 0:", matrix[0, :])              # [1, 2, 3]
print("Column 1:", matrix[:, 1])           # [2, 5, 8]
print("Submatrix:\n", matrix[0:2, 1:3])    # [[2, 3], [5, 6]]

Boolean Masking: Powerful Filtering

Boolean indexing lets you filter arrays based on conditions:

import numpy as np

scores = np.array([85, 92, 78, 95, 88, 73])

# Create boolean mask
high_scores_mask = scores > 85
print("Mask:", high_scores_mask)            # [False, True, False, True, True, False]

# Apply mask to filter
print("High scores:", scores[high_scores_mask])  # [92, 95, 88]

# Combine conditions with & (and) | (or)
elite = scores[(scores > 85) & (scores < 95)]
print("Elite scores:", elite)          # [92, 88]

Real-World Example

Data Cleaning with Boolean Masking

Boolean masking is essential for data cleaning:

temperatures = np.array([22.5, 23.1, -999, 24.3, -999, 22.8])

# Remove sensor error readings (-999)
valid_temps = temperatures[temperatures != -999]
print("Valid readings:", valid_temps)  # [22.5, 23.1, 24.3, 22.8]

# Replace outliers with mean
mean_temp = valid_temps.mean()
temperatures[temperatures == -999] = mean_temp
print("Cleaned:", temperatures)

Practice Exercises

Indexing & Filtering Exercises

Exercise 1 (Beginner): Create a 1D array [10, 20, 30, 40, 50]. Practice indexing: first element, last element, elements at indices 1 and 3.

Exercise 2 (Beginner): Create a 2D array [[1,2,3],[4,5,6],[7,8,9]]. Extract row 1, column 2, and the submatrix from rows 0-1, columns 1-2.

Exercise 3 (Intermediate): Create an array of scores [85, 92, 78, 95, 88, 73]. Use boolean masking to find all scores > 85. Use & operator to find scores between 80 and 90.

Exercise 4 (Intermediate): Create an array with sensor values including -999 (errors). Use boolean masking to filter out errors, calculate mean of valid data, and replace errors with the mean.

Challenge (Advanced): Create a 3x3 matrix. Use fancy indexing to select specific rows and columns. Create boolean masks for multiple conditions (e.g., values > 5 AND < 8). Verify mask combinations with & | ~ operators.

Broadcasting: NumPy's Superpower

Broadcasting is NumPy's ability to perform operations on arrays of different shapes without explicit loops or copying data. This makes code both faster and more readable.

Broadcasting Rules

NumPy compares shapes element-wise from right to left. Two dimensions are compatible when:

They are equal, or
One of them is 1

import numpy as np

# Example 1: Scalar broadcasting
arr = np.array([1, 2, 3, 4])
result = arr + 10              # Adds 10 to each element
print(result)                  # [11, 12, 13, 14]

import numpy as np

# Example 2: 1D to 2D broadcasting
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
row_vec = np.array([10, 20, 30])

result = matrix + row_vec      # Adds row_vec to each row
print(result)
# [[11, 22, 33],
#  [14, 25, 36]]

Practical Broadcasting Examples

import numpy as np

# Normalize data (zero mean, unit variance)
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])

mean = data.mean(axis=0)       # Column means: [4, 5, 6]
std = data.std(axis=0)         # Column stds

normalized = (data - mean) / std
print("Normalized:\n", normalized)

                            
                            Broadcasting Magic: Without broadcasting, you'd need nested loops. Broadcasting does this automatically in optimized C code, making it 100x faster while keeping code readable.
                        

Practice Exercises

Operations & Broadcasting Exercises

Exercise 1 (Beginner): Create two arrays: a = [1,2,3,4] and b = [10,20,30,40]. Perform addition, subtraction, multiplication, and division. Print all results.

Exercise 2 (Beginner): Create a 2D array [[1,2],[3,4]]. Add 10 to every element. Multiply each element by 2. Print the result.

Exercise 3 (Intermediate): Given data = np.array([[10,20,30],[40,50,60]]), subtract the mean of each column from that column (normalization). Verify the column means are now ~0.

Exercise 4 (Intermediate): Create a matrix and row vector. Use broadcasting to add the row vector to each row. Then multiply each row by a column vector [1,2,3,...].

Challenge (Advanced): Implement z-score normalization: (x - mean) / std for data = np.arange(12).reshape(3,4) without using explicit loops. Verify mean is ~0 and std is ~1.

Array Reshaping & Manipulation

Reshaping arrays is essential for preparing data for different algorithms and transforming between data representations.

Reshaping Arrays

import numpy as np

# Create 1D array
arr = np.arange(12)
print("Original:", arr)          # [0, 1, 2, ..., 11]

# Reshape to 2D
matrix = arr.reshape(3, 4)
print("3x4 matrix:\n", matrix)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

import numpy as np

arr = np.arange(12)

# Reshape to 3D
cube = arr.reshape(2, 3, 2)
print("2x3x2 cube:\n", cube)

# Automatic dimension with -1
auto = arr.reshape(4, -1)        # NumPy calculates: (4, 3)
print("Auto reshape shape:", auto.shape)
print("Auto reshape:\n", auto)

Flattening Arrays

import numpy as np

matrix = np.array([[1, 2, 3], [4, 5, 6]])

# Flatten to 1D (returns copy)
flat = matrix.flatten()
print("Flatten:", flat)          # [1, 2, 3, 4, 5, 6]

# Ravel (returns view when possible)
ravel = matrix.ravel()
print("Ravel:", ravel)           # [1, 2, 3, 4, 5, 6]

# Ravel is faster but be careful with modifications!
ravel[0] = 999
print("Matrix changed:", matrix)  # [[999, 2, 3], [4, 5, 6]]

Transpose and Swapping Axes

import numpy as np

# Simple transpose
A = np.array([[1, 2, 3], [4, 5, 6]])
print("Transpose:\n", A.T)
# [[1, 4]
#  [2, 5]
#  [3, 6]]

import numpy as np

# For 3D+ arrays, specify axes
cube = np.random.rand(2, 3, 4)
swapped = cube.transpose(2, 0, 1)  # Swap axes: (2,3,4) -> (4,2,3)
print("Original shape:", cube.shape)
print("Swapped shape:", swapped.shape)

Adding and Removing Dimensions

import numpy as np

# Expand dimensions
arr = np.array([1, 2, 3])
print("Original shape:", arr.shape)  # (3,)

# Add dimension at axis 0
expanded = np.expand_dims(arr, axis=0)
print("Expanded (axis=0):", expanded.shape)   # (1, 3)

# Add dimension at axis 1
expanded = np.expand_dims(arr, axis=1)
print("Expanded (axis=1):", expanded.shape)   # (3, 1)

import numpy as np

# Squeeze: remove single-dimensional entries
arr_with_singles = np.array([[[1, 2, 3]]])
print("Before squeeze:", arr_with_singles.shape)  # (1, 1, 3)
squeezed = np.squeeze(arr_with_singles)
print("After squeeze:", squeezed.shape)          # (3,)

import numpy as np

# Ensure minimum dimensions
x = 5  # scalar
print("atleast_1d:", np.atleast_1d(x).shape)     # (1,)
print("atleast_2d:", np.atleast_2d(x).shape)     # (1, 1)
print("atleast_3d:", np.atleast_3d(x).shape)     # (1, 1, 1)

Tiling and Repeating

import numpy as np

# Tile: repeat entire array
arr = np.array([[1, 2], [3, 4]])
tiled = np.tile(arr, 2)  # Repeat 2 times horizontally
print("Tiled 2x:\n", tiled)
# [[1, 2, 1, 2]
#  [3, 4, 3, 4]]

import numpy as np

arr = np.array([[1, 2], [3, 4]])
tiled = np.tile(arr, (2, 3))  # Repeat 2x vertically, 3x horizontally
print("Tiled (2,3):\n", tiled)

import numpy as np

# Repeat: repeat each element
arr = np.array([1, 2, 3])
repeated = np.repeat(arr, 3)
print("Repeat each 3x:", repeated)  # [1, 1, 1, 2, 2, 2, 3, 3, 3]

import numpy as np

# Repeat along axis
matrix = np.array([[1, 2], [3, 4]])
repeated = np.repeat(matrix, 2, axis=0)
print("Repeat rows:\n", repeated)
# [[1, 2]
#  [1, 2]
#  [3, 4]
#  [3, 4]]

Creating Coordinate Grids

import numpy as np

# Meshgrid: create coordinate matrices for vectorized computations
x = np.array([1, 2, 3])
y = np.array([4, 5])

xx, yy = np.meshgrid(x, y)
print("X grid:\n", xx)
# [[1, 2, 3]
#  [1, 2, 3]]

print("Y grid:\n", yy)
# [[4, 4, 4]
#  [5, 5, 5]]

# Use for function evaluation
z = xx**2 + yy**2
print("f(x,y) = x² + y²:\n", z)

import numpy as np

# mgrid and ogrid for dense/sparse grids
x, y = np.mgrid[0:5, 0:5]  # Dense grid
print("mgrid x shape:", x.shape)  # (5, 5)

x, y = np.ogrid[0:5, 0:5]  # Open (sparse) grid
print("ogrid x shape:", x.shape)  # (5, 1) - memory efficient!

Rearranging Elements

import numpy as np

# Roll: circular shift
arr = np.array([1, 2, 3, 4, 5])
rolled = np.roll(arr, 2)  # Shift right by 2
print("Rolled:", rolled)  # [4, 5, 1, 2, 3]

import numpy as np

# Roll 2D arrays along axis
matrix = np.array([[1, 2, 3], [4, 5, 6]])
rolled = np.roll(matrix, 1, axis=1)  # Shift columns
print("Rolled columns:\n", rolled)
# [[3, 1, 2]
#  [6, 4, 5]]

import numpy as np

# Flip arrays
arr = np.array([1, 2, 3, 4, 5])
flipped = np.flip(arr)
print("Flipped:", flipped)  # [5, 4, 3, 2, 1]

import numpy as np

matrix = np.array([[1, 2], [3, 4]])
print("Flip vertical:\n", np.flipud(matrix))
# [[3, 4]
#  [1, 2]]

print("Flip horizontal:\n", np.fliplr(matrix))
# [[2, 1]
#  [4, 3]]

import numpy as np

# Rotate 90 degrees
matrix = np.array([[1, 2], [3, 4]])
rotated = np.rot90(matrix)
print("Rotated 90°:\n", rotated)
# [[2, 4]
#  [1, 3]]

Common Pattern

Reshaping for Machine Learning

ML algorithms often require specific input shapes. Here's how to prepare image data:

# Convert 100 28x28 grayscale images to flat vectors
images = np.random.rand(100, 28, 28)
print("Original shape:", images.shape)  # (100, 28, 28)

# Flatten each image: (100, 28, 28) -> (100, 784)
flat_images = images.reshape(100, -1)
print("Flattened shape:", flat_images.shape)  # (100, 784)

# Add channel dimension for CNNs: (100, 28, 28) -> (100, 28, 28, 1)
cnn_images = images.reshape(100, 28, 28, 1)
print("CNN shape:", cnn_images.shape)  # (100, 28, 28, 1)

Scientific Computing

Using Meshgrid for 2D Function Plotting

Meshgrid is essential for evaluating functions over 2D grids, commonly used in plotting and numerical methods:

# Create 2D Gaussian function
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(x, y)

# Evaluate 2D Gaussian: f(x,y) = exp(-(x² + y²)/2)
z = np.exp(-(xx**2 + yy**2) / 2)

print("Grid shape:", xx.shape)  # (100, 100)
print("Z values shape:", z.shape)  # (100, 100)
print("Max value (at origin):", z.max())  # 1.0

# This z array can be directly plotted with matplotlib's contour or imshow

Combining Arrays: Stack & Concatenate

NumPy provides powerful functions to combine multiple arrays in various ways.

Concatenation

import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])

# Concatenate along rows (axis 0)
vcat = np.concatenate([a, b], axis=0)
print("Vertical concat:\n", vcat)
# [[1, 2]
#  [3, 4]
#  [5, 6]]

# Concatenate along columns (axis 1)
c = np.array([[5], [6]])
hcat = np.concatenate([a, c], axis=1)
print("Horizontal concat:\n", hcat)
# [[1, 2, 5]
#  [3, 4, 6]]

Stacking Functions

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Vertical stack (row-wise)
vstack = np.vstack([a, b])
print("vstack:\n", vstack)
# [[1, 2, 3]
#  [4, 5, 6]]

# Horizontal stack (column-wise)
hstack = np.hstack([a, b])
print("hstack:", hstack)         # [1, 2, 3, 4, 5, 6]

# Depth stack (3D stacking)
dstack = np.dstack([a, b])
print("dstack shape:", dstack.shape)  # (1, 3, 2)

Column and Row Stack

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Column stack: 1D arrays -> 2D columns
col_stack = np.column_stack([a, b])
print("Column stack:\n", col_stack)
# [[1, 4]
#  [2, 5]
#  [3, 6]]

# Row stack (alias for vstack)
row_stack = np.row_stack([a, b])
print("Row stack:\n", row_stack)

Splitting Arrays

import numpy as np

# Split array into equal parts
arr = np.array([1, 2, 3, 4, 5, 6])
split_arrays = np.split(arr, 3)  # Split into 3 equal parts
print("Split into 3:", [a for a in split_arrays])
# [array([1, 2]), array([3, 4]), array([5, 6])]

# Split at specific indices
split_arrays = np.split(arr, [2, 4])  # Split at indices 2 and 4
print("Split at indices:", [a for a in split_arrays])
# [array([1, 2]), array([3, 4]), array([5, 6])]

import numpy as np

# Horizontal and vertical splits
matrix = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]])

# Split horizontally (column-wise)
h_split = np.hsplit(matrix, 2)
print("Horizontal split:")
for arr in h_split:
    print(arr)
# [[1, 2]    [[3, 4]
#  [5, 6]     [7, 8]
#  [9, 10]]   [11, 12]]

# Split vertically (row-wise)
v_split = np.vsplit(matrix, 3)
print("Vertical split:")
for arr in v_split:
    print(arr)  # [[1,2,3,4]], [[5,6,7,8]], [[9,10,11,12]]

                            
                            Quick Reference: 
                            Stacking: vstack (rows), hstack (columns), dstack (depth), column_stack (1D→columns)
Splitting: split (any axis), hsplit (columns), vsplit (rows), dsplit (depth)

Sorting, Searching & Set Operations

Sorting Arrays

import numpy as np

arr = np.array([3, 1, 4, 1, 5, 9, 2, 6])

# Sort array (returns new sorted array)
sorted_arr = np.sort(arr)
print("Sorted:", sorted_arr)     # [1, 1, 2, 3, 4, 5, 6, 9]

# Argsort (returns indices that would sort the array)
indices = np.argsort(arr)
print("Argsort:", indices)       # [1, 3, 6, 0, 2, 4, 7, 5]
print("Using indices:", arr[indices])  # Same as sorted_arr

import numpy as np

arr = np.array([3, 1, 4, 1, 5, 9, 2, 6])

# Sort in descending order
desc = np.sort(arr)[::-1]
print("Descending:", desc)       # [9, 6, 5, 4, 3, 2, 1, 1]

Sorting 2D Arrays

import numpy as np

matrix = np.array([[3, 1, 4],
                   [1, 5, 9],
                   [2, 6, 5]])

# Sort along axis 0 (columns)
col_sorted = np.sort(matrix, axis=0)
print("Column sorted:\n", col_sorted)

# Sort along axis 1 (rows)
row_sorted = np.sort(matrix, axis=1)
print("Row sorted:\n", row_sorted)

Searching Arrays

import numpy as np

arr = np.array([1, 3, 5, 7, 9, 11])

# Binary search in sorted array
idx = np.searchsorted(arr, 6)
print("Insert position for 6:", idx)  # 3 (between 5 and 7)

# Search for multiple values
indices = np.searchsorted(arr, [4, 8, 10])
print("Insert positions:", indices)  # [2, 4, 5]

# Find where condition is true
greater_than_5 = np.where(arr > 5)
print("Indices > 5:", greater_than_5)  # (array([3, 4, 5]),)

Set Operations

import numpy as np

# Unique values
arr_dup = np.array([1, 2, 2, 3, 3, 3, 4])
unique = np.unique(arr_dup)
print("Unique:", unique)         # [1, 2, 3, 4]

# Unique with counts
unique, counts = np.unique(arr_dup, return_counts=True)
print("Counts:", dict(zip(unique, counts)))  # {1: 1, 2: 2, 3: 3, 4: 1}

import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([4, 5, 6, 7, 8])

# Set operations
intersection = np.intersect1d(a, b)
print("Intersection:", intersection)  # [4, 5]

union = np.union1d(a, b)
print("Union:", union)           # [1, 2, 3, 4, 5, 6, 7, 8]

difference = np.setdiff1d(a, b)
print("Difference (a-b):", difference)  # [1, 2, 3]

Practical Example

Finding Top K Elements

import numpy as np

# Find top 3 scores and their indices
scores = np.array([85, 92, 78, 95, 88, 73, 99, 82])

# Use argpartition for efficient partial sorting
k = 3
top_k_indices = np.argpartition(scores, -k)[-k:]
top_k_scores = scores[top_k_indices]

# Sort the top K in descending order
sorted_idx = top_k_indices[np.argsort(top_k_scores)][::-1]
print("Top 3 scores:", scores[sorted_idx])      # [99, 95, 92]
print("Top 3 indices:", sorted_idx)             # [6, 3, 1]

Statistical Functions

Descriptive Statistics

import numpy as np

data = np.array([85, 92, 78, 95, 88, 73, 99, 82])

# Basic statistics
print("Mean:", data.mean())           # 86.5
print("Median:", np.median(data))     # 86.5
print("Std Dev:", data.std())         # 7.94
print("Variance:", data.var())        # 63.0
print("Min:", data.min())             # 73
print("Max:", data.max())             # 99
print("Range:", data.ptp())           # 26 (peak-to-peak)

Percentiles and Quantiles

import numpy as np

data = np.array([85, 92, 78, 95, 88, 73, 99, 82])

# Percentiles
p25 = np.percentile(data, 25)
p50 = np.percentile(data, 50)  # Same as median
p75 = np.percentile(data, 75)
print(f"25th: {p25}, 50th: {p50}, 75th: {p75}")

# Quantiles (equivalent to percentiles / 100)
q = np.quantile(data, [0.25, 0.5, 0.75])
print("Quantiles:", q)

Correlation and Covariance

import numpy as np

# Two variables
hours_studied = np.array([2, 3, 4, 5, 6, 7, 8])
test_scores = np.array([65, 70, 75, 80, 85, 90, 95])

# Correlation coefficient
correlation = np.corrcoef(hours_studied, test_scores)
print("Correlation matrix:\n", correlation)
print("Correlation:", correlation[0, 1])  # ~1.0 (perfect positive)

# Covariance
covariance = np.cov(hours_studied, test_scores)
print("Covariance matrix:\n", covariance)

Histograms

import numpy as np

# Create histogram data
data = np.random.randn(1000)
counts, bins = np.histogram(data, bins=10)

print("Counts:", counts)
print("Bin edges:", bins)

# Histogram with custom bins
counts, bins = np.histogram(data, bins=[-3, -2, -1, 0, 1, 2, 3])
print("Custom bins counts:", counts)

                            
                            Statistical Functions Summary:
                            mean(), median(), std(), var() - Central tendency and spread
min(), max(), ptp() - Range statistics
percentile(), quantile() - Distribution percentiles
corrcoef(), cov() - Relationships between variables
histogram() - Frequency distributions

                        

Saving & Loading Arrays

NumPy provides efficient binary formats for saving and loading arrays, preserving data types and shapes.

Single Array: .npy Format

import numpy as np

# Save single array
data = np.array([[1, 2, 3], [4, 5, 6]])
np.save('my_array.npy', data)

# Load array
loaded = np.load('my_array.npy')
print("Loaded:\n", loaded)
print("Same array:", np.array_equal(data, loaded))  # True

Multiple Arrays: .npz Format

import numpy as np

# Save multiple arrays
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
z = np.array([[7, 8], [9, 10]])

np.savez('multiple_arrays.npz', x=x, y=y, z=z)

# Load multiple arrays
loaded = np.load('multiple_arrays.npz')
print("Available arrays:", loaded.files)  # ['x', 'y', 'z']
print("X:", loaded['x'])
print("Y:", loaded['y'])
print("Z:", loaded['z'])

Compressed Format: .npz

import numpy as np

# Save compressed (smaller file size)
large_array = np.random.rand(1000, 1000)
np.savez_compressed('compressed.npz', data=large_array)

# Load compressed (transparent decompression)
loaded = np.load('compressed.npz')
print("Shape:", loaded['data'].shape)

Text Files: CSV-like

import numpy as np

# Save to text file
data = np.array([[1, 2, 3], [4, 5, 6]])
np.savetxt('data.txt', data, delimiter=',', fmt='%d')

# Load from text file
loaded = np.loadtxt('data.txt', delimiter=',')
print("Loaded from text:\n", loaded)

Performance Comparison

Binary vs Text Format

import numpy as np
import time

# Create large array
large = np.random.rand(10000, 100)

# Binary save (.npy) - FAST
start = time.time()
np.save('binary.npy', large)
binary_time = time.time() - start

# Text save (.txt) - SLOW
start = time.time()
np.savetxt('text.txt', large)
text_time = time.time() - start

print(f"Binary save: {binary_time:.3f}s")
print(f"Text save: {text_time:.3f}s")
print(f"Binary is {text_time/binary_time:.1f}x faster")

Result: Binary format is typically 10-50x faster and produces much smaller files. Use .npy/.npz for NumPy-to-NumPy storage!

Practice Exercises

Reshaping & Manipulation Exercises

Exercise 1 (Beginner): Create a 1D array with 24 elements. Reshape it to 2x3x4, then back to 6x4. Verify each shape transformation.

Exercise 2 (Beginner): Given a 3x4 array, flatten it to 1D using flatten() and reshape(). Are they the same? Why or why not?

Exercise 3 (Intermediate): Create a 4x4 array. Transpose it. Flip it horizontally (flip(axis=1)). Stack the original, transposed, and flipped versions along axis 0.

Exercise 4 (Intermediate): Create three 2D arrays of shape (3,3). Use hstack() and vstack() to combine them different ways. Compare results.

Challenge (Advanced): Create a 3D array (2,3,4). Reshape to 2D, apply some operations, then reshape back to 3D. Ensure data integrity throughout.

Linear Algebra Operations

NumPy's linalg module provides comprehensive linear algebra functionality essential for machine learning and scientific computing.

Matrix Operations

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication
C = A @ B                      # Python 3.5+ syntax
# or: C = np.dot(A, B)
print("A @ B:\n", C)

# Transpose
print("Transpose:\n", A.T)

# Determinant
det = np.linalg.det(A)
print("Determinant:", det)     # -2.0

# Inverse (if exists)
A_inv = np.linalg.inv(A)
print("Inverse:\n", A_inv)

Eigenvalues and Eigenvectors

import numpy as np

matrix = np.array([[4, -2],
                   [1,  1]])

eigenvalues, eigenvectors = np.linalg.eig(matrix)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

Machine Learning Application

Solving Linear Regression with Linear Algebra

Linear regression can be solved using the normal equation: θ = (X^T X)^(-1) X^T y

import numpy as np

# Generate synthetic data
X = np.array([[1, 1], [1, 2], [1, 3]])  # Features with bias term
y = np.array([2, 4, 6])                  # Target values

# Normal equation: theta = (X^T X)^(-1) X^T y
theta = np.linalg.inv(X.T @ X) @ X.T @ y
print("Coefficients:", theta)            # [0, 2] (intercept=0, slope=2)

This is exactly how libraries like scikit-learn solve linear regression under the hood!

Random Number Generation

NumPy's random module is crucial for simulations, data generation, and machine learning initialization.

Basic Random Generation

import numpy as np

# Set seed for reproducibility
np.random.seed(42)

# Random floats [0, 1)
uniform = np.random.rand(3, 3)
print("Uniform [0,1):\n", uniform)

# Random integers
integers = np.random.randint(0, 10, size=(2, 3))
print("Random integers:\n", integers)

# Normal distribution (mean=0, std=1)
normal = np.random.randn(1000)
print("Normal mean:", normal.mean())    # ~0
print("Normal std:", normal.std())      # ~1

Statistical Distributions

import numpy as np

# Different distributions
exponential = np.random.exponential(scale=2.0, size=1000)
poisson = np.random.poisson(lam=3, size=1000)
binomial = np.random.binomial(n=10, p=0.5, size=1000)

# Choice and shuffle
deck = np.arange(52)
np.random.shuffle(deck)
hand = np.random.choice(deck, size=5, replace=False)

                            
                            Reproducibility: Always set np.random.seed() at the start of notebooks or scripts. This ensures random operations produce identical results across runs—critical for debugging and scientific reproducibility.
                        

Performance Optimization

Views vs Copies

Understanding when NumPy creates views (references) versus copies is critical for performance:

import numpy as np

original = np.array([1, 2, 3, 4, 5])

# Slicing creates a VIEW (shares memory)
view = original[1:4]
view[0] = 999
print("Original changed:", original)  # [1, 999, 3, 4, 5]

# Fancy indexing creates a COPY
copy = original[[0, 2, 4]]
copy[0] = 777
print("Original unchanged:", original)  # [1, 999, 3, 4, 5]

Memory Layout: C vs Fortran Order

import numpy as np

# C-contiguous (row-major, default)
c_array = np.array([[1, 2], [3, 4]], order='C')

# Fortran-contiguous (column-major)
f_array = np.array([[1, 2], [3, 4]], order='F')

# Check with flags
print("C-contiguous:", c_array.flags['C_CONTIGUOUS'])
print("F-contiguous:", f_array.flags['F_CONTIGUOUS'])

Performance Benchmark

Vectorization Speed Comparison

import numpy as np
import time

size = 1_000_000
a = np.random.rand(size)
b = np.random.rand(size)

# Python loop approach
start = time.time()
result = []
for i in range(size):
    result.append(a[i] + b[i])
python_time = time.time() - start

# NumPy vectorized approach
start = time.time()
result_np = a + b
numpy_time = time.time() - start

print(f"Python loop: {python_time:.4f}s")
print(f"NumPy vectorized: {numpy_time:.4f}s")
print(f"Speedup: {python_time/numpy_time:.1f}x")

Typical result: NumPy is 50-100x faster for element-wise operations!

Practical Applications

Image Processing

Images are just 3D arrays (height × width × channels):

import numpy as np

# Simulate a 100x100 RGB image
image = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)

# Convert to grayscale (weighted average)
gray = np.dot(image[...,:3], [0.299, 0.587, 0.114])

# Flip horizontally
flipped = image[:, ::-1, :]

# Crop center
h, w = image.shape[:2]
cropped = image[h//4:3*h//4, w//4:3*w//4, :]

Statistical Analysis

import numpy as np

# Generate sample data
data = np.random.normal(loc=100, scale=15, size=1000)

# Compute statistics
mean = data.mean()
median = np.median(data)
std = data.std()
percentiles = np.percentile(data, [25, 50, 75])

print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Std Dev: {std:.2f}")
print(f"IQR: {percentiles[2] - percentiles[0]:.2f}")

Time Series Analysis

import numpy as np

# Moving average
def moving_average(data, window):
    return np.convolve(data, np.ones(window)/window, mode='valid')

prices = np.random.randn(100).cumsum() + 100
ma_5 = moving_average(prices, 5)
ma_20 = moving_average(prices, 20)

Next Steps in the Series

Congratulations! You now understand NumPy's core concepts and capabilities. You've mastered:

✅ Array creation and manipulation
✅ Vectorized operations and broadcasting
✅ Indexing, slicing, and boolean masking
✅ Linear algebra operations
✅ Performance optimization techniques

                            
                            What's Next: In Part 2: Pandas for Data Analysis, you'll learn how Pandas builds on NumPy to provide powerful tools for working with real-world tabular data—DataFrames, data cleaning, transformation, and exploratory analysis.
                        

NumPy is the foundation, but Pandas is where data science really comes alive. See you in the next article!

NumPy API Cheat Sheet

Quick reference for the most commonly used NumPy operations covered in this article.

Array Creation

`np.array([1,2,3])`	Create from list
`np.zeros((3,4))`	3×4 array of zeros
`np.ones((2,3))`	2×3 array of ones
`np.arange(0,10,2)`	[0,2,4,6,8]
`np.linspace(0,1,5)`	5 evenly spaced
`np.eye(3)`	3×3 identity matrix
`np.random.rand(3,4)`	3×4 random [0,1)
`np.random.randn(3,4)`	Normal distribution

Indexing & Slicing

`arr[0]`	First element
`arr[-1]`	Last element
`arr[1:4]`	Elements 1 to 3
`arr[::2]`	Every 2nd element
`arr[1,2]`	2D: row 1, col 2
`arr[:,0]`	All rows, col 0
`arr[0,:]`	Row 0, all cols
`arr[arr > 5]`	Boolean indexing

Array Operations

`arr + 5`	Add scalar
`arr1 + arr2`	Element-wise add
`arr * 2`	Multiply scalar
`arr1 * arr2`	Element-wise mult
`arr ** 2`	Element-wise power
`np.sqrt(arr)`	Square root
`np.exp(arr)`	Exponential
`np.log(arr)`	Natural log

Aggregations

`arr.sum()`	Sum all elements
`arr.mean()`	Mean (average)
`arr.std()`	Standard deviation
`arr.min()`	Minimum value
`arr.max()`	Maximum value
`arr.sum(axis=0)`	Sum by column
`arr.sum(axis=1)`	Sum by row
`np.median(arr)`	Median value

Reshaping

`arr.reshape(3,4)`	New shape (3×4)
`arr.flatten()`	To 1D array
`arr.ravel()`	To 1D (view)
`arr.T`	Transpose
`np.vstack([a,b])`	Stack vertically
`np.hstack([a,b])`	Stack horizontally
`np.concatenate()`	Join arrays
`np.split(arr, 3)`	Split into 3

Linear Algebra

`A @ B`	Matrix multiply
`A.dot(B)`	Dot product
`np.linalg.inv(A)`	Matrix inverse
`np.linalg.det(A)`	Determinant
`np.linalg.eig(A)`	Eigenvalues/vectors
`np.linalg.solve(A,b)`	Solve Ax=b
`np.trace(A)`	Trace (diagonal sum)
`np.linalg.norm(v)`	Vector norm

                            
                            Pro Tips:
                            Broadcasting: Arrays with different shapes can work together if dimensions are compatible
Views vs Copies: Slicing creates views (modifies original); use .copy() for independence
Axis parameter: axis=0 operates on columns (down), axis=1 on rows (across)
Performance: Vectorized operations are 10-100× faster than Python loops

                        

Practice Exercises

Linear Algebra Exercises

Exercise 1 (Beginner): Create two 2x2 matrices and perform matrix multiplication using @ operator. Compare with element-wise multiplication (*).

Exercise 2 (Beginner): Create a 3x3 matrix and calculate its determinant, trace (sum of diagonal), and rank. Verify the relationship between these properties.

Exercise 3 (Intermediate): Create a 3x3 matrix, compute its inverse, and multiply it by the original. The result should be close to identity matrix.

Exercise 4 (Intermediate): Create a 4x2 matrix A and 2x3 matrix B. Compute A @ B and verify the shape is 4x3. What happens with incompatible shapes?

Challenge (Advanced): Implement Least Squares regression: Given A (design matrix) and b (target), solve A @ x = b using linalg.lstsq(). Verify solution minimizes error.

Cookie Consent

Cookie Preferences

Python Data Science Series Part 1: NumPy Foundations

Table of Contents

Introduction: Why NumPy Matters

The Evolution of NumPy

Historical Context

Why NumPy Succeeded

Why It Matters Today

NumPy Fundamentals

Installation and Setup

The ndarray: NumPy's Core Data Structure

Working with Arrays

Array Creation Methods

Array Attributes

Memory Efficiency with dtypes

Arrays Section Exercises

Array Operations & Vectorization

Vectorization: The NumPy Advantage

Element-wise Operations

Aggregation Functions

Indexing, Slicing & Boolean Masking

Basic Indexing and Slicing

Multi-dimensional Indexing

Boolean Masking: Powerful Filtering

Data Cleaning with Boolean Masking

Indexing & Filtering Exercises

Broadcasting: NumPy's Superpower

Broadcasting Rules

Practical Broadcasting Examples

Operations & Broadcasting Exercises

Array Reshaping & Manipulation

Reshaping Arrays

Flattening Arrays

Transpose and Swapping Axes

Adding and Removing Dimensions

Tiling and Repeating

Creating Coordinate Grids

Rearranging Elements

Reshaping for Machine Learning

Using Meshgrid for 2D Function Plotting

Combining Arrays: Stack & Concatenate

Concatenation

Stacking Functions

Column and Row Stack

Splitting Arrays

Sorting, Searching & Set Operations

Sorting Arrays

Sorting 2D Arrays

Searching Arrays

Set Operations

Finding Top K Elements

Statistical Functions

Descriptive Statistics

Percentiles and Quantiles

Correlation and Covariance

Histograms

Saving & Loading Arrays

Single Array: .npy Format

Multiple Arrays: .npz Format

Compressed Format: .npz

Text Files: CSV-like

Binary vs Text Format

Reshaping & Manipulation Exercises

Linear Algebra Operations

Matrix Operations

Eigenvalues and Eigenvectors

Solving Linear Regression with Linear Algebra

Random Number Generation

Basic Random Generation

Statistical Distributions

Performance Optimization

Views vs Copies

Memory Layout: C vs Fortran Order

Vectorization Speed Comparison

Practical Applications

Image Processing

Statistical Analysis

Time Series Analysis

Next Steps in the Series