Introduction to Pandas
If NumPy is the foundation of Python's data science stack, Pandas is where data analysis truly begins. Built on top of NumPy, Pandas provides high-level data structures and tools specifically designed for working with real-world, tabular data—the kind you find in spreadsheets, databases, and CSV files.
Why Pandas Matters: Pandas is to data science what SQL is to databases—the essential tool for data manipulation. It's the first library you reach for when loading, cleaning, transforming, and analyzing data before machine learning or visualization.
Python Setup & Notebooks
IDE setup, Jupyter, virtual environments
NumPy Foundations
Arrays, broadcasting, linear algebra
3
Pandas Data Analysis
DataFrames, cleaning, manipulation
You Are Here
4
Data Visualization
Matplotlib, Seaborn, Plotly
5
Machine Learning with Scikit-learn
Classification, regression, clustering
6
ML Mathematics & Statistics
Linear algebra, calculus, probability
7
Artificial Neural Networks
Perceptrons, backpropagation, architectures
8
Computer Vision Fundamentals
CNNs, image processing, object detection
9
PyTorch Deep Learning
Tensors, autograd, model training
10
TensorFlow & Keras
Sequential models, callbacks, deployment
11
Transformers & Attention
Self-attention, BERT, GPT architecture
Key capabilities:
- DataFrame: 2D table with labeled rows and columns (like Excel, but programmable)
- Series: 1D labeled array (a single column or row)
- I/O tools: Read/write CSV, Excel, JSON, SQL, Parquet, and more
- Data cleaning: Handle missing values, duplicates, and transformations
- GroupBy: Split-apply-combine operations for aggregation
- Merge/Join: Combine datasets like SQL joins
- Time series: Specialized support for datetime data
Installation and Setup
# Install Pandas
pip install pandas
# Import convention
import pandas as pd
import numpy as np
print(f"pandas version: {pd.__version__}")
Series: 1D Labeled Arrays
A Series is a one-dimensional labeled array—like a column in a spreadsheet or a single feature in a dataset. It combines NumPy's array performance with labeled indices.
import pandas as pd
import numpy as np
# Creating Series from list
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'], name='scores')
print(s)
# Output:
# a 10
# b 20
# c 30
# d 40
# Name: scores, dtype: int64
# Access by label and position
print("By label 'b':", s['b']) # 20
print("By position 2:", s.iloc[2]) # 30
# Vectorized operations
print("Add 5:", s + 5)
print("Multiply by 2:", s * 2)
Index Alignment
Pandas automatically aligns data by index labels—a powerful feature:
import pandas as pd
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])
# Alignment by index (missing indices become NaN)
result = s1 + s2
print(result)
# a NaN
# b 12.0
# c 23.0
# d NaN
Critical Concept: Index alignment happens automatically. Operations match by labels, not positions. This prevents silent bugs when working with mismatched data but can introduce NaN values.
DataFrame: 2D Tabular Data
The DataFrame is Pandas' primary data structure—a 2D table with labeled rows and columns. Think of it as a spreadsheet or SQL table in Python.
import pandas as pd
# Create DataFrame from dictionary
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
print(df)
# name age city score
# 0 Alice 25 NY 88
# 1 Bob 30 LA 92
# 2 Charlie 35 NY 85
# 3 Diana 28 SF 90
# Data inspection
print(df.head(2)) # First 2 rows
print(df.info()) # Structure and types
print(df.describe()) # Statistical summary
print(df.shape) # (4, 4) - rows × columns
Column Operations
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
# Select single column (returns Series)
ages = df['age']
print(type(ages)) #
# Select multiple columns (returns DataFrame)
subset = df[['name', 'score']]
# Add new column
df['is_ny'] = df['city'] == 'NY'
print(df)
# name age city score is_ny
# 0 Alice 25 NY 88 True
# 1 Bob 30 LA 92 False
# 2 Charlie 35 NY 85 True
# 3 Diana 28 SF 90 False
DataFrame Anatomy
Understanding DataFrame Structure
- Index: Row labels (default: 0, 1, 2, ...)
- Columns: Column labels (from keys or specified)
- Values: Underlying NumPy array (access with
.values)
- dtypes: Data type per column (int64, float64, object)
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
print("Index:", df.index)
print("Columns:", df.columns)
print("Dtypes:", df.dtypes)
print("Values type:", type(df.values)) # numpy.ndarray
Reading & Writing Data
Pandas excels at loading data from various sources—CSV, Excel, JSON, SQL databases, and more.
CSV Operations
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
# Write to CSV
df.to_csv('people.csv', index=False)
# Read from CSV
df_csv = pd.read_csv('people.csv')
print(df_csv.head())
# Advanced CSV reading
df_adv = pd.read_csv('data.csv',
sep=';', # Custom delimiter
skiprows=2, # Skip header rows
na_values=['NA', 'missing'], # Custom NaN values
parse_dates=['date'], # Parse date columns
dtype={'id': str}) # Force column types
Excel, JSON, and More
import pandas as pd
import sqlite3
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
# Excel (requires openpyxl or xlrd)
df.to_excel('people.xlsx', index=False)
df_xlsx = pd.read_excel('people.xlsx')
# JSON
df.to_json('people.json', orient='records')
df_json = pd.read_json('people.json')
# Parquet (efficient columnar format)
df.to_parquet('people.parquet')
df_parquet = pd.read_parquet('people.parquet')
# SQL database
conn = sqlite3.connect('database.db')
df.to_sql('people', conn, if_exists='replace')
df_sql = pd.read_sql('SELECT * FROM people WHERE age > 25', conn)
Performance Tip: For large datasets (>1GB), use Parquet instead of CSV. Parquet is 10x smaller and 100x faster to read/write than CSV.
Indexing, Selection & Filtering
Pandas provides two primary indexers: loc (label-based) and iloc (position-based).
loc vs iloc
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
# Set custom index
df_idx = df.set_index('name')
# loc: label-based selection
print(df_idx.loc['Alice']) # Row by label
print(df_idx.loc['Alice', 'age']) # Specific cell
print(df_idx.loc['Alice':'Charlie']) # Slice (inclusive!)
# iloc: position-based selection
print(df_idx.iloc[0]) # First row
print(df_idx.iloc[0, 1]) # Row 0, col 1
print(df_idx.iloc[0:2, 1:3]) # Slice (exclusive)
Boolean Filtering
Boolean masks are the most powerful way to filter data:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
# Single condition
high_scores = df[df['score'] > 85]
print(high_scores)
# Multiple conditions (use & | ~)
elite = df[(df['score'] > 85) & (df['age'] < 30)]
# isin() for membership
ny_sf = df[df['city'].isin(['NY', 'SF'])]
# String methods
starts_with_a = df[df['name'].str.startswith('A')]
# Query syntax (cleaner for complex conditions)
result = df.query('age >= 30 and city == "LA"')
print(result)
Boolean Indexing with 2D Arrays
When working with NumPy arrays or numerical DataFrames, you can use boolean indexing on multiple dimensions—a powerful technique for filtering machine learning datasets by class or condition.
import pandas as pd
import numpy as np
# Create a dataset with features (X) and labels (y)
X = np.array([[1.2, 2.3, 3.1],
[4.5, 5.6, 6.2],
[7.8, 8.9, 9.3],
[2.1, 3.4, 4.5],
[5.6, 6.7, 7.8]])
y = np.array([0, 1, 1, 0, 1]) # Binary labels (class 0 or 1)
print("Full dataset:")
print("X shape:", X.shape) # (5, 3)
print("y shape:", y.shape) # (5,)
print("X:\n", X)
print("y:", y)
Filtering Rows by Class Label
The syntax X[y==0] creates a boolean mask and applies it to select rows:
import numpy as np
X = np.array([[1.2, 2.3, 3.1],
[4.5, 5.6, 6.2],
[7.8, 8.9, 9.3],
[2.1, 3.4, 4.5],
[5.6, 6.7, 7.8]])
y = np.array([0, 1, 1, 0, 1])
# Step 1: Create boolean mask
mask_class_0 = (y == 0)
print("Boolean mask for class 0:", mask_class_0)
# [True, False, False, True, False]
# Step 2: Apply mask to select rows
class_0_samples = X[mask_class_0]
print("Class 0 samples:\n", class_0_samples)
# [[1.2, 2.3, 3.1]
# [2.1, 3.4, 4.5]]
# Combined syntax (one-liner)
class_1_samples = X[y == 1]
print("Class 1 samples:\n", class_1_samples)
# [[4.5, 5.6, 6.2]
# [7.8, 8.9, 9.3]
# [5.6, 6.7, 7.8]]
Selecting Specific Columns from Filtered Rows
The syntax X[y==0, 0] selects rows where y==0 AND column 0 (first feature):
import numpy as np
X = np.array([[1.2, 2.3, 3.1],
[4.5, 5.6, 6.2],
[7.8, 8.9, 9.3],
[2.1, 3.4, 4.5],
[5.6, 6.7, 7.8]])
y = np.array([0, 1, 1, 0, 1])
# Get first feature (column 0) for class 0 samples
class_0_feature_0 = X[y == 0, 0]
print("Class 0, Feature 0:", class_0_feature_0)
# [1.2, 2.1]
# Get second feature (column 1) for class 1 samples
class_1_feature_1 = X[y == 1, 1]
print("Class 1, Feature 1:", class_1_feature_1)
# [5.6, 8.9, 6.7]
# Get multiple columns using slicing
class_0_first_two_features = X[y == 0, :2]
print("Class 0, First 2 features:\n", class_0_first_two_features)
# [[1.2, 2.3]
# [2.1, 3.4]]
Using with Pandas DataFrames
The same pattern works with Pandas DataFrames for feature engineering and analysis:
import pandas as pd
import numpy as np
# Create DataFrame with features and target
df = pd.DataFrame({
'feature_1': [1.2, 4.5, 7.8, 2.1, 5.6],
'feature_2': [2.3, 5.6, 8.9, 3.4, 6.7],
'feature_3': [3.1, 6.2, 9.3, 4.5, 7.8],
'class_label': [0, 1, 1, 0, 1]
})
print("Full DataFrame:")
print(df)
# Boolean indexing with DataFrames
class_0_df = df[df['class_label'] == 0]
print("\nClass 0 samples:")
print(class_0_df)
# Select specific columns for a class
class_1_features = df.loc[df['class_label'] == 1, ['feature_1', 'feature_2']]
print("\nClass 1, Features 1 & 2:")
print(class_1_features)
# Compute statistics per class
print("\nMean per class:")
print("Class 0 mean:", df[df['class_label'] == 0][['feature_1', 'feature_2', 'feature_3']].mean())
print("Class 1 mean:", df[df['class_label'] == 1][['feature_1', 'feature_2', 'feature_3']].mean())
Machine Learning Application: Boolean indexing by class label is essential for:
- Data Exploration: Analyze feature distributions per class
- Visualization: Plot different classes with different colors (covered in Visualization tutorial)
- Splitting Data: Separate train/test sets while preserving class balance
- Evaluation: Compute metrics (accuracy, precision, recall) per class
Common Pitfall
Chained Assignment Warning
Avoid this common mistake:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'score': [88, 92, 85, 90]
})
# BAD: Chained assignment (might not work!)
df[df['age'] > 25]['score'] = 100
# GOOD: Use .loc for safe assignment
df.loc[df['age'] > 25, 'score'] = 100
Chained assignment can produce a copy instead of a view, causing silent failures. Always use .loc for assignments.
Practice Exercises
Selection & Filtering Exercises
Exercise 1 (Beginner): Create a DataFrame with columns name, age, city, salary. Select all rows where age > 30. Select only name and salary columns for these rows.
Exercise 2 (Beginner): Create a DataFrame. Add a new column 'bonus' that is 10% of salary. Modify existing columns using .loc[]. Check the difference between copy() and view.
Exercise 3 (Intermediate): Create a DataFrame. Filter rows where (city == 'NY') AND (salary > 50000). Try boolean operators (&, |, ~). Explain what goes wrong with 'and' operator.
Exercise 4 (Intermediate): Create a DataFrame with various data types (int, float, string, bool). Use iloc for position-based selection. Extract specific rows and columns by integer position.
Challenge (Advanced): Create a DataFrame and use query() method for complex conditions. Compare performance and readability with .loc[] approach on large data.
Handling Missing Data
Real-world data always has missing values. Pandas provides comprehensive tools to detect, remove, and fill them.
import pandas as pd
import numpy as np
df_na = pd.DataFrame({
'A': [1, np.nan, 3, np.nan],
'B': ['x', 'y', None, 'z'],
'C': [10.0, 11.5, np.nan, 9.8]
})
# Detect missing values
print(df_na.isnull())
# A B C
# 0 False False False
# 1 True False False
# 2 False True True
# 3 True False False
print(df_na.isnull().sum()) # Count per column
# A 2
# B 1
# C 1
Removing Missing Data
import pandas as pd
import numpy as np
df_na = pd.DataFrame({
'A': [1, np.nan, 3, np.nan],
'B': ['x', 'y', None, 'z'],
'C': [10.0, 11.5, np.nan, 9.8]
})
# Drop rows with ANY missing value
df_clean = df_na.dropna()
# Drop rows where ALL values are missing
df_clean = df_na.dropna(how='all')
# Drop columns with missing values
df_clean = df_na.dropna(axis=1)
# Drop rows missing values in specific columns
df_clean = df_na.dropna(subset=['A', 'C'])
Filling Missing Data
import pandas as pd
import numpy as np
df_na = pd.DataFrame({
'A': [1, np.nan, 3, np.nan],
'B': ['x', 'y', None, 'z'],
'C': [10.0, 11.5, np.nan, 9.8]
})
# Fill with specific value
df_filled = df_na.fillna(0)
# Fill with column means
df_filled = df_na.fillna(df_na.mean())
# Fill with different values per column
df_filled = df_na.fillna({
'A': df_na['A'].mean(),
'B': 'missing',
'C': df_na['C'].median()
})
# Forward fill (propagate last valid value)
df_filled = df_na.fillna(method='ffill')
# Backward fill
df_filled = df_na.fillna(method='bfill')
Best Practice: Never blindly drop or fill missing data. Understand why data is missing (random, systematic, measurement error) before choosing a strategy. Document your decisions.
Operations & Computations
Apply, Map, and Applymap
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
# apply() on columns
df['age_squared'] = df['age'].apply(lambda x: x ** 2)
# apply() on rows (axis=1)
df['total'] = df.apply(lambda row: row['age'] + row['score'], axis=1)
# map() for Series element-wise transformation
city_map = {'NY': 'New York', 'LA': 'Los Angeles', 'SF': 'San Francisco'}
df['city_full'] = df['city'].map(city_map)
# applymap() for entire DataFrame (deprecated, use .map() instead)
df_str = pd.DataFrame({'x': ['a', 'bb'], 'y': ['ccc', 'd']})
lengths = df_str.map(len) # Apply len to every cell
String Operations
The .str accessor provides vectorized string methods:
df_text = pd.DataFrame({
'name': [' alice ', 'Bob', 'CHARLIE'],
'email': ['a@x.com', 'b@y.org', 'c@z.net']
})
# String cleaning
df_text['name'] = df_text['name'].str.strip().str.title()
# String operations
df_text['name_len'] = df_text['name'].str.len()
df_text['has_org'] = df_text['email'].str.contains('.org')
df_text['domain'] = df_text['email'].str.extract(r'@([\\w.]+)')
print(df_text)
Practice Exercises
String Operations Exercises
Exercise 1 (Beginner): Create a DataFrame with a string column. Use .str.lower(), .str.upper(), .str.title() to transform the text. Print results.
Exercise 2 (Beginner): Create a DataFrame with email addresses. Use .str.contains() to find emails from a specific domain. Use .str.extract() to extract the domain name.
Exercise 3 (Intermediate): Create a DataFrame with names like ' john doe ', 'jane SMITH'. Clean using .str.strip() and .str.title(). Split into first and last names using .str.split().
Exercise 4 (Intermediate): Create a DataFrame with text containing special characters. Use .str.replace() to replace patterns. Use .str.startswith() and .str.endswith() for filtering.
Challenge (Advanced): Create a DataFrame with mixed-case email addresses. Extract domain and user parts separately using .str.extract() with regex groups. Validate email format using .str.match().
Sorting
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
# Sort by values
df_sorted = df.sort_values('age')
df_sorted = df.sort_values(['city', 'score'], ascending=[True, False])
# Sort by index
df_sorted = df.sort_index()
GroupBy & Aggregations
GroupBy implements the split-apply-combine pattern—split data into groups, apply a function, combine results. This is essential for data analysis.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
# Group by city and calculate statistics
grouped = df.groupby('city')['score'].agg(['count', 'mean', 'min', 'max'])
print(grouped)
# count mean min max
# city
# LA 1 92.0 92 92
# NY 2 86.5 85 88
# SF 1 90.0 90 90
# Multiple aggregations with named outputs
agg_df = df.groupby('city').agg(
age_mean=('age', 'mean'),
score_sum=('score', 'sum'),
count=('name', 'count')
)
print(agg_df)
Advanced GroupBy
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
df['is_ny'] = df['city'] == 'NY'
# Custom aggregation function
def age_range(x):
return x.max() - x.min()
df.groupby('city')['age'].agg(age_range)
# Apply custom function to groups
def zscore(x):
return (x - x.mean()) / x.std(ddof=0)
df['score_zscore'] = df.groupby('city')['score'].transform(zscore)
# Multiple grouping columns
df.groupby(['city', 'is_ny'])['score'].mean()
Real-World Example
Sales Analysis with GroupBy
import pandas as pd
import numpy as np
sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100, freq='D'),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'product': np.random.choice(['A', 'B', 'C'], 100),
'revenue': np.random.randint(100, 1000, 100)
})
# Monthly revenue by region
sales['month'] = sales['date'].dt.to_period('M')
monthly = sales.groupby(['month', 'region'])['revenue'].sum().unstack()
print(monthly.head())
Pivot Tables & Crosstabs
Pivot tables reshape data from long to wide format, aggregating as needed—like Excel pivot tables.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NY', 'LA', 'NY', 'SF'],
'score': [88, 92, 85, 90]
})
df['is_ny'] = df['city'] == 'NY'
# Pivot table: reshape and aggregate
pivot = pd.pivot_table(
df,
values='score',
index='city',
columns='age',
aggfunc='mean',
fill_value=0
)
print(pivot)
# age 25 28 30 35
# city
# LA 0 0 92 0
# NY 88 0 0 85
# SF 0 90 0 0
# Crosstab: frequency table
ct = pd.crosstab(df['city'], df['is_ny'])
print(ct)
# is_ny False True
# city
# LA 1 0
# NY 0 2
# SF 1 0
Practice Exercises
GroupBy & Aggregations Exercises
Exercise 1 (Beginner): Create a sales DataFrame with columns: product, region, amount. Group by region and calculate total amount per region. Then group by product.
Exercise 2 (Beginner): Create a DataFrame with numeric columns. Group by a category column and apply multiple aggregations (mean, sum, min, max) simultaneously.
Exercise 3 (Intermediate): Create sales data with date and region. Group by region and month, then calculate total revenue. Use unstack() to create a pivot-like view.
Exercise 4 (Intermediate): Write a custom aggregation function (e.g., coefficient of variation) and apply it to grouped data. Compare with built-in agg functions.
Challenge (Advanced): Create complex grouped data. Use transform() to add group means as a new column. Use apply() for custom group-wise transformations. Explain difference from agg().
Merge, Join & Concatenate
Combining datasets is crucial for data analysis. Pandas provides SQL-like joins and concatenation.
Merge (SQL-style joins)
import pandas as pd
left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})
# Inner join (intersection)
inner = left.merge(right, on='id', how='inner')
print(inner)
# id city score
# 0 1 NY 88
# 1 2 LA 92
# Left join (all from left, matching from right)
left_join = left.merge(right, on='id', how='left')
print(left_join)
# id city score
# 0 1 NY 88.0
# 1 2 LA 92.0
# 2 3 SF NaN
# Outer join (all from both)
outer = left.merge(right, on='id', how='outer')
print(outer)
Join (index-based)
import pandas as pd
left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})
l = left.set_index('id')
r = right.set_index('id')
# Join on index
joined = l.join(r, how='left')
print(joined)
Concatenate
import pandas as pd
left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})
# Stack rows vertically
concat_rows = pd.concat([left, right], axis=0, ignore_index=True)
# Stack columns horizontally
concat_cols = pd.concat([left, right], axis=1)
SQL Analogy: merge(how='inner') = INNER JOIN, how='left' = LEFT JOIN, how='right' = RIGHT JOIN, how='outer' = FULL OUTER JOIN.
Time Series Operations
Pandas has powerful tools for working with time-indexed data:
import pandas as pd
import numpy as np
# Create time series
rng = pd.date_range('2024-01-01', periods=30, freq='D')
ts = pd.Series(np.random.randn(len(rng)).cumsum(), index=rng, name='value')
print(ts.head())
# 2024-01-01 -0.234
# 2024-01-02 0.123
# ...
# Resample to weekly mean
weekly = ts.resample('W').mean()
# Shift (lag/lead)
ts_shifted = ts.shift(1) # Lag by 1 day
# Rolling window
rolling_7d = ts.rolling(window=7).mean()
# Date accessors
df_ts = ts.to_frame()
df_ts['month'] = df_ts.index.month
df_ts['weekday'] = df_ts.index.day_name()
print(df_ts.head())
Best Practices & Summary
Key Takeaways
- ? Use vectorized operations over loops for performance
- ? Prefer loc/iloc for clear, safe indexing semantics
- ? Handle missing data explicitly with domain knowledge
- ? Use groupby for split-apply-combine aggregations
- ? Leverage method chaining for readable transformations
- ? Choose appropriate merge types (inner/left/right/outer)
- ? Use categorical dtype for repeated labels to save memory
Common Pitfalls
Avoid These Mistakes
- Chained assignment: Use
.loc instead of df[cond][col] = val
- Ignoring dtypes: Check and convert types explicitly
- Not setting index: Use meaningful indices for time series and joins
- Copying unnecessarily: Many operations return views—use
.copy() when needed
- Mixing label/position indexing: Stick to
loc or iloc, don't mix
What's Next
You've mastered Pandas—data loading, cleaning, transformation, grouping, and merging. You're now ready to visualize your insights!
Next in the Series: Part 3: Data Visualization with Matplotlib & Seaborn will teach you to create compelling visualizations that communicate your data analysis results effectively.
Practice Exercises
Merge & Join Exercises
Exercise 1 (Beginner): Create two DataFrames with overlapping IDs. Perform inner, outer, left, and right joins. Compare row counts and identify missing data.
Exercise 2 (Beginner): Create customer and order DataFrames. Merge on customer_id. Then group by customer to calculate total purchases.
Exercise 3 (Intermediate): Create multiple DataFrames. Concatenate them vertically (vstack) and horizontally (hstack). Use ignore_index=True. Explain when to use which.
Exercise 4 (Intermediate): Create DataFrames with non-matching keys. Try merging on key using how='left'. Handle NaN values from unmatched rows using fillna().
Challenge (Advanced): Create 3+ DataFrames with different keys. Chain multiple merges or use reduce() to combine all. Verify final row count matches expected result.
Pandas API Cheat Sheet
Quick reference for essential Pandas operations for data manipulation and analysis.
pd.DataFrame(dict) | From dictionary |
pd.read_csv('file.csv') | From CSV file |
pd.read_excel('file.xlsx') | From Excel |
pd.read_json('file.json') | From JSON |
pd.read_sql(query, conn) | From SQL query |
pd.Series([1,2,3]) | Create Series |
df.head(n) | First n rows |
df.tail(n) | Last n rows |
df.info() | Column types & nulls |
df.describe() | Statistical summary |
df.shape | (rows, columns) |
df.dtypes | Column data types |
df['col'] | Select column |
df[['a','b']] | Select columns |
df.loc['idx'] | By label |
df.iloc[0] | By position |
df[df['age'] > 25] | Boolean filter |
df.query('age > 25') | Query syntax |
df.isin(['NY','LA']) | Membership test |
df.isnull() | Detect NaN values |
df.isnull().sum() | Count nulls per col |
df.dropna() | Drop rows with NaN |
df.fillna(value) | Fill with value |
df.fillna(method='ffill') | Forward fill |
df.fillna(df.mean()) | Fill with mean |
df.groupby('col') | Group by column |
.mean() | Group mean |
.sum() | Group sum |
.count() | Group count |
.agg(['min','max']) | Multiple aggs |
.transform(fn) | Broadcast result |
pd.pivot_table() | Pivot & aggregate |
pd.merge(df1, df2) | Merge (inner) |
how='left' | Left join |
how='right' | Right join |
how='outer' | Outer join |
on='key' | Merge on column |
df1.join(df2) | Join on index |
pd.concat([df1,df2]) | Stack DataFrames |
df['new'] = df['a'] + df['b'] | Add column |
df.apply(fn) | Apply function |
df['col'].map(dict) | Map values |
df.sort_values('col') | Sort by column |
df.rename(columns={}) | Rename columns |
df.drop(['col'], axis=1) | Drop column |
df.reset_index() | Reset index |
df['col'].str.lower() | To lowercase |
.str.upper() | To uppercase |
.str.strip() | Remove whitespace |
.str.contains('word') | Contains pattern |
.str.startswith('A') | Starts with |
.str.split(',') | Split string |
.str.replace('old','new') | Replace text |
Pro Tips:
- loc vs iloc:
loc uses labels, iloc uses positions (integers)
- Chaining: Avoid chained assignment; use
.loc[] for safe updates
- Method chaining: Chain methods for cleaner code:
df.dropna().groupby('col').mean()
- Performance: Use vectorized operations instead of
.apply() when possible
Related Articles in This Series
Part 1: NumPy Foundations for Data Science
Master NumPy arrays, vectorization, broadcasting, and linear algebra operations—the foundation of Python data science.
Read Article
Part 3: Data Visualization with Matplotlib & Seaborn
Create compelling visualizations with Python's most powerful plotting libraries. Learn line plots, bar charts, scatter plots, and statistical graphics.
Read Article
Part 4: Machine Learning with Scikit-learn
Build predictive models with classification, regression, clustering algorithms, and complete ML pipelines using Scikit-learn.
Read Article