Back to Technology

Python Data Science Series Part 2: Pandas for Data Analysis

December 27, 2025 Wasil Zafar 22 min read

Master Pandas, Python's primary tool for data manipulation and analysis. Learn DataFrames, Series, data cleaning, transformation, groupby operations, and everything needed for real-world data work.

Table of Contents

  1. Introduction & Core Structures
  2. Data Input/Output & Selection
  3. Data Transformation & Operations
  4. Advanced Data Manipulation

Introduction to Pandas

Prerequisites: Before running the code examples in this tutorial, make sure you have Python and Jupyter notebooks properly set up. If you haven't configured your development environment yet, check out our complete setup guide for VS Code, PyCharm, Jupyter, and Colab.

If NumPy is the foundation of Python's data science stack, Pandas is where data analysis truly begins. Built on top of NumPy, Pandas provides high-level data structures and tools specifically designed for working with real-world, tabular data—the kind you find in spreadsheets, databases, and CSV files.

Why Pandas Matters: Pandas is to data science what SQL is to databases—the essential tool for data manipulation. It's the first library you reach for when loading, cleaning, transforming, and analyzing data before machine learning or visualization.

Key capabilities:

  • DataFrame: 2D table with labeled rows and columns (like Excel, but programmable)
  • Series: 1D labeled array (a single column or row)
  • I/O tools: Read/write CSV, Excel, JSON, SQL, Parquet, and more
  • Data cleaning: Handle missing values, duplicates, and transformations
  • GroupBy: Split-apply-combine operations for aggregation
  • Merge/Join: Combine datasets like SQL joins
  • Time series: Specialized support for datetime data

Installation and Setup

# Install Pandas
pip install pandas

# Import convention
import pandas as pd
import numpy as np

print(f"pandas version: {pd.__version__}")

Series: 1D Labeled Arrays

A Series is a one-dimensional labeled array—like a column in a spreadsheet or a single feature in a dataset. It combines NumPy's array performance with labeled indices.

import pandas as pd
import numpy as np

# Creating Series from list
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'], name='scores')
print(s)
# Output:
# a    10
# b    20
# c    30
# d    40
# Name: scores, dtype: int64

# Access by label and position
print("By label 'b':", s['b'])     # 20
print("By position 2:", s.iloc[2]) # 30

# Vectorized operations
print("Add 5:", s + 5)
print("Multiply by 2:", s * 2)

Index Alignment

Pandas automatically aligns data by index labels—a powerful feature:

import pandas as pd

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])

# Alignment by index (missing indices become NaN)
result = s1 + s2
print(result)
# a     NaN
# b    12.0
# c    23.0
# d     NaN
Critical Concept: Index alignment happens automatically. Operations match by labels, not positions. This prevents silent bugs when working with mismatched data but can introduce NaN values.

DataFrame: 2D Tabular Data

The DataFrame is Pandas' primary data structure—a 2D table with labeled rows and columns. Think of it as a spreadsheet or SQL table in Python.

import pandas as pd

# Create DataFrame from dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

print(df)
#       name  age city  score
# 0    Alice   25   NY     88
# 1      Bob   30   LA     92
# 2  Charlie   35   NY     85
# 3    Diana   28   SF     90

# Data inspection
print(df.head(2))        # First 2 rows
print(df.info())         # Structure and types
print(df.describe())     # Statistical summary
print(df.shape)          # (4, 4) - rows × columns

Column Operations

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Select single column (returns Series)
ages = df['age']
print(type(ages))  # 

# Select multiple columns (returns DataFrame)
subset = df[['name', 'score']]

# Add new column
df['is_ny'] = df['city'] == 'NY'
print(df)
#       name  age city  score  is_ny
# 0    Alice   25   NY     88   True
# 1      Bob   30   LA     92  False
# 2  Charlie   35   NY     85   True
# 3    Diana   28   SF     90  False
DataFrame Anatomy

Understanding DataFrame Structure

  • Index: Row labels (default: 0, 1, 2, ...)
  • Columns: Column labels (from keys or specified)
  • Values: Underlying NumPy array (access with .values)
  • dtypes: Data type per column (int64, float64, object)
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

print("Index:", df.index)
print("Columns:", df.columns)
print("Dtypes:", df.dtypes)
print("Values type:", type(df.values))  # numpy.ndarray

Reading & Writing Data

Pandas excels at loading data from various sources—CSV, Excel, JSON, SQL databases, and more.

CSV Operations

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Write to CSV
df.to_csv('people.csv', index=False)

# Read from CSV
df_csv = pd.read_csv('people.csv')
print(df_csv.head())

# Advanced CSV reading
df_adv = pd.read_csv('data.csv',
                     sep=';',                    # Custom delimiter
                     skiprows=2,                 # Skip header rows
                     na_values=['NA', 'missing'], # Custom NaN values
                     parse_dates=['date'],       # Parse date columns
                     dtype={'id': str})          # Force column types

Excel, JSON, and More

import pandas as pd
import sqlite3

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Excel (requires openpyxl or xlrd)
df.to_excel('people.xlsx', index=False)
df_xlsx = pd.read_excel('people.xlsx')

# JSON
df.to_json('people.json', orient='records')
df_json = pd.read_json('people.json')

# Parquet (efficient columnar format)
df.to_parquet('people.parquet')
df_parquet = pd.read_parquet('people.parquet')

# SQL database
conn = sqlite3.connect('database.db')
df.to_sql('people', conn, if_exists='replace')
df_sql = pd.read_sql('SELECT * FROM people WHERE age > 25', conn)
Performance Tip: For large datasets (>1GB), use Parquet instead of CSV. Parquet is 10x smaller and 100x faster to read/write than CSV.

Indexing, Selection & Filtering

Pandas provides two primary indexers: loc (label-based) and iloc (position-based).

loc vs iloc

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Set custom index
df_idx = df.set_index('name')

# loc: label-based selection
print(df_idx.loc['Alice'])           # Row by label
print(df_idx.loc['Alice', 'age'])    # Specific cell
print(df_idx.loc['Alice':'Charlie']) # Slice (inclusive!)

# iloc: position-based selection
print(df_idx.iloc[0])                # First row
print(df_idx.iloc[0, 1])            # Row 0, col 1
print(df_idx.iloc[0:2, 1:3])        # Slice (exclusive)

Boolean Filtering

Boolean masks are the most powerful way to filter data:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Single condition
high_scores = df[df['score'] > 85]
print(high_scores)

# Multiple conditions (use & | ~)
elite = df[(df['score'] > 85) & (df['age'] < 30)]

# isin() for membership
ny_sf = df[df['city'].isin(['NY', 'SF'])]

# String methods
starts_with_a = df[df['name'].str.startswith('A')]

# Query syntax (cleaner for complex conditions)
result = df.query('age >= 30 and city == "LA"')
print(result)

Boolean Indexing with 2D Arrays

When working with NumPy arrays or numerical DataFrames, you can use boolean indexing on multiple dimensions—a powerful technique for filtering machine learning datasets by class or condition.

import pandas as pd
import numpy as np

# Create a dataset with features (X) and labels (y)
X = np.array([[1.2, 2.3, 3.1],
              [4.5, 5.6, 6.2],
              [7.8, 8.9, 9.3],
              [2.1, 3.4, 4.5],
              [5.6, 6.7, 7.8]])

y = np.array([0, 1, 1, 0, 1])  # Binary labels (class 0 or 1)

print("Full dataset:")
print("X shape:", X.shape)  # (5, 3)
print("y shape:", y.shape)  # (5,)
print("X:\n", X)
print("y:", y)

Filtering Rows by Class Label

The syntax X[y==0] creates a boolean mask and applies it to select rows:

import numpy as np

X = np.array([[1.2, 2.3, 3.1],
              [4.5, 5.6, 6.2],
              [7.8, 8.9, 9.3],
              [2.1, 3.4, 4.5],
              [5.6, 6.7, 7.8]])

y = np.array([0, 1, 1, 0, 1])

# Step 1: Create boolean mask
mask_class_0 = (y == 0)
print("Boolean mask for class 0:", mask_class_0)
# [True, False, False, True, False]

# Step 2: Apply mask to select rows
class_0_samples = X[mask_class_0]
print("Class 0 samples:\n", class_0_samples)
# [[1.2, 2.3, 3.1]
#  [2.1, 3.4, 4.5]]

# Combined syntax (one-liner)
class_1_samples = X[y == 1]
print("Class 1 samples:\n", class_1_samples)
# [[4.5, 5.6, 6.2]
#  [7.8, 8.9, 9.3]
#  [5.6, 6.7, 7.8]]

Selecting Specific Columns from Filtered Rows

The syntax X[y==0, 0] selects rows where y==0 AND column 0 (first feature):

import numpy as np

X = np.array([[1.2, 2.3, 3.1],
              [4.5, 5.6, 6.2],
              [7.8, 8.9, 9.3],
              [2.1, 3.4, 4.5],
              [5.6, 6.7, 7.8]])

y = np.array([0, 1, 1, 0, 1])

# Get first feature (column 0) for class 0 samples
class_0_feature_0 = X[y == 0, 0]
print("Class 0, Feature 0:", class_0_feature_0)
# [1.2, 2.1]

# Get second feature (column 1) for class 1 samples
class_1_feature_1 = X[y == 1, 1]
print("Class 1, Feature 1:", class_1_feature_1)
# [5.6, 8.9, 6.7]

# Get multiple columns using slicing
class_0_first_two_features = X[y == 0, :2]
print("Class 0, First 2 features:\n", class_0_first_two_features)
# [[1.2, 2.3]
#  [2.1, 3.4]]

Using with Pandas DataFrames

The same pattern works with Pandas DataFrames for feature engineering and analysis:

import pandas as pd
import numpy as np

# Create DataFrame with features and target
df = pd.DataFrame({
    'feature_1': [1.2, 4.5, 7.8, 2.1, 5.6],
    'feature_2': [2.3, 5.6, 8.9, 3.4, 6.7],
    'feature_3': [3.1, 6.2, 9.3, 4.5, 7.8],
    'class_label': [0, 1, 1, 0, 1]
})

print("Full DataFrame:")
print(df)

# Boolean indexing with DataFrames
class_0_df = df[df['class_label'] == 0]
print("\nClass 0 samples:")
print(class_0_df)

# Select specific columns for a class
class_1_features = df.loc[df['class_label'] == 1, ['feature_1', 'feature_2']]
print("\nClass 1, Features 1 & 2:")
print(class_1_features)

# Compute statistics per class
print("\nMean per class:")
print("Class 0 mean:", df[df['class_label'] == 0][['feature_1', 'feature_2', 'feature_3']].mean())
print("Class 1 mean:", df[df['class_label'] == 1][['feature_1', 'feature_2', 'feature_3']].mean())
Machine Learning Application: Boolean indexing by class label is essential for:
  • Data Exploration: Analyze feature distributions per class
  • Visualization: Plot different classes with different colors (covered in Visualization tutorial)
  • Splitting Data: Separate train/test sets while preserving class balance
  • Evaluation: Compute metrics (accuracy, precision, recall) per class
Common Pitfall

Chained Assignment Warning

Avoid this common mistake:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'score': [88, 92, 85, 90]
})

# BAD: Chained assignment (might not work!)
df[df['age'] > 25]['score'] = 100

# GOOD: Use .loc for safe assignment
df.loc[df['age'] > 25, 'score'] = 100

Chained assignment can produce a copy instead of a view, causing silent failures. Always use .loc for assignments.

Practice Exercises

Selection & Filtering Exercises

Exercise 1 (Beginner): Create a DataFrame with columns name, age, city, salary. Select all rows where age > 30. Select only name and salary columns for these rows.

Exercise 2 (Beginner): Create a DataFrame. Add a new column 'bonus' that is 10% of salary. Modify existing columns using .loc[]. Check the difference between copy() and view.

Exercise 3 (Intermediate): Create a DataFrame. Filter rows where (city == 'NY') AND (salary > 50000). Try boolean operators (&, |, ~). Explain what goes wrong with 'and' operator.

Exercise 4 (Intermediate): Create a DataFrame with various data types (int, float, string, bool). Use iloc for position-based selection. Extract specific rows and columns by integer position.

Challenge (Advanced): Create a DataFrame and use query() method for complex conditions. Compare performance and readability with .loc[] approach on large data.

Handling Missing Data

Real-world data always has missing values. Pandas provides comprehensive tools to detect, remove, and fill them.

import pandas as pd
import numpy as np

df_na = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': ['x', 'y', None, 'z'],
    'C': [10.0, 11.5, np.nan, 9.8]
})

# Detect missing values
print(df_na.isnull())
#        A      B      C
# 0  False  False  False
# 1   True  False  False
# 2  False   True   True
# 3   True  False  False

print(df_na.isnull().sum())  # Count per column
# A    2
# B    1
# C    1

Removing Missing Data

import pandas as pd
import numpy as np

df_na = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': ['x', 'y', None, 'z'],
    'C': [10.0, 11.5, np.nan, 9.8]
})

# Drop rows with ANY missing value
df_clean = df_na.dropna()

# Drop rows where ALL values are missing
df_clean = df_na.dropna(how='all')

# Drop columns with missing values
df_clean = df_na.dropna(axis=1)

# Drop rows missing values in specific columns
df_clean = df_na.dropna(subset=['A', 'C'])

Filling Missing Data

import pandas as pd
import numpy as np

df_na = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': ['x', 'y', None, 'z'],
    'C': [10.0, 11.5, np.nan, 9.8]
})

# Fill with specific value
df_filled = df_na.fillna(0)

# Fill with column means
df_filled = df_na.fillna(df_na.mean())

# Fill with different values per column
df_filled = df_na.fillna({
    'A': df_na['A'].mean(),
    'B': 'missing',
    'C': df_na['C'].median()
})

# Forward fill (propagate last valid value)
df_filled = df_na.fillna(method='ffill')

# Backward fill
df_filled = df_na.fillna(method='bfill')
Best Practice: Never blindly drop or fill missing data. Understand why data is missing (random, systematic, measurement error) before choosing a strategy. Document your decisions.

Operations & Computations

Apply, Map, and Applymap

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# apply() on columns
df['age_squared'] = df['age'].apply(lambda x: x ** 2)

# apply() on rows (axis=1)
df['total'] = df.apply(lambda row: row['age'] + row['score'], axis=1)

# map() for Series element-wise transformation
city_map = {'NY': 'New York', 'LA': 'Los Angeles', 'SF': 'San Francisco'}
df['city_full'] = df['city'].map(city_map)

# applymap() for entire DataFrame (deprecated, use .map() instead)
df_str = pd.DataFrame({'x': ['a', 'bb'], 'y': ['ccc', 'd']})
lengths = df_str.map(len)  # Apply len to every cell

String Operations

The .str accessor provides vectorized string methods:

df_text = pd.DataFrame({
    'name': [' alice ', 'Bob', 'CHARLIE'],
    'email': ['a@x.com', 'b@y.org', 'c@z.net']
})

# String cleaning
df_text['name'] = df_text['name'].str.strip().str.title()

# String operations
df_text['name_len'] = df_text['name'].str.len()
df_text['has_org'] = df_text['email'].str.contains('.org')
df_text['domain'] = df_text['email'].str.extract(r'@([\\w.]+)')

print(df_text)
Practice Exercises

String Operations Exercises

Exercise 1 (Beginner): Create a DataFrame with a string column. Use .str.lower(), .str.upper(), .str.title() to transform the text. Print results.

Exercise 2 (Beginner): Create a DataFrame with email addresses. Use .str.contains() to find emails from a specific domain. Use .str.extract() to extract the domain name.

Exercise 3 (Intermediate): Create a DataFrame with names like ' john doe ', 'jane SMITH'. Clean using .str.strip() and .str.title(). Split into first and last names using .str.split().

Exercise 4 (Intermediate): Create a DataFrame with text containing special characters. Use .str.replace() to replace patterns. Use .str.startswith() and .str.endswith() for filtering.

Challenge (Advanced): Create a DataFrame with mixed-case email addresses. Extract domain and user parts separately using .str.extract() with regex groups. Validate email format using .str.match().

Sorting

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Sort by values
df_sorted = df.sort_values('age')
df_sorted = df.sort_values(['city', 'score'], ascending=[True, False])

# Sort by index
df_sorted = df.sort_index()

GroupBy & Aggregations

GroupBy implements the split-apply-combine pattern—split data into groups, apply a function, combine results. This is essential for data analysis.

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Group by city and calculate statistics
grouped = df.groupby('city')['score'].agg(['count', 'mean', 'min', 'max'])
print(grouped)
#       count  mean  min  max
# city                      
# LA        1  92.0   92   92
# NY        2  86.5   85   88
# SF        1  90.0   90   90

# Multiple aggregations with named outputs
agg_df = df.groupby('city').agg(
    age_mean=('age', 'mean'),
    score_sum=('score', 'sum'),
    count=('name', 'count')
)
print(agg_df)

Advanced GroupBy

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})
df['is_ny'] = df['city'] == 'NY'

# Custom aggregation function
def age_range(x):
    return x.max() - x.min()

df.groupby('city')['age'].agg(age_range)

# Apply custom function to groups
def zscore(x):
    return (x - x.mean()) / x.std(ddof=0)

df['score_zscore'] = df.groupby('city')['score'].transform(zscore)

# Multiple grouping columns
df.groupby(['city', 'is_ny'])['score'].mean()
Real-World Example

Sales Analysis with GroupBy

import pandas as pd
import numpy as np

sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'revenue': np.random.randint(100, 1000, 100)
})

# Monthly revenue by region
sales['month'] = sales['date'].dt.to_period('M')
monthly = sales.groupby(['month', 'region'])['revenue'].sum().unstack()
print(monthly.head())

Pivot Tables & Crosstabs

Pivot tables reshape data from long to wide format, aggregating as needed—like Excel pivot tables.

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})
df['is_ny'] = df['city'] == 'NY'

# Pivot table: reshape and aggregate
pivot = pd.pivot_table(
    df,
    values='score',
    index='city',
    columns='age',
    aggfunc='mean',
    fill_value=0
)
print(pivot)
#       age  25  28  30  35
# city                     
# LA          0   0  92   0
# NY         88   0   0  85
# SF          0  90   0   0

# Crosstab: frequency table
ct = pd.crosstab(df['city'], df['is_ny'])
print(ct)
# is_ny  False  True
# city              
# LA         1     0
# NY         0     2
# SF         1     0
Practice Exercises

GroupBy & Aggregations Exercises

Exercise 1 (Beginner): Create a sales DataFrame with columns: product, region, amount. Group by region and calculate total amount per region. Then group by product.

Exercise 2 (Beginner): Create a DataFrame with numeric columns. Group by a category column and apply multiple aggregations (mean, sum, min, max) simultaneously.

Exercise 3 (Intermediate): Create sales data with date and region. Group by region and month, then calculate total revenue. Use unstack() to create a pivot-like view.

Exercise 4 (Intermediate): Write a custom aggregation function (e.g., coefficient of variation) and apply it to grouped data. Compare with built-in agg functions.

Challenge (Advanced): Create complex grouped data. Use transform() to add group means as a new column. Use apply() for custom group-wise transformations. Explain difference from agg().

Merge, Join & Concatenate

Combining datasets is crucial for data analysis. Pandas provides SQL-like joins and concatenation.

Merge (SQL-style joins)

import pandas as pd

left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})

# Inner join (intersection)
inner = left.merge(right, on='id', how='inner')
print(inner)
#    id city  score
# 0   1   NY     88
# 1   2   LA     92

# Left join (all from left, matching from right)
left_join = left.merge(right, on='id', how='left')
print(left_join)
#    id city  score
# 0   1   NY   88.0
# 1   2   LA   92.0
# 2   3   SF    NaN

# Outer join (all from both)
outer = left.merge(right, on='id', how='outer')
print(outer)

Join (index-based)

import pandas as pd

left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})

l = left.set_index('id')
r = right.set_index('id')

# Join on index
joined = l.join(r, how='left')
print(joined)

Concatenate

import pandas as pd

left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})

# Stack rows vertically
concat_rows = pd.concat([left, right], axis=0, ignore_index=True)

# Stack columns horizontally
concat_cols = pd.concat([left, right], axis=1)
SQL Analogy: merge(how='inner') = INNER JOIN, how='left' = LEFT JOIN, how='right' = RIGHT JOIN, how='outer' = FULL OUTER JOIN.

Time Series Operations

Pandas has powerful tools for working with time-indexed data:

import pandas as pd
import numpy as np

# Create time series
rng = pd.date_range('2024-01-01', periods=30, freq='D')
ts = pd.Series(np.random.randn(len(rng)).cumsum(), index=rng, name='value')

print(ts.head())
# 2024-01-01   -0.234
# 2024-01-02    0.123
# ...

# Resample to weekly mean
weekly = ts.resample('W').mean()

# Shift (lag/lead)
ts_shifted = ts.shift(1)  # Lag by 1 day

# Rolling window
rolling_7d = ts.rolling(window=7).mean()

# Date accessors
df_ts = ts.to_frame()
df_ts['month'] = df_ts.index.month
df_ts['weekday'] = df_ts.index.day_name()
print(df_ts.head())

Best Practices & Summary

Key Takeaways

  • ? Use vectorized operations over loops for performance
  • ? Prefer loc/iloc for clear, safe indexing semantics
  • ? Handle missing data explicitly with domain knowledge
  • ? Use groupby for split-apply-combine aggregations
  • ? Leverage method chaining for readable transformations
  • ? Choose appropriate merge types (inner/left/right/outer)
  • ? Use categorical dtype for repeated labels to save memory

Common Pitfalls

Avoid These Mistakes
  1. Chained assignment: Use .loc instead of df[cond][col] = val
  2. Ignoring dtypes: Check and convert types explicitly
  3. Not setting index: Use meaningful indices for time series and joins
  4. Copying unnecessarily: Many operations return views—use .copy() when needed
  5. Mixing label/position indexing: Stick to loc or iloc, don't mix

What's Next

You've mastered Pandas—data loading, cleaning, transformation, grouping, and merging. You're now ready to visualize your insights!

Next in the Series: Part 3: Data Visualization with Matplotlib & Seaborn will teach you to create compelling visualizations that communicate your data analysis results effectively.
Practice Exercises

Merge & Join Exercises

Exercise 1 (Beginner): Create two DataFrames with overlapping IDs. Perform inner, outer, left, and right joins. Compare row counts and identify missing data.

Exercise 2 (Beginner): Create customer and order DataFrames. Merge on customer_id. Then group by customer to calculate total purchases.

Exercise 3 (Intermediate): Create multiple DataFrames. Concatenate them vertically (vstack) and horizontally (hstack). Use ignore_index=True. Explain when to use which.

Exercise 4 (Intermediate): Create DataFrames with non-matching keys. Try merging on key using how='left'. Handle NaN values from unmatched rows using fillna().

Challenge (Advanced): Create 3+ DataFrames with different keys. Chain multiple merges or use reduce() to combine all. Verify final row count matches expected result.

Pandas API Cheat Sheet

Quick reference for essential Pandas operations for data manipulation and analysis.

DataFrame Creation
pd.DataFrame(dict)From dictionary
pd.read_csv('file.csv')From CSV file
pd.read_excel('file.xlsx')From Excel
pd.read_json('file.json')From JSON
pd.read_sql(query, conn)From SQL query
pd.Series([1,2,3])Create Series
Data Inspection
df.head(n)First n rows
df.tail(n)Last n rows
df.info()Column types & nulls
df.describe()Statistical summary
df.shape(rows, columns)
df.dtypesColumn data types
Selection & Filtering
df['col']Select column
df[['a','b']]Select columns
df.loc['idx']By label
df.iloc[0]By position
df[df['age'] > 25]Boolean filter
df.query('age > 25')Query syntax
df.isin(['NY','LA'])Membership test
Missing Data
df.isnull()Detect NaN values
df.isnull().sum()Count nulls per col
df.dropna()Drop rows with NaN
df.fillna(value)Fill with value
df.fillna(method='ffill')Forward fill
df.fillna(df.mean())Fill with mean
GroupBy & Aggregation
df.groupby('col')Group by column
.mean()Group mean
.sum()Group sum
.count()Group count
.agg(['min','max'])Multiple aggs
.transform(fn)Broadcast result
pd.pivot_table()Pivot & aggregate
Merging & Joining
pd.merge(df1, df2)Merge (inner)
how='left'Left join
how='right'Right join
how='outer'Outer join
on='key'Merge on column
df1.join(df2)Join on index
pd.concat([df1,df2])Stack DataFrames
Transformation
df['new'] = df['a'] + df['b']Add column
df.apply(fn)Apply function
df['col'].map(dict)Map values
df.sort_values('col')Sort by column
df.rename(columns={})Rename columns
df.drop(['col'], axis=1)Drop column
df.reset_index()Reset index
String Operations
df['col'].str.lower()To lowercase
.str.upper()To uppercase
.str.strip()Remove whitespace
.str.contains('word')Contains pattern
.str.startswith('A')Starts with
.str.split(',')Split string
.str.replace('old','new')Replace text
Pro Tips:
  • loc vs iloc: loc uses labels, iloc uses positions (integers)
  • Chaining: Avoid chained assignment; use .loc[] for safe updates
  • Method chaining: Chain methods for cleaner code: df.dropna().groupby('col').mean()
  • Performance: Use vectorized operations instead of .apply() when possible
Technology