Python Data Science Series Part 2: Pandas for Data Analysis

Introduction to Pandas

Prerequisites: Before running the code examples in this tutorial, make sure you have Python and Jupyter notebooks properly set up. If you haven't configured your development environment yet, check out our complete setup guide for VS Code, PyCharm, Jupyter, and Colab.

If NumPy is the foundation of Python's data science stack, Pandas is where data analysis truly begins. Built on top of NumPy, Pandas provides high-level data structures and tools specifically designed for working with real-world, tabular data—the kind you find in spreadsheets, databases, and CSV files.

                            
                            Why Pandas Matters: Pandas is to data science what SQL is to databases—the essential tool for data manipulation. It's the first library you reach for when loading, cleaning, transforming, and analyzing data before machine learning or visualization.
                        

Data Science Mastery

Your 11-step learning path • Currently on Step 3

Python Setup & Notebooks

IDE setup, Jupyter, virtual environments

NumPy Foundations

Arrays, broadcasting, linear algebra

3

Pandas Data Analysis

DataFrames, cleaning, manipulation

You Are Here

4

Key capabilities:

DataFrame: 2D table with labeled rows and columns (like Excel, but programmable)
Series: 1D labeled array (a single column or row)
I/O tools: Read/write CSV, Excel, JSON, SQL, Parquet, and more
Data cleaning: Handle missing values, duplicates, and transformations
GroupBy: Split-apply-combine operations for aggregation
Merge/Join: Combine datasets like SQL joins
Time series: Specialized support for datetime data

Installation and Setup

# Install Pandas
pip install pandas

# Import convention
import pandas as pd
import numpy as np

print(f"pandas version: {pd.__version__}")

Series: 1D Labeled Arrays

A Series is a one-dimensional labeled array—like a column in a spreadsheet or a single feature in a dataset. It combines NumPy's array performance with labeled indices.

import pandas as pd
import numpy as np

# Creating Series from list
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'], name='scores')
print(s)
# Output:
# a    10
# b    20
# c    30
# d    40
# Name: scores, dtype: int64

# Access by label and position
print("By label 'b':", s['b'])     # 20
print("By position 2:", s.iloc[2]) # 30

# Vectorized operations
print("Add 5:", s + 5)
print("Multiply by 2:", s * 2)

Index Alignment

Pandas automatically aligns data by index labels—a powerful feature:

import pandas as pd

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])

# Alignment by index (missing indices become NaN)
result = s1 + s2
print(result)
# a     NaN
# b    12.0
# c    23.0
# d     NaN

                            
                            Critical Concept: Index alignment happens automatically. Operations match by labels, not positions. This prevents silent bugs when working with mismatched data but can introduce NaN values.
                        

DataFrame: 2D Tabular Data

The DataFrame is Pandas' primary data structure—a 2D table with labeled rows and columns. Think of it as a spreadsheet or SQL table in Python.

import pandas as pd

# Create DataFrame from dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

print(df)
#       name  age city  score
# 0    Alice   25   NY     88
# 1      Bob   30   LA     92
# 2  Charlie   35   NY     85
# 3    Diana   28   SF     90

# Data inspection
print(df.head(2))        # First 2 rows
print(df.info())         # Structure and types
print(df.describe())     # Statistical summary
print(df.shape)          # (4, 4) - rows × columns

Column Operations

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Select single column (returns Series)
ages = df['age']
print(type(ages))  # 

# Select multiple columns (returns DataFrame)
subset = df[['name', 'score']]

# Add new column
df['is_ny'] = df['city'] == 'NY'
print(df)
#       name  age city  score  is_ny
# 0    Alice   25   NY     88   True
# 1      Bob   30   LA     92  False
# 2  Charlie   35   NY     85   True
# 3    Diana   28   SF     90  False

DataFrame Anatomy

Understanding DataFrame Structure

Index: Row labels (default: 0, 1, 2, ...)
Columns: Column labels (from keys or specified)
Values: Underlying NumPy array (access with .values)
dtypes: Data type per column (int64, float64, object)

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

print("Index:", df.index)
print("Columns:", df.columns)
print("Dtypes:", df.dtypes)
print("Values type:", type(df.values))  # numpy.ndarray

Reading & Writing Data

Pandas excels at loading data from various sources—CSV, Excel, JSON, SQL databases, and more.

CSV Operations

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Write to CSV
df.to_csv('people.csv', index=False)

# Read from CSV
df_csv = pd.read_csv('people.csv')
print(df_csv.head())

# Advanced CSV reading
df_adv = pd.read_csv('data.csv',
                     sep=';',                    # Custom delimiter
                     skiprows=2,                 # Skip header rows
                     na_values=['NA', 'missing'], # Custom NaN values
                     parse_dates=['date'],       # Parse date columns
                     dtype={'id': str})          # Force column types

Excel, JSON, and More

import pandas as pd
import sqlite3

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Excel (requires openpyxl or xlrd)
df.to_excel('people.xlsx', index=False)
df_xlsx = pd.read_excel('people.xlsx')

# JSON
df.to_json('people.json', orient='records')
df_json = pd.read_json('people.json')

# Parquet (efficient columnar format)
df.to_parquet('people.parquet')
df_parquet = pd.read_parquet('people.parquet')

# SQL database
conn = sqlite3.connect('database.db')
df.to_sql('people', conn, if_exists='replace')
df_sql = pd.read_sql('SELECT * FROM people WHERE age > 25', conn)

                            
                            Performance Tip: For large datasets (>1GB), use Parquet instead of CSV. Parquet is 10x smaller and 100x faster to read/write than CSV.
                        

Indexing, Selection & Filtering

Pandas provides two primary indexers: loc (label-based) and iloc (position-based).

loc vs iloc

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Set custom index
df_idx = df.set_index('name')

# loc: label-based selection
print(df_idx.loc['Alice'])           # Row by label
print(df_idx.loc['Alice', 'age'])    # Specific cell
print(df_idx.loc['Alice':'Charlie']) # Slice (inclusive!)

# iloc: position-based selection
print(df_idx.iloc[0])                # First row
print(df_idx.iloc[0, 1])            # Row 0, col 1
print(df_idx.iloc[0:2, 1:3])        # Slice (exclusive)

Boolean Filtering

Boolean masks are the most powerful way to filter data:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Single condition
high_scores = df[df['score'] > 85]
print(high_scores)

# Multiple conditions (use & | ~)
elite = df[(df['score'] > 85) & (df['age'] < 30)]

# isin() for membership
ny_sf = df[df['city'].isin(['NY', 'SF'])]

# String methods
starts_with_a = df[df['name'].str.startswith('A')]

# Query syntax (cleaner for complex conditions)
result = df.query('age >= 30 and city == "LA"')
print(result)

Boolean Indexing with 2D Arrays

When working with NumPy arrays or numerical DataFrames, you can use boolean indexing on multiple dimensions—a powerful technique for filtering machine learning datasets by class or condition.

import pandas as pd
import numpy as np

# Create a dataset with features (X) and labels (y)
X = np.array([[1.2, 2.3, 3.1],
              [4.5, 5.6, 6.2],
              [7.8, 8.9, 9.3],
              [2.1, 3.4, 4.5],
              [5.6, 6.7, 7.8]])

y = np.array([0, 1, 1, 0, 1])  # Binary labels (class 0 or 1)

print("Full dataset:")
print("X shape:", X.shape)  # (5, 3)
print("y shape:", y.shape)  # (5,)
print("X:\n", X)
print("y:", y)

Filtering Rows by Class Label

The syntax X[y==0] creates a boolean mask and applies it to select rows:

import numpy as np

X = np.array([[1.2, 2.3, 3.1],
              [4.5, 5.6, 6.2],
              [7.8, 8.9, 9.3],
              [2.1, 3.4, 4.5],
              [5.6, 6.7, 7.8]])

y = np.array([0, 1, 1, 0, 1])

# Step 1: Create boolean mask
mask_class_0 = (y == 0)
print("Boolean mask for class 0:", mask_class_0)
# [True, False, False, True, False]

# Step 2: Apply mask to select rows
class_0_samples = X[mask_class_0]
print("Class 0 samples:\n", class_0_samples)
# [[1.2, 2.3, 3.1]
#  [2.1, 3.4, 4.5]]

# Combined syntax (one-liner)
class_1_samples = X[y == 1]
print("Class 1 samples:\n", class_1_samples)
# [[4.5, 5.6, 6.2]
#  [7.8, 8.9, 9.3]
#  [5.6, 6.7, 7.8]]

Selecting Specific Columns from Filtered Rows

The syntax X[y==0, 0] selects rows where y==0 AND column 0 (first feature):

import numpy as np

X = np.array([[1.2, 2.3, 3.1],
              [4.5, 5.6, 6.2],
              [7.8, 8.9, 9.3],
              [2.1, 3.4, 4.5],
              [5.6, 6.7, 7.8]])

y = np.array([0, 1, 1, 0, 1])

# Get first feature (column 0) for class 0 samples
class_0_feature_0 = X[y == 0, 0]
print("Class 0, Feature 0:", class_0_feature_0)
# [1.2, 2.1]

# Get second feature (column 1) for class 1 samples
class_1_feature_1 = X[y == 1, 1]
print("Class 1, Feature 1:", class_1_feature_1)
# [5.6, 8.9, 6.7]

# Get multiple columns using slicing
class_0_first_two_features = X[y == 0, :2]
print("Class 0, First 2 features:\n", class_0_first_two_features)
# [[1.2, 2.3]
#  [2.1, 3.4]]

Using with Pandas DataFrames

The same pattern works with Pandas DataFrames for feature engineering and analysis:

import pandas as pd
import numpy as np

# Create DataFrame with features and target
df = pd.DataFrame({
    'feature_1': [1.2, 4.5, 7.8, 2.1, 5.6],
    'feature_2': [2.3, 5.6, 8.9, 3.4, 6.7],
    'feature_3': [3.1, 6.2, 9.3, 4.5, 7.8],
    'class_label': [0, 1, 1, 0, 1]
})

print("Full DataFrame:")
print(df)

# Boolean indexing with DataFrames
class_0_df = df[df['class_label'] == 0]
print("\nClass 0 samples:")
print(class_0_df)

# Select specific columns for a class
class_1_features = df.loc[df['class_label'] == 1, ['feature_1', 'feature_2']]
print("\nClass 1, Features 1 & 2:")
print(class_1_features)

# Compute statistics per class
print("\nMean per class:")
print("Class 0 mean:", df[df['class_label'] == 0][['feature_1', 'feature_2', 'feature_3']].mean())
print("Class 1 mean:", df[df['class_label'] == 1][['feature_1', 'feature_2', 'feature_3']].mean())

                            
                            Machine Learning Application: Boolean indexing by class label is essential for:
                            Data Exploration: Analyze feature distributions per class
Visualization: Plot different classes with different colors (covered in Visualization tutorial)
Splitting Data: Separate train/test sets while preserving class balance
Evaluation: Compute metrics (accuracy, precision, recall) per class

                        

Common Pitfall

Chained Assignment Warning

Avoid this common mistake:

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'score': [88, 92, 85, 90]
})

# BAD: Chained assignment (might not work!)
df[df['age'] > 25]['score'] = 100

# GOOD: Use .loc for safe assignment
df.loc[df['age'] > 25, 'score'] = 100

Chained assignment can produce a copy instead of a view, causing silent failures. Always use .loc for assignments.

Practice Exercises

Selection & Filtering Exercises

Exercise 1 (Beginner): Create a DataFrame with columns name, age, city, salary. Select all rows where age > 30. Select only name and salary columns for these rows.

Exercise 2 (Beginner): Create a DataFrame. Add a new column 'bonus' that is 10% of salary. Modify existing columns using .loc[]. Check the difference between copy() and view.

Exercise 3 (Intermediate): Create a DataFrame. Filter rows where (city == 'NY') AND (salary > 50000). Try boolean operators (&, |, ~). Explain what goes wrong with 'and' operator.

Exercise 4 (Intermediate): Create a DataFrame with various data types (int, float, string, bool). Use iloc for position-based selection. Extract specific rows and columns by integer position.

Challenge (Advanced): Create a DataFrame and use query() method for complex conditions. Compare performance and readability with .loc[] approach on large data.

Handling Missing Data

Real-world data always has missing values. Pandas provides comprehensive tools to detect, remove, and fill them.

import pandas as pd
import numpy as np

df_na = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': ['x', 'y', None, 'z'],
    'C': [10.0, 11.5, np.nan, 9.8]
})

# Detect missing values
print(df_na.isnull())
#        A      B      C
# 0  False  False  False
# 1   True  False  False
# 2  False   True   True
# 3   True  False  False

print(df_na.isnull().sum())  # Count per column
# A    2
# B    1
# C    1

Removing Missing Data

import pandas as pd
import numpy as np

df_na = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': ['x', 'y', None, 'z'],
    'C': [10.0, 11.5, np.nan, 9.8]
})

# Drop rows with ANY missing value
df_clean = df_na.dropna()

# Drop rows where ALL values are missing
df_clean = df_na.dropna(how='all')

# Drop columns with missing values
df_clean = df_na.dropna(axis=1)

# Drop rows missing values in specific columns
df_clean = df_na.dropna(subset=['A', 'C'])

Filling Missing Data

import pandas as pd
import numpy as np

df_na = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': ['x', 'y', None, 'z'],
    'C': [10.0, 11.5, np.nan, 9.8]
})

# Fill with specific value
df_filled = df_na.fillna(0)

# Fill with column means
df_filled = df_na.fillna(df_na.mean())

# Fill with different values per column
df_filled = df_na.fillna({
    'A': df_na['A'].mean(),
    'B': 'missing',
    'C': df_na['C'].median()
})

# Forward fill (propagate last valid value)
df_filled = df_na.fillna(method='ffill')

# Backward fill
df_filled = df_na.fillna(method='bfill')

                            
                            Best Practice: Never blindly drop or fill missing data. Understand why data is missing (random, systematic, measurement error) before choosing a strategy. Document your decisions.
                        

Operations & Computations

Apply, Map, and Applymap

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# apply() on columns
df['age_squared'] = df['age'].apply(lambda x: x ** 2)

# apply() on rows (axis=1)
df['total'] = df.apply(lambda row: row['age'] + row['score'], axis=1)

# map() for Series element-wise transformation
city_map = {'NY': 'New York', 'LA': 'Los Angeles', 'SF': 'San Francisco'}
df['city_full'] = df['city'].map(city_map)

# applymap() for entire DataFrame (deprecated, use .map() instead)
df_str = pd.DataFrame({'x': ['a', 'bb'], 'y': ['ccc', 'd']})
lengths = df_str.map(len)  # Apply len to every cell

String Operations

The .str accessor provides vectorized string methods:

df_text = pd.DataFrame({
    'name': [' alice ', 'Bob', 'CHARLIE'],
    'email': ['a@x.com', 'b@y.org', 'c@z.net']
})

# String cleaning
df_text['name'] = df_text['name'].str.strip().str.title()

# String operations
df_text['name_len'] = df_text['name'].str.len()
df_text['has_org'] = df_text['email'].str.contains('.org')
df_text['domain'] = df_text['email'].str.extract(r'@([\\w.]+)')

print(df_text)

Practice Exercises

String Operations Exercises

Exercise 1 (Beginner): Create a DataFrame with a string column. Use .str.lower(), .str.upper(), .str.title() to transform the text. Print results.

Exercise 2 (Beginner): Create a DataFrame with email addresses. Use .str.contains() to find emails from a specific domain. Use .str.extract() to extract the domain name.

Exercise 3 (Intermediate): Create a DataFrame with names like ' john doe ', 'jane SMITH'. Clean using .str.strip() and .str.title(). Split into first and last names using .str.split().

Exercise 4 (Intermediate): Create a DataFrame with text containing special characters. Use .str.replace() to replace patterns. Use .str.startswith() and .str.endswith() for filtering.

Challenge (Advanced): Create a DataFrame with mixed-case email addresses. Extract domain and user parts separately using .str.extract() with regex groups. Validate email format using .str.match().

Sorting

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Sort by values
df_sorted = df.sort_values('age')
df_sorted = df.sort_values(['city', 'score'], ascending=[True, False])

# Sort by index
df_sorted = df.sort_index()

GroupBy & Aggregations

GroupBy implements the split-apply-combine pattern—split data into groups, apply a function, combine results. This is essential for data analysis.

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})

# Group by city and calculate statistics
grouped = df.groupby('city')['score'].agg(['count', 'mean', 'min', 'max'])
print(grouped)
#       count  mean  min  max
# city                      
# LA        1  92.0   92   92
# NY        2  86.5   85   88
# SF        1  90.0   90   90

# Multiple aggregations with named outputs
agg_df = df.groupby('city').agg(
    age_mean=('age', 'mean'),
    score_sum=('score', 'sum'),
    count=('name', 'count')
)
print(agg_df)

Advanced GroupBy

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})
df['is_ny'] = df['city'] == 'NY'

# Custom aggregation function
def age_range(x):
    return x.max() - x.min()

df.groupby('city')['age'].agg(age_range)

# Apply custom function to groups
def zscore(x):
    return (x - x.mean()) / x.std(ddof=0)

df['score_zscore'] = df.groupby('city')['score'].transform(zscore)

# Multiple grouping columns
df.groupby(['city', 'is_ny'])['score'].mean()

Real-World Example

Sales Analysis with GroupBy

import pandas as pd
import numpy as np

sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'revenue': np.random.randint(100, 1000, 100)
})

# Monthly revenue by region
sales['month'] = sales['date'].dt.to_period('M')
monthly = sales.groupby(['month', 'region'])['revenue'].sum().unstack()
print(monthly.head())

Pivot Tables & Crosstabs

Pivot tables reshape data from long to wide format, aggregating as needed—like Excel pivot tables.

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NY', 'LA', 'NY', 'SF'],
    'score': [88, 92, 85, 90]
})
df['is_ny'] = df['city'] == 'NY'

# Pivot table: reshape and aggregate
pivot = pd.pivot_table(
    df,
    values='score',
    index='city',
    columns='age',
    aggfunc='mean',
    fill_value=0
)
print(pivot)
#       age  25  28  30  35
# city                     
# LA          0   0  92   0
# NY         88   0   0  85
# SF          0  90   0   0

# Crosstab: frequency table
ct = pd.crosstab(df['city'], df['is_ny'])
print(ct)
# is_ny  False  True
# city              
# LA         1     0
# NY         0     2
# SF         1     0

Merge, Join & Concatenate

Combining datasets is crucial for data analysis. Pandas provides SQL-like joins and concatenation.

Merge (SQL-style joins)

import pandas as pd

left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})

# Inner join (intersection)
inner = left.merge(right, on='id', how='inner')
print(inner)
#    id city  score
# 0   1   NY     88
# 1   2   LA     92

# Left join (all from left, matching from right)
left_join = left.merge(right, on='id', how='left')
print(left_join)
#    id city  score
# 0   1   NY   88.0
# 1   2   LA   92.0
# 2   3   SF    NaN

# Outer join (all from both)
outer = left.merge(right, on='id', how='outer')
print(outer)

Join (index-based)

import pandas as pd

left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})

l = left.set_index('id')
r = right.set_index('id')

# Join on index
joined = l.join(r, how='left')
print(joined)

Concatenate

import pandas as pd

left = pd.DataFrame({'id': [1, 2, 3], 'city': ['NY', 'LA', 'SF']})
right = pd.DataFrame({'id': [1, 2, 4], 'score': [88, 92, 77]})

# Stack rows vertically
concat_rows = pd.concat([left, right], axis=0, ignore_index=True)

# Stack columns horizontally
concat_cols = pd.concat([left, right], axis=1)

                            
                            SQL Analogy: merge(how='inner') = INNER JOIN, how='left' = LEFT JOIN, how='right' = RIGHT JOIN, how='outer' = FULL OUTER JOIN.
                        

Time Series Operations

Pandas has powerful tools for working with time-indexed data:

import pandas as pd
import numpy as np

# Create time series
rng = pd.date_range('2024-01-01', periods=30, freq='D')
ts = pd.Series(np.random.randn(len(rng)).cumsum(), index=rng, name='value')

print(ts.head())
# 2024-01-01   -0.234
# 2024-01-02    0.123
# ...

# Resample to weekly mean
weekly = ts.resample('W').mean()

# Shift (lag/lead)
ts_shifted = ts.shift(1)  # Lag by 1 day

# Rolling window
rolling_7d = ts.rolling(window=7).mean()

# Date accessors
df_ts = ts.to_frame()
df_ts['month'] = df_ts.index.month
df_ts['weekday'] = df_ts.index.day_name()
print(df_ts.head())

Best Practices & Summary

Key Takeaways

? Use vectorized operations over loops for performance
? Prefer loc/iloc for clear, safe indexing semantics
? Handle missing data explicitly with domain knowledge
? Use groupby for split-apply-combine aggregations
? Leverage method chaining for readable transformations
? Choose appropriate merge types (inner/left/right/outer)
? Use categorical dtype for repeated labels to save memory

Common Pitfalls

Avoid These Mistakes

Chained assignment: Use .loc instead of df[cond][col] = val
Ignoring dtypes: Check and convert types explicitly
Not setting index: Use meaningful indices for time series and joins
Copying unnecessarily: Many operations return views—use .copy() when needed
Mixing label/position indexing: Stick to loc or iloc, don't mix

What's Next

You've mastered Pandas—data loading, cleaning, transformation, grouping, and merging. You're now ready to visualize your insights!

                            
                            Next in the Series: Part 3: Data Visualization with Matplotlib & Seaborn will teach you to create compelling visualizations that communicate your data analysis results effectively.
                        

Practice Exercises

Merge & Join Exercises

Exercise 1 (Beginner): Create two DataFrames with overlapping IDs. Perform inner, outer, left, and right joins. Compare row counts and identify missing data.

Exercise 2 (Beginner): Create customer and order DataFrames. Merge on customer_id. Then group by customer to calculate total purchases.

Exercise 3 (Intermediate): Create multiple DataFrames. Concatenate them vertically (vstack) and horizontally (hstack). Use ignore_index=True. Explain when to use which.

Exercise 4 (Intermediate): Create DataFrames with non-matching keys. Try merging on key using how='left'. Handle NaN values from unmatched rows using fillna().

Challenge (Advanced): Create 3+ DataFrames with different keys. Chain multiple merges or use reduce() to combine all. Verify final row count matches expected result.

Pandas API Cheat Sheet

Quick reference for essential Pandas operations for data manipulation and analysis.

DataFrame Creation

`pd.DataFrame(dict)`	From dictionary
`pd.read_csv('file.csv')`	From CSV file
`pd.read_excel('file.xlsx')`	From Excel
`pd.read_json('file.json')`	From JSON
`pd.read_sql(query, conn)`	From SQL query
`pd.Series([1,2,3])`	Create Series

Data Inspection

`df.head(n)`	First n rows
`df.tail(n)`	Last n rows
`df.info()`	Column types & nulls
`df.describe()`	Statistical summary
`df.shape`	(rows, columns)
`df.dtypes`	Column data types

Selection & Filtering

`df['col']`	Select column
`df[['a','b']]`	Select columns
`df.loc['idx']`	By label
`df.iloc[0]`	By position
`df[df['age'] > 25]`	Boolean filter
`df.query('age > 25')`	Query syntax
`df.isin(['NY','LA'])`	Membership test

Missing Data

`df.isnull()`	Detect NaN values
`df.isnull().sum()`	Count nulls per col
`df.dropna()`	Drop rows with NaN
`df.fillna(value)`	Fill with value
`df.fillna(method='ffill')`	Forward fill
`df.fillna(df.mean())`	Fill with mean

GroupBy & Aggregation

`df.groupby('col')`	Group by column
`.mean()`	Group mean
`.sum()`	Group sum
`.count()`	Group count
`.agg(['min','max'])`	Multiple aggs
`.transform(fn)`	Broadcast result
`pd.pivot_table()`	Pivot & aggregate

Merging & Joining

`pd.merge(df1, df2)`	Merge (inner)
`how='left'`	Left join
`how='right'`	Right join
`how='outer'`	Outer join
`on='key'`	Merge on column
`df1.join(df2)`	Join on index
`pd.concat([df1,df2])`	Stack DataFrames

Transformation

`df['new'] = df['a'] + df['b']`	Add column
`df.apply(fn)`	Apply function
`df['col'].map(dict)`	Map values
`df.sort_values('col')`	Sort by column
`df.rename(columns={})`	Rename columns
`df.drop(['col'], axis=1)`	Drop column
`df.reset_index()`	Reset index

String Operations

`df['col'].str.lower()`	To lowercase
`.str.upper()`	To uppercase
`.str.strip()`	Remove whitespace
`.str.contains('word')`	Contains pattern
`.str.startswith('A')`	Starts with
`.str.split(',')`	Split string
`.str.replace('old','new')`	Replace text

                            
                            Pro Tips:
                            loc vs iloc: loc uses labels, iloc uses positions (integers)
Chaining: Avoid chained assignment; use .loc[] for safe updates
Method chaining: Chain methods for cleaner code: df.dropna().groupby('col').mean()
Performance: Use vectorized operations instead of .apply() when possible

                        

Cookie Consent

Cookie Preferences

Python Data Science Series Part 2: Pandas for Data Analysis

Table of Contents

Introduction to Pandas

Data Science Mastery

Python Setup & Notebooks

NumPy Foundations

Pandas Data Analysis

Data Visualization

Machine Learning with Scikit-learn

ML Mathematics & Statistics

Artificial Neural Networks

Computer Vision Fundamentals

PyTorch Deep Learning

TensorFlow & Keras

Transformers & Attention

Installation and Setup

Series: 1D Labeled Arrays

Index Alignment

DataFrame: 2D Tabular Data

Column Operations

Understanding DataFrame Structure

Reading & Writing Data

CSV Operations

Excel, JSON, and More

Indexing, Selection & Filtering

loc vs iloc

Boolean Filtering

Boolean Indexing with 2D Arrays

Filtering Rows by Class Label

Selecting Specific Columns from Filtered Rows

Using with Pandas DataFrames

Chained Assignment Warning

Selection & Filtering Exercises

Handling Missing Data

Removing Missing Data

Filling Missing Data

Operations & Computations

Apply, Map, and Applymap

String Operations

String Operations Exercises

Sorting

GroupBy & Aggregations

Advanced GroupBy

Sales Analysis with GroupBy

Pivot Tables & Crosstabs

GroupBy & Aggregations Exercises

Merge, Join & Concatenate

Merge (SQL-style joins)

Join (index-based)

Concatenate

Time Series Operations

Best Practices & Summary

Key Takeaways

Common Pitfalls

What's Next

Merge & Join Exercises

Pandas API Cheat Sheet

DataFrame Creation

Data Inspection

Selection & Filtering

Missing Data

GroupBy & Aggregation

Merging & Joining

Transformation

String Operations

Related Articles in This Series

Part 1: NumPy Foundations for Data Science

Part 3: Data Visualization with Matplotlib & Seaborn

Part 4: Machine Learning with Scikit-learn