Introduction to Scikit-learn
You've learned NumPy (arrays), Pandas (data manipulation), and visualization. Now it's time for machine learning—using data to make predictions and discover patterns.
Why Scikit-learn: Scikit-learn provides a simple, consistent API for hundreds of ML algorithms. Whether you're doing classification, regression, or clustering, the workflow is always: fit() to train, predict() to infer, score() to evaluate.
Python Setup & Notebooks
IDE setup, Jupyter, virtual environments
NumPy Foundations
Arrays, broadcasting, linear algebra
Pandas Data Analysis
DataFrames, cleaning, manipulation
Data Visualization
Matplotlib, Seaborn, Plotly
5
Machine Learning with Scikit-learn
Classification, regression, clustering
You Are Here
6
ML Mathematics & Statistics
Linear algebra, calculus, probability
7
Artificial Neural Networks
Perceptrons, backpropagation, architectures
8
Computer Vision Fundamentals
CNNs, image processing, object detection
9
PyTorch Deep Learning
Tensors, autograd, model training
10
TensorFlow & Keras
Sequential models, callbacks, deployment
11
Transformers & Attention
Self-attention, BERT, GPT architecture
Key Features
- Consistent API: All models follow the same estimator interface
- Comprehensive: Classification, regression, clustering, dimensionality reduction
- Preprocessing tools: Scaling, encoding, feature selection
- Model evaluation: Cross-validation, metrics, confusion matrices
- Pipelines: Chain preprocessing and modeling steps
- Well-documented: Excellent examples and user guide
# Installation
pip install scikit-learn
# Import convention
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
ML Terminology Explained
Before diving into code, let's clarify the key terms you'll see throughout this tutorial. Understanding these concepts will make everything that follows much clearer.
Data Terms
Features (X): The input variables used to make predictions. Think of them as the "questions" you ask.
- Also called: predictors, independent variables, attributes
- Example: For house prices ? square footage, bedrooms, location
- Notation: Capital
X (matrix of shape: samples × features)
Labels/Target (y): The output variable you want to predict. The "answer" you're looking for.
- Also called: response, dependent variable, outcome
- Example: For house prices ? actual price in dollars
- Notation: Lowercase
y (array of shape: samples)
Samples: Individual data points (rows in your dataset).
- Also called: observations, instances, examples
- Example: Each house in your dataset is one sample
Train/Test Terms
Training Set: Data used to teach the model. The model sees both features (X) and labels (y).
- Typical size: 70-80% of total data
- Variables:
X_train, y_train
- Purpose: Learn patterns and relationships
Test Set: Data used to evaluate the model. Model has never seen these samples during training.
- Typical size: 20-30% of total data
- Variables:
X_test, y_test
- Purpose: Measure real-world performance
Split: The process of dividing data into training and test sets.
- Function:
train_test_split()
- Why: Prevents overfitting—ensures model works on new data
Model Training Terms
fit(): Train the model on data. The model learns patterns from X_train and y_train.
- Usage:
model.fit(X_train, y_train)
- What happens: Model adjusts internal parameters (weights) to minimize errors
- Analogy: Student studying for an exam (seeing questions + answers)
predict(): Make predictions on new data. Model outputs predictions based on what it learned.
- Usage:
y_pred = model.predict(X_test)
- Input: Only features (X), no labels needed
- Analogy: Student taking the exam (answering new questions)
score(): Evaluate model performance. Compares predictions to actual labels.
- Usage:
accuracy = model.score(X_test, y_test)
- Output: Performance metric (e.g., 0.95 = 95% accuracy)
- Analogy: Grading the exam (checking answers)
Preprocessing Terms
fit_transform(): Learn preprocessing parameters from data AND apply the transformation.
- Usage:
X_train_scaled = scaler.fit_transform(X_train)
- Use on: Training data only
- What it does: Calculates mean/std (or min/max) from training data, then scales
- Example: StandardScaler learns mean=50, std=10, then applies scaling
transform(): Apply previously learned preprocessing to new data.
- Usage:
X_test_scaled = scaler.transform(X_test)
- Use on: Test data (or any new data)
- What it does: Uses training mean/std to scale test data
- ?? Critical: Never use
fit_transform() on test data—causes data leakage!
Why separate fit and transform?
- Training: Model should only learn from training data
- Testing: Apply same transformation to test data (using training stats)
- Real-world: New data gets transformed using original training parameters
Evaluation Terms
Accuracy: Percentage of correct predictions.
- Formula: (Correct predictions) / (Total predictions)
- Range: 0.0 to 1.0 (0% to 100%)
- Example: 0.95 = 95 out of 100 predictions were correct
Precision: Of all positive predictions, how many were actually correct?
- Formula: True Positives / (True Positives + False Positives)
- Use when: False positives are costly (e.g., spam detection)
Recall: Of all actual positives, how many did we catch?
- Formula: True Positives / (True Positives + False Negatives)
- Use when: Missing positives is costly (e.g., cancer detection)
F1 Score: Harmonic mean of precision and recall.
- Range: 0.0 to 1.0 (higher is better)
- Use when: Need balance between precision and recall
?? Common Mistakes
1. Data Leakage: Test data "leaking" into training.
- ? Wrong:
scaler.fit_transform(X_test)
- ? Right:
scaler.transform(X_test)
- Why: Test data should remain unseen during training
2. Training on Test Data: Using test set for training.
- ? Wrong:
model.fit(X_test, y_test)
- ? Right:
model.fit(X_train, y_train)
- Why: Inflates performance metrics artificially
3. Forgetting to Split: Evaluating on training data.
- ? Wrong:
model.score(X_train, y_train)
- ? Right:
model.score(X_test, y_test)
- Why: Model has memorized training data (overfitting)
4. Wrong Order: Transform before split.
- ? Wrong: Scale data ? then split
- ? Right: Split data ? then scale training ? transform test
- Why: Scaler would learn from entire dataset (including test)
Quick Reference: Throughout this tutorial, you'll see these terms in action. Whenever you see X and y, remember: X = features (what you know), y = labels (what you want to predict). The pattern is always: fit() on training data, predict() on test data, score() to evaluate.
The ML Workflow
Every machine learning project follows these steps:
Standard ML Workflow
- Load data: Import from CSV, database, or API
- Explore: Visualize distributions, check for missing values
- Split: Separate into training and test sets
- Preprocess: Scale features, encode categoricals
- Choose model: Select algorithm based on problem type
- Train: Fit model on training data
- Evaluate: Test on held-out data
- Tune: Optimize hyperparameters
- Deploy: Save model for production use
What's Next: In the following sections, we'll learn each step of this workflow in detail—from preprocessing data to evaluating models. Then we'll put it all together with complete examples using Scikit-learn's built-in datasets.
Built-in Datasets (Quick Introduction)
Before diving into ML techniques, let's get familiar with Scikit-learn's built-in datasets. These are perfect for learning and experimentation—no need to download external files.
Available Datasets
- Classification: Iris (flowers), Wine (quality), Digits (handwritten), Breast Cancer (diagnosis)
- Regression: Diabetes (disease progression), California Housing (prices)
- Toy Datasets: Small, clean datasets perfect for quick experiments
- Real-world Data: Based on actual research and applications
Loading and Exploring Datasets
All dataset loading functions follow the same pattern and return a Bunch object (dictionary-like) with consistent structure:
Understanding X and y: In machine learning, we use X (uppercase) for features and y (lowercase) for labels. This convention comes from mathematics where X is a matrix (2D array) and y is a vector (1D array). Every code example follows this pattern.
# Import dataset loaders
from sklearn.datasets import load_iris, load_wine, load_digits
import pandas as pd # For DataFrame display
# Load the Iris dataset (most famous ML dataset)
iris = load_iris()
# Bunch object contains:
# - data: FEATURES (X) - the input measurements used for predictions
# - target: LABELS (y) - the output we want to predict
# - feature_names: descriptive names for each feature column
# - target_names: descriptive names for each class/category
# - DESCR: full description of dataset (origin, usage, references)
print("Keys in dataset:", iris.keys())
# Output: dict_keys(['data', 'target', 'feature_names', 'target_names', 'DESCR'])
# CRITICAL CONVENTION: Assign data to X (features) and target to y (labels)
X = iris.data # Features: measurements like sepal length, petal width
# Shape: (150 samples, 4 features) = 150 rows × 4 columns
y = iris.target # Labels: species class (0, 1, or 2)
# Shape: (150,) = one label per sample
print(f"Shape of features (X): {X.shape}") # (150, 4) - 2D array (matrix)
print(f"Shape of labels (y): {y.shape}") # (150,) - 1D array (vector)
print(f"Feature names: {iris.feature_names}")
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(f"Target names: {iris.target_names}")
# ['setosa', 'versicolor', 'virginica'] - the 3 species we're classifying
Think of it this way: Features (X) = "What do we know about each flower?" (measurements). Labels (y) = "What species is it?" (answer we're trying to predict). The model learns the relationship between X and y.
# Import for data exploration
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
# Load dataset
iris = load_iris()
# Convert to DataFrame for easy viewing
# Create DataFrame from feature matrix with feature names as columns
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# Add target column with species names
df['species'] = iris.target_names[iris.target]
# Display first few rows
print(df.head())
# sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
# 0 5.1 3.5 1.4 0.2 setosa
# 1 4.9 3.0 1.4 0.2 setosa
# ...
# Statistical summary
print(df.describe())
# Shows count, mean, std, min, quartiles, max for each feature
# Check class distribution
print(df['species'].value_counts())
# setosa 50
# versicolor 50
# virginica 50
# Perfect balance—50 samples per class!
# Reading the full dataset description
from sklearn.datasets import load_iris
iris = load_iris()
# DESCR contains detailed information about the dataset
print(iris.DESCR)
# Includes:
# - Dataset characteristics (number of samples, features, classes)
# - Attribute information (feature descriptions)
# - Creator and source
# - References to relevant papers
# This is helpful for understanding what you're working with!
Key Takeaway: All Scikit-learn datasets use the same structure (data, target, feature_names, target_names, DESCR). Learn one, and you know them all! In later sections, we'll use these datasets to demonstrate complete ML workflows with classification, regression, and more.
Real-World Dataset Loaders
Scikit-learn provides several real-world datasets from published research and applications. These are perfect for learning ML techniques without downloading external data.
Classification Datasets
Iris Dataset (Multi-class Classification)
Classification
3 Classes
Description: Classic dataset containing measurements of 3 iris flower species. Most famous ML dataset, published by Ronald Fisher in 1936.
- Samples: 150 (50 per class)
- Features: 4 (sepal length/width, petal length/width in cm)
- Classes: 3 (Setosa, Versicolor, Virginica)
- Use Cases: Perfect for beginners, testing classification algorithms
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
print(f"Dataset shape: {X.shape}") # (150, 4)
print(f"Classes: {iris.target_names}") # ['setosa' 'versicolor' 'virginica']
print(f"Feature names: {iris.feature_names}")
# Class distribution
unique, counts = np.unique(y, return_counts=True)
print("\nClass distribution:")
for name, count in zip(iris.target_names, counts):
print(f" {name}: {count} samples")
# Convert to DataFrame for easy viewing
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target_names[y]
print("\nFirst 5 samples:")
print(df.head())
Wine Dataset (Multi-class Classification)
Classification
3 Classes
Description: Chemical analysis of wines from Italy. Predict wine cultivar based on 13 chemical measurements.
- Samples: 178
- Features: 13 (alcohol, malic acid, ash, alkalinity, magnesium, phenols, etc.)
- Classes: 3 wine cultivars
- Use Cases: Feature scaling demos, multi-class classification
from sklearn.datasets import load_wine
import pandas as pd
# Load dataset
wine = load_wine()
X, y = wine.data, wine.target
print(f"Dataset shape: {X.shape}") # (178, 13)
print(f"\nFirst 3 feature names: {wine.feature_names[:3]}")
# ['alcohol', 'malic_acid', 'ash']
# Check feature ranges (important for scaling!)
df = pd.DataFrame(X, columns=wine.feature_names)
print("\nFeature statistics:")
print(df.describe()[['alcohol', 'malic_acid', 'proline']])
# Notice: features have very different scales!
# alcohol: 11-15, malic_acid: 0-6, proline: 278-1680
# This dataset needs scaling before use with distance-based algorithms
Digits Dataset (Image Classification)
Classification
10 Classes
Description: Handwritten digit images (0-9). Simplified version of MNIST with 8×8 grayscale images.
- Samples: 1,797
- Features: 64 (8×8 pixel intensities, values 0-16)
- Classes: 10 (digits 0-9)
- Use Cases: Image classification, neural networks, dimensionality reduction
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np
# Load dataset
digits = load_digits()
X, y = digits.data, digits.target
print(f"Dataset shape: {X.shape}") # (1797, 64)
print(f"Each sample: 8x8 = 64 pixel values")
print(f"Classes: {np.unique(y)}") # [0 1 2 3 4 5 6 7 8 9]
# Visualize first 10 digits
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
# Reshape 64-length vector to 8×8 image
image = digits.images[i]
ax.imshow(image, cmap='gray')
ax.set_title(f"Label: {digits.target[i]}")
ax.axis('off')
plt.tight_layout()
plt.show()
# Pixel intensity range
print(f"\nPixel value range: {X.min():.0f} to {X.max():.0f}")
print("0 = white background, 16 = black digit")
Breast Cancer Dataset (Binary Classification)
Classification
2 Classes
Description: Features computed from breast mass images. Predict malignant vs benign tumors.
- Samples: 569
- Features: 30 (radius, texture, perimeter, area, smoothness, compactness, concavity, etc.)
- Classes: 2 (malignant=0, benign=1)
- Use Cases: Binary classification, medical diagnosis, feature importance analysis
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
print(f"Dataset shape: {X.shape}") # (569, 30)
print(f"Classes: {cancer.target_names}") # ['malignant' 'benign']
# Class distribution
unique, counts = np.unique(y, return_counts=True)
for name, count in zip(cancer.target_names, counts):
print(f"{name}: {count} samples ({count/len(y)*100:.1f}%)")
# malignant: 212 (37.3%)
# benign: 357 (62.7%)
# Slightly imbalanced—benign tumors are more common
# Feature groups
print("\nFeature groups (first 5 of each):")
print("Mean features:", cancer.feature_names[:5])
print("SE features:", cancer.feature_names[10:15])
print("Worst features:", cancer.feature_names[20:25])
Regression Datasets
Diabetes Dataset (Regression)
Regression
Description: Predict disease progression one year after baseline. Classic medical regression dataset.
- Samples: 442
- Features: 10 (age, sex, BMI, blood pressure, 6 blood serum measurements)
- Target: Quantitative measure of disease progression
- Use Cases: Regression, feature selection, regularization demos
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt
# Load dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
print(f"Dataset shape: {X.shape}") # (442, 10)
print(f"Feature names: {diabetes.feature_names}")
# ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
print(f"\nTarget statistics:")
print(f" Min: {y.min():.1f}")
print(f" Max: {y.max():.1f}")
print(f" Mean: {y.mean():.1f}")
print(f" Std: {y.std():.1f}")
# Visualize target distribution
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(y, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Disease Progression')
plt.ylabel('Frequency')
plt.title('Target Distribution')
plt.subplot(1, 2, 2)
plt.scatter(X[:, 2], y, alpha=0.5) # BMI vs progression
plt.xlabel('BMI (normalized)')
plt.ylabel('Disease Progression')
plt.title('BMI vs Disease Progression')
plt.tight_layout()
plt.show()
California Housing Dataset (Regression)
Regression
Description: Predict median house prices in California districts. Based on 1990 census data.
- Samples: 20,640
- Features: 8 (median income, house age, avg rooms, avg bedrooms, population, avg occupancy, latitude, longitude)
- Target: Median house value (in $100,000s)
- Use Cases: Regression, spatial data, larger dataset for performance testing
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
# Load dataset (uses fetch_ because it downloads data)
housing = fetch_california_housing()
X, y = housing.data, housing.target
print(f"Dataset shape: {X.shape}") # (20640, 8)
print(f"Feature names: {housing.feature_names}")
# ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
print(f"\nTarget statistics (in $100,000s):")
print(f" Min: ${y.min()*100000:.0f}")
print(f" Max: ${y.max()*100000:.0f}")
print(f" Median: ${np.median(y)*100000:.0f}")
# Geographic distribution
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 7], X[:, 6], c=y, cmap='viridis', alpha=0.3, s=10)
plt.colorbar(label='Median House Value ($100k)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('California Housing Prices by Location')
plt.show()
# You can see the shape of California!
Dataset Loading Patterns: Most datasets use load_*() functions (data included with scikit-learn). Larger datasets like California Housing use fetch_*() (downloads data on first use and caches it).
Synthetic Dataset Generators
Scikit-learn provides powerful generators to create synthetic datasets with known properties. These are invaluable for testing algorithms, debugging models, and understanding how different patterns affect performance.
Classification Generators
make_classification()
Classification
Customizable
Purpose: Generate random n-class classification problems with configurable complexity.
Key Parameters:
n_samples: Number of samples to generate
n_features: Total number of features
n_informative: Features that are useful for prediction
n_redundant: Features that are linear combinations of informative features
n_classes: Number of classes (labels)
class_sep: Separation between classes (higher = easier)
random_state: Seed for reproducibility
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
# Generate a simple 2-class problem
X, y = make_classification(
n_samples=1000, # 1000 data points
n_features=2, # 2 features (easy to visualize)
n_informative=2, # Both features are useful
n_redundant=0, # No redundant features
n_classes=2, # Binary classification
class_sep=1.5, # Moderate separation
random_state=42
)
print(f"X shape: {X.shape}") # (1000, 2)
print(f"y unique values: {np.unique(y)}") # [0 1]
print(f"Class 0: {np.sum(y==0)} samples")
print(f"Class 1: {np.sum(y==1)} samples")
# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], label='Class 0', alpha=0.6, edgecolors='k')
plt.scatter(X[y==1, 0], X[y==1, 1], label='Class 1', alpha=0.6, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Generated Classification Data')
plt.legend()
plt.show()
from sklearn.datasets import make_classification
import numpy as np
# Generate a HARDER problem (more realistic)
X, y = make_classification(
n_samples=500,
n_features=20, # 20 total features
n_informative=15, # 15 are useful
n_redundant=3, # 3 are redundant (linear combos)
n_repeated=0, # No duplicated features
n_classes=3, # Multi-class problem
n_clusters_per_class=2, # Each class has 2 clusters
class_sep=0.8, # Classes overlap slightly (harder)
flip_y=0.05, # 5% label noise (realistic!)
random_state=42
)
print(f"Dataset shape: {X.shape}") # (500, 20)
print(f"Classes: {np.unique(y)}")
print("\nClass distribution:")
for cls in np.unique(y):
print(f" Class {cls}: {np.sum(y==cls)} samples")
# This dataset is perfect for testing:
# - Feature selection (which of 20 features matter?)
# - Handling label noise
# - Multi-class classification
make_blobs()
Clustering & Classification
Purpose: Generate isotropic Gaussian blobs (clusters). Perfect for clustering algorithm demos.
Key Parameters:
n_samples: Total samples (distributed across centers)
n_features: Number of features
centers: Number of centers/clusters or explicit center coordinates
cluster_std: Standard deviation of clusters (controls spread)
center_box: Bounding box for cluster centers
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
# Generate 3 well-separated clusters
X, y = make_blobs(
n_samples=300, # 300 points total
n_features=2, # 2D for visualization
centers=3, # 3 cluster centers
cluster_std=0.5, # Tight clusters
random_state=42
)
print(f"X shape: {X.shape}") # (300, 2)
print(f"Cluster labels: {np.unique(y)}")
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolors='k', alpha=0.7)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Generated Blobs (3 Clusters)')
plt.colorbar(label='Cluster')
plt.show()
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
# Custom cluster centers (explicit positioning)
centers = np.array([
[0, 0], # Cluster 1 at origin
[5, 5], # Cluster 2 at (5, 5)
[0, 5] # Cluster 3 at (0, 5)
])
X, y = make_blobs(
n_samples=600,
centers=centers, # Use our custom centers
cluster_std=[0.4, 1.0, 0.7], # Different spread per cluster!
random_state=42
)
print(f"Generated {len(X)} samples")
print(f"Center 0 samples: {np.sum(y==0)}")
print(f"Center 1 samples: {np.sum(y==1)}")
print(f"Center 2 samples: {np.sum(y==2)}")
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.6, edgecolors='k')
plt.scatter(centers[:, 0], centers[:, 1], marker='X', s=200, c='red', edgecolors='black', linewidths=2, label='Centers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Blobs with Custom Centers and Variable Spread')
plt.legend()
plt.show()
make_moons() & make_circles()
Non-linear Classification
Purpose: Generate non-linearly separable datasets. Perfect for demonstrating kernel methods, neural networks, and testing linear vs non-linear classifiers.
Key Parameters:
n_samples: Number of samples to generate
noise: Standard deviation of Gaussian noise (0.0 = perfect, 0.1 = realistic)
random_state: Seed for reproducibility
factor (circles only): Scale factor between inner and outer circle
from sklearn.datasets import make_moons, make_circles
import matplotlib.pyplot as plt
# Generate moons
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)
# Generate circles
X_circles, y_circles = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
# Visualize both
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='viridis', edgecolors='k', alpha=0.7)
axes[0].set_title('Moons Dataset (Non-linear Boundary)')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[1].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='viridis', edgecolors='k', alpha=0.7)
axes[1].set_title('Circles Dataset (Concentric Classes)')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
print("These datasets CANNOT be separated by a straight line!")
print("Linear classifiers (like Logistic Regression) will fail.")
print("Non-linear models (SVM with RBF kernel, neural nets) will succeed.")
Regression Generators
make_regression()
Regression
Purpose: Generate random regression problems with known ground truth.
Key Parameters:
n_samples: Number of samples
n_features: Total features
n_informative: Useful features (others are noise)
noise: Standard deviation of Gaussian noise
bias: Constant term added to output
coef: If True, returns true coefficients
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import numpy as np
# Simple 1D regression for visualization
X, y = make_regression(
n_samples=200,
n_features=1, # Single feature (easy to plot)
noise=20, # Add realistic noise
random_state=42
)
print(f"X shape: {X.shape}") # (200, 1)
print(f"y shape: {y.shape}") # (200,)
print(f"y range: [{y.min():.1f}, {y.max():.1f}]")
plt.figure(figsize=(8, 6))
plt.scatter(X, y, alpha=0.6, edgecolors='k')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Generated Regression Data (1 feature)')
plt.show()
from sklearn.datasets import make_regression
import numpy as np
# Multi-feature regression with known coefficients
X, y, coef = make_regression(
n_samples=500,
n_features=10, # 10 total features
n_informative=5, # Only 5 actually matter
noise=10,
coef=True, # Return true coefficients
random_state=42
)
print(f"Dataset shape: {X.shape}") # (500, 10)
print(f"\nTrue coefficients (first 5):")
print(coef[:5])
print("\nNon-informative features have coefficients ˜ 0:")
print(coef[5:])
print("\nUse this to test if your model identifies important features!")
Advanced Generators
Other Useful Generators
- make_multilabel_classification(): Multi-label problems (each sample has multiple labels)
- make_hastie_10_2(): Binary classification with 10 features (from Elements of Statistical Learning)
- make_gaussian_quantiles(): Gaussian distributions divided into quantiles
- make_swiss_roll(): 3D manifold for dimensionality reduction demos
- make_s_curve(): S-shaped 3D manifold
- make_low_rank_matrix(): Low-rank matrices for matrix factorization
from sklearn.datasets import make_multilabel_classification
import numpy as np
# Multi-label classification (e.g., image tags: "cat", "outdoor", "sunny")
X, y = make_multilabel_classification(
n_samples=200,
n_features=10,
n_classes=5, # 5 possible labels
n_labels=2, # Each sample has ~2 labels on average
random_state=42
)
print(f"X shape: {X.shape}") # (200, 10)
print(f"y shape: {y.shape}") # (200, 5) - binary matrix!
print(f"\nFirst sample features: {X[0]}")
print(f"First sample labels: {y[0]}")
# [1 0 1 0 0] means this sample has labels 0 and 2
from sklearn.datasets import make_swiss_roll
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# Generate Swiss roll (3D manifold)
X, color = make_swiss_roll(n_samples=1500, noise=0.1, random_state=42)
print(f"X shape: {X.shape}") # (1500, 3)
print("This is a 3D dataset that lies on a 2D manifold!")
print("Perfect for testing dimensionality reduction (t-SNE, PCA, Isomap)")
# Visualize in 3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap='viridis', s=10)
ax.set_title('Swiss Roll 3D Manifold')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()
When to Use Synthetic Data:
- Algorithm Testing: Verify your implementation works on known data
- Performance Comparison: Compare algorithms on controlled problems
- Debugging: Start simple (make_blobs) before tackling real data
- Education: Demonstrate concepts (e.g., non-linear boundaries with make_moons)
- Scaling Tests: Generate large datasets to test performance
Important: Always set random_state for reproducibility! This ensures you get the same dataset across runs, making debugging and comparison easier.
Data Preprocessing
Preprocessing transforms raw data into a format suitable for ML algorithms. This is critical—garbage in, garbage out.
Feature Scaling
Many algorithms (SVM, neural networks, k-NN) require features on similar scales:
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load sample data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# StandardScaler: mean=0, std=1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use training stats!
print("Scaled training data (first 2 samples):")
print(X_scaled[:2])
print(f"Mean: {X_scaled.mean():.3f}, Std: {X_scaled.std():.3f}")
# MinMaxScaler: scale to [0, 1]
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X_train)
print("\nMinMax scaled data (first 2 samples):")
print(X_minmax[:2])
print(f"Min: {X_minmax.min():.3f}, Max: {X_minmax.max():.3f}")
Encoding Categorical Variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
# LabelEncoder: convert strings to integers
le = LabelEncoder()
animal_names = ['cat', 'dog', 'cat', 'bird']
y_encoded = le.fit_transform(animal_names) # [0, 1, 0, 2]
print("LabelEncoder mapping:")
for i, label in enumerate(le.classes_):
print(f" {label}: {i}")
print(f"Encoded result: {y_encoded}")
# OneHotEncoder: create binary columns
ohe = OneHotEncoder(sparse_output=False) # sparse_output replaces sparse in newer scikit-learn
colors = np.array(['red', 'blue', 'green']).reshape(-1, 1)
encoded = ohe.fit_transform(colors)
print("\nOneHotEncoder result:")
print(encoded)
print(f"Feature names: {ohe.get_feature_names_out(['color'])}")
Critical: Always fit() on training data, then transform() on both train and test. Never fit() on test data—this causes data leakage!
Classification Models
Classification predicts categorical outcomes (spam/not spam, disease/healthy, customer churn).
Common Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load sample data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression (linear boundary)
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
print(f"Logistic Regression accuracy: {log_reg.score(X_test, y_test):.3f}")
# Random Forest (ensemble of decision trees)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
print(f"Random Forest accuracy: {rf.score(X_test, y_test):.3f}")
# Support Vector Machine (complex boundaries)
svm = SVC(kernel='rbf')
svm.fit(X_train, y_train)
print(f"SVM accuracy: {svm.score(X_test, y_test):.3f}")
# k-Nearest Neighbors (instance-based)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print(f"KNN accuracy: {knn.score(X_test, y_test):.3f}")
Choosing an Algorithm
- Logistic Regression: Fast, interpretable, works well for linearly separable data
- Random Forest: Handles non-linear relationships, robust to outliers, good default choice
- SVM: Powerful for complex boundaries, sensitive to feature scaling
- k-NN: Simple, no training phase, good for small datasets
Rule of thumb: Start with Logistic Regression (fast baseline), then try Random Forest if you need more complexity.
Regression Models
Regression predicts continuous values (house prices, temperatures, sales).
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Generate synthetic data
from sklearn.datasets import make_regression
X_reg, y_reg = make_regression(n_samples=200, n_features=3, noise=10, random_state=42)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train_r, y_train_r)
y_pred = lin_reg.predict(X_test_r)
print(f"Linear - RMSE: {mean_squared_error(y_test_r, y_pred):.2f}")
print(f"Linear - R²: {r2_score(y_test_r, y_pred):.3f}")
# Ridge Regression (with L2 regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_r, y_train_r)
y_pred_ridge = ridge.predict(X_test_r)
print(f"Ridge - R²: {r2_score(y_test_r, y_pred_ridge):.3f}")
# Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train_r, y_train_r)
print(f"RF - R²: {r2_score(y_test_r, rf_reg.predict(X_test_r)):.3f}")
Clustering
Clustering finds groups in unlabeled data (customer segmentation, anomaly detection).
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Generate blob data
from sklearn.datasets import make_blobs
X_blob, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
# K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X_blob)
print(f"Cluster centers: {kmeans.cluster_centers_.shape}")
print(f"Silhouette score: {silhouette_score(X_blob, labels):.3f}")
# Visualize (assuming 2D data)
import matplotlib.pyplot as plt
plt.scatter(X_blob[:, 0], X_blob[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
marker='X', s=200, c='red', label='Centroids')
plt.legend()
plt.title('K-Means Clustering')
plt.show()
Silhouette Score: Ranges from -1 to 1. Values near 1 indicate well-separated clusters, near 0 means overlapping clusters, negative values suggest misclassification.
Model Evaluation
Classification Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted'):.3f}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted'):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted'):.3f}")
# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Classification Report (comprehensive)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Cross-Validation
A single train/test split can be misleading—your model might get "lucky" or "unlucky" depending on which samples end up in the test set. Cross-validation solves this by testing your model on multiple different splits of the data.
What is Cross-Validation?
Cross-validation is a technique that evaluates model performance by:
- Splitting data into K folds (e.g., 5 equal parts)
- Training K times: Each time, use K-1 folds for training and 1 fold for testing
- Rotating the test fold: Every fold gets to be the test set exactly once
- Averaging results: Final score is the mean of all K test scores
This gives a more reliable estimate of how your model will perform on unseen data, using all your data for both training and testing (but never at the same time).
Why Cross-Validation Matters
- Reduces Variance: Single split might be lucky/unlucky—CV averages over multiple splits
- Uses All Data: Every sample is used for both training and testing (in different iterations)
- Detects Overfitting: Large gap between train and CV scores indicates overfitting
- Model Selection: Compare different models or hyperparameters fairly
- Small Dataset Friendly: Maximizes use of limited data (unlike holding out large test set)
K-Fold Cross-Validation Example
Let's see how 5-fold CV works step-by-step:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Create model
model = RandomForestClassifier(n_estimators=100, random_state=42)
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
# Custom K-Fold
kf = KFold(n_splits=10, shuffle=True, random_state=42)
fold_scores = []
for i, (train_idx, test_idx) in enumerate(kf.split(X)):
X_train_fold, X_test_fold = X[train_idx], X[test_idx]
y_train_fold, y_test_fold = y[train_idx], y[test_idx]
model.fit(X_train_fold, y_train_fold)
fold_score = model.score(X_test_fold, y_test_fold)
fold_scores.append(fold_score)
print(f"Fold {i+1} accuracy: {fold_score:.3f}")
print(f"Mean fold accuracy: {sum(fold_scores)/len(fold_scores):.3f}")
Understanding the CV Process
5-Fold Cross-Validation Breakdown
With 150 samples divided into 5 folds (30 samples each):
- Fold 1: Train on folds 2-5 (120 samples), test on fold 1 (30 samples) ? Score 1
- Fold 2: Train on folds 1,3-5 (120 samples), test on fold 2 (30 samples) ? Score 2
- Fold 3: Train on folds 1-2,4-5 (120 samples), test on fold 3 (30 samples) ? Score 3
- Fold 4: Train on folds 1-3,5 (120 samples), test on fold 4 (30 samples) ? Score 4
- Fold 5: Train on folds 1-4 (120 samples), test on fold 5 (30 samples) ? Score 5
Final Score: Average of all 5 scores ± standard deviation (shows variability)
Key Insight: Every sample is used for testing exactly once, and for training 4 times!
Cross-Validation Strategies
from sklearn.model_selection import (
KFold, # Standard k-fold
StratifiedKFold, # Maintains class distribution in each fold
ShuffleSplit, # Random train/test splits
LeaveOneOut # Each sample is test set once (n_splits = n_samples)
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import numpy as np
# Load data
iris = load_iris()
X, y = iris.data, iris.target
model = RandomForestClassifier(n_estimators=100, random_state=42)
# 1. Standard KFold - splits data sequentially
kf = KFold(n_splits=5, shuffle=False)
print("KFold (no shuffle):")
for i, (train_idx, test_idx) in enumerate(kf.split(X)):
print(f" Fold {i+1}: Train size={len(train_idx)}, Test size={len(test_idx)}")
# 2. Shuffled KFold - randomizes before splitting (recommended!)
kf_shuffle = KFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=kf_shuffle, scoring='accuracy')
print(f"\nShuffled KFold: {scores.mean():.3f} ± {scores.std():.3f}")
# 3. StratifiedKFold - maintains class proportions (best for classification!)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_strat = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f"Stratified KFold: {scores_strat.mean():.3f} ± {scores_strat.std():.3f}")
# Check class distribution in folds
print("\nClass distribution in each fold:")
for i, (train_idx, test_idx) in enumerate(skf.split(X, y)):
test_classes = np.bincount(y[test_idx])
print(f" Fold {i+1} test set: {test_classes} (balanced!)")
# 4. LeaveOneOut - extreme CV, very slow but uses maximum data
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
print(f"\nLeaveOneOut: {loo.get_n_splits(X)} splits (one per sample)")
# Too slow for large datasets, but useful for tiny datasets (< 100 samples)
Best Practice for Classification: Always use StratifiedKFold instead of regular KFold. It ensures each fold has the same class distribution as the original dataset, preventing biased folds (e.g., a fold with mostly one class).
Cross-Validation Parameters Explained
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
model = RandomForestClassifier(n_estimators=100, random_state=42)
# cross_val_score(estimator, X, y, cv=5, scoring=None, n_jobs=None)
# Parameters:
# estimator: model - The model to evaluate
# X: array - Features
# y: array - Target labels
# cv: int or CV object - Number of folds or CV strategy (default: 5)
# scoring: str - Metric to use ('accuracy', 'f1', 'roc_auc', etc.)
# n_jobs: int - Number of parallel jobs (-1 = use all cores)
# Example: F1 score with 10-fold CV using all CPU cores
scores_f1 = cross_val_score(
model, X, y,
cv=10, # 10-fold cross-validation
scoring='f1_macro', # F1 score (macro-averaged for multi-class)
n_jobs=-1 # Use all available CPU cores
)
print(f"10-fold CV F1: {scores_f1.mean():.3f} ± {scores_f1.std():.3f}")
print(f"Individual fold scores: {scores_f1}")
Common Cross-Validation Mistakes
Avoid These Pitfalls
- ? Fitting preprocessing on entire dataset before CV: Causes data leakage!
# WRONG - scaling before CV leaks info from test folds into training
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Uses info from ALL data
scores = cross_val_score(model, X_scaled, y, cv=5) # Test folds saw training data!
- ? Use pipelines to ensure preprocessing happens inside each fold:
# CORRECT - scaling happens separately for each fold
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
scores = cross_val_score(pipeline, X, y, cv=5) # Scaling done per fold!
- ? Using KFold instead of StratifiedKFold for classification: Can create imbalanced folds
- ? Too many folds on small datasets: Each fold has too few test samples (high variance)
- ? Too few folds on large datasets: Wastes data and underestimates performance
- ? Not shuffling data before KFold: If data is ordered by class, folds will be biased
Metric Selection
- Accuracy: Good for balanced datasets
- Precision: Important when false positives are costly (spam detection)
- Recall: Critical when false negatives are costly (disease detection)
- F1 Score: Harmonic mean of precision and recall—good for imbalanced data
- ROC-AUC: Measures model's ability to distinguish classes (0.5 = random, 1.0 = perfect)
Pipelines & Automation
Pipelines chain preprocessing and modeling steps, preventing data leakage and simplifying code. They ensure transformations are applied consistently and in the correct order.
Why Pipelines?
- Prevent Data Leakage: Ensure test data never influences preprocessing (fit only on train)
- Reproducibility: Same transformations applied to train, validation, and test sets
- Cleaner Code: Replace dozens of lines with a single pipeline.fit()
- Easy Cross-Validation: Pass entire pipeline to cross_val_score()
- Hyperparameter Tuning: Tune preprocessing and model parameters together in GridSearchCV
Pipeline Constructor & Basic Usage
The Pipeline constructor takes a list of (name, transformer) tuples. All steps except the last must be transformers (have fit/transform methods). The last step can be a transformer or estimator.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load sample data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create pipeline: list of (name, object) tuples
# Names are arbitrary but should be descriptive
pipeline = Pipeline([
('scaler', StandardScaler()), # Step 1: Scale features
('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) # Step 2: Classify
])
# Alternative syntax using make_pipeline (auto-generates names)
from sklearn.pipeline import make_pipeline
pipeline_auto = make_pipeline(
StandardScaler(), # Name: 'standardscaler'
RandomForestClassifier(n_estimators=100, random_state=42) # Name: 'randomforestclassifier'
)
# Fit entire pipeline: fits scaler on X_train, transforms X_train, then fits classifier
pipeline.fit(X_train, y_train)
# Predict: transforms X_test using fitted scaler, then predicts using fitted classifier
y_pred = pipeline.predict(X_test)
print(f"Pipeline accuracy: {pipeline.score(X_test, y_test):.3f}")
# Cross-validate entire pipeline (CORRECT way - no data leakage)
scores = cross_val_score(pipeline, X, y, cv=5)
print(f"CV mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
Pipeline Methods & Attributes
Pipelines expose the same methods as the final estimator, plus additional pipeline-specific functionality:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create multi-step pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Key Pipeline Methods:
# 1. fit(X, y) - Fit all transformers and final estimator
pipeline.fit(X_train, y_train)
# 2. predict(X) - Transform data through pipeline and predict
predictions = pipeline.predict(X_test)
# 3. predict_proba(X) - Get class probabilities (if final estimator supports it)
probabilities = pipeline.predict_proba(X_test)
print(f"Class probabilities shape: {probabilities.shape}") # (n_samples, n_classes)
# 4. score(X, y) - Transform X and score using final estimator
accuracy = pipeline.score(X_test, y_test)
print(f"Test accuracy: {accuracy:.3f}")
# 5. fit_transform(X, y) - Fit pipeline and return transformed X (uses all steps)
X_transformed = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2))
]).fit_transform(X_train)
print(f"Transformed shape: {X_transformed.shape}") # (120, 2) - reduced to 2 components
# 6. Access individual steps using named_steps attribute
print(f"\nStep names: {pipeline.named_steps.keys()}")
print(f"Scaler mean: {pipeline.named_steps['scaler'].mean_}")
print(f"PCA explained variance: {pipeline.named_steps['pca'].explained_variance_ratio_}")
print(f"Classifier feature importances: {pipeline.named_steps['classifier'].feature_importances_}")
# 7. Access steps by index
print(f"\nFirst step: {pipeline.steps[0]}") # ('scaler', StandardScaler(...))
print(f"Second step name: {pipeline.steps[1][0]}") # 'pca'
print(f"Last step (estimator): {pipeline[-1]}") # RandomForestClassifier(...)
# 8. Get parameters of any step (useful for GridSearchCV)
print(f"\nAll pipeline parameters:")
print(pipeline.get_params().keys())
Modifying Pipeline Steps
You can set parameters of individual steps using the step_name__parameter syntax:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2)),
('svc', SVC(kernel='rbf'))
])
# Set parameters using double underscore notation
# Format: step_name__parameter_name
pipeline.set_params(
pca__n_components=3, # Change PCA components from 2 to 3
svc__C=10.0, # Set SVM regularization
svc__kernel='linear' # Change kernel from 'rbf' to 'linear'
)
print("Updated parameters:")
print(f"PCA components: {pipeline.named_steps['pca'].n_components}")
print(f"SVC kernel: {pipeline.named_steps['svc'].kernel}")
print(f"SVC C: {pipeline.named_steps['svc'].C}")
# This syntax is crucial for GridSearchCV
from sklearn.model_selection import GridSearchCV
param_grid = {
'pca__n_components': [2, 3, 4],
'svc__C': [0.1, 1, 10],
'svc__kernel': ['linear', 'rbf']
}
grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X, y)
print(f"\nBest parameters: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
ColumnTransformer for Mixed Data Types
ColumnTransformer applies different preprocessing to different columns (e.g., scaling numeric, encoding categorical):
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
# Create sample mixed data
data = pd.DataFrame({
'age': [25, 30, np.nan, 45, 50],
'income': [50000, 60000, 55000, np.nan, 80000],
'city': ['NY', 'LA', 'NY', 'SF', 'LA'],
'gender': ['M', 'F', 'F', 'M', 'F'],
'purchased': [0, 1, 0, 1, 1]
})
X = data.drop('purchased', axis=1)
y = data['purchased']
# Define column groups
numeric_features = ['age', 'income']
categorical_features = ['city', 'gender']
# Create preprocessing pipelines for each data type
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')), # Fill missing values
('scaler', StandardScaler()) # Scale features
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')), # Fill missing categories
('onehot', OneHotEncoder(handle_unknown='ignore')) # One-hot encode
])
# Combine transformers using ColumnTransformer
# transformers: list of (name, transformer, columns) tuples
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
],
remainder='drop' # Options: 'drop' (default), 'passthrough', or a transformer
)
# Create full pipeline with preprocessor and model
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Fit and predict (single call handles everything!)
full_pipeline.fit(X, y)
predictions = full_pipeline.predict(X)
print(f"Predictions: {predictions}")
# Access transformed column names
# Note: OneHotEncoder creates multiple columns
preprocessor.fit(X)
print(f"\nTransformed feature names:")
print(preprocessor.get_feature_names_out())
ColumnTransformer with Remainder
Control what happens to columns not specified in transformers:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
import numpy as np
# Sample data with extra columns
X = pd.DataFrame({
'age': [25, 30, 35],
'income': [50000, 60000, 70000],
'city': ['NY', 'LA', 'SF'],
'id': [101, 102, 103], # Not used in transformers
'timestamp': ['2024-01-01', '2024-01-02', '2024-01-03'] # Not used
})
# Option 1: Drop unspecified columns (default)
ct_drop = ColumnTransformer([
('scale', StandardScaler(), ['age', 'income']),
('encode', OneHotEncoder(), ['city'])
], remainder='drop') # 'id' and 'timestamp' will be dropped
# Option 2: Pass through unspecified columns unchanged
ct_passthrough = ColumnTransformer([
('scale', StandardScaler(), ['age', 'income']),
('encode', OneHotEncoder(), ['city'])
], remainder='passthrough') # 'id' and 'timestamp' kept as-is
# Option 3: Apply a transformer to remaining columns
ct_custom = ColumnTransformer([
('scale', StandardScaler(), ['age', 'income']),
('encode', OneHotEncoder(), ['city'])
], remainder=StandardScaler()) # Scale remaining numeric columns
X_drop = ct_drop.fit_transform(X)
X_pass = ct_passthrough.fit_transform(X)
X_custom = ct_custom.fit_transform(X)
print(f"Drop: {X_drop.shape}") # (3, 5) - age, income, city_NY, city_LA, city_SF
print(f"Passthrough: {X_pass.shape}") # (3, 7) - adds id and timestamp
print(f"Custom: {X_custom.shape}") # (3, 7) - scales id and timestamp too
FeatureUnion: Parallel Feature Extraction
FeatureUnion runs multiple transformers in parallel and concatenates results (useful for combining different feature extraction methods):
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create parallel feature transformations
feature_union = FeatureUnion([
('pca', PCA(n_components=2)), # Extract 2 principal components
('poly', PolynomialFeatures(degree=2, include_bias=False)) # Add polynomial features
])
# Combine with classifier in pipeline
pipeline = Pipeline([
('scaler', StandardScaler()), # Scale original features
('features', feature_union), # Create 2 PCA + polynomial features in parallel
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
# Check combined feature count
# Original: 4 features
# PCA: 2 features
# Polynomial (degree=2 on 4 features): 14 features (4 + 6 interactions + 4 squares)
# Total: 2 + 14 = 16 features
X_transformed = pipeline.named_steps['features'].transform(
pipeline.named_steps['scaler'].transform(X_train[:1])
)
print(f"Original features: {X_train.shape[1]}") # 4
print(f"After FeatureUnion: {X_transformed.shape[1]}") # 16
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")
Custom Transformers
Create custom transformers by implementing fit() and transform() methods:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
# Custom transformer: Log transform features
class LogTransformer(BaseEstimator, TransformerMixin):
"""Apply log(x + 1) transformation to avoid log(0)"""
def fit(self, X, y=None):
# No fitting needed for log transform
return self
def transform(self, X):
# Apply log(x + 1) element-wise
return np.log1p(X)
# Custom transformer: Feature selector based on variance
class VarianceThreshold(BaseEstimator, TransformerMixin):
"""Remove low-variance features"""
def __init__(self, threshold=0.1):
self.threshold = threshold
def fit(self, X, y=None):
# Calculate variance of each feature
self.variances_ = np.var(X, axis=0)
# Store mask of features to keep
self.mask_ = self.variances_ > self.threshold
return self
def transform(self, X):
# Keep only high-variance features
return X[:, self.mask_]
# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Use custom transformers in pipeline
pipeline = Pipeline([
('log', LogTransformer()), # Custom: Log transform
('var_filter', VarianceThreshold(threshold=0.5)), # Custom: Remove low-variance
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
print(f"Original features: {X_train.shape[1]}") # 4
print(f"After variance filter: {np.sum(pipeline.named_steps['var_filter'].mask_)}") # e.g., 3
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")
Real-World Pipeline Example
Complete Production Pipeline
A typical ML pipeline for mixed data with feature engineering, selection, and tuning:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
# Sample dataset
np.random.seed(42)
data = pd.DataFrame({
'age': np.random.randint(20, 70, 100),
'income': np.random.randint(30000, 120000, 100),
'city': np.random.choice(['NY', 'LA', 'SF', 'CHI'], 100),
'education': np.random.choice(['HS', 'BS', 'MS', 'PhD'], 100),
'target': np.random.randint(0, 2, 100)
})
X = data.drop('target', axis=1)
y = data['target']
# Define preprocessing for different column types
numeric_features = ['age', 'income']
categorical_features = ['city', 'education']
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine preprocessors
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline with feature selection and model
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('feature_selection', SelectKBest(f_classif, k=5)),
('classifier', RandomForestClassifier(random_state=42))
])
# Hyperparameter grid for tuning
param_grid = {
'feature_selection__k': [3, 5, 'all'],
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [5, 10, None]
}
# Grid search with cross-validation
grid_search = GridSearchCV(
full_pipeline,
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
print(f"\nPipeline steps:")
for name, step in grid_search.best_estimator_.named_steps.items():
print(f" {name}: {step}")
Pipeline Best Practices:
- Always fit on training data only: Never fit transformers on test data
- Use pipelines with cross-validation: Prevents data leakage across folds
- Name steps descriptively: Makes debugging and parameter access easier
- Combine with GridSearchCV: Tune preprocessing and model together
- Save entire pipeline: Use joblib.dump(pipeline, 'model.pkl') for deployment
- Custom transformers inherit BaseEstimator + TransformerMixin: Ensures compatibility
Hyperparameter Tuning
Hyperparameters control model behavior (learning rate, tree depth, etc.). Tuning finds optimal values.
Grid Search
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
# Grid search with cross-validation
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1 # Use all CPU cores
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")
print(f"Test score: {grid.score(X_test, y_test):.3f}")
Random Search (Faster)
from sklearn.model_selection import RandomizedSearchCV
# Random search samples random combinations
param_dist = {
'n_estimators': [50, 100, 150, 200, 250, 300],
'max_depth': [None, 5, 10, 15, 20, 25, 30],
'min_samples_split': [2, 5, 10, 15]
}
random = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_dist,
n_iter=20, # Try 20 random combinations
cv=5,
random_state=42,
n_jobs=-1
)
random.fit(X_train, y_train)
print(f"Best params: {random.best_params_}")
Grid vs Random: Grid Search is exhaustive but slow. Random Search samples randomly—often finds good parameters 10x faster. For large grids, use Random Search first, then Grid Search to fine-tune.
Complete Workflow Examples with Datasets
Now that you've learned preprocessing, classification, regression, clustering, evaluation, pipelines, and hyperparameter tuning, let's see how everything fits together. This section demonstrates complete end-to-end ML workflows using Scikit-learn's built-in datasets.
What You'll See: Each example below walks through the entire process—from loading data and exploration, through preprocessing and model selection, to evaluation and visualization. These are realistic workflows you can adapt for your own projects.
Datasets Covered
- Classification: Iris, Wine, Digits, Breast Cancer—demonstrating Logistic Regression, Random Forest, SVM, and evaluation
- Regression: Diabetes, California Housing—demonstrating Linear Regression, Ridge, feature importance
- Complete Pipeline: Every example shows data splitting, preprocessing, training, evaluation, and visualization
1. Iris Dataset (Multi-class Classification)
About Iris: Classic dataset with 150 samples of iris flowers. Features include sepal length, sepal width, petal length, and petal width. Target: 3 species (setosa, versicolor, virginica). Perfect for learning classification.
# Import necessary libraries
import numpy as np # For numerical operations
import pandas as pd # For data manipulation
import matplotlib.pyplot as plt # For plotting
from sklearn.datasets import load_iris # Load built-in Iris dataset
from sklearn.model_selection import train_test_split, cross_val_score # For data splitting and validation
from sklearn.preprocessing import StandardScaler # For feature scaling
from sklearn.linear_model import LogisticRegression # Linear classifier
from sklearn.ensemble import RandomForestClassifier # Tree-based ensemble classifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # Evaluation metrics
import seaborn as sns # For advanced visualization
# 1. LOAD DATA
# load_iris() returns a Bunch object (dict-like) containing:
# - data: feature matrix (150 samples x 4 features)
# - target: class labels (0, 1, 2 for setosa, versicolor, virginica)
# - feature_names: names of the 4 features
# - target_names: names of the 3 species
iris = load_iris()
X, y = iris.data, iris.target # X = features (150x4), y = labels (150,)
# Display dataset information
print(f"Dataset shape: {X.shape}") # Output: (150, 4) - 150 samples, 4 features
print(f"Feature names: {iris.feature_names}") # sepal length/width, petal length/width
print(f"Target names: {iris.target_names}") # setosa, versicolor, virginica
print(f"Sample distribution: {np.bincount(y)}") # Count samples per class - Output: [50 50 50] (balanced)
# Import libraries for data exploration
import pandas as pd # For DataFrame operations
import matplotlib.pyplot as plt # For visualization
import seaborn as sns # For enhanced plots
from sklearn.datasets import load_iris # Load dataset
# 2. EXPLORE DATA
iris = load_iris()
X, y = iris.data, iris.target
# Create DataFrame for easy exploration and analysis
# pd.DataFrame() converts NumPy array to tabular format with column names
df = pd.DataFrame(X, columns=iris.feature_names)
# Add species names by mapping numeric labels (0,1,2) to text labels
df['species'] = iris.target_names[y] # e.g., 0 -> 'setosa'
# Display first 5 rows to see data structure
print(df.head()) # Shows sample data with feature values and species
# Statistical summary: count, mean, std, min, 25%, 50%, 75%, max
print(df.describe()) # Helps identify feature ranges and distributions
# Count samples per species - should be 50 each (balanced dataset)
print(df['species'].value_counts())
# Visualize feature distributions to understand data patterns
plt.figure(figsize=(12, 4)) # Create figure 12 inches wide, 4 tall
for i in range(4): # Loop through 4 features
plt.subplot(1, 4, i+1) # Create 1 row, 4 columns of subplots
# Create overlapping histograms for each species
# X[y==0, i] gets feature i values for species 0, etc.
plt.hist([X[y==0, i], X[y==1, i], X[y==2, i]],
label=iris.target_names, alpha=0.7) # alpha=0.7 for transparency
plt.xlabel(iris.feature_names[i]) # Label x-axis with feature name
plt.ylabel('Frequency') # Count of samples in each bin
plt.legend() # Show which color represents which species
plt.tight_layout() # Adjust spacing to prevent overlap
plt.show() # Display the plot
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split # For splitting data
from sklearn.preprocessing import StandardScaler # For feature normalization
from sklearn.linear_model import LogisticRegression # Linear classification model
from sklearn.ensemble import RandomForestClassifier # Ensemble tree model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# 3. SPLIT DATA into training and testing sets
iris = load_iris()
X, y = iris.data, iris.target
# train_test_split() randomly divides data into train/test sets
# test_size=0.2: Use 20% for testing, 80% for training
# random_state=42: Set seed for reproducibility (same split every time)
# stratify=y: Maintain class proportions in both sets (33% of each species)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}") # (120, 4) - 80% of 150 samples
print(f"Test set: {X_test.shape}") # (30, 4) - 20% of 150 samples
# 4. PREPROCESS: Scale features to mean=0, std=1
# Scaling is crucial for distance-based algorithms (e.g., Logistic Regression, SVM)
scaler = StandardScaler() # Create scaler object
# fit_transform(): Learn mean/std from training data AND transform it
X_train_scaled = scaler.fit_transform(X_train)
# transform(): Apply same scaling (using training mean/std) to test data
# NEVER fit on test data - this would cause data leakage!
X_test_scaled = scaler.transform(X_test)
# 5. TRAIN MODELS on the training data
# Logistic Regression (linear decision boundaries)
# max_iter=200: Maximum optimization iterations
# random_state=42: For reproducibility in stochastic processes
log_reg = LogisticRegression(max_iter=200, random_state=42)
log_reg.fit(X_train_scaled, y_train) # Learn weights from scaled training data
# Random Forest (ensemble of decision trees)
# n_estimators=100: Build 100 decision trees and average their predictions
# Tree-based models are scale-invariant (don't need scaled features)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train) # Train on original (unscaled) data
# 6. EVALUATE models on test data (unseen data)
# predict(): Generate predictions for test samples
y_pred_lr = log_reg.predict(X_test_scaled) # Use scaled test data
y_pred_rf = rf.predict(X_test) # Use original test data
# accuracy_score(): Fraction of correct predictions
print(f"\nLogistic Regression Accuracy: {accuracy_score(y_test, y_pred_lr):.3f}")
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf):.3f}")
# classification_report(): Precision, recall, F1-score for each class
# Provides detailed per-class performance metrics
print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, y_pred_lr, target_names=iris.target_names))
import matplotlib.pyplot as plt
import seaborn as sns # Advanced visualization library built on matplotlib
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix # For error analysis
# 7. VISUALIZE RESULTS with a confusion matrix
iris = load_iris()
X, y = iris.data, iris.target
# Split data with same parameters to ensure reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train) # Learn from training data
y_pred_rf = rf.predict(X_test) # Make predictions on test data
# confusion_matrix(): Create matrix showing actual vs predicted classes
# Rows = actual classes, Columns = predicted classes
# Diagonal elements = correct predictions, off-diagonal = errors
cm = confusion_matrix(y_test, y_pred_rf)
# Visualize confusion matrix as a heatmap
plt.figure(figsize=(8, 6)) # Set figure size
# sns.heatmap(): Display matrix with color-coded cells
# annot=True: Show numbers in each cell
# fmt='d': Format numbers as integers (not decimals)
# cmap='Blues': Use blue color scheme (darker = higher values)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=iris.target_names, # Label columns with species names
yticklabels=iris.target_names) # Label rows with species names
plt.xlabel('Predicted') # What the model predicted
plt.ylabel('Actual') # What the true class was
plt.title('Iris Classification Confusion Matrix') # Descriptive title
plt.show() # Display the plot
# How to read: If cell (setosa, versicolor) = 2, means 2 setosa samples
# were incorrectly classified as versicolor
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# 8. FEATURE IMPORTANCE - Which features are most useful for predictions?
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest (tree-based models provide feature importance)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# feature_importances_: Array of importance scores (sum to 1.0)
# Higher score = feature contributes more to accurate predictions
# Based on how much each feature decreases impurity (Gini) across trees
importances = rf.feature_importances_
# np.argsort(): Get indices that would sort array in ascending order
# [::-1] reverses to get descending order (most important first)
indices = np.argsort(importances)[::-1]
# Create bar plot of feature importance
plt.figure(figsize=(10, 6))
# Plot bars in order of importance
plt.bar(range(X.shape[1]), importances[indices])
# Label x-axis with feature names in sorted order, rotated 45° for readability
plt.xticks(range(X.shape[1]), [iris.feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature') # X-axis label
plt.ylabel('Importance') # Y-axis label (0 to ~0.5 for Iris dataset)
plt.title('Feature Importance for Iris Classification')
plt.tight_layout() # Prevent label cutoff
plt.show()
# Print ranked list of features with importance scores
print("Feature ranking:")
for i in range(X.shape[1]):
print(f"{i+1}. {iris.feature_names[indices[i]]}: {importances[indices[i]]:.3f}")
# Typically petal width and petal length are most important for Iris
2. Wine Dataset (Multi-class Classification)
About Wine: Chemical analysis of 178 wine samples from Italy. 13 features (alcohol, acidity, phenols, etc.). Target: 3 wine types. Great for classification with multiple continuous features.
# Import necessary libraries
from sklearn.datasets import load_wine # Wine quality dataset
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler # For feature scaling
from sklearn.svm import SVC # Support Vector Machine classifier
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd # For data manipulation
import numpy as np # For numerical operations
# 1. LOAD & EXPLORE
# load_wine() returns chemical analysis of 178 wine samples
# Features include alcohol content, acidity, phenols, color intensity, etc.
wine = load_wine()
X, y = wine.data, wine.target # X = 13 chemical features, y = wine class (0, 1, 2)
print(f"Dataset shape: {X.shape}") # (178, 13) - 178 samples, 13 features
print(f"Features: {len(wine.feature_names)}") # 13 chemical properties
print(f"Classes: {wine.target_names}") # class_0, class_1, class_2 (wine cultivars)
print(f"Class distribution: {np.bincount(y)}") # Samples per class - may be imbalanced
# Create DataFrame for easier exploration
df_wine = pd.DataFrame(X, columns=wine.feature_names)
df_wine['wine_type'] = y # Add target column
print("\nFirst few rows:") # Preview data structure
print(df_wine.head())
print("\nStatistics:") # Mean, std, min, max for each feature
print(df_wine.describe()) # Note: features have different scales (0.74 to 1680)
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # For feature normalization
from sklearn.svm import SVC # Support Vector Machine
from sklearn.ensemble import GradientBoostingClassifier # Boosting ensemble
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
# 2. SPLIT & PREPROCESS data
wine = load_wine()
X, y = wine.data, wine.target
# train_test_split(): Randomly divide data
# test_size=0.25: Use 25% for testing (higher than standard 20% due to small dataset)
# random_state=42: Reproducible split
# stratify=y: Maintain class proportions in train/test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# Scale features to mean=0, std=1 (critical for SVM performance)
# Wine features have vastly different scales (alcohol ~10-15, proline ~200-1600)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Learn scaling from training data
X_test_scaled = scaler.transform(X_test) # Apply same scaling to test data
# 3. TRAIN MULTIPLE MODELS for comparison
# SVM with RBF (Radial Basis Function) kernel
# kernel='rbf': Non-linear decision boundaries
# C=10: High regularization penalty (tighter fit to training data)
# gamma='scale': Kernel coefficient = 1 / (n_features * X.var())
# See: https://scikit-learn.org/stable/modules/svm.html
svm = SVC(kernel='rbf', C=10, gamma='scale', random_state=42)
svm.fit(X_train_scaled, y_train) # Train on scaled features
# Gradient Boosting: Sequential ensemble of decision trees
# n_estimators=100: Build 100 trees (each learns from previous tree's errors)
# See: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting
# Tree models don't need scaled features (they only use rank order)
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X_train, y_train) # Train on original unscaled data
# 4. EVALUATE both models on test set
y_pred_svm = svm.predict(X_test_scaled) # SVM needs scaled features
y_pred_gb = gb.predict(X_test) # GB uses original features
# Compare accuracy scores
print(f"SVM Accuracy: {accuracy_score(y_test, y_pred_svm):.3f}")
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, y_pred_gb):.3f}")
# Detailed per-class metrics (precision, recall, F1-score)
print("\nGradient Boosting Report:")
print(classification_report(y_test, y_pred_gb, target_names=wine.target_names))
# Import libraries for cross-validation workflow
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score # For k-fold cross-validation
from sklearn.preprocessing import StandardScaler # Feature scaling
from sklearn.svm import SVC # Classifier
from sklearn.pipeline import Pipeline # Chain preprocessing + model together
import numpy as np
# 5. CROSS-VALIDATION with Pipeline
# Pipeline ensures scaling is done correctly within each CV fold
# This prevents data leakage (test data influencing training)
wine = load_wine()
X, y = wine.data, wine.target
# Create Pipeline: preprocessing step + model step
# Pipeline chains operations: data flows scaler ? SVM
# See: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
pipeline = Pipeline([
('scaler', StandardScaler()), # Step 1: Scale features
('svm', SVC(kernel='rbf', C=10, random_state=42)) # Step 2: Train SVM
])
# cross_val_score(): Perform k-fold cross-validation
# cv=5: Split data into 5 folds
# - Train on 4 folds, test on 1 fold
# - Repeat 5 times (each fold used as test once)
# - Returns 5 accuracy scores
# scoring='accuracy': Metric to evaluate (could be 'f1', 'precision', etc.)
# Pipeline ensures each fold is scaled independently (no data leakage)
# See: https://scikit-learn.org/stable/modules/cross_validation.html
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
# Display results
print(f"Cross-validation scores: {scores}") # 5 individual fold scores
# Mean ± 2*std gives 95% confidence interval estimate
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
# Example output: "Mean accuracy: 0.978 (+/- 0.034)"
3. Digits Dataset (Image Classification)
About Digits: 1,797 images of handwritten digits (0-9), each 8x8 pixels (64 features). Perfect for learning image classification and dimensionality reduction techniques.
# Import libraries for visualization and data loading
import matplotlib.pyplot as plt # For plotting
from sklearn.datasets import load_digits # Handwritten digits dataset
import numpy as np # For numerical operations
# 1. LOAD & VISUALIZE handwritten digits
# load_digits() returns 1,797 images of digits 0-9
# Each image is 8x8 pixels, flattened to 64-dimensional vector
digits = load_digits()
X, y = digits.data, digits.target # X = 64 features (pixel intensities), y = digit label (0-9)
print(f"Dataset shape: {X.shape}") # (1797, 64) - 1797 images, 64 pixels each
print(f"Image shape: {digits.images.shape}") # (1797, 8, 8) - original 2D format
print(f"Classes: 0-9 (10 classes)") # 10 possible digit labels
print(f"Samples per class: {np.bincount(y)}") # Distribution (~180 samples per digit)
# Visualize sample digits to understand the data
# Create 2 rows × 5 columns = 10 subplots
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat): # axes.flat iterates over all subplots
# imshow(): Display 2D array as image
# cmap='gray': Use grayscale colormap (0=black, 16=white)
ax.imshow(digits.images[i], cmap='gray')
ax.set_title(f"Label: {digits.target[i]}") # Show true digit label
ax.axis('off') # Hide axis ticks and labels
plt.tight_layout() # Adjust spacing between subplots
plt.show() # Display the plot
# Import libraries for neural network training
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # For feature normalization
from sklearn.neural_network import MLPClassifier # Multi-Layer Perceptron (neural network)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# 2. SPLIT & TRAIN with neural network
digits = load_digits()
X, y = digits.data, digits.target # X = 64 pixel intensities, y = digit label (0-9)
# Split data: 80% training, 20% testing
# stratify=y: Ensure balanced digit distribution in both sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features to improve neural network convergence
# Neural networks learn faster when features are standardized
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform training data
X_test_scaled = scaler.transform(X_test) # Transform test data (using training stats)
# MLPClassifier: Multi-Layer Perceptron (feedforward neural network)
# hidden_layer_sizes=(100, 50): Architecture with 2 hidden layers
# - Layer 1: 100 neurons (fully connected to 64 input pixels)
# - Layer 2: 50 neurons (fully connected to Layer 1)
# - Output: 10 neurons (one per digit class)
# max_iter=500: Maximum training epochs (iterations through dataset)
# See: https://scikit-learn.org/stable/modules/neural_networks_supervised.html
mlp = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
mlp.fit(X_train_scaled, y_train) # Train network using backpropagation
# Predict digit labels for test images
y_pred = mlp.predict(X_test_scaled)
# Evaluate performance
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}") # Overall correctness
print("\nClassification Report:") # Per-digit precision, recall, F1-score
print(classification_report(y_test, y_pred)) # Shows performance for each digit 0-9
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns # For advanced heatmaps
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix # For error pattern analysis
# 3. CONFUSION MATRIX VISUALIZATION - See where model makes mistakes
digits = load_digits()
X, y = digits.data, digits.target
# Reproduce same train/test split as previous example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train neural network
mlp = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
mlp.fit(X_train_scaled, y_train)
y_pred = mlp.predict(X_test_scaled) # Get predictions
# confusion_matrix(): 10x10 matrix showing actual vs predicted digits
# Rows = true digit, Columns = predicted digit
# Diagonal = correct predictions, off-diagonal = errors
cm = confusion_matrix(y_test, y_pred)
# Visualize as heatmap
plt.figure(figsize=(10, 8)) # Large figure for 10x10 matrix
# annot=True: Show count in each cell
# fmt='d': Display as integers
# cmap='Blues': Blue color scheme (darker = more samples)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Digit') # What model predicted
plt.ylabel('True Digit') # Actual digit in test set
plt.title('Digit Classification Confusion Matrix')
plt.show()
# How to read: If cell (8, 3) = 5, means 5 images of digit 8 were
# incorrectly classified as digit 3 (common mistake due to similar shapes)
# Import visualization libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
import numpy as np
# 4. VISUALIZE PREDICTIONS - See model's predictions on actual images
digits = load_digits()
X, y = digits.data, digits.target
# Reproduce same train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train neural network
mlp = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42)
mlp.fit(X_train_scaled, y_train)
y_pred = mlp.predict(X_test_scaled) # Predict all test samples
# Show first 18 test images (3 rows × 6 columns) with predictions
fig, axes = plt.subplots(3, 6, figsize=(15, 8))
for i, ax in enumerate(axes.flat): # Loop through 18 subplots
# X_test[i] is 64-element array; reshape to 8x8 for display
ax.imshow(X_test[i].reshape(8, 8), cmap='gray')
# Show true label vs predicted label
ax.set_title(f"True: {y_test[i]}\nPred: {y_pred[i]}")
ax.axis('off') # Hide axis ticks
# Highlight incorrect predictions in red for easy spotting
if y_test[i] != y_pred[i]:
# Make incorrect predictions stand out visually
ax.set_title(f"True: {y_test[i]}\nPred: {y_pred[i]}",
color='red', fontweight='bold')
plt.tight_layout() # Prevent title overlap
plt.show()
# This visualization helps identify which digits the model confuses
# e.g., 8 vs 3, 5 vs 3, 1 vs 7 are common errors
4. Breast Cancer Dataset (Binary Classification)
About Breast Cancer: 569 samples with 30 features computed from breast mass images. Binary classification: malignant (0) or benign (1). Real medical data—demonstrates importance of precision/recall.
# Import libraries for medical dataset analysis
from sklearn.datasets import load_breast_cancer # Real medical diagnostic data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import pandas as pd # For data manipulation
import numpy as np # For numerical operations
# 1. LOAD & EXPLORE breast cancer diagnostic data
# This dataset contains features computed from digitized images of breast mass
# Binary classification: malignant (cancerous) vs benign (non-cancerous)
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target # X = 30 features, y = diagnosis (0=malignant, 1=benign)
print(f"Dataset shape: {X.shape}") # (569, 30) - 569 samples, 30 features
print(f"Features: {cancer.feature_names[:5]}... (30 total)") # radius, texture, perimeter, area, smoothness, etc.
print(f"Classes: {cancer.target_names}") # ['malignant' 'benign']
print(f"Class distribution: {np.bincount(y)}") # Count of each class
print(f"Malignant (0): {(y==0).sum()}, Benign (1): {(y==1).sum()}") # Show imbalance if any
# Create DataFrame for statistical analysis
# Features include mean, std error, and worst values for 10 measurements
df_cancer = pd.DataFrame(X, columns=cancer.feature_names)
print("\nFeature statistics:") # Mean, std, min, max for all features
print(df_cancer.describe()) # Note: Features have very different scales (0.1 to 3000)
# Import libraries for medical classification
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression # Linear classifier
from sklearn.ensemble import RandomForestClassifier # Tree ensemble
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
import numpy as np
# 2. SPLIT, SCALE & TRAIN multiple models
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Split: 80% train, 20% test, maintaining class balance
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features (critical for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Logistic Regression
# max_iter=10000: High iteration limit (default 100 may not converge)
# See: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
lr = LogisticRegression(max_iter=10000, random_state=42)
lr.fit(X_train_scaled, y_train) # Train on scaled data
# Train Random Forest (comparison model)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train) # Trees don't need scaling
# 3. EVALUATE with MULTIPLE METRICS (critical for medical applications)
# Medical data requires careful evaluation beyond just accuracy
y_pred_lr = lr.predict(X_test_scaled) # Binary predictions (0 or 1)
y_pred_rf = rf.predict(X_test)
y_proba_lr = lr.predict_proba(X_test_scaled)[:, 1] # Probability of benign (class 1)
print("Logistic Regression:")
print(f" Accuracy: {accuracy_score(y_test, y_pred_lr):.3f}") # Overall correctness
print(f" Precision: {precision_score(y_test, y_pred_lr):.3f}") # Of predicted benign, how many are actually benign?
print(f" Recall: {recall_score(y_test, y_pred_lr):.3f}") # Of actual benign, how many did we catch?
print(f" F1 Score: {f1_score(y_test, y_pred_lr):.3f}") # Harmonic mean of precision & recall
print(f" ROC-AUC: {roc_auc_score(y_test, y_proba_lr):.3f}") # Area under ROC curve (0.5-1.0)
print("\nRandom Forest:")
print(f" Accuracy: {accuracy_score(y_test, y_pred_rf):.3f}")
print(f" Precision: {precision_score(y_test, y_pred_rf):.3f}")
print(f" Recall: {recall_score(y_test, y_pred_rf):.3f}") # High recall = fewer missed cancers
print(f" F1 Score: {f1_score(y_test, y_pred_rf):.3f}")
# For medical diagnosis: High recall is often prioritized (don't miss cancers)
# High precision avoids false alarms (unnecessary biopsies)
# Import libraries for ROC curve analysis
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score # For ROC analysis
# 4. ROC CURVE - Visualize classifier performance across thresholds
# ROC = Receiver Operating Characteristic
# Shows trade-off between True Positive Rate and False Positive Rate
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Reproduce same train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train logistic regression
lr = LogisticRegression(max_iter=10000, random_state=42)
lr.fit(X_train_scaled, y_train)
# Get probability predictions (not binary 0/1)
y_proba = lr.predict_proba(X_test_scaled)[:, 1] # Probability of benign (class 1)
# roc_curve(): Calculate TPR and FPR at different classification thresholds
# fpr: False Positive Rate (X-axis) - How many benign predicted as malignant?
# tpr: True Positive Rate (Y-axis) - How many benign correctly identified?
# thresholds: Classification thresholds (0.0 to 1.0)
# See: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# roc_auc_score(): Area Under ROC Curve
# 0.5 = random classifier, 1.0 = perfect classifier
auc = roc_auc_score(y_test, y_proba)
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier') # Diagonal line (AUC=0.5)
plt.xlabel('False Positive Rate') # More FP = more false alarms
plt.ylabel('True Positive Rate') # More TP = fewer missed diagnoses
plt.title('ROC Curve - Breast Cancer Classification')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
# Ideal curve: Hugs top-left corner (high TPR, low FPR)
# Higher AUC = better overall classifier performance
5. Diabetes Dataset (Regression)
About Diabetes: 442 samples with 10 baseline features (age, BMI, blood pressure, etc.). Target: quantitative measure of disease progression one year after baseline. Great for learning regression.
# Import libraries for regression analysis
from sklearn.datasets import load_diabetes # Medical progression prediction dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso # Linear models with regularization
from sklearn.ensemble import RandomForestRegressor # Tree ensemble for regression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd # For data analysis
import numpy as np
# 1. LOAD & EXPLORE diabetes progression dataset
# This dataset contains baseline patient data and disease progression after 1 year
# Target is quantitative measure of disease progression (continuous value, not classification)
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target # X = 10 features (age, BMI, BP, etc.), y = progression score
print(f"Dataset shape: {X.shape}") # (442, 10) - 442 patients, 10 baseline measurements
print(f"Features: {diabetes.feature_names}") # age, sex, bmi, bp, s1-s6 (blood serum measurements)
print(f"Target statistics: min={y.min():.1f}, max={y.max():.1f}, mean={y.mean():.1f}")
# Target values range from 25 to 346 (higher = worse progression)
# Create DataFrame for correlation analysis
df_diabetes = pd.DataFrame(X, columns=diabetes.feature_names)
df_diabetes['progression'] = y # Add target column
# Identify which features correlate most with disease progression
print("\nCorrelation with target:") # Positive = feature increases with disease progression
print(df_diabetes.corr()['progression'].sort_values(ascending=False))
# Typically: bmi (body mass index), s5 (serum triglycerides) are top predictors
# Import regression models and metrics
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso # Linear models
from sklearn.ensemble import RandomForestRegressor # Non-linear ensemble
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
# 2. SPLIT & TRAIN MULTIPLE REGRESSION MODELS
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
# Split data (no stratification needed for regression)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Linear Regression - no regularization (baseline model)
# See: https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares
lr = LinearRegression()
lr.fit(X_train, y_train) # Learns weights for each feature
# Ridge Regression - L2 regularization (penalizes large coefficients)
# alpha=1.0: Regularization strength (higher = more penalty = simpler model)
# Good when features are correlated (reduces overfitting)
# See: https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# Lasso Regression - L1 regularization (can zero out features)
# alpha=0.5: Regularization strength
# Performs automatic feature selection (sets some coefficients to exactly 0)
# See: https://scikit-learn.org/stable/modules/linear_model.html#lasso
lasso = Lasso(alpha=0.5)
lasso.fit(X_train, y_train)
# Random Forest Regressor - ensemble of decision trees
# n_estimators=100: Build 100 trees and average predictions
# Captures non-linear relationships between features and target
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
# 3. EVALUATE with regression metrics
models = {
'Linear Regression': lr,
'Ridge': ridge,
'Lasso': lasso,
'Random Forest': rf_reg
}
for name, model in models.items():
y_pred = model.predict(X_test)
# RMSE: Root Mean Squared Error (same units as target, penalizes large errors)
rmse = mean_squared_error(y_test, y_pred, squared=False)
# R² Score: Coefficient of determination (0-1, higher = better fit)
# 1.0 = perfect predictions, 0 = model as good as mean baseline
r2 = r2_score(y_test, y_pred)
# MAE: Mean Absolute Error (average absolute difference, robust to outliers)
mae = mean_absolute_error(y_test, y_pred)
print(f"\n{name}:")
print(f" RMSE: {rmse:.2f}") # Lower is better
print(f" R² Score: {r2:.3f}") # Higher is better (max 1.0)
print(f" MAE: {mae:.2f}") # Lower is better
# Import visualization library
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# 4. VISUALIZE PREDICTIONS - Scatter plot of actual vs predicted values
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
# Reproduce same train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest (typically best performer)
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred = rf_reg.predict(X_test) # Predict disease progression for test patients
# Create scatter plot
plt.figure(figsize=(10, 6))
# Each point = one test patient
# X-axis = actual progression, Y-axis = model's prediction
plt.scatter(y_test, y_pred, alpha=0.6) # alpha=0.6 for transparency (see overlapping points)
# Plot ideal prediction line (y=x)
# Perfect predictions would fall exactly on this red dashed line
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Disease Progression') # True progression after 1 year
plt.ylabel('Predicted Disease Progression') # Model's prediction
plt.title('Diabetes Progression: Actual vs Predicted') # Title
plt.grid(alpha=0.3) # Light grid for easier reading
plt.show()
# Points close to red line = good predictions
# Points far from line = model errors (over/under-estimation)
# Import libraries for feature importance analysis
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import numpy as np
# 5. FEATURE IMPORTANCE - Which features best predict disease progression?
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
# Reproduce same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Random Forest
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
# Extract feature importances (how much each feature contributes to predictions)
importances = rf_reg.feature_importances_ # Sum to 1.0
# Sort features by importance (descending order)
indices = np.argsort(importances)[::-1]
# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices]) # Bars sorted by importance
# Label x-axis with feature names in sorted order
plt.xticks(range(X.shape[1]), [diabetes.feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature') # Baseline patient measurements
plt.ylabel('Importance') # 0 to ~0.3 for Diabetes dataset
plt.title('Feature Importance for Diabetes Progression Prediction')
plt.tight_layout() # Prevent label cutoff
plt.show()
# Print ranked list with importance scores
print("Feature ranking:")
for i in range(X.shape[1]):
print(f"{i+1}. {diabetes.feature_names[indices[i]]}: {importances[indices[i]]:.3f}")
# Typically: bmi (body mass index) and s5 (serum triglycerides) are most important
# This tells us which patient measurements to prioritize in clinical settings
6. California Housing Dataset (Regression)
About California Housing: 20,640 samples from California census data. 8 features (median income, house age, rooms, location, etc.). Target: median house value. Larger dataset ideal for testing model scalability.
# Import libraries for large-scale regression
from sklearn.datasets import fetch_california_housing # NOTE: fetch_, not load_
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd # For data analysis
import numpy as np
# 1. LOAD & EXPLORE California Housing dataset
# fetch_california_housing() downloads dataset from internet (first time only)
# This is a larger dataset (20,640 samples) - good for testing model scalability
# Based on 1990 California census data
housing = fetch_california_housing()
X, y = housing.data, housing.target # X = 8 features, y = median house value
print(f"Dataset shape: {X.shape}") # (20640, 8) - 20,640 California districts
print(f"Features: {housing.feature_names}")
# MedInc: median income, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude
print(f"Target (median house value in $100k): min={y.min():.2f}, max={y.max():.2f}, mean={y.mean():.2f}")
# Values in $100,000s - e.g., 2.5 = $250,000 median house value
# Create DataFrame for correlation analysis
df_housing = pd.DataFrame(X, columns=housing.feature_names)
df_housing['MedHouseVal'] = y # Add target column
print("\nFirst few rows:") # Preview data structure
print(df_housing.head())
print("\nCorrelation with target:") # Which features correlate with house prices?
print(df_housing.corr()['MedHouseVal'].sort_values(ascending=False))
# Typically: MedInc (median income) is strongest predictor of house value
# Import regression models and evaluation metrics
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression # Simple linear model
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor # Powerful ensembles
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# 2. SPLIT, SCALE & TRAIN multiple models
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Split: 80% train (16,512 samples), 20% test (4,128 samples)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training samples: {X_train.shape[0]}") # 16,512 districts
print(f"Test samples: {X_test.shape[0]}") # 4,128 districts
# Scale features (important for linear models, not trees)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Linear Regression - fast baseline model
lr = LinearRegression()
lr.fit(X_train_scaled, y_train) # Train on scaled data
# Gradient Boosting - powerful for tabular data
# n_estimators=100: Build 100 sequential trees
# learning_rate=0.1: Step size for gradient descent (smaller = more conservative)
# max_depth=5: Maximum tree depth (prevents overfitting)
# See: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
gb.fit(X_train, y_train) # Trees don't need scaling
# Random Forest - ensemble of independent trees
# max_depth=20: Allow deeper trees than Gradient Boosting
rf = RandomForestRegressor(n_estimators=100, max_depth=20, random_state=42)
rf.fit(X_train, y_train)
# 3. EVALUATE all models on test set
models = {
'Linear Regression': (lr, X_test_scaled), # Needs scaled data
'Gradient Boosting': (gb, X_test), # Original data
'Random Forest': (rf, X_test) # Original data
}
for name, (model, X_test_data) in models.items():
y_pred = model.predict(X_test_data)
# RMSE in $100k units (multiply by 100,000 for dollars)
rmse = mean_squared_error(y_test, y_pred, squared=False)
# R²: proportion of variance explained (0-1, higher = better)
r2 = r2_score(y_test, y_pred)
print(f"\n{name}:")
print(f" RMSE: {rmse:.3f} ($100k)") # e.g., 0.5 = ±$50,000 error
print(f" R² Score: {r2:.3f}") # e.g., 0.8 = explains 80% of variance
# Import visualization library
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
# 4. VISUALIZE PREDICTIONS with dual plots
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Reproduce same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Gradient Boosting (typically best model for this dataset)
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test) # Predict house values for 4,128 test districts
# Create figure with 2 side-by-side subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 5)) # 1 row, 2 columns
# LEFT PLOT: Scatter plot of Actual vs Predicted
axes[0].scatter(y_test, y_pred, alpha=0.3) # alpha=0.3 for transparency (many points)
# Plot y=x line (perfect predictions)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual House Value ($100k)') # True median house value
axes[0].set_ylabel('Predicted House Value ($100k)') # Model's prediction
axes[0].set_title('California Housing: Actual vs Predicted')
axes[0].grid(alpha=0.3)
# Points near red line = accurate predictions
# Points above line = overestimation, below = underestimation
# RIGHT PLOT: Residual plot (errors vs predictions)
# residuals = actual - predicted (positive = underestimated, negative = overestimated)
residuals = y_test - y_pred
axes[1].scatter(y_pred, residuals, alpha=0.3)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2) # Zero error line
axes[1].set_xlabel('Predicted House Value ($100k)')
axes[1].set_ylabel('Residuals') # Error in predictions
axes[1].set_title('Residual Plot') # Check for patterns in errors
axes[1].grid(alpha=0.3)
# Random scatter around y=0 = good (no systematic bias)
# Pattern (e.g., curve) = model missing relationships
plt.tight_layout() # Prevent subplot overlap
plt.show()
# Import libraries for feature importance analysis
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
# 5. FEATURE IMPORTANCE - Which factors most influence house prices?
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Reproduce same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Gradient Boosting
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
gb.fit(X_train, y_train)
# Extract feature importances from trained model
importances = gb.feature_importances_ # Sum to 1.0
# Sort features by importance (descending)
indices = np.argsort(importances)[::-1]
# Create bar chart of feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices]) # 8 features
# Label x-axis with feature names in sorted order
plt.xticks(range(X.shape[1]), [housing.feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature') # Census and geographic features
plt.ylabel('Importance') # 0 to ~0.5 for California Housing
plt.title('Feature Importance for California Housing Price Prediction')
plt.tight_layout() # Prevent x-label cutoff
plt.show()
# Print ranked list with importance scores
print("Feature ranking:")
for i in range(X.shape[1]):
print(f"{i+1}. {housing.feature_names[indices[i]]}: {importances[indices[i]]:.3f}")
# Typically: MedInc (median income) is by far the most important predictor
# Latitude and Longitude also matter (location, location, location!)
# This tells us income and location are key drivers of California house prices
Datasets Summary: You've now seen complete workflows for all major Scikit-learn datasets—from loading and exploring to training, evaluation, and visualization. These patterns apply to any ML project. Use these datasets to experiment with new algorithms and techniques!
Best Practices & Summary
Key Takeaways
- ? Always split data: Training set to train, test set to evaluate (never train on test!)
- ? Scale features: Especially for distance-based models (SVM, k-NN, neural nets)
- ? Use pipelines: Prevent data leakage and simplify workflows
- ? Cross-validate: Single train/test split can be misleading
- ? Choose metrics wisely: Accuracy isn't always appropriate (use F1 for imbalanced data)
- ? Start simple: Baseline with Logistic Regression before complex models
- ? Set random_state: For reproducibility in experiments
- ? Save models: Use
joblib to persist trained models
Common Pitfalls
Avoid These Mistakes
- Data leakage: Fitting preprocessors on test data
- Not scaling: SVM and neural nets need scaled features
- Using accuracy for imbalanced data: 99% accuracy means nothing if 99% of data is one class
- Overfitting: Model performs great on training data, poorly on test data
- Not setting random_state: Results change every run
Model Persistence
import joblib
# Save model
joblib.dump(model, 'model.joblib')
# Load model
loaded_model = joblib.load('model.joblib')
predictions = loaded_model.predict(X_new)
Series Completion
Congratulations! You've completed the Python Data Science Series. You now have the full toolkit:
- NumPy: Efficient numerical computation
- Pandas: Data manipulation and analysis
- Matplotlib/Seaborn: Compelling visualizations
- Scikit-learn: Machine learning models and pipelines
You're ready to tackle real-world data science projects—from exploratory analysis to predictive modeling!
Scikit-learn API Cheat Sheet
Quick reference for machine learning workflows with Scikit-learn.
train_test_split(X, y) | Split data |
test_size=0.2 | 20% test set |
random_state=42 | Reproducibility |
StandardScaler() | Standardize |
MinMaxScaler() | Scale 0-1 |
LabelEncoder() | Encode labels |
OneHotEncoder() | One-hot encode |
model.fit(X_train, y_train) | Train model |
model.predict(X_test) | Make predictions |
model.score(X, y) | Model accuracy |
model.predict_proba(X) | Probabilities |
LinearRegression() | Linear model |
LogisticRegression() | Classification |
RandomForestClassifier() | Random forest |
accuracy_score(y, pred) | Accuracy |
precision_score(y, pred) | Precision |
recall_score(y, pred) | Recall |
f1_score(y, pred) | F1 score |
confusion_matrix(y, pred) | Confusion matrix |
mean_squared_error(y, pred) | MSE |
r2_score(y, pred) | R² score |
scaler.fit(X_train) | Fit scaler |
scaler.transform(X) | Transform data |
scaler.fit_transform(X) | Fit & transform |
SimpleImputer() | Fill missing |
PolynomialFeatures() | Poly features |
normalize(X) | Normalize |
binarize(X) | Binarize |
cross_val_score(model, X, y) | Cross-validation |
cv=5 | 5-fold CV |
GridSearchCV(model, params) | Grid search |
RandomizedSearchCV() | Random search |
learning_curve() | Learning curve |
validation_curve() | Validation curve |
Pipeline(steps) | Create pipeline |
make_pipeline() | Quick pipeline |
pipe.fit(X, y) | Fit pipeline |
pipe.predict(X) | Predict |
ColumnTransformer() | Column-wise ops |
FeatureUnion() | Combine features |
RandomForestClassifier() | Random forest |
GradientBoostingClassifier() | Gradient boost |
AdaBoostClassifier() | AdaBoost |
VotingClassifier() | Voting |
BaggingClassifier() | Bagging |
StackingClassifier() | Stacking |
PCA(n_components=2) | PCA |
pca.fit_transform(X) | Transform to PC |
pca.explained_variance_ | Variance explained |
TruncatedSVD() | SVD |
TSNE() | t-SNE |
SelectKBest() | Feature selection |
Pro Tips:
- Train-test split: Always split before preprocessing to avoid data leakage
- Cross-validation: Use CV for robust model evaluation (5-10 folds typical)
- Pipelines: Chain preprocessing + model to prevent leakage and simplify workflow
- Scaling: Required for distance-based algorithms (KNN, SVM, neural networks)
- Class imbalance: Use
class_weight='balanced' or SMOTE for imbalanced data
Related Articles in This Series
Part 1: NumPy Foundations for Data Science
Master NumPy arrays, vectorization, broadcasting, and linear algebra operations—the foundation of Python data science.
Read Article
Part 2: Pandas for Data Analysis
Master Pandas DataFrames, Series, data cleaning, transformation, groupby operations, and merge techniques for real-world data analysis.
Read Article
Part 3: Data Visualization with Matplotlib & Seaborn
Create compelling visualizations with Python's most powerful plotting libraries. Learn line plots, bar charts, scatter plots, and statistical graphics.
Read Article