1
Fork 0

Session_01

This commit is contained in:
Jakub Polec 2025-06-13 07:25:59 +02:00
parent 16a25d8ee5
commit 6befd2d50c
18 changed files with 99852 additions and 1 deletions

View file

@ -1 +1,98 @@
# crypto_bot_training # Build Your Own Crypto Trading Bot Course Repository
Welcome to the private repository for the **"Build Your Own Crypto Trading Bot Hands-On Course with Alex"** by QuantJourney.
This repository contains materials, templates, and code samples used during the 6 live sessions held in June 2025.
> ⚠️ This repository is for registered participants only.
---
## Content Overview
**Session 1: Foundations & Data Structures**
- Set up Python, IDE, and required libraries
- Pandas basics for financial time series
- Understanding OHLCV format
- Create your first crypto DataFrame with sample data
**Session 2: Data Acquisition & Exchange Connectivity**
- WebSocket basics for real-time crypto feeds (Binance focus)
- Fail-safe reconnection logic and error handling
- Logging basics for live systems
- Build tools: order flow scanner, liquidation monitor, funding rate tracker
**Session 3: Data Processing & Technical Analysis**
- API access using CCXT
- Handle rate limits and API error scenarios
- Reconnect & retry mechanisms
- Use pandas-ta to compute SMA, EMA, RSI
- Create your own indicator pipeline
**Session 4: Strategy Development & Backtesting**
- Overview of strategy types (trend, mean reversion)
- Backtesting with `backtesting.py`
- Compute Sharpe ratio, drawdown, profit factor
- Add position sizing, SL/TP, and walk-forward logic
- Adjust for fees, slippage, and latency
**Session 5: Bot Architecture & Implementation**
- Bot system design: event-driven vs loop-based
- Core components: order manager, position tracker, error handler
- Risk constraints: daily limits, max size
- Logging & monitoring structure
- Write the engine core for your bot
**Session 6: Live Trading & Deployment**
- API keys and secure credential handling
- Deployment targets: local, VPS, cloud (e.g., Hetzner)
- Running 24/7: restart logic, alerting
- Final bot launch + testing in production
- Send alerts via Telegram or email
---
## 🤖 AI-Enhanced Trading
Bonus section:
- Use ChatGPT/Claude for strategy suggestions
- Integrate AI-based filters or signal generation
- Let LLMs help you refactor and extend your logic
---
## 📁 Repository Structure
```text
/Session_01/ # Foundations & DataFrame Handling
/Session_02/ # WebSockets & Real-Time Feed Tools
/Session_03/ # Indicators & Analysis
/Session_04/ # Backtesting + Strategy Logic
/Session_05/ # Trading Bot Core Engine
/Session_06/ # Live Deployment and Monitoring
/templates/ # Starter and final bot code
/utils/ # Helper scripts for logging, reconnection, etc.
README.md # You are here
```
---
## 🛠 Requirements
- Python 3.10+
- Install dependencies per session in each folder or via a top-level `requirements.txt` (provided)
---
## 📫 Support
You can reach Alex directly at [alex@quantjourney.pro](mailto:alex@quantjourney.pro) for post-course support (1 week included).
---
## ⚠️ Disclaimer
This project is for **educational use only**. No financial advice. Always trade with caution and use proper risk management.
---
Happy coding and trade smart.
QuantJourney Team

BIN
Session_01/.DS_Store vendored Normal file

Binary file not shown.

83955
Session_01/Data/BTCUSD-1h-data.csv Executable file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,391 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 1: Creating DataFrames\n",
"\n",
"## Learning Objectives\n",
"- Understand different methods to create pandas DataFrames\n",
"- Learn to create DataFrames from dictionaries, lists, and NumPy arrays\n",
"- Practice with various data types and structures\n",
"\n",
"## Prerequisites\n",
"- Basic Python knowledge\n",
"- Understanding of lists and dictionaries"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pandas version: 2.2.3\n",
"NumPy version: 2.2.6\n"
]
}
],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"\n",
"print(f\"Pandas version: {pd.__version__}\")\n",
"print(f\"NumPy version: {np.__version__}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 1: Creating DataFrame from Dictionary\n",
"\n",
"This is the most common and intuitive way to create a DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Student DataFrame:\n",
" Name Age Grade Score\n",
"0 Alice 23 A 95\n",
"1 Bob 25 B 87\n",
"2 Charlie 22 A 92\n",
"3 Diana 24 C 78\n",
"4 Eve 23 B 89\n",
"\n",
"Shape: (5, 4)\n",
"Data types:\n",
"Name object\n",
"Age int64\n",
"Grade object\n",
"Score int64\n",
"dtype: object\n"
]
}
],
"source": [
"# Creating DataFrame from dictionary\n",
"student_data = {\n",
" 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],\n",
" 'Age': [23, 25, 22, 24, 23],\n",
" 'Grade': ['A', 'B', 'A', 'C', 'B'],\n",
" 'Score': [95, 87, 92, 78, 89]\n",
"}\n",
"\n",
"df_students = pd.DataFrame(student_data)\n",
"print(\"Student DataFrame:\")\n",
"print(df_students)\n",
"print(f\"\\nShape: {df_students.shape}\")\n",
"print(f\"Data types:\\n{df_students.dtypes}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 2: Creating DataFrame from Lists\n",
"\n",
"You can create DataFrames from separate lists by combining them in a dictionary."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cities DataFrame:\n",
" City Population_Million Country\n",
"0 New York 8.4 USA\n",
"1 London 8.9 UK\n",
"2 Tokyo 13.9 Japan\n",
"3 Paris 2.1 France\n",
"4 Sydney 5.3 Australia\n",
"\n",
"Index: [0, 1, 2, 3, 4]\n",
"Columns: ['City', 'Population_Million', 'Country']\n"
]
}
],
"source": [
"# Creating DataFrame from separate lists\n",
"cities = ['New York', 'London', 'Tokyo', 'Paris', 'Sydney']\n",
"populations = [8.4, 8.9, 13.9, 2.1, 5.3]\n",
"countries = ['USA', 'UK', 'Japan', 'France', 'Australia']\n",
"\n",
"df_cities = pd.DataFrame({\n",
" 'City': cities,\n",
" 'Population_Million': populations,\n",
" 'Country': countries\n",
"})\n",
"\n",
"print(\"Cities DataFrame:\")\n",
"print(df_cities)\n",
"print(f\"\\nIndex: {df_cities.index.tolist()}\")\n",
"print(f\"Columns: {df_cities.columns.tolist()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 3: Creating DataFrame from NumPy Array\n",
"\n",
"This method is useful when working with numerical data or when you need random data for testing."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Random DataFrame:\n",
" Column_A Column_B Column_C\n",
"Row1 52 93 15\n",
"Row2 72 61 21\n",
"Row3 83 87 75\n",
"Row4 75 88 24\n",
"Row5 3 22 53\n",
"\n",
"Summary statistics:\n",
" Column_A Column_B Column_C\n",
"count 5.000000 5.000000 5.000000\n",
"mean 57.000000 70.200000 37.600000\n",
"std 32.272279 29.693434 25.530374\n",
"min 3.000000 22.000000 15.000000\n",
"25% 52.000000 61.000000 21.000000\n",
"50% 72.000000 87.000000 24.000000\n",
"75% 75.000000 88.000000 53.000000\n",
"max 83.000000 93.000000 75.000000\n"
]
}
],
"source": [
"# Creating DataFrame from NumPy array\n",
"np.random.seed(42) # For reproducible results\n",
"random_data = np.random.randint(1, 100, size=(5, 3))\n",
"\n",
"df_random = pd.DataFrame(random_data, \n",
" columns=['Column_A', 'Column_B', 'Column_C'],\n",
" index=['Row1', 'Row2', 'Row3', 'Row4', 'Row5'])\n",
"\n",
"print(\"Random DataFrame:\")\n",
"print(df_random)\n",
"print(f\"\\nSummary statistics:\")\n",
"print(df_random.describe())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 4: Creating DataFrame with Custom Index\n",
"\n",
"You can specify custom row labels (index) when creating DataFrames."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Products DataFrame with Custom Index:\n",
" Product Price Stock\n",
"PROD001 Laptop 1200 15\n",
"PROD002 Phone 800 50\n",
"PROD003 Tablet 600 30\n",
"PROD004 Monitor 300 20\n",
"\n",
"Accessing by index label 'PROD002':\n",
"Product Phone\n",
"Price 800\n",
"Stock 50\n",
"Name: PROD002, dtype: object\n"
]
}
],
"source": [
"# Creating DataFrame with custom index\n",
"product_data = {\n",
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],\n",
" 'Price': [1200, 800, 600, 300],\n",
" 'Stock': [15, 50, 30, 20]\n",
"}\n",
"\n",
"# Custom index using product codes\n",
"custom_index = ['PROD001', 'PROD002', 'PROD003', 'PROD004']\n",
"df_products = pd.DataFrame(product_data, index=custom_index)\n",
"\n",
"print(\"Products DataFrame with Custom Index:\")\n",
"print(df_products)\n",
"print(f\"\\nAccessing by index label 'PROD002':\")\n",
"print(df_products.loc['PROD002'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 5: Creating Empty DataFrame and Adding Data\n",
"\n",
"Sometimes you need to start with an empty DataFrame and add data incrementally."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Empty DataFrame:\n",
"Empty DataFrame\n",
"Columns: [Date, Temperature, Humidity, Pressure]\n",
"Index: []\n",
"Shape: (0, 4)\n",
"\n",
"DataFrame after adding data:\n",
" Date Temperature Humidity Pressure\n",
"0 2024-01-01 22.5 65 1013.2\n",
"1 2024-01-02 24.1 68 1015.1\n",
"2 2024-01-03 21.8 72 1012.8\n"
]
}
],
"source": [
"# Creating empty DataFrame with specified columns\n",
"columns = ['Date', 'Temperature', 'Humidity', 'Pressure']\n",
"df_weather = pd.DataFrame(columns=columns)\n",
"\n",
"print(\"Empty DataFrame:\")\n",
"print(df_weather)\n",
"print(f\"Shape: {df_weather.shape}\")\n",
"\n",
"# Adding data row by row (not recommended for large datasets)\n",
"weather_data = [\n",
" ['2024-01-01', 22.5, 65, 1013.2],\n",
" ['2024-01-02', 24.1, 68, 1015.1],\n",
" ['2024-01-03', 21.8, 72, 1012.8]\n",
"]\n",
"\n",
"for row in weather_data:\n",
" df_weather.loc[len(df_weather)] = row\n",
"\n",
"print(\"\\nDataFrame after adding data:\")\n",
"print(df_weather)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Try these exercises to reinforce your learning:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Create a DataFrame from dictionary with employee information\n",
"# Include: Employee ID, Name, Department, Salary, Years of Experience\n",
"\n",
"# Your code here:\n",
"employee_data = {\n",
" # Add your data here\n",
"}\n",
"\n",
"# df_employees = pd.DataFrame(employee_data)\n",
"# print(df_employees)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Create a DataFrame using NumPy with 6 rows and 4 columns\n",
"# Use column names: 'A', 'B', 'C', 'D'\n",
"# Use row indices: 'R1', 'R2', 'R3', 'R4', 'R5', 'R6'\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Create a DataFrame with mixed data types\n",
"# Include at least one string, integer, float, and boolean column\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Dictionary method** is most intuitive for creating DataFrames\n",
"2. **NumPy arrays** are useful for numerical data and testing\n",
"3. **Custom indices** provide meaningful row labels\n",
"4. **Empty DataFrames** can be useful but avoid adding rows one by one for large datasets\n",
"5. Always check the **shape** and **data types** of your DataFrame after creation\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View file

@ -0,0 +1,523 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 2: Basic Operations\n",
"\n",
"## Learning Objectives\n",
"- Learn essential methods to explore DataFrame structure\n",
"- Understand how to get basic information about your data\n",
"- Master data inspection techniques\n",
"- Practice with summary statistics\n",
"\n",
"## Prerequisites\n",
"- Completed Lesson 1: Creating DataFrames\n",
"- Basic understanding of pandas DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"\n",
"# Set display options for better output\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.width', None)\n",
"pd.set_option('display.max_colwidth', 50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Sample Dataset\n",
"\n",
"Let's create a comprehensive sales dataset to practice basic operations."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sales Dataset Created!\n",
"Dataset shape: (20, 6)\n"
]
}
],
"source": [
"# Create a comprehensive sales dataset\n",
"np.random.seed(42)\n",
"\n",
"sales_data = {\n",
" 'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
" 'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
" 1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
" 'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
" 'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
" 'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
"}\n",
"\n",
"df_sales = pd.DataFrame(sales_data)\n",
"print(\"Sales Dataset Created!\")\n",
"print(f\"Dataset shape: {df_sales.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Viewing Data\n",
"\n",
"These methods help you quickly inspect your data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# View first few rows\n",
"print(\"First 5 rows (default):\")\n",
"print(df_sales.head())\n",
"\n",
"print(\"\\nFirst 3 rows:\")\n",
"print(df_sales.head(3))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# View last few rows\n",
"print(\"Last 5 rows (default):\")\n",
"print(df_sales.tail())\n",
"\n",
"print(\"\\nLast 3 rows:\")\n",
"print(df_sales.tail(3))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Random sample of 5 rows:\n",
" Date Product Sales Region Salesperson Commission_Rate\n",
"0 2024-01-01 Laptop 1200 North John 0.10\n",
"17 2024-01-18 Tablet 620 East Mike 0.08\n",
"15 2024-01-16 Laptop 1220 North John 0.10\n",
"1 2024-01-02 Phone 800 South Sarah 0.12\n",
"8 2024-01-09 Laptop 1250 West Lisa 0.11\n",
"\n",
"Random sample with different random state:\n",
" Date Product Sales Region Salesperson Commission_Rate\n",
"7 2024-01-08 Tablet 650 East Mike 0.08\n",
"10 2024-01-11 Laptop 1150 North John 0.10\n",
"5 2024-01-06 Laptop 1300 North John 0.10\n"
]
}
],
"source": [
"# Sample random rows\n",
"print(\"Random sample of 5 rows:\")\n",
"print(df_sales.sample(5))\n",
"\n",
"print(\"\\nRandom sample with different random state:\")\n",
"print(df_sales.sample(3, random_state=10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. DataFrame Information\n",
"\n",
"Get detailed information about your DataFrame structure."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Comprehensive information about the DataFrame\n",
"print(\"DataFrame Info:\")\n",
"df_sales.info()\n",
"\n",
"print(\"\\nMemory usage:\")\n",
"df_sales.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Basic properties\n",
"print(f\"Shape (rows, columns): {df_sales.shape}\")\n",
"print(f\"Number of rows: {len(df_sales)}\")\n",
"print(f\"Number of columns: {len(df_sales.columns)}\")\n",
"print(f\"Total elements: {df_sales.size}\")\n",
"print(f\"Dimensions: {df_sales.ndim}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Column and index information\n",
"print(\"Column names:\")\n",
"print(df_sales.columns.tolist())\n",
"\n",
"print(\"\\nData types:\")\n",
"print(df_sales.dtypes)\n",
"\n",
"print(\"\\nIndex information:\")\n",
"print(f\"Index: {df_sales.index}\")\n",
"print(f\"Index type: {type(df_sales.index)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Summary Statistics\n",
"\n",
"Understand your data through statistical summaries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Summary statistics for numerical columns\n",
"print(\"Summary statistics:\")\n",
"print(df_sales.describe())\n",
"\n",
"print(\"\\nRounded to 2 decimal places:\")\n",
"print(df_sales.describe().round(2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Summary statistics for all columns (including non-numeric)\n",
"print(\"Summary for all columns:\")\n",
"print(df_sales.describe(include='all'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Individual statistics\n",
"print(\"Individual Statistical Measures:\")\n",
"print(f\"Mean sales: {df_sales['Sales'].mean():.2f}\")\n",
"print(f\"Median sales: {df_sales['Sales'].median():.2f}\")\n",
"print(f\"Standard deviation: {df_sales['Sales'].std():.2f}\")\n",
"print(f\"Minimum sales: {df_sales['Sales'].min()}\")\n",
"print(f\"Maximum sales: {df_sales['Sales'].max()}\")\n",
"print(f\"Sales range: {df_sales['Sales'].max() - df_sales['Sales'].min()}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Quantiles and percentiles\n",
"print(\"Quantiles for Sales:\")\n",
"print(f\"25th percentile (Q1): {df_sales['Sales'].quantile(0.25)}\")\n",
"print(f\"50th percentile (Q2/Median): {df_sales['Sales'].quantile(0.50)}\")\n",
"print(f\"75th percentile (Q3): {df_sales['Sales'].quantile(0.75)}\")\n",
"print(f\"90th percentile: {df_sales['Sales'].quantile(0.90)}\")\n",
"\n",
"print(\"\\nCustom quantiles:\")\n",
"quantiles = df_sales['Sales'].quantile([0.1, 0.3, 0.7, 0.9])\n",
"print(quantiles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Counting and Unique Values\n",
"\n",
"Understand the distribution of categorical data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Count unique values in each column\n",
"print(\"Number of unique values per column:\")\n",
"print(df_sales.nunique())\n",
"\n",
"print(\"\\nUnique values in 'Product' column:\")\n",
"print(df_sales['Product'].unique())\n",
"\n",
"print(\"\\nValue counts for 'Product':\")\n",
"print(df_sales['Product'].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Product distribution (counts and percentages):\n",
" Count Percentage\n",
"Product \n",
"Laptop 8 40.0\n",
"Phone 8 40.0\n",
"Tablet 4 20.0\n"
]
}
],
"source": [
"# Value counts with percentages\n",
"print(\"Product distribution (counts and percentages):\")\n",
"product_counts = df_sales['Product'].value_counts()\n",
"product_percentages = df_sales['Product'].value_counts(normalize=True) * 100\n",
"\n",
"distribution = pd.DataFrame({\n",
" 'Count': product_counts,\n",
" 'Percentage': product_percentages.round(1)\n",
"})\n",
"print(distribution)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Cross-tabulation\n",
"print(\"Cross-tabulation of Product vs Region:\")\n",
"crosstab = pd.crosstab(df_sales['Product'], df_sales['Region'])\n",
"print(crosstab)\n",
"\n",
"print(\"\\nWith percentages:\")\n",
"crosstab_pct = pd.crosstab(df_sales['Product'], df_sales['Region'], normalize='all') * 100\n",
"print(crosstab_pct.round(1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Data Quality Checks\n",
"\n",
"Essential checks for data quality and integrity."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check for missing values\n",
"print(\"Missing values per column:\")\n",
"print(df_sales.isnull().sum())\n",
"\n",
"print(\"\\nPercentage of missing values:\")\n",
"missing_percentages = (df_sales.isnull().sum() / len(df_sales)) * 100\n",
"print(missing_percentages.round(2))\n",
"\n",
"print(\"\\nAny missing values in dataset?\", df_sales.isnull().any().any())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check for duplicates\n",
"print(f\"Number of duplicate rows: {df_sales.duplicated().sum()}\")\n",
"print(f\"Any duplicate rows? {df_sales.duplicated().any()}\")\n",
"\n",
"# Check for duplicates based on specific columns\n",
"print(f\"\\nDuplicate combinations of Date and Salesperson: {df_sales.duplicated(['Date', 'Salesperson']).sum()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Quick Data Exploration\n",
"\n",
"Rapid exploration techniques to understand your data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Quick exploration function\n",
"def quick_explore(df, column_name):\n",
" \"\"\"Quick exploration of a specific column\"\"\"\n",
" print(f\"=== Quick Exploration: {column_name} ===\")\n",
" col = df[column_name]\n",
" \n",
" print(f\"Data type: {col.dtype}\")\n",
" print(f\"Non-null values: {col.count()}/{len(col)}\")\n",
" print(f\"Unique values: {col.nunique()}\")\n",
" \n",
" if col.dtype in ['int64', 'float64']:\n",
" print(f\"Min: {col.min()}, Max: {col.max()}\")\n",
" print(f\"Mean: {col.mean():.2f}, Median: {col.median():.2f}\")\n",
" else:\n",
" print(f\"Most common: {col.mode().iloc[0] if not col.mode().empty else 'N/A'}\")\n",
" print(f\"Sample values: {col.unique()[:5].tolist()}\")\n",
" print()\n",
"\n",
"# Explore different columns\n",
"for col in ['Sales', 'Product', 'Region']:\n",
" quick_explore(df_sales, col)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Test your understanding with these exercises:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Create a larger dataset and explore it\n",
"# Create a dataset with 100 rows and at least 5 columns\n",
"# Include different data types (numeric, categorical, datetime)\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Write a function that provides a complete data profile\n",
"# Include: shape, data types, missing values, unique values, and basic stats\n",
"\n",
"def data_profile(df):\n",
" \"\"\"Provide a comprehensive data profile\"\"\"\n",
" # Your code here:\n",
" pass\n",
"\n",
"# Test your function\n",
"# data_profile(df_sales)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Find interesting insights from the sales data\n",
"# Questions to answer:\n",
"# 1. Which product has the highest average sales?\n",
"# 2. Which region has the most consistent sales (lowest standard deviation)?\n",
"# 3. What's the total commission earned by each salesperson?\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **`.head()` and `.tail()`** are essential for quick data inspection\n",
"2. **`.info()`** provides comprehensive DataFrame structure information\n",
"3. **`.describe()`** gives statistical summaries for numerical columns\n",
"4. **`.nunique()` and `.value_counts()`** help understand categorical data\n",
"5. **Always check for missing values** and duplicates in your data\n",
"6. **Statistical measures** (mean, median, std) provide insights into data distribution\n",
"7. **Cross-tabulation** helps understand relationships between categorical variables\n",
"\n",
"## Common Gotchas\n",
"\n",
"- `.describe()` only includes numeric columns by default (use `include='all'` for all columns)\n",
"- Missing values can affect statistical calculations\n",
"- Large datasets might need memory-efficient exploration techniques\n",
"- Always verify data types are correct for your analysis"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View file

@ -0,0 +1,593 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 3: Selecting and Filtering Data\n",
"\n",
"## Learning Objectives\n",
"- Master column and row selection techniques\n",
"- Learn boolean indexing for data filtering\n",
"- Understand the difference between `.loc[]` and `.iloc[]`\n",
"- Practice complex filtering conditions\n",
"- Handle edge cases in data selection\n",
"\n",
"## Prerequisites\n",
"- Completed Lessons 1-2\n",
"- Understanding of Python boolean operations"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"\n",
"# Create sample dataset\n",
"np.random.seed(42)\n",
"sales_data = {\n",
" 'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
" 'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
" 1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
" 'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
" 'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
" 'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
"}\n",
"\n",
"df_sales = pd.DataFrame(sales_data)\n",
"print(\"Dataset loaded:\")\n",
"print(df_sales.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Selecting Columns\n",
"\n",
"Different ways to select columns from a DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Single column (Product) - Returns Series:\n",
"Type: <class 'pandas.core.series.Series'>\n",
"0 Laptop\n",
"1 Phone\n",
"2 Tablet\n",
"3 Laptop\n",
"4 Phone\n",
"Name: Product, dtype: object\n",
"\n",
"Single column with dot notation:\n",
"0 Laptop\n",
"1 Phone\n",
"2 Tablet\n",
"3 Laptop\n",
"4 Phone\n",
"Name: Product, dtype: object\n",
"\n",
"Single column as DataFrame (note the double brackets):\n",
"Type: <class 'pandas.core.frame.DataFrame'>\n",
" Product\n",
"0 Laptop\n",
"1 Phone\n",
"2 Tablet\n",
"3 Laptop\n",
"4 Phone\n"
]
}
],
"source": [
"# Single column selection (returns Series)\n",
"print(\"Single column (Product) - Returns Series:\")\n",
"product_series = df_sales['Product']\n",
"print(f\"Type: {type(product_series)}\")\n",
"print(product_series.head())\n",
"\n",
"print(\"\\nSingle column with dot notation:\")\n",
"print(df_sales.Product.head())\n",
"\n",
"print(\"\\nSingle column as DataFrame (note the double brackets):\")\n",
"product_df = df_sales[['Product']]\n",
"print(f\"Type: {type(product_df)}\")\n",
"print(product_df.head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Multiple column selection\n",
"print(\"Multiple columns:\")\n",
"selected_cols = df_sales[['Product', 'Sales', 'Region']]\n",
"print(selected_cols.head())\n",
"\n",
"print(\"\\nUsing a list variable:\")\n",
"columns_to_select = ['Date', 'Salesperson', 'Sales']\n",
"selected_df = df_sales[columns_to_select]\n",
"print(selected_df.head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Column selection with conditions\n",
"print(\"Selecting columns by data type:\")\n",
"numeric_cols = df_sales.select_dtypes(include=[np.number])\n",
"print(\"Numeric columns:\")\n",
"print(numeric_cols.head())\n",
"\n",
"print(\"\\nSelecting columns by name pattern:\")\n",
"# Columns containing 'S'\n",
"s_columns = [col for col in df_sales.columns if 'S' in col]\n",
"print(f\"Columns with 'S': {s_columns}\")\n",
"print(df_sales[s_columns].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Selecting Rows\n",
"\n",
"Different methods to select specific rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Row selection by index position\n",
"print(\"First row (index 0):\")\n",
"print(df_sales.iloc[0])\n",
"\n",
"print(\"\\nRows 2 to 4 (positions 1, 2, 3):\")\n",
"print(df_sales.iloc[1:4])\n",
"\n",
"print(\"\\nLast 3 rows:\")\n",
"print(df_sales.iloc[-3:])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Row selection by label/index\n",
"print(\"Using .loc with index labels:\")\n",
"print(df_sales.loc[0:2]) # Note: includes endpoint with .loc\n",
"\n",
"print(\"\\nSpecific rows by index:\")\n",
"specific_rows = df_sales.loc[[0, 5, 10, 15]]\n",
"print(specific_rows[['Product', 'Sales', 'Region']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Random sampling\n",
"print(\"Random sample of 5 rows:\")\n",
"random_sample = df_sales.sample(n=5, random_state=42)\n",
"print(random_sample[['Product', 'Sales', 'Salesperson']])\n",
"\n",
"print(\"\\nRandom 25% of the data:\")\n",
"percentage_sample = df_sales.sample(frac=0.25, random_state=42)\n",
"print(f\"Sample size: {len(percentage_sample)} rows\")\n",
"print(percentage_sample[['Product', 'Sales']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Boolean Indexing and Filtering\n",
"\n",
"Filter data based on conditions using boolean indexing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Simple boolean conditions\n",
"print(\"Sales greater than 1000:\")\n",
"high_sales = df_sales[df_sales['Sales'] > 1000]\n",
"print(high_sales[['Product', 'Sales', 'Region']])\n",
"\n",
"print(\"\\nSpecific product filter:\")\n",
"laptops_only = df_sales[df_sales['Product'] == 'Laptop']\n",
"print(laptops_only[['Date', 'Sales', 'Salesperson']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Multiple conditions with AND (&)\n",
"print(\"Laptops with sales > 1100:\")\n",
"laptop_high_sales = df_sales[(df_sales['Product'] == 'Laptop') & (df_sales['Sales'] > 1100)]\n",
"print(laptop_high_sales[['Date', 'Product', 'Sales', 'Region']])\n",
"\n",
"print(\"\\nNorth region with commission rate >= 0.10:\")\n",
"north_high_commission = df_sales[(df_sales['Region'] == 'North') & (df_sales['Commission_Rate'] >= 0.10)]\n",
"print(north_high_commission[['Product', 'Sales', 'Commission_Rate', 'Salesperson']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Multiple conditions with OR (|)\n",
"print(\"Laptops OR high sales (>1200):\")\n",
"laptop_or_high = df_sales[(df_sales['Product'] == 'Laptop') | (df_sales['Sales'] > 1200)]\n",
"print(laptop_or_high[['Product', 'Sales', 'Region']])\n",
"\n",
"print(\"\\nNorth OR South regions:\")\n",
"north_or_south = df_sales[(df_sales['Region'] == 'North') | (df_sales['Region'] == 'South')]\n",
"print(north_or_south[['Product', 'Sales', 'Region']].head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Using .isin() for multiple values\n",
"print(\"Products: Laptop or Phone\")\n",
"laptop_phone = df_sales[df_sales['Product'].isin(['Laptop', 'Phone'])]\n",
"print(laptop_phone[['Product', 'Sales', 'Region']].head())\n",
"\n",
"print(\"\\nSpecific salespersons:\")\n",
"selected_salespeople = df_sales[df_sales['Salesperson'].isin(['John', 'Sarah'])]\n",
"print(selected_salespeople[['Salesperson', 'Product', 'Sales']].head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# NOT conditions using ~\n",
"print(\"NOT Tablets:\")\n",
"not_tablets = df_sales[~(df_sales['Product'] == 'Tablet')]\n",
"print(not_tablets['Product'].value_counts())\n",
"\n",
"print(\"\\nNOT in North region:\")\n",
"not_north = df_sales[~df_sales['Region'].isin(['North'])]\n",
"print(not_north['Region'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Advanced Selection with .loc and .iloc\n",
"\n",
"Powerful selection methods for precise data access."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# .loc for label-based selection\n",
"print(\".loc examples - Label-based selection:\")\n",
"\n",
"# Select specific rows and columns\n",
"print(\"Rows 0-2, specific columns:\")\n",
"result = df_sales.loc[0:2, ['Product', 'Sales', 'Region']]\n",
"print(result)\n",
"\n",
"print(\"\\nAll rows, specific columns:\")\n",
"result = df_sales.loc[:, ['Product', 'Sales']]\n",
"print(result.head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# .iloc for position-based selection\n",
"print(\".iloc examples - Position-based selection:\")\n",
"\n",
"# Select by position\n",
"print(\"First 3 rows, first 3 columns:\")\n",
"result = df_sales.iloc[0:3, 0:3]\n",
"print(result)\n",
"\n",
"print(\"\\nEvery other row, specific columns:\")\n",
"result = df_sales.iloc[::2, [1, 2, 3]] # Every 2nd row, columns 1,2,3\n",
"print(result.head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Combining boolean indexing with .loc\n",
"print(\"Boolean indexing with .loc:\")\n",
"\n",
"# High sales, specific columns\n",
"high_sales_subset = df_sales.loc[df_sales['Sales'] > 1000, ['Product', 'Sales', 'Salesperson']]\n",
"print(high_sales_subset)\n",
"\n",
"print(\"\\nComplex condition with .loc:\")\n",
"complex_filter = (df_sales['Product'] == 'Laptop') & (df_sales['Region'] == 'North')\n",
"result = df_sales.loc[complex_filter, ['Date', 'Sales', 'Commission_Rate']]\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. String-based Filtering\n",
"\n",
"Filter data based on string patterns and conditions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# String methods for filtering\n",
"print(\"Salesperson names starting with 'J':\")\n",
"j_names = df_sales[df_sales['Salesperson'].str.startswith('J')]\n",
"print(j_names[['Salesperson', 'Product', 'Sales']].head())\n",
"\n",
"print(\"\\nRegions containing 'th':\")\n",
"th_regions = df_sales[df_sales['Region'].str.contains('th')]\n",
"print(th_regions[['Region', 'Product', 'Sales']].head())\n",
"\n",
"print(\"\\nProducts with exactly 5 characters:\")\n",
"five_char_products = df_sales[df_sales['Product'].str.len() == 5]\n",
"print(five_char_products['Product'].unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Date-based Filtering\n",
"\n",
"Filter data based on date conditions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Date filtering\n",
"print(\"Data from first week of January 2024:\")\n",
"first_week = df_sales[df_sales['Date'] <= '2024-01-07']\n",
"print(first_week[['Date', 'Product', 'Sales']])\n",
"\n",
"print(\"\\nData from specific date range:\")\n",
"date_range = df_sales[(df_sales['Date'] >= '2024-01-10') & (df_sales['Date'] <= '2024-01-15')]\n",
"print(date_range[['Date', 'Product', 'Sales', 'Region']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Using date components\n",
"print(\"Data from weekends (Saturday=5, Sunday=6):\")\n",
"weekends = df_sales[df_sales['Date'].dt.dayofweek >= 5]\n",
"print(weekends[['Date', 'Product', 'Sales']])\n",
"\n",
"print(\"\\nData from specific days of week:\")\n",
"mondays = df_sales[df_sales['Date'].dt.day_name() == 'Monday']\n",
"print(f\"Monday sales: {len(mondays)} records\")\n",
"if len(mondays) > 0:\n",
" print(mondays[['Date', 'Product', 'Sales']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Query Method\n",
"\n",
"Alternative syntax for filtering using the `.query()` method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Using .query() method for cleaner syntax\n",
"print(\"Using .query() method:\")\n",
"\n",
"# Simple condition\n",
"high_sales_query = df_sales.query('Sales > 1000')\n",
"print(f\"High sales records: {len(high_sales_query)}\")\n",
"print(high_sales_query[['Product', 'Sales', 'Region']].head())\n",
"\n",
"print(\"\\nMultiple conditions:\")\n",
"complex_query = df_sales.query('Product == \"Laptop\" and Region == \"North\"')\n",
"print(complex_query[['Date', 'Sales', 'Commission_Rate']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Query with variables\n",
"min_sales = 900\n",
"target_region = 'East'\n",
"\n",
"print(\"Query with variables:\")\n",
"var_query = df_sales.query('Sales >= @min_sales and Region == @target_region')\n",
"print(var_query[['Product', 'Sales', 'Region']])\n",
"\n",
"print(\"\\nQuery with list (isin equivalent):\")\n",
"products = ['Laptop', 'Phone']\n",
"list_query = df_sales.query('Product in @products')\n",
"print(f\"Records for {products}: {len(list_query)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Test your filtering and selection skills:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Complex Filtering\n",
"# Find all sales where:\n",
"# - Product is either 'Laptop' or 'Phone'\n",
"# - Sales are above the median\n",
"# - Commission rate is at least 0.10\n",
"# Show only Date, Product, Sales, and Salesperson columns\n",
"\n",
"# Your code here:\n",
"median_sales = df_sales['Sales'].median()\n",
"print(f\"Median sales: {median_sales}\")\n",
"\n",
"# complex_filter = ?\n",
"# result = ?\n",
"# print(result)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Date-based Analysis\n",
"# Find sales data for the second week of January 2024\n",
"# Calculate the average sales for that week\n",
"# Show which products were sold and by whom\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Performance Analysis\n",
"# Create a function that finds top performers:\n",
"# - Takes a DataFrame and a percentile (e.g., 0.8 for top 20%)\n",
"# - Returns salespeople whose average sales are in the top percentile\n",
"# - Show their average sales and total number of sales\n",
"\n",
"def find_top_performers(df, percentile=0.8):\n",
" \"\"\"Find top performing salespeople\"\"\"\n",
" # Your code here:\n",
" pass\n",
"\n",
"# Test your function\n",
"# top_performers = find_top_performers(df_sales, 0.8)\n",
"# print(top_performers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Column Selection**: Use `[]` for single/multiple columns, understand Series vs DataFrame return types\n",
"2. **Row Selection**: `.iloc[]` for position-based, `.loc[]` for label-based selection\n",
"3. **Boolean Indexing**: Use `&` (AND), `|` (OR), `~` (NOT) for combining conditions\n",
"4. **Parentheses Matter**: Always wrap individual conditions in parentheses when combining\n",
"5. **`.isin()` Method**: Efficient way to filter for multiple values\n",
"6. **String Methods**: Use `.str` accessor for string-based filtering\n",
"7. **Date Filtering**: Leverage `.dt` accessor for date-based conditions\n",
"8. **`.query()` Method**: Alternative syntax for complex filtering\n",
"\n",
"## Common Mistakes to Avoid\n",
"\n",
"- Using `and/or` instead of `&/|` in boolean conditions\n",
"- Forgetting parentheses around conditions\n",
"- Confusing `.loc[]` and `.iloc[]` usage\n",
"- Not handling empty results from filtering\n",
"- Using chained indexing instead of `.loc[]`\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,733 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 5: Adding and Modifying Columns\n",
"\n",
"## Learning Objectives\n",
"- Learn different methods to add new columns to DataFrames\n",
"- Master conditional column creation using various techniques\n",
"- Understand how to modify existing columns\n",
"- Practice with calculated fields and derived columns\n",
"- Explore data type conversions and transformations\n",
"\n",
"## Prerequisites\n",
"- Completed Lessons 1-4\n",
"- Understanding of basic Python operations and functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"\n",
"# Create sample dataset\n",
"np.random.seed(42)\n",
"n_records = 150\n",
"\n",
"sales_data = {\n",
" 'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n",
" 'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),\n",
" 'Sales': np.random.normal(1000, 200, n_records).astype(int),\n",
" 'Quantity': np.random.randint(1, 8, n_records),\n",
" 'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
" 'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n_records),\n",
" 'Customer_Type': np.random.choice(['New', 'Returning', 'VIP'], n_records, p=[0.3, 0.6, 0.1])\n",
"}\n",
"\n",
"df_sales = pd.DataFrame(sales_data)\n",
"df_sales['Sales'] = np.abs(df_sales['Sales']) # Ensure positive values\n",
"\n",
"print(\"Original dataset:\")\n",
"print(f\"Shape: {df_sales.shape}\")\n",
"print(\"\\nFirst few rows:\")\n",
"print(df_sales.head())\n",
"print(\"\\nData types:\")\n",
"print(df_sales.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Basic Column Addition\n",
"\n",
"Simple methods to add new columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 1: Direct assignment\n",
"df_modified = df_sales.copy()\n",
"\n",
"# Add simple calculated columns\n",
"df_modified['Revenue'] = df_modified['Sales'] * df_modified['Quantity']\n",
"df_modified['Commission_10%'] = df_modified['Sales'] * 0.10\n",
"df_modified['Sales_per_Unit'] = df_modified['Sales'] / df_modified['Quantity']\n",
"\n",
"print(\"New calculated columns:\")\n",
"print(df_modified[['Sales', 'Quantity', 'Revenue', 'Commission_10%', 'Sales_per_Unit']].head())\n",
"\n",
"# Add constant value column\n",
"df_modified['Year'] = 2024\n",
"df_modified['Currency'] = 'USD'\n",
"df_modified['Department'] = 'Sales'\n",
"\n",
"print(\"\\nConstant value columns added:\")\n",
"print(df_modified[['Year', 'Currency', 'Department']].head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 2: Using assign() method (more functional approach)\n",
"df_assigned = df_sales.assign(\n",
" Revenue=lambda x: x['Sales'] * x['Quantity'],\n",
" Commission_Rate=0.08,\n",
" Commission_Amount=lambda x: x['Sales'] * 0.08,\n",
" Sales_Squared=lambda x: x['Sales'] ** 2,\n",
" Is_High_Volume=lambda x: x['Quantity'] > 5\n",
")\n",
"\n",
"print(\"Using assign() method:\")\n",
"print(df_assigned[['Sales', 'Quantity', 'Revenue', 'Commission_Amount', 'Is_High_Volume']].head())\n",
"\n",
"print(f\"\\nOriginal shape: {df_sales.shape}\")\n",
"print(f\"Modified shape: {df_assigned.shape}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 3: Using insert() for specific positioning\n",
"df_insert = df_sales.copy()\n",
"\n",
"# Insert column at specific position (after 'Sales')\n",
"sales_index = df_insert.columns.get_loc('Sales')\n",
"df_insert.insert(sales_index + 1, 'Sales_Tax', df_insert['Sales'] * 0.08)\n",
"df_insert.insert(sales_index + 2, 'Total_with_Tax', df_insert['Sales'] + df_insert['Sales_Tax'])\n",
"\n",
"print(\"Using insert() for positioned columns:\")\n",
"print(df_insert[['Product', 'Sales', 'Sales_Tax', 'Total_with_Tax', 'Quantity']].head())\n",
"print(f\"\\nColumn order: {list(df_insert.columns)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Conditional Column Creation\n",
"\n",
"Create columns based on conditions and business logic."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 1: Using np.where() for simple conditions\n",
"df_conditional = df_sales.copy()\n",
"\n",
"# Simple binary conditions\n",
"df_conditional['High_Sales'] = np.where(df_conditional['Sales'] > 1000, 'Yes', 'No')\n",
"df_conditional['Weekend'] = np.where(df_conditional['Date'].dt.dayofweek >= 5, 'Weekend', 'Weekday')\n",
"df_conditional['Bulk_Order'] = np.where(df_conditional['Quantity'] >= 5, 'Bulk', 'Regular')\n",
"\n",
"print(\"Simple conditional columns:\")\n",
"print(df_conditional[['Sales', 'High_Sales', 'Date', 'Weekend', 'Quantity', 'Bulk_Order']].head())\n",
"\n",
"# Nested conditions\n",
"df_conditional['Sales_Category'] = np.where(df_conditional['Sales'] > 1200, 'High',\n",
" np.where(df_conditional['Sales'] > 800, 'Medium', 'Low'))\n",
"\n",
"print(\"\\nNested conditions:\")\n",
"print(df_conditional[['Sales', 'Sales_Category']].head(10))\n",
"print(\"\\nCategory distribution:\")\n",
"print(df_conditional['Sales_Category'].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 2: Using pd.cut() for binning numerical data\n",
"df_conditional['Sales_Tier'] = pd.cut(df_conditional['Sales'], \n",
" bins=[0, 500, 800, 1200, float('inf')],\n",
" labels=['Entry', 'Standard', 'Premium', 'Luxury'])\n",
"\n",
"print(\"Using pd.cut() for binning:\")\n",
"print(df_conditional[['Sales', 'Sales_Tier']].head(10))\n",
"print(\"\\nTier distribution:\")\n",
"print(df_conditional['Sales_Tier'].value_counts())\n",
"\n",
"# Using pd.qcut() for quantile-based binning\n",
"df_conditional['Sales_Quintile'] = pd.qcut(df_conditional['Sales'], \n",
" q=5, \n",
" labels=['Bottom 20%', 'Low 20%', 'Mid 20%', 'High 20%', 'Top 20%'])\n",
"\n",
"print(\"\\nUsing pd.qcut() for quantile binning:\")\n",
"print(df_conditional['Sales_Quintile'].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 3: Using pandas.select() for multiple conditions\n",
"# Define conditions and choices\n",
"conditions = [\n",
" (df_conditional['Sales'] >= 1200) & (df_conditional['Quantity'] >= 5),\n",
" (df_conditional['Sales'] >= 1000) & (df_conditional['Customer_Type'] == 'VIP'),\n",
" (df_conditional['Sales'] >= 800) & (df_conditional['Region'] == 'North'),\n",
" df_conditional['Customer_Type'] == 'New'\n",
"]\n",
"\n",
"choices = ['Premium Deal', 'VIP Sale', 'North Preferred', 'New Customer']\n",
"default = 'Standard'\n",
"\n",
"df_conditional['Deal_Type'] = np.select(conditions, choices, default=default)\n",
"\n",
"print(\"Using np.select() for complex conditions:\")\n",
"print(df_conditional[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Deal_Type']].head(10))\n",
"print(\"\\nDeal type distribution:\")\n",
"print(df_conditional['Deal_Type'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Using Apply and Lambda Functions\n",
"\n",
"Create complex calculated columns using custom functions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Simple lambda functions\n",
"df_apply = df_sales.copy()\n",
"\n",
"# Single column transformations\n",
"df_apply['Sales_Log'] = df_apply['Sales'].apply(lambda x: np.log(x))\n",
"df_apply['Product_Length'] = df_apply['Product'].apply(lambda x: len(x))\n",
"df_apply['Days_Since_Start'] = df_apply['Date'].apply(lambda x: (x - df_apply['Date'].min()).days)\n",
"\n",
"print(\"Simple lambda transformations:\")\n",
"print(df_apply[['Sales', 'Sales_Log', 'Product', 'Product_Length', 'Days_Since_Start']].head())\n",
"\n",
"# Multiple column operations using lambda\n",
"df_apply['Efficiency_Score'] = df_apply.apply(\n",
" lambda row: (row['Sales'] * row['Quantity']) / (row['Days_Since_Start'] + 1), \n",
" axis=1\n",
")\n",
"\n",
"print(\"\\nMultiple column lambda:\")\n",
"print(df_apply[['Sales', 'Quantity', 'Days_Since_Start', 'Efficiency_Score']].head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Custom functions for complex business logic\n",
"def calculate_commission(row):\n",
" \"\"\"Calculate commission based on complex business rules\"\"\"\n",
" base_rate = 0.05\n",
" \n",
" # VIP customers get higher commission\n",
" if row['Customer_Type'] == 'VIP':\n",
" base_rate += 0.02\n",
" \n",
" # High quantity orders get bonus\n",
" if row['Quantity'] >= 5:\n",
" base_rate += 0.01\n",
" \n",
" # Regional multipliers\n",
" region_multipliers = {'North': 1.2, 'South': 1.0, 'East': 1.1, 'West': 0.9}\n",
" multiplier = region_multipliers.get(row['Region'], 1.0)\n",
" \n",
" return row['Sales'] * base_rate * multiplier\n",
"\n",
"def performance_rating(row):\n",
" \"\"\"Calculate performance rating based on multiple factors\"\"\"\n",
" score = 0\n",
" \n",
" # Sales performance (40% weight)\n",
" if row['Sales'] > 1200:\n",
" score += 40\n",
" elif row['Sales'] > 800:\n",
" score += 30\n",
" else:\n",
" score += 20\n",
" \n",
" # Quantity performance (30% weight)\n",
" if row['Quantity'] >= 6:\n",
" score += 30\n",
" elif row['Quantity'] >= 4:\n",
" score += 20\n",
" else:\n",
" score += 10\n",
" \n",
" # Customer type bonus (30% weight)\n",
" customer_bonus = {'VIP': 30, 'Returning': 20, 'New': 15}\n",
" score += customer_bonus.get(row['Customer_Type'], 0)\n",
" \n",
" # Convert to letter grade\n",
" if score >= 85:\n",
" return 'A'\n",
" elif score >= 70:\n",
" return 'B'\n",
" elif score >= 55:\n",
" return 'C'\n",
" else:\n",
" return 'D'\n",
"\n",
"# Apply custom functions\n",
"df_apply['Commission'] = df_apply.apply(calculate_commission, axis=1)\n",
"df_apply['Performance_Rating'] = df_apply.apply(performance_rating, axis=1)\n",
"\n",
"print(\"Custom function results:\")\n",
"print(df_apply[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Commission', 'Performance_Rating']].head())\n",
"\n",
"print(\"\\nPerformance rating distribution:\")\n",
"print(df_apply['Performance_Rating'].value_counts().sort_index())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Date and Time Derived Columns\n",
"\n",
"Extract useful information from datetime columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract date components\n",
"df_dates = df_sales.copy()\n",
"\n",
"# Basic date components\n",
"df_dates['Year'] = df_dates['Date'].dt.year\n",
"df_dates['Month'] = df_dates['Date'].dt.month\n",
"df_dates['Day'] = df_dates['Date'].dt.day\n",
"df_dates['DayOfWeek'] = df_dates['Date'].dt.dayofweek # 0=Monday, 6=Sunday\n",
"df_dates['DayName'] = df_dates['Date'].dt.day_name()\n",
"df_dates['MonthName'] = df_dates['Date'].dt.month_name()\n",
"\n",
"print(\"Basic date components:\")\n",
"print(df_dates[['Date', 'Year', 'Month', 'Day', 'DayOfWeek', 'DayName', 'MonthName']].head())\n",
"\n",
"# Business-relevant date features\n",
"df_dates['Quarter'] = df_dates['Date'].dt.quarter\n",
"df_dates['Week'] = df_dates['Date'].dt.isocalendar().week\n",
"df_dates['DayOfYear'] = df_dates['Date'].dt.dayofyear\n",
"df_dates['IsWeekend'] = df_dates['Date'].dt.dayofweek >= 5\n",
"df_dates['IsMonthStart'] = df_dates['Date'].dt.is_month_start\n",
"df_dates['IsMonthEnd'] = df_dates['Date'].dt.is_month_end\n",
"\n",
"print(\"\\nBusiness date features:\")\n",
"print(df_dates[['Date', 'Quarter', 'Week', 'IsWeekend', 'IsMonthStart', 'IsMonthEnd']].head(10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Time-based calculations\n",
"start_date = df_dates['Date'].min()\n",
"df_dates['Days_Since_Start'] = (df_dates['Date'] - start_date).dt.days\n",
"df_dates['Weeks_Since_Start'] = df_dates['Days_Since_Start'] // 7\n",
"\n",
"# Create season column\n",
"def get_season(month):\n",
" if month in [12, 1, 2]:\n",
" return 'Winter'\n",
" elif month in [3, 4, 5]:\n",
" return 'Spring'\n",
" elif month in [6, 7, 8]:\n",
" return 'Summer'\n",
" else:\n",
" return 'Fall'\n",
"\n",
"df_dates['Season'] = df_dates['Month'].apply(get_season)\n",
"\n",
"# Business day calculations\n",
"df_dates['IsBusinessDay'] = df_dates['Date'].dt.dayofweek < 5\n",
"df_dates['BusinessDaysSinceStart'] = df_dates.apply(\n",
" lambda row: np.busday_count(start_date.date(), row['Date'].date()), axis=1\n",
")\n",
"\n",
"print(\"Time-based calculations:\")\n",
"print(df_dates[['Date', 'Days_Since_Start', 'Weeks_Since_Start', 'Season', \n",
" 'IsBusinessDay', 'BusinessDaysSinceStart']].head())\n",
"\n",
"print(\"\\nSeason distribution:\")\n",
"print(df_dates['Season'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Text and String Manipulations\n",
"\n",
"Create columns based on string operations and text processing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# String manipulations\n",
"df_text = df_sales.copy()\n",
"\n",
"# Basic string operations\n",
"df_text['Product_Upper'] = df_text['Product'].str.upper()\n",
"df_text['Product_Lower'] = df_text['Product'].str.lower()\n",
"df_text['Product_Length'] = df_text['Product'].str.len()\n",
"df_text['Product_First_Char'] = df_text['Product'].str[0]\n",
"df_text['Product_Last_Three'] = df_text['Product'].str[-3:]\n",
"\n",
"print(\"Basic string operations:\")\n",
"print(df_text[['Product', 'Product_Upper', 'Product_Lower', 'Product_Length', \n",
" 'Product_First_Char', 'Product_Last_Three']].head())\n",
"\n",
"# Text categorization\n",
"df_text['Product_Category'] = df_text['Product'].apply(lambda x: \n",
" 'Computer' if x in ['Laptop', 'Monitor'] else\n",
" 'Mobile' if x in ['Phone', 'Tablet'] else\n",
" 'Other'\n",
")\n",
"\n",
"# Check for patterns\n",
"df_text['Has_Letter_A'] = df_text['Product'].str.contains('a', case=False)\n",
"df_text['Starts_With_L'] = df_text['Product'].str.startswith('L')\n",
"df_text['Ends_With_E'] = df_text['Product'].str.endswith('e')\n",
"\n",
"print(\"\\nText patterns and categorization:\")\n",
"print(df_text[['Product', 'Product_Category', 'Has_Letter_A', 'Starts_With_L', 'Ends_With_E']].head(10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create formatted text columns\n",
"df_text['Sales_Formatted'] = df_text['Sales'].apply(lambda x: f\"${x:,.2f}\")\n",
"df_text['Transaction_ID'] = df_text.apply(\n",
" lambda row: f\"{row['Region'][:1]}{row['Product'][:3].upper()}{row.name:04d}\", axis=1\n",
")\n",
"\n",
"# Create summary descriptions\n",
"df_text['Transaction_Summary'] = df_text.apply(\n",
" lambda row: f\"{row['Salesperson']} sold {row['Quantity']} {row['Product']}(s) \"\n",
" f\"for {row['Sales_Formatted']} in {row['Region']} region\", \n",
" axis=1\n",
")\n",
"\n",
"print(\"Formatted text columns:\")\n",
"print(df_text[['Sales_Formatted', 'Transaction_ID']].head())\n",
"print(\"\\nTransaction summaries:\")\n",
"for i, summary in enumerate(df_text['Transaction_Summary'].head(3)):\n",
" print(f\"{i+1}. {summary}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Working with Categorical Data\n",
"\n",
"Optimize memory usage and enable category-specific operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert to categorical data types\n",
"df_categorical = df_sales.copy()\n",
"\n",
"# Check memory usage before\n",
"print(\"Memory usage before categorical conversion:\")\n",
"print(df_categorical.memory_usage(deep=True))\n",
"\n",
"# Convert string columns to categorical\n",
"categorical_columns = ['Product', 'Region', 'Salesperson', 'Customer_Type']\n",
"for col in categorical_columns:\n",
" df_categorical[col] = df_categorical[col].astype('category')\n",
"\n",
"print(\"\\nMemory usage after categorical conversion:\")\n",
"print(df_categorical.memory_usage(deep=True))\n",
"\n",
"print(\"\\nData types after conversion:\")\n",
"print(df_categorical.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Working with ordered categories\n",
"# Create ordered categorical for sales performance\n",
"performance_categories = ['Poor', 'Fair', 'Good', 'Excellent']\n",
"df_categorical['Performance_Level'] = pd.cut(\n",
" df_categorical['Sales'],\n",
" bins=[0, 700, 900, 1200, float('inf')],\n",
" labels=performance_categories,\n",
" ordered=True\n",
")\n",
"\n",
"print(\"Ordered categorical data:\")\n",
"print(df_categorical['Performance_Level'].head(10))\n",
"print(\"\\nCategory info:\")\n",
"print(df_categorical['Performance_Level'].cat.categories)\n",
"print(f\"Is ordered: {df_categorical['Performance_Level'].cat.ordered}\")\n",
"\n",
"# Categorical operations\n",
"print(\"\\nPerformance level distribution:\")\n",
"print(df_categorical['Performance_Level'].value_counts().sort_index())\n",
"\n",
"# Add new category\n",
"df_categorical['Performance_Level'] = df_categorical['Performance_Level'].cat.add_categories(['Outstanding'])\n",
"print(f\"\\nCategories after adding 'Outstanding': {df_categorical['Performance_Level'].cat.categories}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Mathematical and Statistical Transformations\n",
"\n",
"Create columns using mathematical functions and statistical transformations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Mathematical transformations\n",
"df_math = df_sales.copy()\n",
"\n",
"# Common mathematical transformations\n",
"df_math['Sales_Log'] = np.log(df_math['Sales'])\n",
"df_math['Sales_Sqrt'] = np.sqrt(df_math['Sales'])\n",
"df_math['Sales_Squared'] = df_math['Sales'] ** 2\n",
"df_math['Sales_Reciprocal'] = 1 / df_math['Sales']\n",
"\n",
"print(\"Mathematical transformations:\")\n",
"print(df_math[['Sales', 'Sales_Log', 'Sales_Sqrt', 'Sales_Squared', 'Sales_Reciprocal']].head())\n",
"\n",
"# Statistical standardization\n",
"df_math['Sales_Z_Score'] = (df_math['Sales'] - df_math['Sales'].mean()) / df_math['Sales'].std()\n",
"df_math['Sales_Min_Max_Scaled'] = (df_math['Sales'] - df_math['Sales'].min()) / (df_math['Sales'].max() - df_math['Sales'].min())\n",
"\n",
"# Rolling statistics\n",
"df_math = df_math.sort_values('Date')\n",
"df_math['Sales_Rolling_7_Mean'] = df_math['Sales'].rolling(window=7, min_periods=1).mean()\n",
"df_math['Sales_Rolling_7_Std'] = df_math['Sales'].rolling(window=7, min_periods=1).std()\n",
"df_math['Sales_Cumulative_Sum'] = df_math['Sales'].cumsum()\n",
"\n",
"print(\"\\nStatistical transformations:\")\n",
"print(df_math[['Sales', 'Sales_Z_Score', 'Sales_Min_Max_Scaled', \n",
" 'Sales_Rolling_7_Mean', 'Sales_Cumulative_Sum']].head(10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Rank and percentile columns\n",
"df_math['Sales_Rank'] = df_math['Sales'].rank(ascending=False)\n",
"df_math['Sales_Percentile'] = df_math['Sales'].rank(pct=True) * 100\n",
"df_math['Sales_Rank_by_Region'] = df_math.groupby('Region')['Sales'].rank(ascending=False)\n",
"\n",
"# Binning and discretization\n",
"df_math['Sales_Decile'] = pd.qcut(df_math['Sales'], q=10, labels=range(1, 11))\n",
"df_math['Sales_Tertile'] = pd.qcut(df_math['Sales'], q=3, labels=['Low', 'Medium', 'High'])\n",
"\n",
"print(\"Ranking and binning:\")\n",
"print(df_math[['Sales', 'Sales_Rank', 'Sales_Percentile', 'Sales_Rank_by_Region', \n",
" 'Sales_Decile', 'Sales_Tertile']].head(10))\n",
"\n",
"print(\"\\nDecile distribution:\")\n",
"print(df_math['Sales_Decile'].value_counts().sort_index())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Apply your column creation and modification skills:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Customer Segmentation\n",
"# Create a comprehensive customer segmentation system:\n",
"# - Combine purchase behavior, frequency, and value\n",
"# - Create RFM-like scores (Recency, Frequency, Monetary)\n",
"# - Assign customer segments (e.g., Champion, Loyal, At Risk, etc.)\n",
"\n",
"def create_customer_segmentation(df):\n",
" \"\"\"Create customer segmentation based on purchase patterns\"\"\"\n",
" # Your implementation here\n",
" pass\n",
"\n",
"# segmented_df = create_customer_segmentation(df_sales)\n",
"# print(segmented_df[['Customer_Type', 'Sales', 'Frequency_Score', 'Monetary_Score', 'Segment']].head())"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Performance Metrics Dashboard\n",
"# Create a comprehensive set of KPI columns:\n",
"# - Sales efficiency metrics\n",
"# - Trend indicators (growth rates, momentum)\n",
"# - Comparative metrics (vs. average, vs. target)\n",
"# - Alert flags for unusual patterns\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Feature Engineering for ML\n",
"# Create features that could be useful for machine learning:\n",
"# - Interaction features (product of two variables)\n",
"# - Polynomial features\n",
"# - Time-based features (seasonality, trends)\n",
"# - Lag features (previous period values)\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Column Assignment**: Use direct assignment (`df['col'] = value`) for simple cases\n",
"2. **Assign Method**: Use `.assign()` for functional programming style and method chaining\n",
"3. **Conditional Logic**: Combine `np.where()`, `pd.cut()`, `pd.qcut()`, and `np.select()` for complex conditions\n",
"4. **Apply Functions**: Use `.apply()` with lambda or custom functions for complex transformations\n",
"5. **Date Features**: Extract meaningful components from datetime columns\n",
"6. **String Operations**: Leverage `.str` accessor for text manipulations\n",
"7. **Categorical Data**: Convert to categories for memory efficiency and special operations\n",
"8. **Mathematical Transformations**: Apply statistical and mathematical functions for data preprocessing\n",
"\n",
"## Performance Tips\n",
"\n",
"1. **Vectorized Operations**: Prefer pandas/numpy operations over loops\n",
"2. **Categorical Types**: Use categorical data for repeated string values\n",
"3. **Memory Management**: Monitor memory usage when creating many new columns\n",
"4. **Method Chaining**: Use `.assign()` for readable method chains\n",
"5. **Avoid apply() When Possible**: Use vectorized operations instead of `.apply()` for better performance\n",
"\n",
"## Common Patterns\n",
"\n",
"```python\n",
"# Simple calculation\n",
"df['new_col'] = df['col1'] * df['col2']\n",
"\n",
"# Conditional column\n",
"df['category'] = np.where(df['value'] > threshold, 'High', 'Low')\n",
"\n",
"# Apply custom function\n",
"df['result'] = df.apply(custom_function, axis=1)\n",
"\n",
"# Date features\n",
"df['month'] = df['date'].dt.month\n",
"\n",
"# String operations\n",
"df['upper'] = df['text'].str.upper()\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View file

@ -0,0 +1,916 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 6: Handling Missing Data\n",
"\n",
"## Learning Objectives\n",
"- Understand different types of missing data and their implications\n",
"- Master techniques for detecting and analyzing missing values\n",
"- Learn various strategies for handling missing data\n",
"- Practice imputation methods and their trade-offs\n",
"- Develop best practices for missing data management\n",
"\n",
"## Prerequisites\n",
"- Completed Lessons 1-5\n",
"- Understanding of basic statistical concepts\n",
"- Familiarity with data quality principles"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from datetime import datetime, timedelta\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"# Set display options\n",
"pd.set_option('display.max_columns', None)\n",
"plt.style.use('seaborn-v0_8')\n",
"\n",
"print(\"Libraries loaded successfully!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Dataset with Missing Values\n",
"\n",
"Let's create a realistic dataset with different patterns of missing data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create comprehensive dataset with various missing data patterns\n",
"np.random.seed(42)\n",
"n_records = 500\n",
"\n",
"# Base data\n",
"data = {\n",
" 'customer_id': range(1, n_records + 1),\n",
" 'age': np.random.normal(35, 12, n_records).astype(int),\n",
" 'income': np.random.normal(50000, 15000, n_records),\n",
" 'education_years': np.random.normal(14, 3, n_records),\n",
" 'purchase_amount': np.random.normal(200, 50, n_records),\n",
" 'satisfaction_score': np.random.randint(1, 6, n_records),\n",
" 'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
" 'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),\n",
" 'signup_date': pd.date_range('2023-01-01', periods=n_records, freq='D'),\n",
" 'last_purchase_date': pd.date_range('2023-01-01', periods=n_records, freq='D') + pd.Timedelta(days=30)\n",
"}\n",
"\n",
"df_complete = pd.DataFrame(data)\n",
"\n",
"# Ensure positive values where appropriate\n",
"df_complete['age'] = np.abs(df_complete['age'])\n",
"df_complete['income'] = np.abs(df_complete['income'])\n",
"df_complete['education_years'] = np.clip(df_complete['education_years'], 6, 20)\n",
"df_complete['purchase_amount'] = np.abs(df_complete['purchase_amount'])\n",
"\n",
"print(\"Complete dataset created:\")\n",
"print(f\"Shape: {df_complete.shape}\")\n",
"print(\"\\nFirst few rows:\")\n",
"print(df_complete.head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Introduce different patterns of missing data\n",
"df_missing = df_complete.copy()\n",
"\n",
"# 1. Missing Completely at Random (MCAR) - income data\n",
"# Randomly missing 15% of income values\n",
"mcar_indices = np.random.choice(df_missing.index, size=int(0.15 * len(df_missing)), replace=False)\n",
"df_missing.loc[mcar_indices, 'income'] = np.nan\n",
"\n",
"# 2. Missing at Random (MAR) - education years missing based on age\n",
"# Older people less likely to report education\n",
"older_customers = df_missing['age'] > 60\n",
"older_indices = df_missing[older_customers].index\n",
"education_missing = np.random.choice(older_indices, size=int(0.4 * len(older_indices)), replace=False)\n",
"df_missing.loc[education_missing, 'education_years'] = np.nan\n",
"\n",
"# 3. Missing Not at Random (MNAR) - satisfaction scores\n",
"# Unsatisfied customers less likely to provide ratings\n",
"low_satisfaction = df_missing['satisfaction_score'] <= 2\n",
"low_sat_indices = df_missing[low_satisfaction].index\n",
"satisfaction_missing = np.random.choice(low_sat_indices, size=int(0.6 * len(low_sat_indices)), replace=False)\n",
"df_missing.loc[satisfaction_missing, 'satisfaction_score'] = np.nan\n",
"\n",
"# 4. Systematic missing - last purchase date for new customers\n",
"# New customers (signed up recently) haven't made purchases yet\n",
"recent_signups = df_missing['signup_date'] > '2023-11-01'\n",
"df_missing.loc[recent_signups, 'last_purchase_date'] = pd.NaT\n",
"\n",
"# 5. Random missing in other columns\n",
"# Purchase amount - 10% missing\n",
"purchase_missing = np.random.choice(df_missing.index, size=int(0.10 * len(df_missing)), replace=False)\n",
"df_missing.loc[purchase_missing, 'purchase_amount'] = np.nan\n",
"\n",
"print(\"Missing data patterns introduced:\")\n",
"print(f\"Dataset shape: {df_missing.shape}\")\n",
"print(\"\\nMissing value counts:\")\n",
"missing_summary = df_missing.isnull().sum()\n",
"missing_summary = missing_summary[missing_summary > 0]\n",
"print(missing_summary)\n",
"\n",
"print(\"\\nMissing value percentages:\")\n",
"missing_pct = (df_missing.isnull().sum() / len(df_missing) * 100).round(2)\n",
"missing_pct = missing_pct[missing_pct > 0]\n",
"print(missing_pct)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Detecting and Analyzing Missing Data\n",
"\n",
"Comprehensive techniques for understanding missing data patterns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def analyze_missing_data(df):\n",
" \"\"\"Comprehensive missing data analysis\"\"\"\n",
" print(\"=== MISSING DATA ANALYSIS ===\")\n",
" \n",
" # Basic missing data statistics\n",
" total_cells = df.size\n",
" total_missing = df.isnull().sum().sum()\n",
" print(f\"Total cells: {total_cells:,}\")\n",
" print(f\"Missing cells: {total_missing:,} ({total_missing/total_cells*100:.2f}%)\")\n",
" \n",
" # Missing data by column\n",
" missing_by_column = pd.DataFrame({\n",
" 'Missing_Count': df.isnull().sum(),\n",
" 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,\n",
" 'Data_Type': df.dtypes\n",
" })\n",
" missing_by_column = missing_by_column[missing_by_column['Missing_Count'] > 0]\n",
" missing_by_column = missing_by_column.sort_values('Missing_Percentage', ascending=False)\n",
" \n",
" print(\"\\n--- Missing Data by Column ---\")\n",
" print(missing_by_column.round(2))\n",
" \n",
" # Missing data patterns\n",
" print(\"\\n--- Missing Data Patterns ---\")\n",
" missing_patterns = df.isnull().value_counts().head(10)\n",
" print(\"Top 10 missing patterns (True = Missing):\")\n",
" for pattern, count in missing_patterns.items():\n",
" percentage = (count / len(df)) * 100\n",
" print(f\"{count:4d} rows ({percentage:5.1f}%): {dict(zip(df.columns, pattern))}\")\n",
" \n",
" return missing_by_column\n",
"\n",
"# Analyze missing data\n",
"missing_analysis = analyze_missing_data(df_missing)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Visualize missing data patterns\n",
"def visualize_missing_data(df):\n",
" \"\"\"Create visualizations for missing data patterns\"\"\"\n",
" fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
" \n",
" # 1. Missing data heatmap\n",
" missing_mask = df.isnull()\n",
" sns.heatmap(missing_mask.iloc[:100], \n",
" yticklabels=False, \n",
" cbar=True, \n",
" cmap='viridis',\n",
" ax=axes[0, 0])\n",
" axes[0, 0].set_title('Missing Data Heatmap (First 100 rows)')\n",
" \n",
" # 2. Missing data by column\n",
" missing_counts = df.isnull().sum()\n",
" missing_counts = missing_counts[missing_counts > 0]\n",
" missing_counts.plot(kind='bar', ax=axes[0, 1], color='skyblue')\n",
" axes[0, 1].set_title('Missing Values by Column')\n",
" axes[0, 1].set_ylabel('Count')\n",
" axes[0, 1].tick_params(axis='x', rotation=45)\n",
" \n",
" # 3. Missing data correlation\n",
" missing_corr = df.isnull().corr()\n",
" sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])\n",
" axes[1, 0].set_title('Missing Data Correlation')\n",
" \n",
" # 4. Missing data by row\n",
" missing_per_row = df.isnull().sum(axis=1)\n",
" missing_per_row.hist(bins=range(len(df.columns) + 2), ax=axes[1, 1], alpha=0.7, color='orange')\n",
" axes[1, 1].set_title('Distribution of Missing Values per Row')\n",
" axes[1, 1].set_xlabel('Number of Missing Values')\n",
" axes[1, 1].set_ylabel('Number of Rows')\n",
" \n",
" plt.tight_layout()\n",
" plt.show()\n",
"\n",
"# Visualize missing patterns\n",
"visualize_missing_data(df_missing)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Analyze missing data relationships\n",
"def analyze_missing_relationships(df):\n",
" \"\"\"Analyze relationships between missing data and other variables\"\"\"\n",
" print(\"=== MISSING DATA RELATIONSHIPS ===\")\n",
" \n",
" # Example: Relationship between age and missing education\n",
" if 'age' in df.columns and 'education_years' in df.columns:\n",
" print(\"\\n--- Age vs Missing Education ---\")\n",
" education_missing = df['education_years'].isnull()\n",
" age_stats = df.groupby(education_missing)['age'].agg(['mean', 'median', 'std']).round(2)\n",
" age_stats.index = ['Education Present', 'Education Missing']\n",
" print(age_stats)\n",
" \n",
" # Example: Missing satisfaction by purchase amount\n",
" if 'satisfaction_score' in df.columns and 'purchase_amount' in df.columns:\n",
" print(\"\\n--- Purchase Amount vs Missing Satisfaction ---\")\n",
" satisfaction_missing = df['satisfaction_score'].isnull()\n",
" purchase_stats = df.groupby(satisfaction_missing)['purchase_amount'].agg(['mean', 'median', 'count']).round(2)\n",
" purchase_stats.index = ['Satisfaction Present', 'Satisfaction Missing']\n",
" print(purchase_stats)\n",
" \n",
" # Missing data by categorical variables\n",
" if 'region' in df.columns:\n",
" print(\"\\n--- Missing Data by Region ---\")\n",
" region_missing = df.groupby('region').apply(lambda x: x.isnull().sum())\n",
" print(region_missing[region_missing.sum(axis=1) > 0])\n",
"\n",
"# Analyze relationships\n",
"analyze_missing_relationships(df_missing)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Basic Missing Data Handling\n",
"\n",
"Fundamental techniques for dealing with missing values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 1: Dropping missing values\n",
"print(\"=== DROPPING MISSING VALUES ===\")\n",
"\n",
"# Drop rows with any missing values\n",
"df_drop_any = df_missing.dropna()\n",
"print(f\"Original shape: {df_missing.shape}\")\n",
"print(f\"After dropping any missing: {df_drop_any.shape}\")\n",
"print(f\"Rows removed: {len(df_missing) - len(df_drop_any)} ({(len(df_missing) - len(df_drop_any))/len(df_missing)*100:.1f}%)\")\n",
"\n",
"# Drop rows with missing values in specific columns\n",
"critical_columns = ['customer_id', 'age', 'region']\n",
"df_drop_critical = df_missing.dropna(subset=critical_columns)\n",
"print(f\"\\nAfter dropping rows missing critical columns: {df_drop_critical.shape}\")\n",
"\n",
"# Drop rows with more than X missing values\n",
"df_drop_thresh = df_missing.dropna(thresh=len(df_missing.columns) - 2) # Allow max 2 missing\n",
"print(f\"After dropping rows with >2 missing values: {df_drop_thresh.shape}\")\n",
"\n",
"# Drop columns with too many missing values\n",
"missing_threshold = 0.5 # 50%\n",
"cols_to_keep = df_missing.columns[df_missing.isnull().mean() < missing_threshold]\n",
"df_drop_cols = df_missing[cols_to_keep]\n",
"print(f\"\\nAfter dropping columns with >{missing_threshold*100}% missing: {df_drop_cols.shape}\")\n",
"print(f\"Columns dropped: {set(df_missing.columns) - set(cols_to_keep)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 2: Basic imputation with fillna()\n",
"print(\"=== BASIC IMPUTATION ===\")\n",
"\n",
"df_basic_impute = df_missing.copy()\n",
"\n",
"# Fill with specific values\n",
"df_basic_impute['satisfaction_score'] = df_basic_impute['satisfaction_score'].fillna(3) # Neutral score\n",
"print(\"Filled satisfaction_score with 3 (neutral)\")\n",
"\n",
"# Fill with statistical measures\n",
"df_basic_impute['income'] = df_basic_impute['income'].fillna(df_basic_impute['income'].median())\n",
"df_basic_impute['education_years'] = df_basic_impute['education_years'].fillna(df_basic_impute['education_years'].mean())\n",
"df_basic_impute['purchase_amount'] = df_basic_impute['purchase_amount'].fillna(df_basic_impute['purchase_amount'].mean())\n",
"print(\"Filled numerical columns with mean/median\")\n",
"\n",
"# Forward fill and backward fill for dates\n",
"df_basic_impute['last_purchase_date'] = df_basic_impute['last_purchase_date'].fillna(method='bfill')\n",
"print(\"Filled dates with backward fill\")\n",
"\n",
"print(f\"\\nMissing values after basic imputation:\")\n",
"print(df_basic_impute.isnull().sum().sum())\n",
"\n",
"# Show before/after comparison\n",
"print(\"\\nComparison (first 10 rows):\")\n",
"comparison_cols = ['income', 'education_years', 'purchase_amount', 'satisfaction_score']\n",
"for col in comparison_cols:\n",
" before_missing = df_missing[col].isnull().sum()\n",
" after_missing = df_basic_impute[col].isnull().sum()\n",
" print(f\"{col}: {before_missing} → {after_missing} missing values\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Advanced Imputation Techniques\n",
"\n",
"Sophisticated methods for handling missing data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Group-based imputation\n",
"def group_based_imputation(df):\n",
" \"\"\"Impute missing values based on group statistics\"\"\"\n",
" df_group_impute = df.copy()\n",
" \n",
" print(\"=== GROUP-BASED IMPUTATION ===\")\n",
" \n",
" # Impute income based on region and education level\n",
" # First, create education level categories\n",
" df_group_impute['education_level'] = pd.cut(\n",
" df_group_impute['education_years'].fillna(df_group_impute['education_years'].median()),\n",
" bins=[0, 12, 16, 20],\n",
" labels=['High School', 'Bachelor', 'Advanced']\n",
" )\n",
" \n",
" # Calculate group-based statistics\n",
" income_by_group = df_group_impute.groupby(['region', 'education_level'])['income'].median()\n",
" \n",
" # Fill missing income values\n",
" def fill_income(row):\n",
" if pd.isna(row['income']):\n",
" try:\n",
" return income_by_group.loc[(row['region'], row['education_level'])]\n",
" except KeyError:\n",
" return df_group_impute['income'].median()\n",
" return row['income']\n",
" \n",
" df_group_impute['income'] = df_group_impute.apply(fill_income, axis=1)\n",
" \n",
" print(\"Income imputed based on region and education level\")\n",
" print(\"Group-based median income:\")\n",
" print(income_by_group.round(0))\n",
" \n",
" return df_group_impute\n",
"\n",
"# Apply group-based imputation\n",
"df_group_imputed = group_based_imputation(df_missing)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Comparison of Imputation Methods\n",
"\n",
"Compare different imputation approaches and their impact."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def compare_imputation_methods(original_complete, original_missing, *imputed_dfs, methods_names):\n",
" \"\"\"Compare different imputation methods\"\"\"\n",
" print(\"=== IMPUTATION METHODS COMPARISON ===\")\n",
" \n",
" # Focus on a specific column for comparison\n",
" column = 'income'\n",
" \n",
" if column not in original_complete.columns:\n",
" print(f\"Column {column} not found\")\n",
" return\n",
" \n",
" # Get original values that were made missing\n",
" missing_mask = original_missing[column].isnull()\n",
" true_values = original_complete.loc[missing_mask, column]\n",
" \n",
" print(f\"Comparing imputation for '{column}' column\")\n",
" print(f\"Number of missing values: {len(true_values)}\")\n",
" \n",
" # Calculate errors for each method\n",
" results = {}\n",
" \n",
" for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
" if column in df_imputed.columns:\n",
" imputed_values = df_imputed.loc[missing_mask, column]\n",
" \n",
" # Calculate metrics\n",
" mae = np.mean(np.abs(true_values - imputed_values))\n",
" rmse = np.sqrt(np.mean((true_values - imputed_values) ** 2))\n",
" bias = np.mean(imputed_values - true_values)\n",
" \n",
" results[method_name] = {\n",
" 'MAE': mae,\n",
" 'RMSE': rmse,\n",
" 'Bias': bias,\n",
" 'Mean_Imputed': np.mean(imputed_values),\n",
" 'Std_Imputed': np.std(imputed_values)\n",
" }\n",
" \n",
" # True statistics\n",
" print(f\"\\nTrue statistics for missing values:\")\n",
" print(f\"Mean: {np.mean(true_values):.2f}\")\n",
" print(f\"Std: {np.std(true_values):.2f}\")\n",
" \n",
" # Results comparison\n",
" results_df = pd.DataFrame(results).T\n",
" print(f\"\\nImputation comparison results:\")\n",
" print(results_df.round(2))\n",
" \n",
" # Visualize comparison\n",
" fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
" \n",
" # Distribution comparison\n",
" axes[0, 0].hist(true_values, alpha=0.7, label='True Values', bins=20)\n",
" for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
" if column in df_imputed.columns:\n",
" imputed_values = df_imputed.loc[missing_mask, column]\n",
" axes[0, 0].hist(imputed_values, alpha=0.7, label=f'{method_name}', bins=20)\n",
" axes[0, 0].set_title('Distribution Comparison')\n",
" axes[0, 0].legend()\n",
" \n",
" # Error metrics\n",
" metrics = ['MAE', 'RMSE']\n",
" for i, metric in enumerate(metrics):\n",
" values = [results[method][metric] for method in results.keys()]\n",
" axes[0, 1].bar(range(len(values)), values, alpha=0.7)\n",
" axes[0, 1].set_xticks(range(len(results)))\n",
" axes[0, 1].set_xticklabels(list(results.keys()), rotation=45)\n",
" axes[0, 1].set_title(f'{metric} Comparison')\n",
" break # Show only MAE for now\n",
" \n",
" # Scatter plot: True vs Imputed\n",
" for i, (df_imputed, method_name) in enumerate(zip(imputed_dfs[:2], methods_names[:2])):\n",
" if column in df_imputed.columns:\n",
" imputed_values = df_imputed.loc[missing_mask, column]\n",
" ax = axes[1, i]\n",
" ax.scatter(true_values, imputed_values, alpha=0.6)\n",
" ax.plot([true_values.min(), true_values.max()], \n",
" [true_values.min(), true_values.max()], 'r--', label='Perfect Prediction')\n",
" ax.set_xlabel('True Values')\n",
" ax.set_ylabel('Imputed Values')\n",
" ax.set_title(f'{method_name}: True vs Imputed')\n",
" ax.legend()\n",
" \n",
" plt.tight_layout()\n",
" plt.show()\n",
" \n",
" return results_df\n",
"\n",
"# Compare methods\n",
"comparison_results = compare_imputation_methods(\n",
" df_complete, \n",
" df_missing,\n",
" df_basic_impute,\n",
" methods_names=['Basic Fill', 'KNN', 'Iterative']\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Domain-Specific Imputation Strategies\n",
"\n",
"Business logic-driven approaches to missing data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def business_logic_imputation(df):\n",
" \"\"\"Apply business logic for missing value imputation\"\"\"\n",
" print(\"=== BUSINESS LOGIC IMPUTATION ===\")\n",
" \n",
" df_business = df.copy()\n",
" \n",
" # 1. Income imputation based on age and education\n",
" def estimate_income(row):\n",
" if pd.notna(row['income']):\n",
" return row['income']\n",
" \n",
" # Base income estimation\n",
" base_income = 30000\n",
" \n",
" # Age factor (experience premium)\n",
" if pd.notna(row['age']):\n",
" if row['age'] > 40:\n",
" base_income *= 1.5\n",
" elif row['age'] > 30:\n",
" base_income *= 1.2\n",
" \n",
" # Education factor\n",
" if pd.notna(row['education_years']):\n",
" if row['education_years'] > 16: # Graduate degree\n",
" base_income *= 1.8\n",
" elif row['education_years'] > 12: # Bachelor's\n",
" base_income *= 1.4\n",
" \n",
" # Regional adjustment\n",
" regional_multipliers = {\n",
" 'North': 1.2, # Higher cost of living\n",
" 'South': 0.9,\n",
" 'East': 1.1,\n",
" 'West': 1.0\n",
" }\n",
" base_income *= regional_multipliers.get(row['region'], 1.0)\n",
" \n",
" return base_income\n",
" \n",
" # Apply income estimation\n",
" df_business['income'] = df_business.apply(estimate_income, axis=1)\n",
" \n",
" # 2. Satisfaction score based on purchase behavior\n",
" def estimate_satisfaction(row):\n",
" if pd.notna(row['satisfaction_score']):\n",
" return row['satisfaction_score']\n",
" \n",
" # Base satisfaction\n",
" base_satisfaction = 3 # Neutral\n",
" \n",
" # Purchase amount influence\n",
" if pd.notna(row['purchase_amount']):\n",
" if row['purchase_amount'] > 250: # High value purchase\n",
" base_satisfaction = 4\n",
" elif row['purchase_amount'] < 100: # Low value might indicate dissatisfaction\n",
" base_satisfaction = 2\n",
" \n",
" return base_satisfaction\n",
" \n",
" # Apply satisfaction estimation\n",
" df_business['satisfaction_score'] = df_business.apply(estimate_satisfaction, axis=1)\n",
" \n",
" # 3. Education years based on income and age\n",
" def estimate_education(row):\n",
" if pd.notna(row['education_years']):\n",
" return row['education_years']\n",
" \n",
" # Base education\n",
" base_education = 12 # High school\n",
" \n",
" # Income-based estimation\n",
" if pd.notna(row['income']):\n",
" if row['income'] > 70000:\n",
" base_education = 18 # Graduate level\n",
" elif row['income'] > 45000:\n",
" base_education = 16 # Bachelor's\n",
" elif row['income'] > 35000:\n",
" base_education = 14 # Some college\n",
" \n",
" # Age adjustment (older people might have different education patterns)\n",
" if pd.notna(row['age']) and row['age'] > 55:\n",
" base_education = max(12, base_education - 2) # Lower average for older generation\n",
" \n",
" return base_education\n",
" \n",
" # Apply education estimation\n",
" df_business['education_years'] = df_business.apply(estimate_education, axis=1)\n",
" \n",
" print(\"Business logic imputation completed\")\n",
" print(f\"Missing values remaining: {df_business.isnull().sum().sum()}\")\n",
" \n",
" return df_business\n",
"\n",
"# Apply business logic imputation\n",
"df_business_imputed = business_logic_imputation(df_missing)\n",
"\n",
"print(\"\\nBusiness logic imputation summary:\")\n",
"for col in ['income', 'satisfaction_score', 'education_years']:\n",
" before = df_missing[col].isnull().sum()\n",
" after = df_business_imputed[col].isnull().sum()\n",
" print(f\"{col}: {before} → {after} missing values\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Missing Data Flags and Indicators\n",
"\n",
"Track which values were imputed for transparency and analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def create_missing_indicators(df_original, df_imputed):\n",
" \"\"\"Create indicator variables for missing data\"\"\"\n",
" print(\"=== CREATING MISSING DATA INDICATORS ===\")\n",
" \n",
" df_with_indicators = df_imputed.copy()\n",
" \n",
" # Create indicator columns for each column that had missing data\n",
" columns_with_missing = df_original.columns[df_original.isnull().any()].tolist()\n",
" \n",
" for col in columns_with_missing:\n",
" indicator_col = f'{col}_was_missing'\n",
" df_with_indicators[indicator_col] = df_original[col].isnull().astype(int)\n",
" \n",
" print(f\"Created {len(columns_with_missing)} missing data indicators\")\n",
" print(f\"Indicator columns: {[f'{col}_was_missing' for col in columns_with_missing]}\")\n",
" \n",
" # Summary of missing patterns\n",
" indicator_cols = [f'{col}_was_missing' for col in columns_with_missing]\n",
" missing_patterns = df_with_indicators[indicator_cols].sum()\n",
" \n",
" print(\"\\nMissing data summary by column:\")\n",
" for col, count in missing_patterns.items():\n",
" original_col = col.replace('_was_missing', '')\n",
" percentage = (count / len(df_with_indicators)) * 100\n",
" print(f\"{original_col}: {count} values imputed ({percentage:.1f}%)\")\n",
" \n",
" # Create composite missing indicator\n",
" df_with_indicators['total_missing_count'] = df_with_indicators[indicator_cols].sum(axis=1)\n",
" df_with_indicators['has_any_missing'] = (df_with_indicators['total_missing_count'] > 0).astype(int)\n",
" \n",
" return df_with_indicators, indicator_cols\n",
"\n",
"# Create missing indicators\n",
"df_with_indicators, indicator_columns = create_missing_indicators(df_missing, df_business_imputed)\n",
"\n",
"print(\"\\nDataset with missing indicators:\")\n",
"sample_cols = ['income', 'income_was_missing', 'education_years', 'education_years_was_missing', \n",
" 'satisfaction_score', 'satisfaction_score_was_missing', 'total_missing_count']\n",
"available_cols = [col for col in sample_cols if col in df_with_indicators.columns]\n",
"print(df_with_indicators[available_cols].head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Validation and Quality Assessment\n",
"\n",
"Validate the quality of imputation results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def validate_imputation_quality(df_original, df_missing, df_imputed):\n",
" \"\"\"Validate the quality of imputation\"\"\"\n",
" print(\"=== IMPUTATION QUALITY VALIDATION ===\")\n",
" \n",
" validation_results = {}\n",
" \n",
" # Check each column that had missing data\n",
" for col in df_missing.columns:\n",
" if df_missing[col].isnull().any() and col in df_imputed.columns:\n",
" print(f\"\\n--- Validating {col} ---\")\n",
" \n",
" # Get missing mask\n",
" missing_mask = df_missing[col].isnull()\n",
" \n",
" # Original statistics (complete data)\n",
" original_stats = df_original[col].describe()\n",
" \n",
" # Imputed statistics (only imputed values)\n",
" if missing_mask.any():\n",
" imputed_values = df_imputed.loc[missing_mask, col]\n",
" \n",
" if pd.api.types.is_numeric_dtype(df_original[col]):\n",
" imputed_stats = imputed_values.describe()\n",
" \n",
" # Statistical tests\n",
" mean_diff = abs(original_stats['mean'] - imputed_stats['mean'])\n",
" std_diff = abs(original_stats['std'] - imputed_stats['std'])\n",
" \n",
" validation_results[col] = {\n",
" 'original_mean': original_stats['mean'],\n",
" 'imputed_mean': imputed_stats['mean'],\n",
" 'mean_difference': mean_diff,\n",
" 'original_std': original_stats['std'],\n",
" 'imputed_std': imputed_stats['std'],\n",
" 'std_difference': std_diff,\n",
" 'values_imputed': len(imputed_values)\n",
" }\n",
" \n",
" print(f\"Original mean: {original_stats['mean']:.2f}, Imputed mean: {imputed_stats['mean']:.2f}\")\n",
" print(f\"Mean difference: {mean_diff:.2f} ({mean_diff/original_stats['mean']*100:.1f}%)\")\n",
" print(f\"Original std: {original_stats['std']:.2f}, Imputed std: {imputed_stats['std']:.2f}\")\n",
" \n",
" else:\n",
" # Categorical data\n",
" original_dist = df_original[col].value_counts(normalize=True)\n",
" imputed_dist = imputed_values.value_counts(normalize=True)\n",
" print(f\"Original distribution: {original_dist.to_dict()}\")\n",
" print(f\"Imputed distribution: {imputed_dist.to_dict()}\")\n",
" \n",
" # Overall validation summary\n",
" if validation_results:\n",
" validation_df = pd.DataFrame(validation_results).T\n",
" print(\"\\n=== VALIDATION SUMMARY ===\")\n",
" print(validation_df.round(3))\n",
" \n",
" # Flag potential issues\n",
" print(\"\\n--- Potential Issues ---\")\n",
" for col, stats in validation_results.items():\n",
" mean_change = abs(stats['mean_difference'] / stats['original_mean']) * 100\n",
" if mean_change > 10: # More than 10% change in mean\n",
" print(f\"⚠️ {col}: Large mean change ({mean_change:.1f}%)\")\n",
" \n",
" std_change = abs(stats['std_difference'] / stats['original_std']) * 100\n",
" if std_change > 20: # More than 20% change in std\n",
" print(f\"⚠️ {col}: Large variance change ({std_change:.1f}%)\")\n",
" \n",
" return validation_results\n",
"\n",
"# Validate imputation quality\n",
"validation_results = validate_imputation_quality(df_complete, df_missing, df_business_imputed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Apply missing data handling techniques to challenging scenarios:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Multi-step imputation strategy\n",
"# Create a sophisticated imputation pipeline that:\n",
"# 1. Handles different types of missing data appropriately\n",
"# 2. Uses multiple imputation methods in sequence\n",
"# 3. Validates results at each step\n",
"# 4. Creates comprehensive documentation\n",
"\n",
"def comprehensive_imputation_pipeline(df):\n",
" \"\"\"Comprehensive missing data handling pipeline\"\"\"\n",
" # Your implementation here\n",
" pass\n",
"\n",
"# result_df = comprehensive_imputation_pipeline(df_missing)\n",
"# print(\"Comprehensive pipeline results:\")\n",
"# print(result_df.isnull().sum())"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Missing data pattern analysis\n",
"# Analyze if missing data follows specific patterns:\n",
"# - Time-based patterns\n",
"# - User behavior patterns\n",
"# - System/technical patterns\n",
"# Create insights and recommendations\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Impact assessment\n",
"# Assess how different missing data handling approaches\n",
"# affect downstream analysis:\n",
"# - Statistical analysis results\n",
"# - Machine learning model performance\n",
"# - Business insights and decisions\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Understanding Missing Data Types**:\n",
" - **MCAR**: Missing Completely at Random\n",
" - **MAR**: Missing at Random (depends on observed data)\n",
" - **MNAR**: Missing Not at Random (depends on unobserved data)\n",
"\n",
"2. **Detection and Analysis**:\n",
" - Always analyze missing patterns before imputation\n",
" - Use visualizations to understand missing data structure\n",
" - Look for relationships between missing values and other variables\n",
"\n",
"3. **Handling Strategies**:\n",
" - **Deletion**: Simple but can lose valuable information\n",
" - **Simple Imputation**: Fast but may not preserve relationships\n",
" - **Advanced Methods**: KNN, MICE preserve more complex relationships\n",
" - **Business Logic**: Domain knowledge often provides best results\n",
"\n",
"4. **Best Practices**:\n",
" - Create missing data indicators for transparency\n",
" - Validate imputation quality against original data when possible\n",
" - Consider the impact on downstream analysis\n",
" - Document all imputation decisions and methods\n",
"\n",
"## Method Selection Guide\n",
"\n",
"| Scenario | Recommended Method | Rationale |\n",
"|----------|-------------------|----------|\n",
"| < 5% missing, MCAR | Simple imputation | Low impact, efficiency |\n",
"| 5-20% missing, MAR | KNN or Group-based | Preserve relationships |\n",
"| > 20% missing, complex patterns | MICE or Multiple imputation | Handle complex dependencies |\n",
"| Business-critical decisions | Domain knowledge + validation | Accuracy and explainability |\n",
"| Machine learning features | Advanced methods + indicators | Preserve predictive power |\n",
"\n",
"## Common Pitfalls to Avoid\n",
"\n",
"1. **Data Leakage**: Don't use future information to impute past values\n",
"2. **Ignoring Patterns**: Missing data often has meaningful patterns\n",
"3. **Over-imputation**: Sometimes missing data is informative itself\n",
"4. **One-size-fits-all**: Different columns may need different strategies\n",
"5. **No Validation**: Always check if imputation preserved data characteristics"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View file

@ -0,0 +1,937 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 7: Merging and Joining DataFrames\n",
"\n",
"## Learning Objectives\n",
"- Master different types of joins (inner, outer, left, right)\n",
"- Understand when to use merge vs join vs concat\n",
"- Handle duplicate keys and join conflicts\n",
"- Learn advanced merging techniques and best practices\n",
"- Practice with real-world data integration scenarios\n",
"\n",
"## Prerequisites\n",
"- Completed Lessons 1-6\n",
"- Understanding of relational database concepts (helpful)\n",
"- Basic knowledge of SQL joins (helpful but not required)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"import matplotlib.pyplot as plt\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"# Set display options\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_rows', 50)\n",
"\n",
"print(\"Libraries loaded successfully!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Sample Datasets\n",
"\n",
"Let's create realistic datasets that represent common business scenarios."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create sample datasets for merging examples\n",
"np.random.seed(42)\n",
"\n",
"# Customer dataset\n",
"customers_data = {\n",
" 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n",
" 'customer_name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson',\n",
" 'Frank Miller', 'Grace Lee', 'Henry Davis', 'Ivy Chen', 'Jack Robinson'],\n",
" 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com',\n",
" 'frank@email.com', 'grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],\n",
" 'age': [28, 35, 42, 31, 29, 45, 38, 33, 27, 41],\n",
" 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',\n",
" 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],\n",
" 'signup_date': pd.date_range('2023-01-01', periods=10, freq='M')\n",
"}\n",
"\n",
"df_customers = pd.DataFrame(customers_data)\n",
"\n",
"# Orders dataset\n",
"orders_data = {\n",
" 'order_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],\n",
" 'customer_id': [1, 2, 1, 3, 4, 2, 5, 1, 6, 11, 3, 2], # Note: customer_id 11 doesn't exist in customers\n",
" 'order_date': pd.date_range('2023-06-01', periods=12, freq='W'),\n",
" 'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Monitor', 'Phone', \n",
" 'Headphones', 'Mouse', 'Keyboard', 'Laptop', 'Tablet', 'Monitor'],\n",
" 'quantity': [1, 2, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1],\n",
" 'amount': [1200, 800, 400, 1200, 300, 800, 150, 50, 75, 1200, 800, 300]\n",
"}\n",
"\n",
"df_orders = pd.DataFrame(orders_data)\n",
"\n",
"# Product information dataset\n",
"products_data = {\n",
" 'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones', 'Mouse', 'Keyboard', 'Webcam'],\n",
" 'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', \n",
" 'Audio', 'Accessories', 'Accessories', 'Electronics'],\n",
" 'price': [1200, 800, 400, 300, 150, 50, 75, 100],\n",
" 'supplier': ['TechCorp', 'MobileCorp', 'TechCorp', 'DisplayCorp', \n",
" 'AudioCorp', 'AccessoryCorp', 'AccessoryCorp', 'TechCorp']\n",
"}\n",
"\n",
"df_products = pd.DataFrame(products_data)\n",
"\n",
"# Customer segments dataset\n",
"segments_data = {\n",
" 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 12, 13], # Some customers not in main customer table\n",
" 'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Basic', \n",
" 'Premium', 'Standard', 'Basic', 'Premium', 'Standard'],\n",
" 'loyalty_points': [1500, 800, 1200, 600, 200, 1800, 750, 300, 2000, 900]\n",
"}\n",
"\n",
"df_segments = pd.DataFrame(segments_data)\n",
"\n",
"print(\"Sample datasets created:\")\n",
"print(f\"Customers: {df_customers.shape}\")\n",
"print(f\"Orders: {df_orders.shape}\")\n",
"print(f\"Products: {df_products.shape}\")\n",
"print(f\"Segments: {df_segments.shape}\")\n",
"\n",
"print(\"\\nCustomers dataset:\")\n",
"print(df_customers.head())\n",
"\n",
"print(\"\\nOrders dataset:\")\n",
"print(df_orders.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Basic Merge Operations\n",
"\n",
"Understanding the fundamental merge operations and join types."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Inner Join - only matching records\n",
"print(\"=== INNER JOIN ===\")\n",
"inner_join = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
"print(f\"Result shape: {inner_join.shape}\")\n",
"print(\"Sample results:\")\n",
"print(inner_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
"\n",
"print(f\"\\nUnique customers in result: {inner_join['customer_id'].nunique()}\")\n",
"print(f\"Total orders: {len(inner_join)}\")\n",
"\n",
"# Check which customers have orders\n",
"customers_with_orders = inner_join['customer_id'].unique()\n",
"print(f\"Customers with orders: {sorted(customers_with_orders)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Left Join - all records from left table\n",
"print(\"=== LEFT JOIN ===\")\n",
"left_join = pd.merge(df_customers, df_orders, on='customer_id', how='left')\n",
"print(f\"Result shape: {left_join.shape}\")\n",
"print(\"Sample results:\")\n",
"print(left_join[['customer_name', 'order_id', 'product', 'amount']].head(10))\n",
"\n",
"# Check customers without orders\n",
"customers_without_orders = left_join[left_join['order_id'].isnull()]['customer_name'].tolist()\n",
"print(f\"\\nCustomers without orders: {customers_without_orders}\")\n",
"\n",
"# Summary statistics\n",
"print(f\"\\nTotal records: {len(left_join)}\")\n",
"print(f\"Records with orders: {left_join['order_id'].notna().sum()}\")\n",
"print(f\"Records without orders: {left_join['order_id'].isnull().sum()}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Right Join - all records from right table\n",
"print(\"=== RIGHT JOIN ===\")\n",
"right_join = pd.merge(df_customers, df_orders, on='customer_id', how='right')\n",
"print(f\"Result shape: {right_join.shape}\")\n",
"print(\"Sample results:\")\n",
"print(right_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
"\n",
"# Check orders without customer information\n",
"orders_without_customers = right_join[right_join['customer_name'].isnull()]\n",
"print(f\"\\nOrders without customer info: {len(orders_without_customers)}\")\n",
"if len(orders_without_customers) > 0:\n",
" print(orders_without_customers[['customer_id', 'order_id', 'product', 'amount']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Outer Join - all records from both tables\n",
"print(\"=== OUTER JOIN ===\")\n",
"outer_join = pd.merge(df_customers, df_orders, on='customer_id', how='outer')\n",
"print(f\"Result shape: {outer_join.shape}\")\n",
"\n",
"# Analyze the result\n",
"print(\"\\nData quality analysis:\")\n",
"print(f\"Records with complete customer info: {outer_join['customer_name'].notna().sum()}\")\n",
"print(f\"Records with complete order info: {outer_join['order_id'].notna().sum()}\")\n",
"print(f\"Records with both customer and order info: {(outer_join['customer_name'].notna() & outer_join['order_id'].notna()).sum()}\")\n",
"\n",
"# Show different categories of records\n",
"print(\"\\nCustomers without orders:\")\n",
"customers_only = outer_join[(outer_join['customer_name'].notna()) & (outer_join['order_id'].isnull())]\n",
"print(customers_only[['customer_name', 'city']].drop_duplicates())\n",
"\n",
"print(\"\\nOrders without customer data:\")\n",
"orders_only = outer_join[(outer_join['customer_name'].isnull()) & (outer_join['order_id'].notna())]\n",
"print(orders_only[['customer_id', 'order_id', 'product', 'amount']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Multiple Table Joins\n",
"\n",
"Combining data from multiple sources in sequence."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Three-way join: Customers + Orders + Products\n",
"print(\"=== THREE-WAY JOIN ===\")\n",
"\n",
"# Step 1: Join customers and orders\n",
"customer_orders = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
"print(f\"After joining customers and orders: {customer_orders.shape}\")\n",
"\n",
"# Step 2: Join with products\n",
"complete_data = pd.merge(customer_orders, df_products, on='product', how='left')\n",
"print(f\"After joining with products: {complete_data.shape}\")\n",
"\n",
"# Display comprehensive view\n",
"print(\"\\nComplete order information:\")\n",
"display_cols = ['customer_name', 'order_id', 'product', 'category', 'quantity', 'amount', 'price', 'supplier']\n",
"print(complete_data[display_cols].head())\n",
"\n",
"# Verify data consistency\n",
"print(\"\\nData consistency check:\")\n",
"# Check if order amount matches product price * quantity\n",
"complete_data['calculated_amount'] = complete_data['price'] * complete_data['quantity']\n",
"amount_matches = (complete_data['amount'] == complete_data['calculated_amount']).all()\n",
"print(f\"Order amounts match calculated amounts: {amount_matches}\")\n",
"\n",
"if not amount_matches:\n",
" mismatched = complete_data[complete_data['amount'] != complete_data['calculated_amount']]\n",
" print(f\"\\nMismatched records: {len(mismatched)}\")\n",
" print(mismatched[['order_id', 'product', 'amount', 'calculated_amount']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Add customer segment information\n",
"print(\"=== ADDING CUSTOMER SEGMENTS ===\")\n",
"\n",
"# Join with segments (left join to keep all customers)\n",
"customers_with_segments = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
"print(f\"Customers with segments shape: {customers_with_segments.shape}\")\n",
"\n",
"# Check which customers don't have segment information\n",
"missing_segments = customers_with_segments[customers_with_segments['segment'].isnull()]\n",
"print(f\"\\nCustomers without segment info: {len(missing_segments)}\")\n",
"if len(missing_segments) > 0:\n",
" print(missing_segments[['customer_name', 'city']])\n",
"\n",
"# Create comprehensive customer profile\n",
"full_customer_profile = pd.merge(complete_data, df_segments, on='customer_id', how='left')\n",
"print(f\"\\nFull customer profile shape: {full_customer_profile.shape}\")\n",
"\n",
"# Analyze by segment\n",
"segment_analysis = full_customer_profile.groupby('segment').agg({\n",
" 'amount': ['sum', 'mean', 'count'],\n",
" 'customer_id': 'nunique'\n",
"}).round(2)\n",
"segment_analysis.columns = ['Total_Revenue', 'Avg_Order_Value', 'Total_Orders', 'Unique_Customers']\n",
"print(\"\\nRevenue by customer segment:\")\n",
"print(segment_analysis)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Advanced Merge Techniques\n",
"\n",
"Handling complex merging scenarios and edge cases."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Merge with different column names\n",
"print(\"=== MERGE WITH DIFFERENT COLUMN NAMES ===\")\n",
"\n",
"# Create a dataset with different column name\n",
"customer_demographics = pd.DataFrame({\n",
" 'cust_id': [1, 2, 3, 4, 5],\n",
" 'income_range': ['50-75k', '75-100k', '50-75k', '100k+', '25-50k'],\n",
" 'education': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor'],\n",
" 'occupation': ['Engineer', 'Manager', 'Professor', 'Director', 'Analyst']\n",
"})\n",
"\n",
"# Merge using left_on and right_on parameters\n",
"customers_with_demographics = pd.merge(\n",
" df_customers, \n",
" customer_demographics, \n",
" left_on='customer_id', \n",
" right_on='cust_id', \n",
" how='left'\n",
")\n",
"\n",
"print(\"Merge with different column names:\")\n",
"print(customers_with_demographics[['customer_name', 'customer_id', 'cust_id', 'income_range', 'education']].head())\n",
"\n",
"# Clean up duplicate columns\n",
"customers_with_demographics = customers_with_demographics.drop('cust_id', axis=1)\n",
"print(f\"\\nAfter cleanup: {customers_with_demographics.shape}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Merge on multiple columns\n",
"print(\"=== MERGE ON MULTIPLE COLUMNS ===\")\n",
"\n",
"# Create time-based pricing data\n",
"pricing_data = pd.DataFrame({\n",
" 'product': ['Laptop', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Tablet'],\n",
" 'date': pd.to_datetime(['2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01']),\n",
" 'price': [1200, 1100, 800, 750, 400, 380],\n",
" 'promotion': [False, True, False, True, False, True]\n",
"})\n",
"\n",
"# Add year-month to orders for matching\n",
"df_orders_with_period = df_orders.copy()\n",
"df_orders_with_period['order_month'] = df_orders_with_period['order_date'].dt.to_period('M').dt.start_time\n",
"\n",
"# Create matching periods in pricing data\n",
"pricing_data['period'] = pricing_data['date'].dt.to_period('M').dt.start_time\n",
"\n",
"# Merge on product and time period\n",
"orders_with_pricing = pd.merge(\n",
" df_orders_with_period,\n",
" pricing_data,\n",
" left_on=['product', 'order_month'],\n",
" right_on=['product', 'period'],\n",
" how='left'\n",
")\n",
"\n",
"print(\"Orders with time-based pricing:\")\n",
"print(orders_with_pricing[['order_id', 'product', 'order_date', 'amount', 'price', 'promotion']].head())\n",
"\n",
"# Check for pricing discrepancies\n",
"pricing_discrepancies = orders_with_pricing[\n",
" (orders_with_pricing['amount'] != orders_with_pricing['price'] * orders_with_pricing['quantity']) &\n",
" orders_with_pricing['price'].notna()\n",
"]\n",
"print(f\"\\nOrders with pricing discrepancies: {len(pricing_discrepancies)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Handling duplicate keys in merge\n",
"print(\"=== HANDLING DUPLICATE KEYS ===\")\n",
"\n",
"# Create data with duplicate keys\n",
"customer_contacts = pd.DataFrame({\n",
" 'customer_id': [1, 1, 2, 2, 3],\n",
" 'contact_type': ['email', 'phone', 'email', 'phone', 'email'],\n",
" 'contact_value': ['alice@email.com', '555-0101', 'bob@email.com', '555-0102', 'charlie@email.com'],\n",
" 'is_primary': [True, False, True, True, True]\n",
"})\n",
"\n",
"print(\"Customer contacts with duplicates:\")\n",
"print(customer_contacts)\n",
"\n",
"# Merge will create cartesian product for duplicate keys\n",
"customers_with_contacts = pd.merge(df_customers, customer_contacts, on='customer_id', how='inner')\n",
"print(f\"\\nResult of merge with duplicates: {customers_with_contacts.shape}\")\n",
"print(customers_with_contacts[['customer_name', 'contact_type', 'contact_value', 'is_primary']].head())\n",
"\n",
"# Strategy 1: Filter before merge\n",
"primary_contacts = customer_contacts[customer_contacts['is_primary'] == True]\n",
"customers_primary_contacts = pd.merge(df_customers, primary_contacts, on='customer_id', how='left')\n",
"print(f\"\\nAfter filtering to primary contacts: {customers_primary_contacts.shape}\")\n",
"\n",
"# Strategy 2: Pivot contacts to columns\n",
"contacts_pivoted = customer_contacts.pivot_table(\n",
" index='customer_id',\n",
" columns='contact_type',\n",
" values='contact_value',\n",
" aggfunc='first'\n",
").reset_index()\n",
"print(\"\\nPivoted contacts:\")\n",
"print(contacts_pivoted)\n",
"\n",
"customers_with_pivoted_contacts = pd.merge(df_customers, contacts_pivoted, on='customer_id', how='left')\n",
"print(f\"\\nAfter merging pivoted contacts: {customers_with_pivoted_contacts.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Index-based Joins\n",
"\n",
"Using DataFrame indices for joining operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set up DataFrames with indices\n",
"print(\"=== INDEX-BASED JOINS ===\")\n",
"\n",
"# Set customer_id as index\n",
"customers_indexed = df_customers.set_index('customer_id')\n",
"segments_indexed = df_segments.set_index('customer_id')\n",
"\n",
"print(\"Customers with index:\")\n",
"print(customers_indexed.head())\n",
"\n",
"# Join using indices\n",
"joined_by_index = customers_indexed.join(segments_indexed, how='left')\n",
"print(f\"\\nJoined by index shape: {joined_by_index.shape}\")\n",
"print(joined_by_index[['customer_name', 'city', 'segment', 'loyalty_points']].head())\n",
"\n",
"# Compare with merge\n",
"merged_equivalent = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
"print(f\"\\nEquivalent merge shape: {merged_equivalent.shape}\")\n",
"\n",
"# Verify they're the same (after sorting)\n",
"joined_sorted = joined_by_index.reset_index().sort_values('customer_id')\n",
"merged_sorted = merged_equivalent.sort_values('customer_id')\n",
"are_equal = joined_sorted.equals(merged_sorted)\n",
"print(f\"Results are identical: {are_equal}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Multi-index joins\n",
"print(\"=== MULTI-INDEX JOINS ===\")\n",
"\n",
"# Create a dataset with multiple index levels\n",
"sales_by_region_product = pd.DataFrame({\n",
" 'region': ['North', 'North', 'South', 'South', 'East', 'East'],\n",
" 'product': ['Laptop', 'Phone', 'Laptop', 'Phone', 'Laptop', 'Phone'],\n",
" 'sales_target': [10, 15, 8, 12, 12, 18],\n",
" 'commission_rate': [0.05, 0.04, 0.06, 0.05, 0.05, 0.04]\n",
"})\n",
"\n",
"# Set multi-index\n",
"sales_targets = sales_by_region_product.set_index(['region', 'product'])\n",
"print(\"Sales targets with multi-index:\")\n",
"print(sales_targets)\n",
"\n",
"# Create customer orders with region mapping\n",
"customer_regions = {\n",
" 1: 'North', 2: 'South', 3: 'East', 4: 'North', 5: 'South', 6: 'East'\n",
"}\n",
"\n",
"orders_with_region = df_orders.copy()\n",
"orders_with_region['region'] = orders_with_region['customer_id'].map(customer_regions)\n",
"orders_with_region = orders_with_region.dropna(subset=['region'])\n",
"\n",
"# Merge on multiple columns to match multi-index\n",
"orders_with_targets = pd.merge(\n",
" orders_with_region,\n",
" sales_targets.reset_index(),\n",
" on=['region', 'product'],\n",
" how='left'\n",
")\n",
"\n",
"print(\"\\nOrders with sales targets:\")\n",
"print(orders_with_targets[['order_id', 'region', 'product', 'amount', 'sales_target', 'commission_rate']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Concatenation Operations\n",
"\n",
"Combining DataFrames vertically and horizontally."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Vertical concatenation (stacking DataFrames)\n",
"print(\"=== VERTICAL CONCATENATION ===\")\n",
"\n",
"# Create additional customer data (new batch)\n",
"new_customers = pd.DataFrame({\n",
" 'customer_id': [11, 12, 13, 14, 15],\n",
" 'customer_name': ['Kate Wilson', 'Liam Brown', 'Mia Garcia', 'Noah Jones', 'Olivia Miller'],\n",
" 'email': ['kate@email.com', 'liam@email.com', 'mia@email.com', 'noah@email.com', 'olivia@email.com'],\n",
" 'age': [26, 39, 31, 44, 28],\n",
" 'city': ['Austin', 'Seattle', 'Denver', 'Boston', 'Miami'],\n",
" 'signup_date': pd.date_range('2024-01-01', periods=5, freq='M')\n",
"})\n",
"\n",
"# Concatenate vertically\n",
"all_customers = pd.concat([df_customers, new_customers], ignore_index=True)\n",
"print(f\"Original customers: {len(df_customers)}\")\n",
"print(f\"New customers: {len(new_customers)}\")\n",
"print(f\"Combined customers: {len(all_customers)}\")\n",
"\n",
"print(\"\\nCombined customer data:\")\n",
"print(all_customers.tail())\n",
"\n",
"# Concatenation with different columns\n",
"customers_with_extra_info = pd.DataFrame({\n",
" 'customer_id': [16, 17],\n",
" 'customer_name': ['Paul Davis', 'Quinn Taylor'],\n",
" 'email': ['paul@email.com', 'quinn@email.com'],\n",
" 'age': [35, 29],\n",
" 'city': ['Portland', 'Nashville'],\n",
" 'signup_date': pd.date_range('2024-06-01', periods=2, freq='M'),\n",
" 'referral_source': ['Google', 'Facebook'] # Extra column\n",
"})\n",
"\n",
"# Concat with different columns (creates NaN for missing columns)\n",
"all_customers_extended = pd.concat([all_customers, customers_with_extra_info], ignore_index=True, sort=False)\n",
"print(f\"\\nAfter adding customers with extra info: {all_customers_extended.shape}\")\n",
"print(\"Missing values in referral_source:\")\n",
"print(all_customers_extended['referral_source'].isnull().sum())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Horizontal concatenation\n",
"print(\"=== HORIZONTAL CONCATENATION ===\")\n",
"\n",
"# Split customer data into parts\n",
"customer_basic_info = df_customers[['customer_id', 'customer_name', 'email']]\n",
"customer_demographics = df_customers[['customer_id', 'age', 'city', 'signup_date']]\n",
"\n",
"print(\"Customer basic info:\")\n",
"print(customer_basic_info.head())\n",
"\n",
"print(\"\\nCustomer demographics:\")\n",
"print(customer_demographics.head())\n",
"\n",
"# Concatenate horizontally (by index)\n",
"customers_recombined = pd.concat([customer_basic_info, customer_demographics.drop('customer_id', axis=1)], axis=1)\n",
"print(f\"\\nRecombined shape: {customers_recombined.shape}\")\n",
"print(customers_recombined.head())\n",
"\n",
"# Verify it matches original\n",
"columns_match = set(customers_recombined.columns) == set(df_customers.columns)\n",
"print(f\"\\nColumns match original: {columns_match}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Concat with keys (creating hierarchical columns)\n",
"print(\"=== CONCAT WITH KEYS ===\")\n",
"\n",
"# Create quarterly sales data\n",
"q1_sales = pd.DataFrame({\n",
" 'product': ['Laptop', 'Phone', 'Tablet'],\n",
" 'units_sold': [50, 75, 30],\n",
" 'revenue': [60000, 60000, 12000]\n",
"})\n",
"\n",
"q2_sales = pd.DataFrame({\n",
" 'product': ['Laptop', 'Phone', 'Tablet'],\n",
" 'units_sold': [45, 80, 35],\n",
" 'revenue': [54000, 64000, 14000]\n",
"})\n",
"\n",
"# Concatenate with keys\n",
"quarterly_sales = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])\n",
"print(\"Quarterly sales with hierarchical index:\")\n",
"print(quarterly_sales)\n",
"\n",
"# Access specific quarter\n",
"print(\"\\nQ1 sales only:\")\n",
"print(quarterly_sales.loc['Q1'])\n",
"\n",
"# Create summary comparison\n",
"quarterly_comparison = pd.concat([q1_sales.set_index('product'), q2_sales.set_index('product')], \n",
" keys=['Q1', 'Q2'], axis=1)\n",
"print(\"\\nQuarterly comparison (side by side):\")\n",
"print(quarterly_comparison)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Performance and Best Practices\n",
"\n",
"Optimizing merge operations and avoiding common pitfalls."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Performance comparison: merge vs join\n",
"import time\n",
"\n",
"print(\"=== PERFORMANCE COMPARISON ===\")\n",
"\n",
"# Create larger datasets for performance testing\n",
"np.random.seed(42)\n",
"large_customers = pd.DataFrame({\n",
" 'customer_id': range(1, 10001),\n",
" 'customer_name': [f'Customer_{i}' for i in range(1, 10001)],\n",
" 'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], 10000)\n",
"})\n",
"\n",
"large_orders = pd.DataFrame({\n",
" 'order_id': range(1, 50001),\n",
" 'customer_id': np.random.randint(1, 10001, 50000),\n",
" 'amount': np.random.normal(100, 30, 50000)\n",
"})\n",
"\n",
"print(f\"Large customers: {large_customers.shape}\")\n",
"print(f\"Large orders: {large_orders.shape}\")\n",
"\n",
"# Test merge performance\n",
"start_time = time.time()\n",
"merged_result = pd.merge(large_customers, large_orders, on='customer_id', how='inner')\n",
"merge_time = time.time() - start_time\n",
"\n",
"# Test join performance\n",
"customers_indexed = large_customers.set_index('customer_id')\n",
"orders_indexed = large_orders.set_index('customer_id')\n",
"\n",
"start_time = time.time()\n",
"joined_result = customers_indexed.join(orders_indexed, how='inner')\n",
"join_time = time.time() - start_time\n",
"\n",
"print(f\"\\nMerge time: {merge_time:.4f} seconds\")\n",
"print(f\"Join time: {join_time:.4f} seconds\")\n",
"print(f\"Join is {merge_time/join_time:.2f}x faster\")\n",
"\n",
"print(f\"\\nResults shape - Merge: {merged_result.shape}, Join: {joined_result.shape}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Best practices and common pitfalls\n",
"print(\"=== BEST PRACTICES ===\")\n",
"\n",
"def analyze_merge_keys(df1, df2, key_col):\n",
" \"\"\"Analyze merge keys before joining\"\"\"\n",
" print(f\"\\n--- Analyzing merge on '{key_col}' ---\")\n",
" \n",
" # Check for duplicates\n",
" df1_dups = df1[key_col].duplicated().sum()\n",
" df2_dups = df2[key_col].duplicated().sum()\n",
" \n",
" print(f\"Duplicates in left table: {df1_dups}\")\n",
" print(f\"Duplicates in right table: {df2_dups}\")\n",
" \n",
" # Check for missing values\n",
" df1_missing = df1[key_col].isnull().sum()\n",
" df2_missing = df2[key_col].isnull().sum()\n",
" \n",
" print(f\"Missing values in left table: {df1_missing}\")\n",
" print(f\"Missing values in right table: {df2_missing}\")\n",
" \n",
" # Check overlap\n",
" left_keys = set(df1[key_col].dropna())\n",
" right_keys = set(df2[key_col].dropna())\n",
" \n",
" overlap = left_keys & right_keys\n",
" left_only = left_keys - right_keys\n",
" right_only = right_keys - left_keys\n",
" \n",
" print(f\"Keys in both tables: {len(overlap)}\")\n",
" print(f\"Keys only in left: {len(left_only)}\")\n",
" print(f\"Keys only in right: {len(right_only)}\")\n",
" \n",
" # Predict result sizes\n",
" if df1_dups == 0 and df2_dups == 0:\n",
" inner_size = len(overlap)\n",
" left_size = len(df1)\n",
" right_size = len(df2)\n",
" outer_size = len(left_keys | right_keys)\n",
" else:\n",
" print(\"Warning: Duplicates present, result size may be larger than expected\")\n",
" inner_size = \"Cannot predict (duplicates present)\"\n",
" left_size = \"Cannot predict (duplicates present)\"\n",
" right_size = \"Cannot predict (duplicates present)\"\n",
" outer_size = \"Cannot predict (duplicates present)\"\n",
" \n",
" print(f\"\\nPredicted result sizes:\")\n",
" print(f\"Inner join: {inner_size}\")\n",
" print(f\"Left join: {left_size}\")\n",
" print(f\"Right join: {right_size}\")\n",
" print(f\"Outer join: {outer_size}\")\n",
"\n",
"# Analyze our sample data\n",
"analyze_merge_keys(df_customers, df_orders, 'customer_id')\n",
"analyze_merge_keys(df_customers, df_segments, 'customer_id')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Data validation after merge\n",
"def validate_merge_result(df, expected_rows=None, key_col=None):\n",
" \"\"\"Validate merge results\"\"\"\n",
" print(\"\\n=== MERGE VALIDATION ===\")\n",
" \n",
" print(f\"Result shape: {df.shape}\")\n",
" \n",
" if expected_rows:\n",
" print(f\"Expected rows: {expected_rows}\")\n",
" if len(df) != expected_rows:\n",
" print(\"⚠️ Row count doesn't match expectation!\")\n",
" \n",
" # Check for unexpected duplicates\n",
" if key_col and key_col in df.columns:\n",
" duplicates = df[key_col].duplicated().sum()\n",
" if duplicates > 0:\n",
" print(f\"⚠️ Found {duplicates} duplicate keys after merge\")\n",
" \n",
" # Check for missing values in key columns\n",
" missing_summary = df.isnull().sum()\n",
" critical_missing = missing_summary[missing_summary > 0]\n",
" \n",
" if len(critical_missing) > 0:\n",
" print(\"Missing values after merge:\")\n",
" print(critical_missing)\n",
" \n",
" # Data type consistency\n",
" print(f\"\\nData types:\")\n",
" print(df.dtypes)\n",
" \n",
" return df\n",
"\n",
"# Example validation\n",
"sample_merge = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
"validated_result = validate_merge_result(sample_merge, key_col='customer_id')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Apply merging and joining techniques to real-world scenarios:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Customer Lifetime Value Analysis\n",
"# Create a comprehensive customer analysis by joining:\n",
"# - Customer demographics\n",
"# - Order history\n",
"# - Product information\n",
"# - Customer segments\n",
"# Calculate CLV metrics for each customer\n",
"\n",
"def calculate_customer_lifetime_value(customers, orders, products, segments):\n",
" \"\"\"Calculate comprehensive customer lifetime value metrics\"\"\"\n",
" # Your implementation here\n",
" pass\n",
"\n",
"# clv_analysis = calculate_customer_lifetime_value(df_customers, df_orders, df_products, df_segments)\n",
"# print(\"Customer Lifetime Value Analysis:\")\n",
"# print(clv_analysis.head())"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Data Quality Assessment\n",
"# Create a function that analyzes data quality issues when merging multiple datasets:\n",
"# - Identify orphaned records\n",
"# - Find data inconsistencies\n",
"# - Suggest data cleaning steps\n",
"# - Provide merge recommendations\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Time-series Join Challenge\n",
"# Create a complex time-based join scenario:\n",
"# - Join orders with time-varying product prices\n",
"# - Handle seasonal promotions\n",
"# - Calculate accurate historical revenue\n",
"# - Account for price changes over time\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Join Types**:\n",
" - **Inner**: Only matching records from both tables\n",
" - **Left**: All records from left table + matching from right\n",
" - **Right**: All records from right table + matching from left\n",
" - **Outer**: All records from both tables\n",
"\n",
"2. **Method Selection**:\n",
" - **`pd.merge()`**: Most flexible, works with any columns\n",
" - **`.join()`**: Faster for index-based joins\n",
" - **`pd.concat()`**: For stacking DataFrames vertically/horizontally\n",
"\n",
"3. **Best Practices**:\n",
" - Always analyze merge keys before joining\n",
" - Check for duplicates and missing values\n",
" - Validate results after merging\n",
" - Use appropriate join types for your use case\n",
" - Consider performance implications for large datasets\n",
"\n",
"4. **Common Pitfalls**:\n",
" - Cartesian products from duplicate keys\n",
" - Unexpected result sizes\n",
" - Data type inconsistencies\n",
" - Missing value propagation\n",
"\n",
"## Join Type Selection Guide\n",
"\n",
"| Use Case | Recommended Join | Rationale |\n",
"|----------|-----------------|----------|\n",
"| Customer orders analysis | Inner | Only customers with orders |\n",
"| Customer segmentation | Left | Keep all customers, add segment info |\n",
"| Order validation | Right | Keep all orders, check customer validity |\n",
"| Data completeness analysis | Outer | See all records and identify gaps |\n",
"| Performance-critical operations | Index-based join | Faster execution |\n",
"\n",
"## Performance Tips\n",
"\n",
"1. **Index Usage**: Set indexes for frequently joined columns\n",
"2. **Data Types**: Ensure consistent data types before joining\n",
"3. **Memory Management**: Consider chunking for very large datasets\n",
"4. **Join Order**: Start with smallest datasets\n",
"5. **Validation**: Always validate merge results"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because one or more lines are too long

View file

@ -0,0 +1,815 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 13: Advanced Data Cleaning\n",
"\n",
"## Learning Objectives\n",
"- Master advanced techniques for data cleaning and validation\n",
"- Learn to detect and handle various types of data quality issues\n",
"- Understand data standardization and normalization techniques\n",
"- Practice with real-world messy data scenarios\n",
"- Develop automated data cleaning pipelines\n",
"\n",
"## Prerequisites\n",
"- Completed previous lessons on DataFrames\n",
"- Understanding of basic data cleaning concepts\n",
"- Familiarity with regular expressions (helpful but not required)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"import re\n",
"from datetime import datetime, timedelta\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"# Display settings\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_rows', 100)\n",
"\n",
"print(\"Libraries loaded successfully!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Messy Sample Data\n",
"\n",
"Let's create a realistic messy dataset to practice advanced cleaning techniques."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Create intentionally messy data that mimics real-world issues\n",
"np.random.seed(42)\n",
"\n",
"# Base data\n",
"n_records = 200\n",
"messy_data = {\n",
" 'customer_id': [f'CUST{i:04d}' if i % 10 != 0 else f'cust{i:04d}' for i in range(1, n_records + 1)],\n",
" 'customer_name': [\n",
" 'John Smith', 'jane doe', 'MARY JOHNSON', 'bob wilson', 'Sarah Davis',\n",
" 'Mike Brown', 'lisa garcia', 'DAVID MILLER', 'Amy Wilson', 'Tom Anderson'\n",
" ] * 20,\n",
" 'email': [\n",
" 'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',\n",
" 'bob..wilson@test.com', 'sarah@invalid-email', 'mike@email.com',\n",
" 'lisa.garcia@email.com', 'david@company.org', 'amy@email.com', 'tom@test.com'\n",
" ] * 20,\n",
" 'phone': [\n",
" '(555) 123-4567', '555.987.6543', '5551234567', '555-987-6543',\n",
" '(555)123-4567', '+1-555-123-4567', '555 123 4567', '5559876543',\n",
" '(555) 987 6543', '555-123-4567'\n",
" ] * 20,\n",
" 'address': [\n",
" '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',\n",
" '789 Pine Rd, Los Angeles, CA 90210', '321 Elm St, Chicago, IL 60601',\n",
" '654 Maple Dr, Houston, TX 77001', '987 Cedar Ln, Phoenix, AZ 85001',\n",
" '147 Birch Way, Philadelphia, PA 19101', '258 Ash Ct, San Antonio, TX 78201',\n",
" '369 Walnut St, San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201'\n",
" ] * 20,\n",
" 'purchase_amount': np.random.normal(100, 30, n_records).round(2),\n",
" 'purchase_date': [\n",
" '2024-01-15', '01/16/2024', '2024-1-17', '16-01-2024', '2024/01/18',\n",
" 'January 19, 2024', '2024-01-20', '01-21-24', '2024.01.22', '23/01/2024'\n",
" ] * 20,\n",
" 'category': [\n",
" 'Electronics', 'electronics', 'ELECTRONICS', 'Books', 'books',\n",
" 'Clothing', 'clothing', 'CLOTHING', 'Home & Garden', 'home&garden'\n",
" ] * 20,\n",
" 'satisfaction_score': np.random.choice([1, 2, 3, 4, 5, 99, -1, None], n_records, p=[0.05, 0.1, 0.15, 0.35, 0.3, 0.02, 0.02, 0.01])\n",
"}\n",
"\n",
"# Convert to DataFrame first\n",
"df_messy = pd.DataFrame(messy_data)\n",
"\n",
"# Introduce missing values and anomalies using proper indexing\n",
"df_messy.loc[df_messy.index[::25], 'customer_name'] = None # Some missing names\n",
"df_messy.loc[df_messy.index[::30], 'email'] = None # Some missing emails\n",
"df_messy.loc[df_messy.index[::35], 'purchase_amount'] = np.nan # Some missing amounts\n",
"df_messy.loc[df_messy.index[::40], 'purchase_amount'] = -999 # Invalid negative values\n",
"\n",
"# Add some duplicate records\n",
"duplicate_indices = [0, 1, 2, 3, 4]\n",
"duplicate_rows = df_messy.iloc[duplicate_indices].copy()\n",
"df_messy = pd.concat([df_messy, duplicate_rows], ignore_index=True)\n",
"\n",
"print(\"Messy dataset created:\")\n",
"print(f\"Shape: {df_messy.shape}\")\n",
"print(\"\\nFirst few rows:\")\n",
"print(df_messy.head(10))\n",
"print(\"\\nData types:\")\n",
"print(df_messy.dtypes)\n",
"print(\"\\nSample of data quality issues:\")\n",
"print(\"\\n1. Missing values:\")\n",
"print(df_messy.isnull().sum())\n",
"print(\"\\n2. Inconsistent formatting examples:\")\n",
"print(\"Customer IDs:\", df_messy['customer_id'].head(15).tolist())\n",
"print(\"Customer names:\", df_messy['customer_name'].dropna().head(5).tolist())\n",
"print(\"Categories:\", df_messy['category'].unique()[:5])\n",
"print(\"\\n3. Invalid satisfaction scores:\")\n",
"print(\"Unique satisfaction scores:\", sorted(df_messy['satisfaction_score'].dropna().unique()))\n",
"print(\"\\n4. Invalid purchase amounts:\")\n",
"print(\"Negative amounts:\", df_messy[df_messy['purchase_amount'] < 0]['purchase_amount'].count())\n",
"print(\"\\n5. Date format inconsistencies:\")\n",
"print(\"Sample dates:\", df_messy['purchase_date'].head(10).tolist())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Data Quality Assessment\n",
"\n",
"First, let's assess the quality of our messy data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def assess_data_quality(df):\n",
" \"\"\"Comprehensive data quality assessment\"\"\"\n",
" print(\"=== DATA QUALITY ASSESSMENT ===\")\n",
" print(f\"Dataset shape: {df.shape}\")\n",
" print(f\"Total cells: {df.size}\")\n",
" \n",
" # Missing values analysis\n",
" print(\"\\n--- Missing Values ---\")\n",
" missing_stats = pd.DataFrame({\n",
" 'Missing_Count': df.isnull().sum(),\n",
" 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100\n",
" })\n",
" missing_stats = missing_stats[missing_stats['Missing_Count'] > 0]\n",
" print(missing_stats.round(2))\n",
" \n",
" # Duplicate analysis\n",
" print(\"\\n--- Duplicates ---\")\n",
" total_duplicates = df.duplicated().sum()\n",
" print(f\"Complete duplicate rows: {total_duplicates}\")\n",
" \n",
" # Column-specific analysis\n",
" print(\"\\n--- Column Analysis ---\")\n",
" for col in df.columns:\n",
" unique_count = df[col].nunique()\n",
" unique_percentage = (unique_count / len(df)) * 100\n",
" print(f\"{col}: {unique_count} unique values ({unique_percentage:.1f}%)\")\n",
" \n",
" # Data type issues\n",
" print(\"\\n--- Data Types ---\")\n",
" print(df.dtypes)\n",
" \n",
" return missing_stats, total_duplicates\n",
"\n",
"# Assess the messy data\n",
"missing_stats, duplicate_count = assess_data_quality(df_messy)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Identify specific data quality issues\n",
"def identify_issues(df):\n",
" \"\"\"Identify specific data quality issues\"\"\"\n",
" issues = []\n",
" \n",
" # Check for inconsistent formatting\n",
" print(\"=== SPECIFIC ISSUES IDENTIFIED ===\")\n",
" \n",
" # Customer ID formatting\n",
" id_patterns = df['customer_id'].str.extract(r'(CUST|cust)(\\d+)').fillna('')\n",
" inconsistent_ids = (id_patterns[0] == 'cust').sum()\n",
" print(f\"Inconsistent customer ID format: {inconsistent_ids} records\")\n",
" \n",
" # Email validation\n",
" email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n",
" invalid_emails = ~df['email'].str.match(email_pattern, na=False)\n",
" print(f\"Invalid email formats: {invalid_emails.sum()} records\")\n",
" \n",
" # Negative purchase amounts\n",
" negative_amounts = (df['purchase_amount'] < 0).sum()\n",
" print(f\"Negative purchase amounts: {negative_amounts} records\")\n",
" \n",
" # Invalid satisfaction scores\n",
" invalid_scores = ((df['satisfaction_score'] < 1) | (df['satisfaction_score'] > 5)) & df['satisfaction_score'].notna()\n",
" print(f\"Invalid satisfaction scores: {invalid_scores.sum()} records\")\n",
" \n",
" # Category inconsistencies\n",
" category_variations = df['category'].value_counts()\n",
" print(f\"\\nCategory variations: {len(category_variations)} different values\")\n",
" print(category_variations)\n",
" \n",
" return issues\n",
"\n",
"issues = identify_issues(df_messy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Text Data Standardization\n",
"\n",
"Clean and standardize text fields."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Text cleaning functions\n",
"def clean_text_data(df):\n",
" \"\"\"Comprehensive text data cleaning\"\"\"\n",
" df_clean = df.copy()\n",
" \n",
" # Standardize customer names\n",
" print(\"Cleaning customer names...\")\n",
" df_clean['customer_name_clean'] = df_clean['customer_name'].str.strip() # Remove whitespace\n",
" df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.title() # Title case\n",
" df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.replace(r'\\s+', ' ', regex=True) # Multiple spaces\n",
" \n",
" # Standardize customer IDs\n",
" print(\"Standardizing customer IDs...\")\n",
" df_clean['customer_id_clean'] = df_clean['customer_id'].str.upper() # All uppercase\n",
" df_clean['customer_id_clean'] = df_clean['customer_id_clean'].str.replace('CUST', 'CUST') # Ensure consistent prefix\n",
" \n",
" # Clean email addresses\n",
" print(\"Cleaning email addresses...\")\n",
" df_clean['email_clean'] = df_clean['email'].str.lower() # Lowercase\n",
" df_clean['email_clean'] = df_clean['email_clean'].str.strip() # Remove whitespace\n",
" df_clean['email_clean'] = df_clean['email_clean'].str.replace(r'\\.{2,}', '.', regex=True) # Multiple dots\n",
" \n",
" # Standardize categories\n",
" print(\"Standardizing categories...\")\n",
" category_mapping = {\n",
" 'electronics': 'Electronics',\n",
" 'ELECTRONICS': 'Electronics',\n",
" 'books': 'Books',\n",
" 'clothing': 'Clothing',\n",
" 'CLOTHING': 'Clothing',\n",
" 'home&garden': 'Home & Garden',\n",
" 'Home & Garden': 'Home & Garden'\n",
" }\n",
" df_clean['category_clean'] = df_clean['category'].map(category_mapping).fillna(df_clean['category'])\n",
" \n",
" return df_clean\n",
"\n",
"# Apply text cleaning\n",
"df_text_clean = clean_text_data(df_messy)\n",
"\n",
"print(\"\\nText cleaning comparison:\")\n",
"comparison_cols = ['customer_name', 'customer_name_clean', 'customer_id', 'customer_id_clean', \n",
" 'email', 'email_clean', 'category', 'category_clean']\n",
"print(df_text_clean[comparison_cols].head(10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Advanced text cleaning with regex\n",
"def advanced_text_cleaning(df):\n",
" \"\"\"Advanced text cleaning using regular expressions\"\"\"\n",
" df_advanced = df.copy()\n",
" \n",
" # Extract and standardize address components\n",
" print(\"Processing addresses...\")\n",
" # Basic address pattern: number street, city, state zipcode\n",
" address_pattern = r'(\\d+)\\s+([^,]+),\\s*([^,]+),\\s*([A-Z]{2})\\s+(\\d{5})'\n",
" address_parts = df_advanced['address'].str.extract(address_pattern)\n",
" address_parts.columns = ['street_number', 'street_name', 'city', 'state', 'zipcode']\n",
" \n",
" # Clean street names\n",
" address_parts['street_name'] = address_parts['street_name'].str.title()\n",
" address_parts['city'] = address_parts['city'].str.title()\n",
" \n",
" # Combine cleaned parts\n",
" df_advanced['address_clean'] = (\n",
" address_parts['street_number'] + ' ' + address_parts['street_name'] + ', ' +\n",
" address_parts['city'] + ', ' + address_parts['state'] + ' ' + address_parts['zipcode']\n",
" )\n",
" \n",
" # Add individual address components\n",
" for col in address_parts.columns:\n",
" df_advanced[col] = address_parts[col]\n",
" \n",
" return df_advanced\n",
"\n",
"# Apply advanced cleaning\n",
"df_advanced_clean = advanced_text_cleaning(df_text_clean)\n",
"\n",
"print(\"Address cleaning results:\")\n",
"print(df_advanced_clean[['address', 'address_clean', 'city', 'state', 'zipcode']].head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Phone Number Standardization\n",
"\n",
"Clean and standardize phone numbers using regex patterns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def standardize_phone_numbers(df):\n",
" \"\"\"Standardize phone numbers to consistent format\"\"\"\n",
" df_phone = df.copy()\n",
" \n",
" def clean_phone(phone):\n",
" \"\"\"Clean individual phone number\"\"\"\n",
" if pd.isna(phone):\n",
" return None\n",
" \n",
" # Remove all non-digit characters\n",
" digits_only = re.sub(r'\\D', '', str(phone))\n",
" \n",
" # Handle different formats\n",
" if len(digits_only) == 10:\n",
" # Format as (XXX) XXX-XXXX\n",
" return f\"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}\"\n",
" elif len(digits_only) == 11 and digits_only.startswith('1'):\n",
" # Remove country code and format\n",
" phone_part = digits_only[1:]\n",
" return f\"({phone_part[:3]}) {phone_part[3:6]}-{phone_part[6:]}\"\n",
" else:\n",
" # Invalid phone number\n",
" return 'INVALID'\n",
" \n",
" # Apply phone cleaning\n",
" df_phone['phone_clean'] = df_phone['phone'].apply(clean_phone)\n",
" \n",
" # Extract area code\n",
" df_phone['area_code'] = df_phone['phone_clean'].str.extract(r'\\((\\d{3})\\)')\n",
" \n",
" # Flag invalid phone numbers\n",
" df_phone['phone_is_valid'] = df_phone['phone_clean'] != 'INVALID'\n",
" \n",
" return df_phone\n",
"\n",
"# Apply phone standardization\n",
"df_phone_clean = standardize_phone_numbers(df_advanced_clean)\n",
"\n",
"print(\"Phone number standardization:\")\n",
"print(df_phone_clean[['phone', 'phone_clean', 'area_code', 'phone_is_valid']].head(15))\n",
"\n",
"print(\"\\nPhone validation summary:\")\n",
"print(df_phone_clean['phone_is_valid'].value_counts())\n",
"\n",
"print(\"\\nArea code distribution:\")\n",
"print(df_phone_clean['area_code'].value_counts().head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Date Standardization\n",
"\n",
"Parse and standardize dates from various formats."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def standardize_dates(df):\n",
" \"\"\"Parse and standardize dates from multiple formats\"\"\"\n",
" df_dates = df.copy()\n",
" \n",
" def parse_date(date_str):\n",
" \"\"\"Try to parse date from various formats\"\"\"\n",
" if pd.isna(date_str):\n",
" return None\n",
" \n",
" date_str = str(date_str).strip()\n",
" \n",
" # Common date formats to try\n",
" formats = [\n",
" '%Y-%m-%d', # 2024-01-15\n",
" '%m/%d/%Y', # 01/16/2024\n",
" '%Y-%m-%d', # 2024-1-17 (handled by first format)\n",
" '%d-%m-%Y', # 16-01-2024\n",
" '%Y/%m/%d', # 2024/01/18\n",
" '%B %d, %Y', # January 19, 2024\n",
" '%m-%d-%y', # 01-21-24\n",
" '%Y.%m.%d', # 2024.01.22\n",
" '%d/%m/%Y' # 23/01/2024\n",
" ]\n",
" \n",
" for fmt in formats:\n",
" try:\n",
" return pd.to_datetime(date_str, format=fmt)\n",
" except ValueError:\n",
" continue\n",
" \n",
" # If all else fails, try pandas' flexible parser\n",
" try:\n",
" return pd.to_datetime(date_str, infer_datetime_format=True)\n",
" except:\n",
" return None\n",
" \n",
" # Apply date parsing\n",
" print(\"Parsing dates...\")\n",
" df_dates['purchase_date_clean'] = df_dates['purchase_date'].apply(parse_date)\n",
" \n",
" # Flag unparseable dates\n",
" df_dates['date_is_valid'] = df_dates['purchase_date_clean'].notna()\n",
" \n",
" # Extract date components for valid dates\n",
" df_dates['purchase_year'] = df_dates['purchase_date_clean'].dt.year\n",
" df_dates['purchase_month'] = df_dates['purchase_date_clean'].dt.month\n",
" df_dates['purchase_day'] = df_dates['purchase_date_clean'].dt.day\n",
" df_dates['purchase_day_of_week'] = df_dates['purchase_date_clean'].dt.day_name()\n",
" \n",
" return df_dates\n",
"\n",
"# Apply date standardization\n",
"df_date_clean = standardize_dates(df_phone_clean)\n",
"\n",
"print(\"Date standardization results:\")\n",
"print(df_date_clean[['purchase_date', 'purchase_date_clean', 'date_is_valid', \n",
" 'purchase_year', 'purchase_month', 'purchase_day_of_week']].head(15))\n",
"\n",
"print(\"\\nDate parsing summary:\")\n",
"print(df_date_clean['date_is_valid'].value_counts())\n",
"\n",
"invalid_dates = df_date_clean[~df_date_clean['date_is_valid']]['purchase_date'].unique()\n",
"if len(invalid_dates) > 0:\n",
" print(f\"\\nInvalid date formats found: {invalid_dates}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Numerical Data Cleaning\n",
"\n",
"Handle outliers, invalid values, and missing numerical data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def clean_numerical_data(df):\n",
" \"\"\"Clean and validate numerical data\"\"\"\n",
" df_numeric = df.copy()\n",
" \n",
" # Clean purchase amounts\n",
" print(\"Cleaning purchase amounts...\")\n",
" \n",
" # Flag invalid values\n",
" df_numeric['amount_is_valid'] = (\n",
" df_numeric['purchase_amount'].notna() & \n",
" (df_numeric['purchase_amount'] >= 0) & \n",
" (df_numeric['purchase_amount'] <= 10000) # Reasonable upper limit\n",
" )\n",
" \n",
" # Replace invalid values with NaN\n",
" df_numeric['purchase_amount_clean'] = df_numeric['purchase_amount'].where(\n",
" df_numeric['amount_is_valid'], np.nan\n",
" )\n",
" \n",
" # Detect outliers using IQR method\n",
" Q1 = df_numeric['purchase_amount_clean'].quantile(0.25)\n",
" Q3 = df_numeric['purchase_amount_clean'].quantile(0.75)\n",
" IQR = Q3 - Q1\n",
" lower_bound = Q1 - 1.5 * IQR\n",
" upper_bound = Q3 + 1.5 * IQR\n",
" \n",
" df_numeric['amount_is_outlier'] = (\n",
" (df_numeric['purchase_amount_clean'] < lower_bound) |\n",
" (df_numeric['purchase_amount_clean'] > upper_bound)\n",
" )\n",
" \n",
" # Clean satisfaction scores\n",
" print(\"Cleaning satisfaction scores...\")\n",
" \n",
" # Valid satisfaction scores are 1-5\n",
" df_numeric['satisfaction_is_valid'] = (\n",
" df_numeric['satisfaction_score'].notna() &\n",
" (df_numeric['satisfaction_score'].between(1, 5))\n",
" )\n",
" \n",
" df_numeric['satisfaction_score_clean'] = df_numeric['satisfaction_score'].where(\n",
" df_numeric['satisfaction_is_valid'], np.nan\n",
" )\n",
" \n",
" return df_numeric\n",
"\n",
"# Apply numerical cleaning\n",
"df_numeric_clean = clean_numerical_data(df_date_clean)\n",
"\n",
"print(\"Numerical data cleaning results:\")\n",
"print(df_numeric_clean[['purchase_amount', 'purchase_amount_clean', 'amount_is_valid', \n",
" 'amount_is_outlier', 'satisfaction_score', 'satisfaction_score_clean', \n",
" 'satisfaction_is_valid']].head(15))\n",
"\n",
"print(\"\\nNumerical data quality summary:\")\n",
"print(f\"Valid purchase amounts: {df_numeric_clean['amount_is_valid'].sum()}/{len(df_numeric_clean)}\")\n",
"print(f\"Outlier amounts: {df_numeric_clean['amount_is_outlier'].sum()}\")\n",
"print(f\"Valid satisfaction scores: {df_numeric_clean['satisfaction_is_valid'].sum()}/{len(df_numeric_clean)}\")\n",
"\n",
"# Show statistics for cleaned data\n",
"print(\"\\nCleaned amount statistics:\")\n",
"print(df_numeric_clean['purchase_amount_clean'].describe())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Duplicate Detection and Handling\n",
"\n",
"Identify and handle duplicate records intelligently."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def handle_duplicates(df):\n",
" \"\"\"Comprehensive duplicate detection and handling\"\"\"\n",
" df_dedup = df.copy()\n",
" \n",
" print(\"=== DUPLICATE ANALYSIS ===\")\n",
" \n",
" # 1. Exact duplicates\n",
" exact_duplicates = df_dedup.duplicated()\n",
" print(f\"Exact duplicate rows: {exact_duplicates.sum()}\")\n",
" \n",
" # 2. Duplicates based on key columns (likely same customer)\n",
" key_cols = ['customer_name_clean', 'email_clean']\n",
" key_duplicates = df_dedup.duplicated(subset=key_cols, keep=False)\n",
" print(f\"Duplicate customers (by name/email): {key_duplicates.sum()}\")\n",
" \n",
" # 3. Near duplicates (similar but not exact)\n",
" # For demonstration, we'll check phone numbers\n",
" phone_duplicates = df_dedup.duplicated(subset=['phone_clean'], keep=False)\n",
" print(f\"Duplicate phone numbers: {phone_duplicates.sum()}\")\n",
" \n",
" # Show duplicate examples\n",
" if key_duplicates.any():\n",
" print(\"\\nExample duplicate customers:\")\n",
" duplicate_customers = df_dedup[key_duplicates].sort_values(key_cols)\n",
" print(duplicate_customers[key_cols + ['customer_id_clean', 'purchase_amount_clean']].head(10))\n",
" \n",
" # Remove exact duplicates\n",
" print(f\"\\nRemoving {exact_duplicates.sum()} exact duplicates...\")\n",
" df_no_exact_dups = df_dedup[~exact_duplicates]\n",
" \n",
" # For customer duplicates, keep the one with the highest purchase amount\n",
" print(\"Handling customer duplicates (keeping highest purchase)...\")\n",
" df_final = df_no_exact_dups.sort_values('purchase_amount_clean', ascending=False).drop_duplicates(\n",
" subset=key_cols, keep='first'\n",
" )\n",
" \n",
" print(f\"Final dataset size after deduplication: {len(df_final)} (was {len(df)})\")\n",
" \n",
" return df_final\n",
"\n",
"# Apply duplicate handling\n",
"df_deduplicated = handle_duplicates(df_numeric_clean)\n",
"\n",
"print(f\"\\nRows removed: {len(df_numeric_clean) - len(df_deduplicated)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Data Validation and Quality Scores\n",
"\n",
"Create comprehensive data quality metrics."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def calculate_quality_scores(df):\n",
" \"\"\"Calculate comprehensive data quality scores\"\"\"\n",
" df_quality = df.copy()\n",
" \n",
" # Define quality checks\n",
" quality_checks = {\n",
" 'has_customer_name': df_quality['customer_name_clean'].notna(),\n",
" 'has_valid_email': df_quality['email_clean'].notna() & \n",
" df_quality['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', na=False),\n",
" 'has_valid_phone': df_quality['phone_is_valid'] == True,\n",
" 'has_valid_date': df_quality['date_is_valid'] == True,\n",
" 'has_valid_amount': df_quality['amount_is_valid'] == True,\n",
" 'has_valid_satisfaction': df_quality['satisfaction_is_valid'] == True,\n",
" 'amount_not_outlier': df_quality['amount_is_outlier'] == False,\n",
" 'has_complete_address': df_quality['city'].notna() & df_quality['state'].notna() & df_quality['zipcode'].notna()\n",
" }\n",
" \n",
" # Add individual quality flags\n",
" for check_name, check_result in quality_checks.items():\n",
" df_quality[f'quality_{check_name}'] = check_result.astype(int)\n",
" \n",
" # Calculate overall quality score (percentage of passed checks)\n",
" quality_cols = [col for col in df_quality.columns if col.startswith('quality_')]\n",
" df_quality['data_quality_score'] = df_quality[quality_cols].mean(axis=1) * 100\n",
" \n",
" # Categorize quality levels\n",
" def quality_category(score):\n",
" if score >= 90:\n",
" return 'Excellent'\n",
" elif score >= 75:\n",
" return 'Good'\n",
" elif score >= 50:\n",
" return 'Fair'\n",
" else:\n",
" return 'Poor'\n",
" \n",
" df_quality['quality_category'] = df_quality['data_quality_score'].apply(quality_category)\n",
" \n",
" return df_quality, quality_checks\n",
"\n",
"# Calculate quality scores\n",
"df_with_quality, quality_checks = calculate_quality_scores(df_deduplicated)\n",
"\n",
"print(\"Data quality analysis:\")\n",
"print(df_with_quality[['customer_name_clean', 'data_quality_score', 'quality_category']].head(10))\n",
"\n",
"print(\"\\nQuality category distribution:\")\n",
"print(df_with_quality['quality_category'].value_counts())\n",
"\n",
"print(\"\\nAverage quality scores by check:\")\n",
"quality_summary = {}\n",
"for check_name in quality_checks.keys():\n",
" col_name = f'quality_{check_name}'\n",
" quality_summary[check_name] = df_with_quality[col_name].mean() * 100\n",
"\n",
"quality_df = pd.DataFrame(list(quality_summary.items()), columns=['Quality_Check', 'Pass_Rate_%'])\n",
"quality_df = quality_df.sort_values('Pass_Rate_%', ascending=False)\n",
"print(quality_df.round(1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Apply advanced data cleaning techniques to challenging scenarios:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Create a custom validation function\n",
"# Build a function that validates business rules:\n",
"# - Email domains should be from approved list\n",
"# - Purchase amounts should be within reasonable ranges by category\n",
"# - Dates should be within business operating period\n",
"# - Customer IDs should follow specific format patterns\n",
"\n",
"def validate_business_rules(df):\n",
" \"\"\"Validate business-specific rules\"\"\"\n",
" # Your implementation here\n",
" pass\n",
"\n",
"# validation_results = validate_business_rules(df_final_clean)\n",
"# print(validation_results)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Advanced duplicate detection\n",
"# Implement fuzzy matching for near-duplicate detection:\n",
"# - Similar names (edit distance)\n",
"# - Similar addresses\n",
"# - Similar email patterns\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Data cleaning metrics dashboard\n",
"# Create a comprehensive data quality dashboard that shows:\n",
"# - Data quality trends over time\n",
"# - Field-by-field quality scores\n",
"# - Impact of cleaning steps\n",
"# - Recommendations for further improvement\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Assessment First**: Always assess data quality before cleaning\n",
"2. **Systematic Approach**: Use a structured pipeline for consistent results\n",
"3. **Preserve Original Data**: Keep original values while creating cleaned versions\n",
"4. **Document Everything**: Log all cleaning steps and decisions\n",
"5. **Validation**: Implement business rule validation\n",
"6. **Quality Metrics**: Measure and track data quality improvements\n",
"7. **Reusable Pipeline**: Create automated, configurable cleaning processes\n",
"8. **Context Matters**: Consider domain-specific requirements\n",
"\n",
"## Common Data Issues and Solutions\n",
"\n",
"| Issue | Detection Method | Solution |\n",
"|-------|-----------------|----------|\n",
"| Inconsistent Format | Pattern analysis | Standardization rules |\n",
"| Missing Values | `.isnull()` | Imputation or flagging |\n",
"| Duplicates | `.duplicated()` | Deduplication logic |\n",
"| Outliers | Statistical methods | Capping or flagging |\n",
"| Invalid Values | Business rules | Validation and correction |\n",
"| Inconsistent Naming | String analysis | Normalization |\n",
"| Date Issues | Parsing attempts | Multiple format handling |\n",
"| Text Issues | Regex patterns | Cleaning and standardization |\n",
"\n",
"## Best Practices\n",
"\n",
"1. **Start with Exploration**: Understand your data before cleaning\n",
"2. **Preserve Traceability**: Keep original and cleaned versions\n",
"3. **Validate Assumptions**: Test cleaning rules on sample data\n",
"4. **Measure Impact**: Quantify improvements from cleaning\n",
"5. **Automate When Possible**: Build reusable cleaning pipelines\n",
"6. **Handle Edge Cases**: Plan for unusual but valid data\n",
"7. **Business Context**: Include domain experts in rule definition\n",
"8. **Iterative Process**: Refine cleaning rules based on results\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

1301
Session_01/ohlcv_analysis.ipynb Executable file

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long