Session_01
This commit is contained in:
parent
16a25d8ee5
commit
6befd2d50c
18 changed files with 99852 additions and 1 deletions
99
README.md
99
README.md
|
@ -1 +1,98 @@
|
||||||
# crypto_bot_training
|
# Build Your Own Crypto Trading Bot – Course Repository
|
||||||
|
|
||||||
|
Welcome to the private repository for the **"Build Your Own Crypto Trading Bot – Hands-On Course with Alex"** by QuantJourney.
|
||||||
|
|
||||||
|
This repository contains materials, templates, and code samples used during the 6 live sessions held in June 2025.
|
||||||
|
|
||||||
|
> ⚠️ This repository is for registered participants only.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Content Overview
|
||||||
|
|
||||||
|
**Session 1: Foundations & Data Structures**
|
||||||
|
- Set up Python, IDE, and required libraries
|
||||||
|
- Pandas basics for financial time series
|
||||||
|
- Understanding OHLCV format
|
||||||
|
- Create your first crypto DataFrame with sample data
|
||||||
|
|
||||||
|
**Session 2: Data Acquisition & Exchange Connectivity**
|
||||||
|
- WebSocket basics for real-time crypto feeds (Binance focus)
|
||||||
|
- Fail-safe reconnection logic and error handling
|
||||||
|
- Logging basics for live systems
|
||||||
|
- Build tools: order flow scanner, liquidation monitor, funding rate tracker
|
||||||
|
|
||||||
|
**Session 3: Data Processing & Technical Analysis**
|
||||||
|
- API access using CCXT
|
||||||
|
- Handle rate limits and API error scenarios
|
||||||
|
- Reconnect & retry mechanisms
|
||||||
|
- Use pandas-ta to compute SMA, EMA, RSI
|
||||||
|
- Create your own indicator pipeline
|
||||||
|
|
||||||
|
**Session 4: Strategy Development & Backtesting**
|
||||||
|
- Overview of strategy types (trend, mean reversion)
|
||||||
|
- Backtesting with `backtesting.py`
|
||||||
|
- Compute Sharpe ratio, drawdown, profit factor
|
||||||
|
- Add position sizing, SL/TP, and walk-forward logic
|
||||||
|
- Adjust for fees, slippage, and latency
|
||||||
|
|
||||||
|
**Session 5: Bot Architecture & Implementation**
|
||||||
|
- Bot system design: event-driven vs loop-based
|
||||||
|
- Core components: order manager, position tracker, error handler
|
||||||
|
- Risk constraints: daily limits, max size
|
||||||
|
- Logging & monitoring structure
|
||||||
|
- Write the engine core for your bot
|
||||||
|
|
||||||
|
**Session 6: Live Trading & Deployment**
|
||||||
|
- API keys and secure credential handling
|
||||||
|
- Deployment targets: local, VPS, cloud (e.g., Hetzner)
|
||||||
|
- Running 24/7: restart logic, alerting
|
||||||
|
- Final bot launch + testing in production
|
||||||
|
- Send alerts via Telegram or email
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🤖 AI-Enhanced Trading
|
||||||
|
Bonus section:
|
||||||
|
- Use ChatGPT/Claude for strategy suggestions
|
||||||
|
- Integrate AI-based filters or signal generation
|
||||||
|
- Let LLMs help you refactor and extend your logic
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📁 Repository Structure
|
||||||
|
|
||||||
|
```text
|
||||||
|
/Session_01/ # Foundations & DataFrame Handling
|
||||||
|
/Session_02/ # WebSockets & Real-Time Feed Tools
|
||||||
|
/Session_03/ # Indicators & Analysis
|
||||||
|
/Session_04/ # Backtesting + Strategy Logic
|
||||||
|
/Session_05/ # Trading Bot Core Engine
|
||||||
|
/Session_06/ # Live Deployment and Monitoring
|
||||||
|
/templates/ # Starter and final bot code
|
||||||
|
/utils/ # Helper scripts for logging, reconnection, etc.
|
||||||
|
README.md # You are here
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🛠 Requirements
|
||||||
|
- Python 3.10+
|
||||||
|
- Install dependencies per session in each folder or via a top-level `requirements.txt` (provided)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📫 Support
|
||||||
|
You can reach Alex directly at [alex@quantjourney.pro](mailto:alex@quantjourney.pro) for post-course support (1 week included).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚠️ Disclaimer
|
||||||
|
This project is for **educational use only**. No financial advice. Always trade with caution and use proper risk management.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Happy coding – and trade smart.
|
||||||
|
|
||||||
|
QuantJourney Team
|
||||||
|
|
||||||
|
|
BIN
Session_01/.DS_Store
vendored
Normal file
BIN
Session_01/.DS_Store
vendored
Normal file
Binary file not shown.
83955
Session_01/Data/BTCUSD-1h-data.csv
Executable file
83955
Session_01/Data/BTCUSD-1h-data.csv
Executable file
File diff suppressed because it is too large
Load diff
391
Session_01/PandasDataFrame-exmples/01_creating_dataframes.ipynb
Executable file
391
Session_01/PandasDataFrame-exmples/01_creating_dataframes.ipynb
Executable file
|
@ -0,0 +1,391 @@
|
||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Session 1 - DataFrames - Lesson 1: Creating DataFrames\n",
|
||||||
|
"\n",
|
||||||
|
"## Learning Objectives\n",
|
||||||
|
"- Understand different methods to create pandas DataFrames\n",
|
||||||
|
"- Learn to create DataFrames from dictionaries, lists, and NumPy arrays\n",
|
||||||
|
"- Practice with various data types and structures\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"- Basic Python knowledge\n",
|
||||||
|
"- Understanding of lists and dictionaries"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Pandas version: 2.2.3\n",
|
||||||
|
"NumPy version: 2.2.6\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Import required libraries\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"from datetime import datetime, timedelta\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"Pandas version: {pd.__version__}\")\n",
|
||||||
|
"print(f\"NumPy version: {np.__version__}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Method 1: Creating DataFrame from Dictionary\n",
|
||||||
|
"\n",
|
||||||
|
"This is the most common and intuitive way to create a DataFrame."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 20,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Student DataFrame:\n",
|
||||||
|
" Name Age Grade Score\n",
|
||||||
|
"0 Alice 23 A 95\n",
|
||||||
|
"1 Bob 25 B 87\n",
|
||||||
|
"2 Charlie 22 A 92\n",
|
||||||
|
"3 Diana 24 C 78\n",
|
||||||
|
"4 Eve 23 B 89\n",
|
||||||
|
"\n",
|
||||||
|
"Shape: (5, 4)\n",
|
||||||
|
"Data types:\n",
|
||||||
|
"Name object\n",
|
||||||
|
"Age int64\n",
|
||||||
|
"Grade object\n",
|
||||||
|
"Score int64\n",
|
||||||
|
"dtype: object\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Creating DataFrame from dictionary\n",
|
||||||
|
"student_data = {\n",
|
||||||
|
" 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],\n",
|
||||||
|
" 'Age': [23, 25, 22, 24, 23],\n",
|
||||||
|
" 'Grade': ['A', 'B', 'A', 'C', 'B'],\n",
|
||||||
|
" 'Score': [95, 87, 92, 78, 89]\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_students = pd.DataFrame(student_data)\n",
|
||||||
|
"print(\"Student DataFrame:\")\n",
|
||||||
|
"print(df_students)\n",
|
||||||
|
"print(f\"\\nShape: {df_students.shape}\")\n",
|
||||||
|
"print(f\"Data types:\\n{df_students.dtypes}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Method 2: Creating DataFrame from Lists\n",
|
||||||
|
"\n",
|
||||||
|
"You can create DataFrames from separate lists by combining them in a dictionary."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 21,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Cities DataFrame:\n",
|
||||||
|
" City Population_Million Country\n",
|
||||||
|
"0 New York 8.4 USA\n",
|
||||||
|
"1 London 8.9 UK\n",
|
||||||
|
"2 Tokyo 13.9 Japan\n",
|
||||||
|
"3 Paris 2.1 France\n",
|
||||||
|
"4 Sydney 5.3 Australia\n",
|
||||||
|
"\n",
|
||||||
|
"Index: [0, 1, 2, 3, 4]\n",
|
||||||
|
"Columns: ['City', 'Population_Million', 'Country']\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Creating DataFrame from separate lists\n",
|
||||||
|
"cities = ['New York', 'London', 'Tokyo', 'Paris', 'Sydney']\n",
|
||||||
|
"populations = [8.4, 8.9, 13.9, 2.1, 5.3]\n",
|
||||||
|
"countries = ['USA', 'UK', 'Japan', 'France', 'Australia']\n",
|
||||||
|
"\n",
|
||||||
|
"df_cities = pd.DataFrame({\n",
|
||||||
|
" 'City': cities,\n",
|
||||||
|
" 'Population_Million': populations,\n",
|
||||||
|
" 'Country': countries\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Cities DataFrame:\")\n",
|
||||||
|
"print(df_cities)\n",
|
||||||
|
"print(f\"\\nIndex: {df_cities.index.tolist()}\")\n",
|
||||||
|
"print(f\"Columns: {df_cities.columns.tolist()}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Method 3: Creating DataFrame from NumPy Array\n",
|
||||||
|
"\n",
|
||||||
|
"This method is useful when working with numerical data or when you need random data for testing."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 22,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Random DataFrame:\n",
|
||||||
|
" Column_A Column_B Column_C\n",
|
||||||
|
"Row1 52 93 15\n",
|
||||||
|
"Row2 72 61 21\n",
|
||||||
|
"Row3 83 87 75\n",
|
||||||
|
"Row4 75 88 24\n",
|
||||||
|
"Row5 3 22 53\n",
|
||||||
|
"\n",
|
||||||
|
"Summary statistics:\n",
|
||||||
|
" Column_A Column_B Column_C\n",
|
||||||
|
"count 5.000000 5.000000 5.000000\n",
|
||||||
|
"mean 57.000000 70.200000 37.600000\n",
|
||||||
|
"std 32.272279 29.693434 25.530374\n",
|
||||||
|
"min 3.000000 22.000000 15.000000\n",
|
||||||
|
"25% 52.000000 61.000000 21.000000\n",
|
||||||
|
"50% 72.000000 87.000000 24.000000\n",
|
||||||
|
"75% 75.000000 88.000000 53.000000\n",
|
||||||
|
"max 83.000000 93.000000 75.000000\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Creating DataFrame from NumPy array\n",
|
||||||
|
"np.random.seed(42) # For reproducible results\n",
|
||||||
|
"random_data = np.random.randint(1, 100, size=(5, 3))\n",
|
||||||
|
"\n",
|
||||||
|
"df_random = pd.DataFrame(random_data, \n",
|
||||||
|
" columns=['Column_A', 'Column_B', 'Column_C'],\n",
|
||||||
|
" index=['Row1', 'Row2', 'Row3', 'Row4', 'Row5'])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Random DataFrame:\")\n",
|
||||||
|
"print(df_random)\n",
|
||||||
|
"print(f\"\\nSummary statistics:\")\n",
|
||||||
|
"print(df_random.describe())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Method 4: Creating DataFrame with Custom Index\n",
|
||||||
|
"\n",
|
||||||
|
"You can specify custom row labels (index) when creating DataFrames."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 23,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Products DataFrame with Custom Index:\n",
|
||||||
|
" Product Price Stock\n",
|
||||||
|
"PROD001 Laptop 1200 15\n",
|
||||||
|
"PROD002 Phone 800 50\n",
|
||||||
|
"PROD003 Tablet 600 30\n",
|
||||||
|
"PROD004 Monitor 300 20\n",
|
||||||
|
"\n",
|
||||||
|
"Accessing by index label 'PROD002':\n",
|
||||||
|
"Product Phone\n",
|
||||||
|
"Price 800\n",
|
||||||
|
"Stock 50\n",
|
||||||
|
"Name: PROD002, dtype: object\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Creating DataFrame with custom index\n",
|
||||||
|
"product_data = {\n",
|
||||||
|
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],\n",
|
||||||
|
" 'Price': [1200, 800, 600, 300],\n",
|
||||||
|
" 'Stock': [15, 50, 30, 20]\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"# Custom index using product codes\n",
|
||||||
|
"custom_index = ['PROD001', 'PROD002', 'PROD003', 'PROD004']\n",
|
||||||
|
"df_products = pd.DataFrame(product_data, index=custom_index)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Products DataFrame with Custom Index:\")\n",
|
||||||
|
"print(df_products)\n",
|
||||||
|
"print(f\"\\nAccessing by index label 'PROD002':\")\n",
|
||||||
|
"print(df_products.loc['PROD002'])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Method 5: Creating Empty DataFrame and Adding Data\n",
|
||||||
|
"\n",
|
||||||
|
"Sometimes you need to start with an empty DataFrame and add data incrementally."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 24,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Empty DataFrame:\n",
|
||||||
|
"Empty DataFrame\n",
|
||||||
|
"Columns: [Date, Temperature, Humidity, Pressure]\n",
|
||||||
|
"Index: []\n",
|
||||||
|
"Shape: (0, 4)\n",
|
||||||
|
"\n",
|
||||||
|
"DataFrame after adding data:\n",
|
||||||
|
" Date Temperature Humidity Pressure\n",
|
||||||
|
"0 2024-01-01 22.5 65 1013.2\n",
|
||||||
|
"1 2024-01-02 24.1 68 1015.1\n",
|
||||||
|
"2 2024-01-03 21.8 72 1012.8\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Creating empty DataFrame with specified columns\n",
|
||||||
|
"columns = ['Date', 'Temperature', 'Humidity', 'Pressure']\n",
|
||||||
|
"df_weather = pd.DataFrame(columns=columns)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Empty DataFrame:\")\n",
|
||||||
|
"print(df_weather)\n",
|
||||||
|
"print(f\"Shape: {df_weather.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Adding data row by row (not recommended for large datasets)\n",
|
||||||
|
"weather_data = [\n",
|
||||||
|
" ['2024-01-01', 22.5, 65, 1013.2],\n",
|
||||||
|
" ['2024-01-02', 24.1, 68, 1015.1],\n",
|
||||||
|
" ['2024-01-03', 21.8, 72, 1012.8]\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"for row in weather_data:\n",
|
||||||
|
" df_weather.loc[len(df_weather)] = row\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nDataFrame after adding data:\")\n",
|
||||||
|
"print(df_weather)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Practice Exercises\n",
|
||||||
|
"\n",
|
||||||
|
"Try these exercises to reinforce your learning:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 25,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 1: Create a DataFrame from dictionary with employee information\n",
|
||||||
|
"# Include: Employee ID, Name, Department, Salary, Years of Experience\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n",
|
||||||
|
"employee_data = {\n",
|
||||||
|
" # Add your data here\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"# df_employees = pd.DataFrame(employee_data)\n",
|
||||||
|
"# print(df_employees)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 26,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 2: Create a DataFrame using NumPy with 6 rows and 4 columns\n",
|
||||||
|
"# Use column names: 'A', 'B', 'C', 'D'\n",
|
||||||
|
"# Use row indices: 'R1', 'R2', 'R3', 'R4', 'R5', 'R6'\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 27,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 3: Create a DataFrame with mixed data types\n",
|
||||||
|
"# Include at least one string, integer, float, and boolean column\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Key Takeaways\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Dictionary method** is most intuitive for creating DataFrames\n",
|
||||||
|
"2. **NumPy arrays** are useful for numerical data and testing\n",
|
||||||
|
"3. **Custom indices** provide meaningful row labels\n",
|
||||||
|
"4. **Empty DataFrames** can be useful but avoid adding rows one by one for large datasets\n",
|
||||||
|
"5. Always check the **shape** and **data types** of your DataFrame after creation\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.13.3"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
523
Session_01/PandasDataFrame-exmples/02_basic_operations.ipynb
Executable file
523
Session_01/PandasDataFrame-exmples/02_basic_operations.ipynb
Executable file
|
@ -0,0 +1,523 @@
|
||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Session 1 - DataFrames - Lesson 2: Basic Operations\n",
|
||||||
|
"\n",
|
||||||
|
"## Learning Objectives\n",
|
||||||
|
"- Learn essential methods to explore DataFrame structure\n",
|
||||||
|
"- Understand how to get basic information about your data\n",
|
||||||
|
"- Master data inspection techniques\n",
|
||||||
|
"- Practice with summary statistics\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"- Completed Lesson 1: Creating DataFrames\n",
|
||||||
|
"- Basic understanding of pandas DataFrames"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 22,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Import required libraries\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"from datetime import datetime, timedelta\n",
|
||||||
|
"\n",
|
||||||
|
"# Set display options for better output\n",
|
||||||
|
"pd.set_option('display.max_columns', None)\n",
|
||||||
|
"pd.set_option('display.width', None)\n",
|
||||||
|
"pd.set_option('display.max_colwidth', 50)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Creating Sample Dataset\n",
|
||||||
|
"\n",
|
||||||
|
"Let's create a comprehensive sales dataset to practice basic operations."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 43,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Sales Dataset Created!\n",
|
||||||
|
"Dataset shape: (20, 6)\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Create a comprehensive sales dataset\n",
|
||||||
|
"np.random.seed(42)\n",
|
||||||
|
"\n",
|
||||||
|
"sales_data = {\n",
|
||||||
|
" 'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
|
||||||
|
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
|
||||||
|
" 'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
|
||||||
|
" 1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
|
||||||
|
" 'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
|
||||||
|
" 'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
|
||||||
|
" 'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_sales = pd.DataFrame(sales_data)\n",
|
||||||
|
"print(\"Sales Dataset Created!\")\n",
|
||||||
|
"print(f\"Dataset shape: {df_sales.shape}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 1. Viewing Data\n",
|
||||||
|
"\n",
|
||||||
|
"These methods help you quickly inspect your data."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# View first few rows\n",
|
||||||
|
"print(\"First 5 rows (default):\")\n",
|
||||||
|
"print(df_sales.head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nFirst 3 rows:\")\n",
|
||||||
|
"print(df_sales.head(3))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# View last few rows\n",
|
||||||
|
"print(\"Last 5 rows (default):\")\n",
|
||||||
|
"print(df_sales.tail())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nLast 3 rows:\")\n",
|
||||||
|
"print(df_sales.tail(3))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 44,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Random sample of 5 rows:\n",
|
||||||
|
" Date Product Sales Region Salesperson Commission_Rate\n",
|
||||||
|
"0 2024-01-01 Laptop 1200 North John 0.10\n",
|
||||||
|
"17 2024-01-18 Tablet 620 East Mike 0.08\n",
|
||||||
|
"15 2024-01-16 Laptop 1220 North John 0.10\n",
|
||||||
|
"1 2024-01-02 Phone 800 South Sarah 0.12\n",
|
||||||
|
"8 2024-01-09 Laptop 1250 West Lisa 0.11\n",
|
||||||
|
"\n",
|
||||||
|
"Random sample with different random state:\n",
|
||||||
|
" Date Product Sales Region Salesperson Commission_Rate\n",
|
||||||
|
"7 2024-01-08 Tablet 650 East Mike 0.08\n",
|
||||||
|
"10 2024-01-11 Laptop 1150 North John 0.10\n",
|
||||||
|
"5 2024-01-06 Laptop 1300 North John 0.10\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Sample random rows\n",
|
||||||
|
"print(\"Random sample of 5 rows:\")\n",
|
||||||
|
"print(df_sales.sample(5))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nRandom sample with different random state:\")\n",
|
||||||
|
"print(df_sales.sample(3, random_state=10))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 2. DataFrame Information\n",
|
||||||
|
"\n",
|
||||||
|
"Get detailed information about your DataFrame structure."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Comprehensive information about the DataFrame\n",
|
||||||
|
"print(\"DataFrame Info:\")\n",
|
||||||
|
"df_sales.info()\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nMemory usage:\")\n",
|
||||||
|
"df_sales.info(memory_usage='deep')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Basic properties\n",
|
||||||
|
"print(f\"Shape (rows, columns): {df_sales.shape}\")\n",
|
||||||
|
"print(f\"Number of rows: {len(df_sales)}\")\n",
|
||||||
|
"print(f\"Number of columns: {len(df_sales.columns)}\")\n",
|
||||||
|
"print(f\"Total elements: {df_sales.size}\")\n",
|
||||||
|
"print(f\"Dimensions: {df_sales.ndim}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Column and index information\n",
|
||||||
|
"print(\"Column names:\")\n",
|
||||||
|
"print(df_sales.columns.tolist())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nData types:\")\n",
|
||||||
|
"print(df_sales.dtypes)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nIndex information:\")\n",
|
||||||
|
"print(f\"Index: {df_sales.index}\")\n",
|
||||||
|
"print(f\"Index type: {type(df_sales.index)}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 3. Summary Statistics\n",
|
||||||
|
"\n",
|
||||||
|
"Understand your data through statistical summaries."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Summary statistics for numerical columns\n",
|
||||||
|
"print(\"Summary statistics:\")\n",
|
||||||
|
"print(df_sales.describe())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nRounded to 2 decimal places:\")\n",
|
||||||
|
"print(df_sales.describe().round(2))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Summary statistics for all columns (including non-numeric)\n",
|
||||||
|
"print(\"Summary for all columns:\")\n",
|
||||||
|
"print(df_sales.describe(include='all'))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Individual statistics\n",
|
||||||
|
"print(\"Individual Statistical Measures:\")\n",
|
||||||
|
"print(f\"Mean sales: {df_sales['Sales'].mean():.2f}\")\n",
|
||||||
|
"print(f\"Median sales: {df_sales['Sales'].median():.2f}\")\n",
|
||||||
|
"print(f\"Standard deviation: {df_sales['Sales'].std():.2f}\")\n",
|
||||||
|
"print(f\"Minimum sales: {df_sales['Sales'].min()}\")\n",
|
||||||
|
"print(f\"Maximum sales: {df_sales['Sales'].max()}\")\n",
|
||||||
|
"print(f\"Sales range: {df_sales['Sales'].max() - df_sales['Sales'].min()}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Quantiles and percentiles\n",
|
||||||
|
"print(\"Quantiles for Sales:\")\n",
|
||||||
|
"print(f\"25th percentile (Q1): {df_sales['Sales'].quantile(0.25)}\")\n",
|
||||||
|
"print(f\"50th percentile (Q2/Median): {df_sales['Sales'].quantile(0.50)}\")\n",
|
||||||
|
"print(f\"75th percentile (Q3): {df_sales['Sales'].quantile(0.75)}\")\n",
|
||||||
|
"print(f\"90th percentile: {df_sales['Sales'].quantile(0.90)}\")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nCustom quantiles:\")\n",
|
||||||
|
"quantiles = df_sales['Sales'].quantile([0.1, 0.3, 0.7, 0.9])\n",
|
||||||
|
"print(quantiles)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 4. Counting and Unique Values\n",
|
||||||
|
"\n",
|
||||||
|
"Understand the distribution of categorical data."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Count unique values in each column\n",
|
||||||
|
"print(\"Number of unique values per column:\")\n",
|
||||||
|
"print(df_sales.nunique())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nUnique values in 'Product' column:\")\n",
|
||||||
|
"print(df_sales['Product'].unique())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nValue counts for 'Product':\")\n",
|
||||||
|
"print(df_sales['Product'].value_counts())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 45,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Product distribution (counts and percentages):\n",
|
||||||
|
" Count Percentage\n",
|
||||||
|
"Product \n",
|
||||||
|
"Laptop 8 40.0\n",
|
||||||
|
"Phone 8 40.0\n",
|
||||||
|
"Tablet 4 20.0\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Value counts with percentages\n",
|
||||||
|
"print(\"Product distribution (counts and percentages):\")\n",
|
||||||
|
"product_counts = df_sales['Product'].value_counts()\n",
|
||||||
|
"product_percentages = df_sales['Product'].value_counts(normalize=True) * 100\n",
|
||||||
|
"\n",
|
||||||
|
"distribution = pd.DataFrame({\n",
|
||||||
|
" 'Count': product_counts,\n",
|
||||||
|
" 'Percentage': product_percentages.round(1)\n",
|
||||||
|
"})\n",
|
||||||
|
"print(distribution)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Cross-tabulation\n",
|
||||||
|
"print(\"Cross-tabulation of Product vs Region:\")\n",
|
||||||
|
"crosstab = pd.crosstab(df_sales['Product'], df_sales['Region'])\n",
|
||||||
|
"print(crosstab)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nWith percentages:\")\n",
|
||||||
|
"crosstab_pct = pd.crosstab(df_sales['Product'], df_sales['Region'], normalize='all') * 100\n",
|
||||||
|
"print(crosstab_pct.round(1))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 5. Data Quality Checks\n",
|
||||||
|
"\n",
|
||||||
|
"Essential checks for data quality and integrity."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Check for missing values\n",
|
||||||
|
"print(\"Missing values per column:\")\n",
|
||||||
|
"print(df_sales.isnull().sum())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nPercentage of missing values:\")\n",
|
||||||
|
"missing_percentages = (df_sales.isnull().sum() / len(df_sales)) * 100\n",
|
||||||
|
"print(missing_percentages.round(2))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nAny missing values in dataset?\", df_sales.isnull().any().any())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Check for duplicates\n",
|
||||||
|
"print(f\"Number of duplicate rows: {df_sales.duplicated().sum()}\")\n",
|
||||||
|
"print(f\"Any duplicate rows? {df_sales.duplicated().any()}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Check for duplicates based on specific columns\n",
|
||||||
|
"print(f\"\\nDuplicate combinations of Date and Salesperson: {df_sales.duplicated(['Date', 'Salesperson']).sum()}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 6. Quick Data Exploration\n",
|
||||||
|
"\n",
|
||||||
|
"Rapid exploration techniques to understand your data."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Quick exploration function\n",
|
||||||
|
"def quick_explore(df, column_name):\n",
|
||||||
|
" \"\"\"Quick exploration of a specific column\"\"\"\n",
|
||||||
|
" print(f\"=== Quick Exploration: {column_name} ===\")\n",
|
||||||
|
" col = df[column_name]\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Data type: {col.dtype}\")\n",
|
||||||
|
" print(f\"Non-null values: {col.count()}/{len(col)}\")\n",
|
||||||
|
" print(f\"Unique values: {col.nunique()}\")\n",
|
||||||
|
" \n",
|
||||||
|
" if col.dtype in ['int64', 'float64']:\n",
|
||||||
|
" print(f\"Min: {col.min()}, Max: {col.max()}\")\n",
|
||||||
|
" print(f\"Mean: {col.mean():.2f}, Median: {col.median():.2f}\")\n",
|
||||||
|
" else:\n",
|
||||||
|
" print(f\"Most common: {col.mode().iloc[0] if not col.mode().empty else 'N/A'}\")\n",
|
||||||
|
" print(f\"Sample values: {col.unique()[:5].tolist()}\")\n",
|
||||||
|
" print()\n",
|
||||||
|
"\n",
|
||||||
|
"# Explore different columns\n",
|
||||||
|
"for col in ['Sales', 'Product', 'Region']:\n",
|
||||||
|
" quick_explore(df_sales, col)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Practice Exercises\n",
|
||||||
|
"\n",
|
||||||
|
"Test your understanding with these exercises:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 40,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 1: Create a larger dataset and explore it\n",
|
||||||
|
"# Create a dataset with 100 rows and at least 5 columns\n",
|
||||||
|
"# Include different data types (numeric, categorical, datetime)\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 41,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 2: Write a function that provides a complete data profile\n",
|
||||||
|
"# Include: shape, data types, missing values, unique values, and basic stats\n",
|
||||||
|
"\n",
|
||||||
|
"def data_profile(df):\n",
|
||||||
|
" \"\"\"Provide a comprehensive data profile\"\"\"\n",
|
||||||
|
" # Your code here:\n",
|
||||||
|
" pass\n",
|
||||||
|
"\n",
|
||||||
|
"# Test your function\n",
|
||||||
|
"# data_profile(df_sales)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 42,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 3: Find interesting insights from the sales data\n",
|
||||||
|
"# Questions to answer:\n",
|
||||||
|
"# 1. Which product has the highest average sales?\n",
|
||||||
|
"# 2. Which region has the most consistent sales (lowest standard deviation)?\n",
|
||||||
|
"# 3. What's the total commission earned by each salesperson?\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Key Takeaways\n",
|
||||||
|
"\n",
|
||||||
|
"1. **`.head()` and `.tail()`** are essential for quick data inspection\n",
|
||||||
|
"2. **`.info()`** provides comprehensive DataFrame structure information\n",
|
||||||
|
"3. **`.describe()`** gives statistical summaries for numerical columns\n",
|
||||||
|
"4. **`.nunique()` and `.value_counts()`** help understand categorical data\n",
|
||||||
|
"5. **Always check for missing values** and duplicates in your data\n",
|
||||||
|
"6. **Statistical measures** (mean, median, std) provide insights into data distribution\n",
|
||||||
|
"7. **Cross-tabulation** helps understand relationships between categorical variables\n",
|
||||||
|
"\n",
|
||||||
|
"## Common Gotchas\n",
|
||||||
|
"\n",
|
||||||
|
"- `.describe()` only includes numeric columns by default (use `include='all'` for all columns)\n",
|
||||||
|
"- Missing values can affect statistical calculations\n",
|
||||||
|
"- Large datasets might need memory-efficient exploration techniques\n",
|
||||||
|
"- Always verify data types are correct for your analysis"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.13.3"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
593
Session_01/PandasDataFrame-exmples/03_selecting_filtering.ipynb
Executable file
593
Session_01/PandasDataFrame-exmples/03_selecting_filtering.ipynb
Executable file
|
@ -0,0 +1,593 @@
|
||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Session 1 - DataFrames - Lesson 3: Selecting and Filtering Data\n",
|
||||||
|
"\n",
|
||||||
|
"## Learning Objectives\n",
|
||||||
|
"- Master column and row selection techniques\n",
|
||||||
|
"- Learn boolean indexing for data filtering\n",
|
||||||
|
"- Understand the difference between `.loc[]` and `.iloc[]`\n",
|
||||||
|
"- Practice complex filtering conditions\n",
|
||||||
|
"- Handle edge cases in data selection\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"- Completed Lessons 1-2\n",
|
||||||
|
"- Understanding of Python boolean operations"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Import required libraries\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"from datetime import datetime, timedelta\n",
|
||||||
|
"\n",
|
||||||
|
"# Create sample dataset\n",
|
||||||
|
"np.random.seed(42)\n",
|
||||||
|
"sales_data = {\n",
|
||||||
|
" 'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
|
||||||
|
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
|
||||||
|
" 'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
|
||||||
|
" 1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
|
||||||
|
" 'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
|
||||||
|
" 'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
|
||||||
|
" 'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_sales = pd.DataFrame(sales_data)\n",
|
||||||
|
"print(\"Dataset loaded:\")\n",
|
||||||
|
"print(df_sales.head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 1. Selecting Columns\n",
|
||||||
|
"\n",
|
||||||
|
"Different ways to select columns from a DataFrame."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 24,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Single column (Product) - Returns Series:\n",
|
||||||
|
"Type: <class 'pandas.core.series.Series'>\n",
|
||||||
|
"0 Laptop\n",
|
||||||
|
"1 Phone\n",
|
||||||
|
"2 Tablet\n",
|
||||||
|
"3 Laptop\n",
|
||||||
|
"4 Phone\n",
|
||||||
|
"Name: Product, dtype: object\n",
|
||||||
|
"\n",
|
||||||
|
"Single column with dot notation:\n",
|
||||||
|
"0 Laptop\n",
|
||||||
|
"1 Phone\n",
|
||||||
|
"2 Tablet\n",
|
||||||
|
"3 Laptop\n",
|
||||||
|
"4 Phone\n",
|
||||||
|
"Name: Product, dtype: object\n",
|
||||||
|
"\n",
|
||||||
|
"Single column as DataFrame (note the double brackets):\n",
|
||||||
|
"Type: <class 'pandas.core.frame.DataFrame'>\n",
|
||||||
|
" Product\n",
|
||||||
|
"0 Laptop\n",
|
||||||
|
"1 Phone\n",
|
||||||
|
"2 Tablet\n",
|
||||||
|
"3 Laptop\n",
|
||||||
|
"4 Phone\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# Single column selection (returns Series)\n",
|
||||||
|
"print(\"Single column (Product) - Returns Series:\")\n",
|
||||||
|
"product_series = df_sales['Product']\n",
|
||||||
|
"print(f\"Type: {type(product_series)}\")\n",
|
||||||
|
"print(product_series.head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nSingle column with dot notation:\")\n",
|
||||||
|
"print(df_sales.Product.head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nSingle column as DataFrame (note the double brackets):\")\n",
|
||||||
|
"product_df = df_sales[['Product']]\n",
|
||||||
|
"print(f\"Type: {type(product_df)}\")\n",
|
||||||
|
"print(product_df.head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Multiple column selection\n",
|
||||||
|
"print(\"Multiple columns:\")\n",
|
||||||
|
"selected_cols = df_sales[['Product', 'Sales', 'Region']]\n",
|
||||||
|
"print(selected_cols.head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nUsing a list variable:\")\n",
|
||||||
|
"columns_to_select = ['Date', 'Salesperson', 'Sales']\n",
|
||||||
|
"selected_df = df_sales[columns_to_select]\n",
|
||||||
|
"print(selected_df.head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Column selection with conditions\n",
|
||||||
|
"print(\"Selecting columns by data type:\")\n",
|
||||||
|
"numeric_cols = df_sales.select_dtypes(include=[np.number])\n",
|
||||||
|
"print(\"Numeric columns:\")\n",
|
||||||
|
"print(numeric_cols.head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nSelecting columns by name pattern:\")\n",
|
||||||
|
"# Columns containing 'S'\n",
|
||||||
|
"s_columns = [col for col in df_sales.columns if 'S' in col]\n",
|
||||||
|
"print(f\"Columns with 'S': {s_columns}\")\n",
|
||||||
|
"print(df_sales[s_columns].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 2. Selecting Rows\n",
|
||||||
|
"\n",
|
||||||
|
"Different methods to select specific rows."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Row selection by index position\n",
|
||||||
|
"print(\"First row (index 0):\")\n",
|
||||||
|
"print(df_sales.iloc[0])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nRows 2 to 4 (positions 1, 2, 3):\")\n",
|
||||||
|
"print(df_sales.iloc[1:4])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nLast 3 rows:\")\n",
|
||||||
|
"print(df_sales.iloc[-3:])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Row selection by label/index\n",
|
||||||
|
"print(\"Using .loc with index labels:\")\n",
|
||||||
|
"print(df_sales.loc[0:2]) # Note: includes endpoint with .loc\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nSpecific rows by index:\")\n",
|
||||||
|
"specific_rows = df_sales.loc[[0, 5, 10, 15]]\n",
|
||||||
|
"print(specific_rows[['Product', 'Sales', 'Region']])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Random sampling\n",
|
||||||
|
"print(\"Random sample of 5 rows:\")\n",
|
||||||
|
"random_sample = df_sales.sample(n=5, random_state=42)\n",
|
||||||
|
"print(random_sample[['Product', 'Sales', 'Salesperson']])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nRandom 25% of the data:\")\n",
|
||||||
|
"percentage_sample = df_sales.sample(frac=0.25, random_state=42)\n",
|
||||||
|
"print(f\"Sample size: {len(percentage_sample)} rows\")\n",
|
||||||
|
"print(percentage_sample[['Product', 'Sales']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 3. Boolean Indexing and Filtering\n",
|
||||||
|
"\n",
|
||||||
|
"Filter data based on conditions using boolean indexing."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Simple boolean conditions\n",
|
||||||
|
"print(\"Sales greater than 1000:\")\n",
|
||||||
|
"high_sales = df_sales[df_sales['Sales'] > 1000]\n",
|
||||||
|
"print(high_sales[['Product', 'Sales', 'Region']])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nSpecific product filter:\")\n",
|
||||||
|
"laptops_only = df_sales[df_sales['Product'] == 'Laptop']\n",
|
||||||
|
"print(laptops_only[['Date', 'Sales', 'Salesperson']])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Multiple conditions with AND (&)\n",
|
||||||
|
"print(\"Laptops with sales > 1100:\")\n",
|
||||||
|
"laptop_high_sales = df_sales[(df_sales['Product'] == 'Laptop') & (df_sales['Sales'] > 1100)]\n",
|
||||||
|
"print(laptop_high_sales[['Date', 'Product', 'Sales', 'Region']])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nNorth region with commission rate >= 0.10:\")\n",
|
||||||
|
"north_high_commission = df_sales[(df_sales['Region'] == 'North') & (df_sales['Commission_Rate'] >= 0.10)]\n",
|
||||||
|
"print(north_high_commission[['Product', 'Sales', 'Commission_Rate', 'Salesperson']])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Multiple conditions with OR (|)\n",
|
||||||
|
"print(\"Laptops OR high sales (>1200):\")\n",
|
||||||
|
"laptop_or_high = df_sales[(df_sales['Product'] == 'Laptop') | (df_sales['Sales'] > 1200)]\n",
|
||||||
|
"print(laptop_or_high[['Product', 'Sales', 'Region']])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nNorth OR South regions:\")\n",
|
||||||
|
"north_or_south = df_sales[(df_sales['Region'] == 'North') | (df_sales['Region'] == 'South')]\n",
|
||||||
|
"print(north_or_south[['Product', 'Sales', 'Region']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Using .isin() for multiple values\n",
|
||||||
|
"print(\"Products: Laptop or Phone\")\n",
|
||||||
|
"laptop_phone = df_sales[df_sales['Product'].isin(['Laptop', 'Phone'])]\n",
|
||||||
|
"print(laptop_phone[['Product', 'Sales', 'Region']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nSpecific salespersons:\")\n",
|
||||||
|
"selected_salespeople = df_sales[df_sales['Salesperson'].isin(['John', 'Sarah'])]\n",
|
||||||
|
"print(selected_salespeople[['Salesperson', 'Product', 'Sales']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# NOT conditions using ~\n",
|
||||||
|
"print(\"NOT Tablets:\")\n",
|
||||||
|
"not_tablets = df_sales[~(df_sales['Product'] == 'Tablet')]\n",
|
||||||
|
"print(not_tablets['Product'].value_counts())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nNOT in North region:\")\n",
|
||||||
|
"not_north = df_sales[~df_sales['Region'].isin(['North'])]\n",
|
||||||
|
"print(not_north['Region'].value_counts())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 4. Advanced Selection with .loc and .iloc\n",
|
||||||
|
"\n",
|
||||||
|
"Powerful selection methods for precise data access."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# .loc for label-based selection\n",
|
||||||
|
"print(\".loc examples - Label-based selection:\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Select specific rows and columns\n",
|
||||||
|
"print(\"Rows 0-2, specific columns:\")\n",
|
||||||
|
"result = df_sales.loc[0:2, ['Product', 'Sales', 'Region']]\n",
|
||||||
|
"print(result)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nAll rows, specific columns:\")\n",
|
||||||
|
"result = df_sales.loc[:, ['Product', 'Sales']]\n",
|
||||||
|
"print(result.head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# .iloc for position-based selection\n",
|
||||||
|
"print(\".iloc examples - Position-based selection:\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Select by position\n",
|
||||||
|
"print(\"First 3 rows, first 3 columns:\")\n",
|
||||||
|
"result = df_sales.iloc[0:3, 0:3]\n",
|
||||||
|
"print(result)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nEvery other row, specific columns:\")\n",
|
||||||
|
"result = df_sales.iloc[::2, [1, 2, 3]] # Every 2nd row, columns 1,2,3\n",
|
||||||
|
"print(result.head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Combining boolean indexing with .loc\n",
|
||||||
|
"print(\"Boolean indexing with .loc:\")\n",
|
||||||
|
"\n",
|
||||||
|
"# High sales, specific columns\n",
|
||||||
|
"high_sales_subset = df_sales.loc[df_sales['Sales'] > 1000, ['Product', 'Sales', 'Salesperson']]\n",
|
||||||
|
"print(high_sales_subset)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nComplex condition with .loc:\")\n",
|
||||||
|
"complex_filter = (df_sales['Product'] == 'Laptop') & (df_sales['Region'] == 'North')\n",
|
||||||
|
"result = df_sales.loc[complex_filter, ['Date', 'Sales', 'Commission_Rate']]\n",
|
||||||
|
"print(result)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 5. String-based Filtering\n",
|
||||||
|
"\n",
|
||||||
|
"Filter data based on string patterns and conditions."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# String methods for filtering\n",
|
||||||
|
"print(\"Salesperson names starting with 'J':\")\n",
|
||||||
|
"j_names = df_sales[df_sales['Salesperson'].str.startswith('J')]\n",
|
||||||
|
"print(j_names[['Salesperson', 'Product', 'Sales']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nRegions containing 'th':\")\n",
|
||||||
|
"th_regions = df_sales[df_sales['Region'].str.contains('th')]\n",
|
||||||
|
"print(th_regions[['Region', 'Product', 'Sales']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nProducts with exactly 5 characters:\")\n",
|
||||||
|
"five_char_products = df_sales[df_sales['Product'].str.len() == 5]\n",
|
||||||
|
"print(five_char_products['Product'].unique())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 6. Date-based Filtering\n",
|
||||||
|
"\n",
|
||||||
|
"Filter data based on date conditions."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Date filtering\n",
|
||||||
|
"print(\"Data from first week of January 2024:\")\n",
|
||||||
|
"first_week = df_sales[df_sales['Date'] <= '2024-01-07']\n",
|
||||||
|
"print(first_week[['Date', 'Product', 'Sales']])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nData from specific date range:\")\n",
|
||||||
|
"date_range = df_sales[(df_sales['Date'] >= '2024-01-10') & (df_sales['Date'] <= '2024-01-15')]\n",
|
||||||
|
"print(date_range[['Date', 'Product', 'Sales', 'Region']])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Using date components\n",
|
||||||
|
"print(\"Data from weekends (Saturday=5, Sunday=6):\")\n",
|
||||||
|
"weekends = df_sales[df_sales['Date'].dt.dayofweek >= 5]\n",
|
||||||
|
"print(weekends[['Date', 'Product', 'Sales']])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nData from specific days of week:\")\n",
|
||||||
|
"mondays = df_sales[df_sales['Date'].dt.day_name() == 'Monday']\n",
|
||||||
|
"print(f\"Monday sales: {len(mondays)} records\")\n",
|
||||||
|
"if len(mondays) > 0:\n",
|
||||||
|
" print(mondays[['Date', 'Product', 'Sales']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 7. Query Method\n",
|
||||||
|
"\n",
|
||||||
|
"Alternative syntax for filtering using the `.query()` method."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Using .query() method for cleaner syntax\n",
|
||||||
|
"print(\"Using .query() method:\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Simple condition\n",
|
||||||
|
"high_sales_query = df_sales.query('Sales > 1000')\n",
|
||||||
|
"print(f\"High sales records: {len(high_sales_query)}\")\n",
|
||||||
|
"print(high_sales_query[['Product', 'Sales', 'Region']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nMultiple conditions:\")\n",
|
||||||
|
"complex_query = df_sales.query('Product == \"Laptop\" and Region == \"North\"')\n",
|
||||||
|
"print(complex_query[['Date', 'Sales', 'Commission_Rate']])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Query with variables\n",
|
||||||
|
"min_sales = 900\n",
|
||||||
|
"target_region = 'East'\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Query with variables:\")\n",
|
||||||
|
"var_query = df_sales.query('Sales >= @min_sales and Region == @target_region')\n",
|
||||||
|
"print(var_query[['Product', 'Sales', 'Region']])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nQuery with list (isin equivalent):\")\n",
|
||||||
|
"products = ['Laptop', 'Phone']\n",
|
||||||
|
"list_query = df_sales.query('Product in @products')\n",
|
||||||
|
"print(f\"Records for {products}: {len(list_query)}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Practice Exercises\n",
|
||||||
|
"\n",
|
||||||
|
"Test your filtering and selection skills:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 1: Complex Filtering\n",
|
||||||
|
"# Find all sales where:\n",
|
||||||
|
"# - Product is either 'Laptop' or 'Phone'\n",
|
||||||
|
"# - Sales are above the median\n",
|
||||||
|
"# - Commission rate is at least 0.10\n",
|
||||||
|
"# Show only Date, Product, Sales, and Salesperson columns\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n",
|
||||||
|
"median_sales = df_sales['Sales'].median()\n",
|
||||||
|
"print(f\"Median sales: {median_sales}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# complex_filter = ?\n",
|
||||||
|
"# result = ?\n",
|
||||||
|
"# print(result)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 22,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 2: Date-based Analysis\n",
|
||||||
|
"# Find sales data for the second week of January 2024\n",
|
||||||
|
"# Calculate the average sales for that week\n",
|
||||||
|
"# Show which products were sold and by whom\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 23,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 3: Performance Analysis\n",
|
||||||
|
"# Create a function that finds top performers:\n",
|
||||||
|
"# - Takes a DataFrame and a percentile (e.g., 0.8 for top 20%)\n",
|
||||||
|
"# - Returns salespeople whose average sales are in the top percentile\n",
|
||||||
|
"# - Show their average sales and total number of sales\n",
|
||||||
|
"\n",
|
||||||
|
"def find_top_performers(df, percentile=0.8):\n",
|
||||||
|
" \"\"\"Find top performing salespeople\"\"\"\n",
|
||||||
|
" # Your code here:\n",
|
||||||
|
" pass\n",
|
||||||
|
"\n",
|
||||||
|
"# Test your function\n",
|
||||||
|
"# top_performers = find_top_performers(df_sales, 0.8)\n",
|
||||||
|
"# print(top_performers)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Key Takeaways\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Column Selection**: Use `[]` for single/multiple columns, understand Series vs DataFrame return types\n",
|
||||||
|
"2. **Row Selection**: `.iloc[]` for position-based, `.loc[]` for label-based selection\n",
|
||||||
|
"3. **Boolean Indexing**: Use `&` (AND), `|` (OR), `~` (NOT) for combining conditions\n",
|
||||||
|
"4. **Parentheses Matter**: Always wrap individual conditions in parentheses when combining\n",
|
||||||
|
"5. **`.isin()` Method**: Efficient way to filter for multiple values\n",
|
||||||
|
"6. **String Methods**: Use `.str` accessor for string-based filtering\n",
|
||||||
|
"7. **Date Filtering**: Leverage `.dt` accessor for date-based conditions\n",
|
||||||
|
"8. **`.query()` Method**: Alternative syntax for complex filtering\n",
|
||||||
|
"\n",
|
||||||
|
"## Common Mistakes to Avoid\n",
|
||||||
|
"\n",
|
||||||
|
"- Using `and/or` instead of `&/|` in boolean conditions\n",
|
||||||
|
"- Forgetting parentheses around conditions\n",
|
||||||
|
"- Confusing `.loc[]` and `.iloc[]` usage\n",
|
||||||
|
"- Not handling empty results from filtering\n",
|
||||||
|
"- Using chained indexing instead of `.loc[]`\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.13.3"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
1137
Session_01/PandasDataFrame-exmples/04_grouping_aggregation.ipynb
Executable file
1137
Session_01/PandasDataFrame-exmples/04_grouping_aggregation.ipynb
Executable file
File diff suppressed because it is too large
Load diff
733
Session_01/PandasDataFrame-exmples/05_adding_modifying_columns.ipynb
Executable file
733
Session_01/PandasDataFrame-exmples/05_adding_modifying_columns.ipynb
Executable file
|
@ -0,0 +1,733 @@
|
||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Session 1 - DataFrames - Lesson 5: Adding and Modifying Columns\n",
|
||||||
|
"\n",
|
||||||
|
"## Learning Objectives\n",
|
||||||
|
"- Learn different methods to add new columns to DataFrames\n",
|
||||||
|
"- Master conditional column creation using various techniques\n",
|
||||||
|
"- Understand how to modify existing columns\n",
|
||||||
|
"- Practice with calculated fields and derived columns\n",
|
||||||
|
"- Explore data type conversions and transformations\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"- Completed Lessons 1-4\n",
|
||||||
|
"- Understanding of basic Python operations and functions"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Import required libraries\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"from datetime import datetime, timedelta\n",
|
||||||
|
"\n",
|
||||||
|
"# Create sample dataset\n",
|
||||||
|
"np.random.seed(42)\n",
|
||||||
|
"n_records = 150\n",
|
||||||
|
"\n",
|
||||||
|
"sales_data = {\n",
|
||||||
|
" 'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n",
|
||||||
|
" 'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),\n",
|
||||||
|
" 'Sales': np.random.normal(1000, 200, n_records).astype(int),\n",
|
||||||
|
" 'Quantity': np.random.randint(1, 8, n_records),\n",
|
||||||
|
" 'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
|
||||||
|
" 'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n_records),\n",
|
||||||
|
" 'Customer_Type': np.random.choice(['New', 'Returning', 'VIP'], n_records, p=[0.3, 0.6, 0.1])\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_sales = pd.DataFrame(sales_data)\n",
|
||||||
|
"df_sales['Sales'] = np.abs(df_sales['Sales']) # Ensure positive values\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Original dataset:\")\n",
|
||||||
|
"print(f\"Shape: {df_sales.shape}\")\n",
|
||||||
|
"print(\"\\nFirst few rows:\")\n",
|
||||||
|
"print(df_sales.head())\n",
|
||||||
|
"print(\"\\nData types:\")\n",
|
||||||
|
"print(df_sales.dtypes)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 1. Basic Column Addition\n",
|
||||||
|
"\n",
|
||||||
|
"Simple methods to add new columns."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Method 1: Direct assignment\n",
|
||||||
|
"df_modified = df_sales.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Add simple calculated columns\n",
|
||||||
|
"df_modified['Revenue'] = df_modified['Sales'] * df_modified['Quantity']\n",
|
||||||
|
"df_modified['Commission_10%'] = df_modified['Sales'] * 0.10\n",
|
||||||
|
"df_modified['Sales_per_Unit'] = df_modified['Sales'] / df_modified['Quantity']\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"New calculated columns:\")\n",
|
||||||
|
"print(df_modified[['Sales', 'Quantity', 'Revenue', 'Commission_10%', 'Sales_per_Unit']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Add constant value column\n",
|
||||||
|
"df_modified['Year'] = 2024\n",
|
||||||
|
"df_modified['Currency'] = 'USD'\n",
|
||||||
|
"df_modified['Department'] = 'Sales'\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nConstant value columns added:\")\n",
|
||||||
|
"print(df_modified[['Year', 'Currency', 'Department']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Method 2: Using assign() method (more functional approach)\n",
|
||||||
|
"df_assigned = df_sales.assign(\n",
|
||||||
|
" Revenue=lambda x: x['Sales'] * x['Quantity'],\n",
|
||||||
|
" Commission_Rate=0.08,\n",
|
||||||
|
" Commission_Amount=lambda x: x['Sales'] * 0.08,\n",
|
||||||
|
" Sales_Squared=lambda x: x['Sales'] ** 2,\n",
|
||||||
|
" Is_High_Volume=lambda x: x['Quantity'] > 5\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Using assign() method:\")\n",
|
||||||
|
"print(df_assigned[['Sales', 'Quantity', 'Revenue', 'Commission_Amount', 'Is_High_Volume']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"\\nOriginal shape: {df_sales.shape}\")\n",
|
||||||
|
"print(f\"Modified shape: {df_assigned.shape}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Method 3: Using insert() for specific positioning\n",
|
||||||
|
"df_insert = df_sales.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Insert column at specific position (after 'Sales')\n",
|
||||||
|
"sales_index = df_insert.columns.get_loc('Sales')\n",
|
||||||
|
"df_insert.insert(sales_index + 1, 'Sales_Tax', df_insert['Sales'] * 0.08)\n",
|
||||||
|
"df_insert.insert(sales_index + 2, 'Total_with_Tax', df_insert['Sales'] + df_insert['Sales_Tax'])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Using insert() for positioned columns:\")\n",
|
||||||
|
"print(df_insert[['Product', 'Sales', 'Sales_Tax', 'Total_with_Tax', 'Quantity']].head())\n",
|
||||||
|
"print(f\"\\nColumn order: {list(df_insert.columns)}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 2. Conditional Column Creation\n",
|
||||||
|
"\n",
|
||||||
|
"Create columns based on conditions and business logic."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Method 1: Using np.where() for simple conditions\n",
|
||||||
|
"df_conditional = df_sales.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Simple binary conditions\n",
|
||||||
|
"df_conditional['High_Sales'] = np.where(df_conditional['Sales'] > 1000, 'Yes', 'No')\n",
|
||||||
|
"df_conditional['Weekend'] = np.where(df_conditional['Date'].dt.dayofweek >= 5, 'Weekend', 'Weekday')\n",
|
||||||
|
"df_conditional['Bulk_Order'] = np.where(df_conditional['Quantity'] >= 5, 'Bulk', 'Regular')\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Simple conditional columns:\")\n",
|
||||||
|
"print(df_conditional[['Sales', 'High_Sales', 'Date', 'Weekend', 'Quantity', 'Bulk_Order']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Nested conditions\n",
|
||||||
|
"df_conditional['Sales_Category'] = np.where(df_conditional['Sales'] > 1200, 'High',\n",
|
||||||
|
" np.where(df_conditional['Sales'] > 800, 'Medium', 'Low'))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nNested conditions:\")\n",
|
||||||
|
"print(df_conditional[['Sales', 'Sales_Category']].head(10))\n",
|
||||||
|
"print(\"\\nCategory distribution:\")\n",
|
||||||
|
"print(df_conditional['Sales_Category'].value_counts())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Method 2: Using pd.cut() for binning numerical data\n",
|
||||||
|
"df_conditional['Sales_Tier'] = pd.cut(df_conditional['Sales'], \n",
|
||||||
|
" bins=[0, 500, 800, 1200, float('inf')],\n",
|
||||||
|
" labels=['Entry', 'Standard', 'Premium', 'Luxury'])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Using pd.cut() for binning:\")\n",
|
||||||
|
"print(df_conditional[['Sales', 'Sales_Tier']].head(10))\n",
|
||||||
|
"print(\"\\nTier distribution:\")\n",
|
||||||
|
"print(df_conditional['Sales_Tier'].value_counts())\n",
|
||||||
|
"\n",
|
||||||
|
"# Using pd.qcut() for quantile-based binning\n",
|
||||||
|
"df_conditional['Sales_Quintile'] = pd.qcut(df_conditional['Sales'], \n",
|
||||||
|
" q=5, \n",
|
||||||
|
" labels=['Bottom 20%', 'Low 20%', 'Mid 20%', 'High 20%', 'Top 20%'])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nUsing pd.qcut() for quantile binning:\")\n",
|
||||||
|
"print(df_conditional['Sales_Quintile'].value_counts())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Method 3: Using pandas.select() for multiple conditions\n",
|
||||||
|
"# Define conditions and choices\n",
|
||||||
|
"conditions = [\n",
|
||||||
|
" (df_conditional['Sales'] >= 1200) & (df_conditional['Quantity'] >= 5),\n",
|
||||||
|
" (df_conditional['Sales'] >= 1000) & (df_conditional['Customer_Type'] == 'VIP'),\n",
|
||||||
|
" (df_conditional['Sales'] >= 800) & (df_conditional['Region'] == 'North'),\n",
|
||||||
|
" df_conditional['Customer_Type'] == 'New'\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"choices = ['Premium Deal', 'VIP Sale', 'North Preferred', 'New Customer']\n",
|
||||||
|
"default = 'Standard'\n",
|
||||||
|
"\n",
|
||||||
|
"df_conditional['Deal_Type'] = np.select(conditions, choices, default=default)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Using np.select() for complex conditions:\")\n",
|
||||||
|
"print(df_conditional[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Deal_Type']].head(10))\n",
|
||||||
|
"print(\"\\nDeal type distribution:\")\n",
|
||||||
|
"print(df_conditional['Deal_Type'].value_counts())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 3. Using Apply and Lambda Functions\n",
|
||||||
|
"\n",
|
||||||
|
"Create complex calculated columns using custom functions."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Simple lambda functions\n",
|
||||||
|
"df_apply = df_sales.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Single column transformations\n",
|
||||||
|
"df_apply['Sales_Log'] = df_apply['Sales'].apply(lambda x: np.log(x))\n",
|
||||||
|
"df_apply['Product_Length'] = df_apply['Product'].apply(lambda x: len(x))\n",
|
||||||
|
"df_apply['Days_Since_Start'] = df_apply['Date'].apply(lambda x: (x - df_apply['Date'].min()).days)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Simple lambda transformations:\")\n",
|
||||||
|
"print(df_apply[['Sales', 'Sales_Log', 'Product', 'Product_Length', 'Days_Since_Start']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Multiple column operations using lambda\n",
|
||||||
|
"df_apply['Efficiency_Score'] = df_apply.apply(\n",
|
||||||
|
" lambda row: (row['Sales'] * row['Quantity']) / (row['Days_Since_Start'] + 1), \n",
|
||||||
|
" axis=1\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nMultiple column lambda:\")\n",
|
||||||
|
"print(df_apply[['Sales', 'Quantity', 'Days_Since_Start', 'Efficiency_Score']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Custom functions for complex business logic\n",
|
||||||
|
"def calculate_commission(row):\n",
|
||||||
|
" \"\"\"Calculate commission based on complex business rules\"\"\"\n",
|
||||||
|
" base_rate = 0.05\n",
|
||||||
|
" \n",
|
||||||
|
" # VIP customers get higher commission\n",
|
||||||
|
" if row['Customer_Type'] == 'VIP':\n",
|
||||||
|
" base_rate += 0.02\n",
|
||||||
|
" \n",
|
||||||
|
" # High quantity orders get bonus\n",
|
||||||
|
" if row['Quantity'] >= 5:\n",
|
||||||
|
" base_rate += 0.01\n",
|
||||||
|
" \n",
|
||||||
|
" # Regional multipliers\n",
|
||||||
|
" region_multipliers = {'North': 1.2, 'South': 1.0, 'East': 1.1, 'West': 0.9}\n",
|
||||||
|
" multiplier = region_multipliers.get(row['Region'], 1.0)\n",
|
||||||
|
" \n",
|
||||||
|
" return row['Sales'] * base_rate * multiplier\n",
|
||||||
|
"\n",
|
||||||
|
"def performance_rating(row):\n",
|
||||||
|
" \"\"\"Calculate performance rating based on multiple factors\"\"\"\n",
|
||||||
|
" score = 0\n",
|
||||||
|
" \n",
|
||||||
|
" # Sales performance (40% weight)\n",
|
||||||
|
" if row['Sales'] > 1200:\n",
|
||||||
|
" score += 40\n",
|
||||||
|
" elif row['Sales'] > 800:\n",
|
||||||
|
" score += 30\n",
|
||||||
|
" else:\n",
|
||||||
|
" score += 20\n",
|
||||||
|
" \n",
|
||||||
|
" # Quantity performance (30% weight)\n",
|
||||||
|
" if row['Quantity'] >= 6:\n",
|
||||||
|
" score += 30\n",
|
||||||
|
" elif row['Quantity'] >= 4:\n",
|
||||||
|
" score += 20\n",
|
||||||
|
" else:\n",
|
||||||
|
" score += 10\n",
|
||||||
|
" \n",
|
||||||
|
" # Customer type bonus (30% weight)\n",
|
||||||
|
" customer_bonus = {'VIP': 30, 'Returning': 20, 'New': 15}\n",
|
||||||
|
" score += customer_bonus.get(row['Customer_Type'], 0)\n",
|
||||||
|
" \n",
|
||||||
|
" # Convert to letter grade\n",
|
||||||
|
" if score >= 85:\n",
|
||||||
|
" return 'A'\n",
|
||||||
|
" elif score >= 70:\n",
|
||||||
|
" return 'B'\n",
|
||||||
|
" elif score >= 55:\n",
|
||||||
|
" return 'C'\n",
|
||||||
|
" else:\n",
|
||||||
|
" return 'D'\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply custom functions\n",
|
||||||
|
"df_apply['Commission'] = df_apply.apply(calculate_commission, axis=1)\n",
|
||||||
|
"df_apply['Performance_Rating'] = df_apply.apply(performance_rating, axis=1)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Custom function results:\")\n",
|
||||||
|
"print(df_apply[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Commission', 'Performance_Rating']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nPerformance rating distribution:\")\n",
|
||||||
|
"print(df_apply['Performance_Rating'].value_counts().sort_index())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 4. Date and Time Derived Columns\n",
|
||||||
|
"\n",
|
||||||
|
"Extract useful information from datetime columns."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Extract date components\n",
|
||||||
|
"df_dates = df_sales.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Basic date components\n",
|
||||||
|
"df_dates['Year'] = df_dates['Date'].dt.year\n",
|
||||||
|
"df_dates['Month'] = df_dates['Date'].dt.month\n",
|
||||||
|
"df_dates['Day'] = df_dates['Date'].dt.day\n",
|
||||||
|
"df_dates['DayOfWeek'] = df_dates['Date'].dt.dayofweek # 0=Monday, 6=Sunday\n",
|
||||||
|
"df_dates['DayName'] = df_dates['Date'].dt.day_name()\n",
|
||||||
|
"df_dates['MonthName'] = df_dates['Date'].dt.month_name()\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Basic date components:\")\n",
|
||||||
|
"print(df_dates[['Date', 'Year', 'Month', 'Day', 'DayOfWeek', 'DayName', 'MonthName']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Business-relevant date features\n",
|
||||||
|
"df_dates['Quarter'] = df_dates['Date'].dt.quarter\n",
|
||||||
|
"df_dates['Week'] = df_dates['Date'].dt.isocalendar().week\n",
|
||||||
|
"df_dates['DayOfYear'] = df_dates['Date'].dt.dayofyear\n",
|
||||||
|
"df_dates['IsWeekend'] = df_dates['Date'].dt.dayofweek >= 5\n",
|
||||||
|
"df_dates['IsMonthStart'] = df_dates['Date'].dt.is_month_start\n",
|
||||||
|
"df_dates['IsMonthEnd'] = df_dates['Date'].dt.is_month_end\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nBusiness date features:\")\n",
|
||||||
|
"print(df_dates[['Date', 'Quarter', 'Week', 'IsWeekend', 'IsMonthStart', 'IsMonthEnd']].head(10))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Time-based calculations\n",
|
||||||
|
"start_date = df_dates['Date'].min()\n",
|
||||||
|
"df_dates['Days_Since_Start'] = (df_dates['Date'] - start_date).dt.days\n",
|
||||||
|
"df_dates['Weeks_Since_Start'] = df_dates['Days_Since_Start'] // 7\n",
|
||||||
|
"\n",
|
||||||
|
"# Create season column\n",
|
||||||
|
"def get_season(month):\n",
|
||||||
|
" if month in [12, 1, 2]:\n",
|
||||||
|
" return 'Winter'\n",
|
||||||
|
" elif month in [3, 4, 5]:\n",
|
||||||
|
" return 'Spring'\n",
|
||||||
|
" elif month in [6, 7, 8]:\n",
|
||||||
|
" return 'Summer'\n",
|
||||||
|
" else:\n",
|
||||||
|
" return 'Fall'\n",
|
||||||
|
"\n",
|
||||||
|
"df_dates['Season'] = df_dates['Month'].apply(get_season)\n",
|
||||||
|
"\n",
|
||||||
|
"# Business day calculations\n",
|
||||||
|
"df_dates['IsBusinessDay'] = df_dates['Date'].dt.dayofweek < 5\n",
|
||||||
|
"df_dates['BusinessDaysSinceStart'] = df_dates.apply(\n",
|
||||||
|
" lambda row: np.busday_count(start_date.date(), row['Date'].date()), axis=1\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Time-based calculations:\")\n",
|
||||||
|
"print(df_dates[['Date', 'Days_Since_Start', 'Weeks_Since_Start', 'Season', \n",
|
||||||
|
" 'IsBusinessDay', 'BusinessDaysSinceStart']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nSeason distribution:\")\n",
|
||||||
|
"print(df_dates['Season'].value_counts())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 5. Text and String Manipulations\n",
|
||||||
|
"\n",
|
||||||
|
"Create columns based on string operations and text processing."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# String manipulations\n",
|
||||||
|
"df_text = df_sales.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Basic string operations\n",
|
||||||
|
"df_text['Product_Upper'] = df_text['Product'].str.upper()\n",
|
||||||
|
"df_text['Product_Lower'] = df_text['Product'].str.lower()\n",
|
||||||
|
"df_text['Product_Length'] = df_text['Product'].str.len()\n",
|
||||||
|
"df_text['Product_First_Char'] = df_text['Product'].str[0]\n",
|
||||||
|
"df_text['Product_Last_Three'] = df_text['Product'].str[-3:]\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Basic string operations:\")\n",
|
||||||
|
"print(df_text[['Product', 'Product_Upper', 'Product_Lower', 'Product_Length', \n",
|
||||||
|
" 'Product_First_Char', 'Product_Last_Three']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Text categorization\n",
|
||||||
|
"df_text['Product_Category'] = df_text['Product'].apply(lambda x: \n",
|
||||||
|
" 'Computer' if x in ['Laptop', 'Monitor'] else\n",
|
||||||
|
" 'Mobile' if x in ['Phone', 'Tablet'] else\n",
|
||||||
|
" 'Other'\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"# Check for patterns\n",
|
||||||
|
"df_text['Has_Letter_A'] = df_text['Product'].str.contains('a', case=False)\n",
|
||||||
|
"df_text['Starts_With_L'] = df_text['Product'].str.startswith('L')\n",
|
||||||
|
"df_text['Ends_With_E'] = df_text['Product'].str.endswith('e')\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nText patterns and categorization:\")\n",
|
||||||
|
"print(df_text[['Product', 'Product_Category', 'Has_Letter_A', 'Starts_With_L', 'Ends_With_E']].head(10))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Create formatted text columns\n",
|
||||||
|
"df_text['Sales_Formatted'] = df_text['Sales'].apply(lambda x: f\"${x:,.2f}\")\n",
|
||||||
|
"df_text['Transaction_ID'] = df_text.apply(\n",
|
||||||
|
" lambda row: f\"{row['Region'][:1]}{row['Product'][:3].upper()}{row.name:04d}\", axis=1\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"# Create summary descriptions\n",
|
||||||
|
"df_text['Transaction_Summary'] = df_text.apply(\n",
|
||||||
|
" lambda row: f\"{row['Salesperson']} sold {row['Quantity']} {row['Product']}(s) \"\n",
|
||||||
|
" f\"for {row['Sales_Formatted']} in {row['Region']} region\", \n",
|
||||||
|
" axis=1\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Formatted text columns:\")\n",
|
||||||
|
"print(df_text[['Sales_Formatted', 'Transaction_ID']].head())\n",
|
||||||
|
"print(\"\\nTransaction summaries:\")\n",
|
||||||
|
"for i, summary in enumerate(df_text['Transaction_Summary'].head(3)):\n",
|
||||||
|
" print(f\"{i+1}. {summary}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 6. Working with Categorical Data\n",
|
||||||
|
"\n",
|
||||||
|
"Optimize memory usage and enable category-specific operations."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Convert to categorical data types\n",
|
||||||
|
"df_categorical = df_sales.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Check memory usage before\n",
|
||||||
|
"print(\"Memory usage before categorical conversion:\")\n",
|
||||||
|
"print(df_categorical.memory_usage(deep=True))\n",
|
||||||
|
"\n",
|
||||||
|
"# Convert string columns to categorical\n",
|
||||||
|
"categorical_columns = ['Product', 'Region', 'Salesperson', 'Customer_Type']\n",
|
||||||
|
"for col in categorical_columns:\n",
|
||||||
|
" df_categorical[col] = df_categorical[col].astype('category')\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nMemory usage after categorical conversion:\")\n",
|
||||||
|
"print(df_categorical.memory_usage(deep=True))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nData types after conversion:\")\n",
|
||||||
|
"print(df_categorical.dtypes)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Working with ordered categories\n",
|
||||||
|
"# Create ordered categorical for sales performance\n",
|
||||||
|
"performance_categories = ['Poor', 'Fair', 'Good', 'Excellent']\n",
|
||||||
|
"df_categorical['Performance_Level'] = pd.cut(\n",
|
||||||
|
" df_categorical['Sales'],\n",
|
||||||
|
" bins=[0, 700, 900, 1200, float('inf')],\n",
|
||||||
|
" labels=performance_categories,\n",
|
||||||
|
" ordered=True\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Ordered categorical data:\")\n",
|
||||||
|
"print(df_categorical['Performance_Level'].head(10))\n",
|
||||||
|
"print(\"\\nCategory info:\")\n",
|
||||||
|
"print(df_categorical['Performance_Level'].cat.categories)\n",
|
||||||
|
"print(f\"Is ordered: {df_categorical['Performance_Level'].cat.ordered}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Categorical operations\n",
|
||||||
|
"print(\"\\nPerformance level distribution:\")\n",
|
||||||
|
"print(df_categorical['Performance_Level'].value_counts().sort_index())\n",
|
||||||
|
"\n",
|
||||||
|
"# Add new category\n",
|
||||||
|
"df_categorical['Performance_Level'] = df_categorical['Performance_Level'].cat.add_categories(['Outstanding'])\n",
|
||||||
|
"print(f\"\\nCategories after adding 'Outstanding': {df_categorical['Performance_Level'].cat.categories}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 7. Mathematical and Statistical Transformations\n",
|
||||||
|
"\n",
|
||||||
|
"Create columns using mathematical functions and statistical transformations."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Mathematical transformations\n",
|
||||||
|
"df_math = df_sales.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Common mathematical transformations\n",
|
||||||
|
"df_math['Sales_Log'] = np.log(df_math['Sales'])\n",
|
||||||
|
"df_math['Sales_Sqrt'] = np.sqrt(df_math['Sales'])\n",
|
||||||
|
"df_math['Sales_Squared'] = df_math['Sales'] ** 2\n",
|
||||||
|
"df_math['Sales_Reciprocal'] = 1 / df_math['Sales']\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Mathematical transformations:\")\n",
|
||||||
|
"print(df_math[['Sales', 'Sales_Log', 'Sales_Sqrt', 'Sales_Squared', 'Sales_Reciprocal']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Statistical standardization\n",
|
||||||
|
"df_math['Sales_Z_Score'] = (df_math['Sales'] - df_math['Sales'].mean()) / df_math['Sales'].std()\n",
|
||||||
|
"df_math['Sales_Min_Max_Scaled'] = (df_math['Sales'] - df_math['Sales'].min()) / (df_math['Sales'].max() - df_math['Sales'].min())\n",
|
||||||
|
"\n",
|
||||||
|
"# Rolling statistics\n",
|
||||||
|
"df_math = df_math.sort_values('Date')\n",
|
||||||
|
"df_math['Sales_Rolling_7_Mean'] = df_math['Sales'].rolling(window=7, min_periods=1).mean()\n",
|
||||||
|
"df_math['Sales_Rolling_7_Std'] = df_math['Sales'].rolling(window=7, min_periods=1).std()\n",
|
||||||
|
"df_math['Sales_Cumulative_Sum'] = df_math['Sales'].cumsum()\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nStatistical transformations:\")\n",
|
||||||
|
"print(df_math[['Sales', 'Sales_Z_Score', 'Sales_Min_Max_Scaled', \n",
|
||||||
|
" 'Sales_Rolling_7_Mean', 'Sales_Cumulative_Sum']].head(10))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Rank and percentile columns\n",
|
||||||
|
"df_math['Sales_Rank'] = df_math['Sales'].rank(ascending=False)\n",
|
||||||
|
"df_math['Sales_Percentile'] = df_math['Sales'].rank(pct=True) * 100\n",
|
||||||
|
"df_math['Sales_Rank_by_Region'] = df_math.groupby('Region')['Sales'].rank(ascending=False)\n",
|
||||||
|
"\n",
|
||||||
|
"# Binning and discretization\n",
|
||||||
|
"df_math['Sales_Decile'] = pd.qcut(df_math['Sales'], q=10, labels=range(1, 11))\n",
|
||||||
|
"df_math['Sales_Tertile'] = pd.qcut(df_math['Sales'], q=3, labels=['Low', 'Medium', 'High'])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Ranking and binning:\")\n",
|
||||||
|
"print(df_math[['Sales', 'Sales_Rank', 'Sales_Percentile', 'Sales_Rank_by_Region', \n",
|
||||||
|
" 'Sales_Decile', 'Sales_Tertile']].head(10))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nDecile distribution:\")\n",
|
||||||
|
"print(df_math['Sales_Decile'].value_counts().sort_index())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Practice Exercises\n",
|
||||||
|
"\n",
|
||||||
|
"Apply your column creation and modification skills:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 18,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 1: Customer Segmentation\n",
|
||||||
|
"# Create a comprehensive customer segmentation system:\n",
|
||||||
|
"# - Combine purchase behavior, frequency, and value\n",
|
||||||
|
"# - Create RFM-like scores (Recency, Frequency, Monetary)\n",
|
||||||
|
"# - Assign customer segments (e.g., Champion, Loyal, At Risk, etc.)\n",
|
||||||
|
"\n",
|
||||||
|
"def create_customer_segmentation(df):\n",
|
||||||
|
" \"\"\"Create customer segmentation based on purchase patterns\"\"\"\n",
|
||||||
|
" # Your implementation here\n",
|
||||||
|
" pass\n",
|
||||||
|
"\n",
|
||||||
|
"# segmented_df = create_customer_segmentation(df_sales)\n",
|
||||||
|
"# print(segmented_df[['Customer_Type', 'Sales', 'Frequency_Score', 'Monetary_Score', 'Segment']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 2: Performance Metrics Dashboard\n",
|
||||||
|
"# Create a comprehensive set of KPI columns:\n",
|
||||||
|
"# - Sales efficiency metrics\n",
|
||||||
|
"# - Trend indicators (growth rates, momentum)\n",
|
||||||
|
"# - Comparative metrics (vs. average, vs. target)\n",
|
||||||
|
"# - Alert flags for unusual patterns\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 20,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 3: Feature Engineering for ML\n",
|
||||||
|
"# Create features that could be useful for machine learning:\n",
|
||||||
|
"# - Interaction features (product of two variables)\n",
|
||||||
|
"# - Polynomial features\n",
|
||||||
|
"# - Time-based features (seasonality, trends)\n",
|
||||||
|
"# - Lag features (previous period values)\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Key Takeaways\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Column Assignment**: Use direct assignment (`df['col'] = value`) for simple cases\n",
|
||||||
|
"2. **Assign Method**: Use `.assign()` for functional programming style and method chaining\n",
|
||||||
|
"3. **Conditional Logic**: Combine `np.where()`, `pd.cut()`, `pd.qcut()`, and `np.select()` for complex conditions\n",
|
||||||
|
"4. **Apply Functions**: Use `.apply()` with lambda or custom functions for complex transformations\n",
|
||||||
|
"5. **Date Features**: Extract meaningful components from datetime columns\n",
|
||||||
|
"6. **String Operations**: Leverage `.str` accessor for text manipulations\n",
|
||||||
|
"7. **Categorical Data**: Convert to categories for memory efficiency and special operations\n",
|
||||||
|
"8. **Mathematical Transformations**: Apply statistical and mathematical functions for data preprocessing\n",
|
||||||
|
"\n",
|
||||||
|
"## Performance Tips\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Vectorized Operations**: Prefer pandas/numpy operations over loops\n",
|
||||||
|
"2. **Categorical Types**: Use categorical data for repeated string values\n",
|
||||||
|
"3. **Memory Management**: Monitor memory usage when creating many new columns\n",
|
||||||
|
"4. **Method Chaining**: Use `.assign()` for readable method chains\n",
|
||||||
|
"5. **Avoid apply() When Possible**: Use vectorized operations instead of `.apply()` for better performance\n",
|
||||||
|
"\n",
|
||||||
|
"## Common Patterns\n",
|
||||||
|
"\n",
|
||||||
|
"```python\n",
|
||||||
|
"# Simple calculation\n",
|
||||||
|
"df['new_col'] = df['col1'] * df['col2']\n",
|
||||||
|
"\n",
|
||||||
|
"# Conditional column\n",
|
||||||
|
"df['category'] = np.where(df['value'] > threshold, 'High', 'Low')\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply custom function\n",
|
||||||
|
"df['result'] = df.apply(custom_function, axis=1)\n",
|
||||||
|
"\n",
|
||||||
|
"# Date features\n",
|
||||||
|
"df['month'] = df['date'].dt.month\n",
|
||||||
|
"\n",
|
||||||
|
"# String operations\n",
|
||||||
|
"df['upper'] = df['text'].str.upper()\n",
|
||||||
|
"```"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.13.3"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
916
Session_01/PandasDataFrame-exmples/06_handling_missing_data.ipynb
Executable file
916
Session_01/PandasDataFrame-exmples/06_handling_missing_data.ipynb
Executable file
|
@ -0,0 +1,916 @@
|
||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Session 1 - DataFrames - Lesson 6: Handling Missing Data\n",
|
||||||
|
"\n",
|
||||||
|
"## Learning Objectives\n",
|
||||||
|
"- Understand different types of missing data and their implications\n",
|
||||||
|
"- Master techniques for detecting and analyzing missing values\n",
|
||||||
|
"- Learn various strategies for handling missing data\n",
|
||||||
|
"- Practice imputation methods and their trade-offs\n",
|
||||||
|
"- Develop best practices for missing data management\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"- Completed Lessons 1-5\n",
|
||||||
|
"- Understanding of basic statistical concepts\n",
|
||||||
|
"- Familiarity with data quality principles"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Import required libraries\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"import seaborn as sns\n",
|
||||||
|
"from datetime import datetime, timedelta\n",
|
||||||
|
"import warnings\n",
|
||||||
|
"warnings.filterwarnings('ignore')\n",
|
||||||
|
"\n",
|
||||||
|
"# Set display options\n",
|
||||||
|
"pd.set_option('display.max_columns', None)\n",
|
||||||
|
"plt.style.use('seaborn-v0_8')\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Libraries loaded successfully!\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Creating Dataset with Missing Values\n",
|
||||||
|
"\n",
|
||||||
|
"Let's create a realistic dataset with different patterns of missing data."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Create comprehensive dataset with various missing data patterns\n",
|
||||||
|
"np.random.seed(42)\n",
|
||||||
|
"n_records = 500\n",
|
||||||
|
"\n",
|
||||||
|
"# Base data\n",
|
||||||
|
"data = {\n",
|
||||||
|
" 'customer_id': range(1, n_records + 1),\n",
|
||||||
|
" 'age': np.random.normal(35, 12, n_records).astype(int),\n",
|
||||||
|
" 'income': np.random.normal(50000, 15000, n_records),\n",
|
||||||
|
" 'education_years': np.random.normal(14, 3, n_records),\n",
|
||||||
|
" 'purchase_amount': np.random.normal(200, 50, n_records),\n",
|
||||||
|
" 'satisfaction_score': np.random.randint(1, 6, n_records),\n",
|
||||||
|
" 'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
|
||||||
|
" 'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),\n",
|
||||||
|
" 'signup_date': pd.date_range('2023-01-01', periods=n_records, freq='D'),\n",
|
||||||
|
" 'last_purchase_date': pd.date_range('2023-01-01', periods=n_records, freq='D') + pd.Timedelta(days=30)\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_complete = pd.DataFrame(data)\n",
|
||||||
|
"\n",
|
||||||
|
"# Ensure positive values where appropriate\n",
|
||||||
|
"df_complete['age'] = np.abs(df_complete['age'])\n",
|
||||||
|
"df_complete['income'] = np.abs(df_complete['income'])\n",
|
||||||
|
"df_complete['education_years'] = np.clip(df_complete['education_years'], 6, 20)\n",
|
||||||
|
"df_complete['purchase_amount'] = np.abs(df_complete['purchase_amount'])\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Complete dataset created:\")\n",
|
||||||
|
"print(f\"Shape: {df_complete.shape}\")\n",
|
||||||
|
"print(\"\\nFirst few rows:\")\n",
|
||||||
|
"print(df_complete.head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Introduce different patterns of missing data\n",
|
||||||
|
"df_missing = df_complete.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# 1. Missing Completely at Random (MCAR) - income data\n",
|
||||||
|
"# Randomly missing 15% of income values\n",
|
||||||
|
"mcar_indices = np.random.choice(df_missing.index, size=int(0.15 * len(df_missing)), replace=False)\n",
|
||||||
|
"df_missing.loc[mcar_indices, 'income'] = np.nan\n",
|
||||||
|
"\n",
|
||||||
|
"# 2. Missing at Random (MAR) - education years missing based on age\n",
|
||||||
|
"# Older people less likely to report education\n",
|
||||||
|
"older_customers = df_missing['age'] > 60\n",
|
||||||
|
"older_indices = df_missing[older_customers].index\n",
|
||||||
|
"education_missing = np.random.choice(older_indices, size=int(0.4 * len(older_indices)), replace=False)\n",
|
||||||
|
"df_missing.loc[education_missing, 'education_years'] = np.nan\n",
|
||||||
|
"\n",
|
||||||
|
"# 3. Missing Not at Random (MNAR) - satisfaction scores\n",
|
||||||
|
"# Unsatisfied customers less likely to provide ratings\n",
|
||||||
|
"low_satisfaction = df_missing['satisfaction_score'] <= 2\n",
|
||||||
|
"low_sat_indices = df_missing[low_satisfaction].index\n",
|
||||||
|
"satisfaction_missing = np.random.choice(low_sat_indices, size=int(0.6 * len(low_sat_indices)), replace=False)\n",
|
||||||
|
"df_missing.loc[satisfaction_missing, 'satisfaction_score'] = np.nan\n",
|
||||||
|
"\n",
|
||||||
|
"# 4. Systematic missing - last purchase date for new customers\n",
|
||||||
|
"# New customers (signed up recently) haven't made purchases yet\n",
|
||||||
|
"recent_signups = df_missing['signup_date'] > '2023-11-01'\n",
|
||||||
|
"df_missing.loc[recent_signups, 'last_purchase_date'] = pd.NaT\n",
|
||||||
|
"\n",
|
||||||
|
"# 5. Random missing in other columns\n",
|
||||||
|
"# Purchase amount - 10% missing\n",
|
||||||
|
"purchase_missing = np.random.choice(df_missing.index, size=int(0.10 * len(df_missing)), replace=False)\n",
|
||||||
|
"df_missing.loc[purchase_missing, 'purchase_amount'] = np.nan\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Missing data patterns introduced:\")\n",
|
||||||
|
"print(f\"Dataset shape: {df_missing.shape}\")\n",
|
||||||
|
"print(\"\\nMissing value counts:\")\n",
|
||||||
|
"missing_summary = df_missing.isnull().sum()\n",
|
||||||
|
"missing_summary = missing_summary[missing_summary > 0]\n",
|
||||||
|
"print(missing_summary)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nMissing value percentages:\")\n",
|
||||||
|
"missing_pct = (df_missing.isnull().sum() / len(df_missing) * 100).round(2)\n",
|
||||||
|
"missing_pct = missing_pct[missing_pct > 0]\n",
|
||||||
|
"print(missing_pct)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 1. Detecting and Analyzing Missing Data\n",
|
||||||
|
"\n",
|
||||||
|
"Comprehensive techniques for understanding missing data patterns."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def analyze_missing_data(df):\n",
|
||||||
|
" \"\"\"Comprehensive missing data analysis\"\"\"\n",
|
||||||
|
" print(\"=== MISSING DATA ANALYSIS ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Basic missing data statistics\n",
|
||||||
|
" total_cells = df.size\n",
|
||||||
|
" total_missing = df.isnull().sum().sum()\n",
|
||||||
|
" print(f\"Total cells: {total_cells:,}\")\n",
|
||||||
|
" print(f\"Missing cells: {total_missing:,} ({total_missing/total_cells*100:.2f}%)\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Missing data by column\n",
|
||||||
|
" missing_by_column = pd.DataFrame({\n",
|
||||||
|
" 'Missing_Count': df.isnull().sum(),\n",
|
||||||
|
" 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,\n",
|
||||||
|
" 'Data_Type': df.dtypes\n",
|
||||||
|
" })\n",
|
||||||
|
" missing_by_column = missing_by_column[missing_by_column['Missing_Count'] > 0]\n",
|
||||||
|
" missing_by_column = missing_by_column.sort_values('Missing_Percentage', ascending=False)\n",
|
||||||
|
" \n",
|
||||||
|
" print(\"\\n--- Missing Data by Column ---\")\n",
|
||||||
|
" print(missing_by_column.round(2))\n",
|
||||||
|
" \n",
|
||||||
|
" # Missing data patterns\n",
|
||||||
|
" print(\"\\n--- Missing Data Patterns ---\")\n",
|
||||||
|
" missing_patterns = df.isnull().value_counts().head(10)\n",
|
||||||
|
" print(\"Top 10 missing patterns (True = Missing):\")\n",
|
||||||
|
" for pattern, count in missing_patterns.items():\n",
|
||||||
|
" percentage = (count / len(df)) * 100\n",
|
||||||
|
" print(f\"{count:4d} rows ({percentage:5.1f}%): {dict(zip(df.columns, pattern))}\")\n",
|
||||||
|
" \n",
|
||||||
|
" return missing_by_column\n",
|
||||||
|
"\n",
|
||||||
|
"# Analyze missing data\n",
|
||||||
|
"missing_analysis = analyze_missing_data(df_missing)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Visualize missing data patterns\n",
|
||||||
|
"def visualize_missing_data(df):\n",
|
||||||
|
" \"\"\"Create visualizations for missing data patterns\"\"\"\n",
|
||||||
|
" fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
|
||||||
|
" \n",
|
||||||
|
" # 1. Missing data heatmap\n",
|
||||||
|
" missing_mask = df.isnull()\n",
|
||||||
|
" sns.heatmap(missing_mask.iloc[:100], \n",
|
||||||
|
" yticklabels=False, \n",
|
||||||
|
" cbar=True, \n",
|
||||||
|
" cmap='viridis',\n",
|
||||||
|
" ax=axes[0, 0])\n",
|
||||||
|
" axes[0, 0].set_title('Missing Data Heatmap (First 100 rows)')\n",
|
||||||
|
" \n",
|
||||||
|
" # 2. Missing data by column\n",
|
||||||
|
" missing_counts = df.isnull().sum()\n",
|
||||||
|
" missing_counts = missing_counts[missing_counts > 0]\n",
|
||||||
|
" missing_counts.plot(kind='bar', ax=axes[0, 1], color='skyblue')\n",
|
||||||
|
" axes[0, 1].set_title('Missing Values by Column')\n",
|
||||||
|
" axes[0, 1].set_ylabel('Count')\n",
|
||||||
|
" axes[0, 1].tick_params(axis='x', rotation=45)\n",
|
||||||
|
" \n",
|
||||||
|
" # 3. Missing data correlation\n",
|
||||||
|
" missing_corr = df.isnull().corr()\n",
|
||||||
|
" sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])\n",
|
||||||
|
" axes[1, 0].set_title('Missing Data Correlation')\n",
|
||||||
|
" \n",
|
||||||
|
" # 4. Missing data by row\n",
|
||||||
|
" missing_per_row = df.isnull().sum(axis=1)\n",
|
||||||
|
" missing_per_row.hist(bins=range(len(df.columns) + 2), ax=axes[1, 1], alpha=0.7, color='orange')\n",
|
||||||
|
" axes[1, 1].set_title('Distribution of Missing Values per Row')\n",
|
||||||
|
" axes[1, 1].set_xlabel('Number of Missing Values')\n",
|
||||||
|
" axes[1, 1].set_ylabel('Number of Rows')\n",
|
||||||
|
" \n",
|
||||||
|
" plt.tight_layout()\n",
|
||||||
|
" plt.show()\n",
|
||||||
|
"\n",
|
||||||
|
"# Visualize missing patterns\n",
|
||||||
|
"visualize_missing_data(df_missing)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Analyze missing data relationships\n",
|
||||||
|
"def analyze_missing_relationships(df):\n",
|
||||||
|
" \"\"\"Analyze relationships between missing data and other variables\"\"\"\n",
|
||||||
|
" print(\"=== MISSING DATA RELATIONSHIPS ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Example: Relationship between age and missing education\n",
|
||||||
|
" if 'age' in df.columns and 'education_years' in df.columns:\n",
|
||||||
|
" print(\"\\n--- Age vs Missing Education ---\")\n",
|
||||||
|
" education_missing = df['education_years'].isnull()\n",
|
||||||
|
" age_stats = df.groupby(education_missing)['age'].agg(['mean', 'median', 'std']).round(2)\n",
|
||||||
|
" age_stats.index = ['Education Present', 'Education Missing']\n",
|
||||||
|
" print(age_stats)\n",
|
||||||
|
" \n",
|
||||||
|
" # Example: Missing satisfaction by purchase amount\n",
|
||||||
|
" if 'satisfaction_score' in df.columns and 'purchase_amount' in df.columns:\n",
|
||||||
|
" print(\"\\n--- Purchase Amount vs Missing Satisfaction ---\")\n",
|
||||||
|
" satisfaction_missing = df['satisfaction_score'].isnull()\n",
|
||||||
|
" purchase_stats = df.groupby(satisfaction_missing)['purchase_amount'].agg(['mean', 'median', 'count']).round(2)\n",
|
||||||
|
" purchase_stats.index = ['Satisfaction Present', 'Satisfaction Missing']\n",
|
||||||
|
" print(purchase_stats)\n",
|
||||||
|
" \n",
|
||||||
|
" # Missing data by categorical variables\n",
|
||||||
|
" if 'region' in df.columns:\n",
|
||||||
|
" print(\"\\n--- Missing Data by Region ---\")\n",
|
||||||
|
" region_missing = df.groupby('region').apply(lambda x: x.isnull().sum())\n",
|
||||||
|
" print(region_missing[region_missing.sum(axis=1) > 0])\n",
|
||||||
|
"\n",
|
||||||
|
"# Analyze relationships\n",
|
||||||
|
"analyze_missing_relationships(df_missing)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 2. Basic Missing Data Handling\n",
|
||||||
|
"\n",
|
||||||
|
"Fundamental techniques for dealing with missing values."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Method 1: Dropping missing values\n",
|
||||||
|
"print(\"=== DROPPING MISSING VALUES ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Drop rows with any missing values\n",
|
||||||
|
"df_drop_any = df_missing.dropna()\n",
|
||||||
|
"print(f\"Original shape: {df_missing.shape}\")\n",
|
||||||
|
"print(f\"After dropping any missing: {df_drop_any.shape}\")\n",
|
||||||
|
"print(f\"Rows removed: {len(df_missing) - len(df_drop_any)} ({(len(df_missing) - len(df_drop_any))/len(df_missing)*100:.1f}%)\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Drop rows with missing values in specific columns\n",
|
||||||
|
"critical_columns = ['customer_id', 'age', 'region']\n",
|
||||||
|
"df_drop_critical = df_missing.dropna(subset=critical_columns)\n",
|
||||||
|
"print(f\"\\nAfter dropping rows missing critical columns: {df_drop_critical.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Drop rows with more than X missing values\n",
|
||||||
|
"df_drop_thresh = df_missing.dropna(thresh=len(df_missing.columns) - 2) # Allow max 2 missing\n",
|
||||||
|
"print(f\"After dropping rows with >2 missing values: {df_drop_thresh.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Drop columns with too many missing values\n",
|
||||||
|
"missing_threshold = 0.5 # 50%\n",
|
||||||
|
"cols_to_keep = df_missing.columns[df_missing.isnull().mean() < missing_threshold]\n",
|
||||||
|
"df_drop_cols = df_missing[cols_to_keep]\n",
|
||||||
|
"print(f\"\\nAfter dropping columns with >{missing_threshold*100}% missing: {df_drop_cols.shape}\")\n",
|
||||||
|
"print(f\"Columns dropped: {set(df_missing.columns) - set(cols_to_keep)}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Method 2: Basic imputation with fillna()\n",
|
||||||
|
"print(\"=== BASIC IMPUTATION ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"df_basic_impute = df_missing.copy()\n",
|
||||||
|
"\n",
|
||||||
|
"# Fill with specific values\n",
|
||||||
|
"df_basic_impute['satisfaction_score'] = df_basic_impute['satisfaction_score'].fillna(3) # Neutral score\n",
|
||||||
|
"print(\"Filled satisfaction_score with 3 (neutral)\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Fill with statistical measures\n",
|
||||||
|
"df_basic_impute['income'] = df_basic_impute['income'].fillna(df_basic_impute['income'].median())\n",
|
||||||
|
"df_basic_impute['education_years'] = df_basic_impute['education_years'].fillna(df_basic_impute['education_years'].mean())\n",
|
||||||
|
"df_basic_impute['purchase_amount'] = df_basic_impute['purchase_amount'].fillna(df_basic_impute['purchase_amount'].mean())\n",
|
||||||
|
"print(\"Filled numerical columns with mean/median\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Forward fill and backward fill for dates\n",
|
||||||
|
"df_basic_impute['last_purchase_date'] = df_basic_impute['last_purchase_date'].fillna(method='bfill')\n",
|
||||||
|
"print(\"Filled dates with backward fill\")\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"\\nMissing values after basic imputation:\")\n",
|
||||||
|
"print(df_basic_impute.isnull().sum().sum())\n",
|
||||||
|
"\n",
|
||||||
|
"# Show before/after comparison\n",
|
||||||
|
"print(\"\\nComparison (first 10 rows):\")\n",
|
||||||
|
"comparison_cols = ['income', 'education_years', 'purchase_amount', 'satisfaction_score']\n",
|
||||||
|
"for col in comparison_cols:\n",
|
||||||
|
" before_missing = df_missing[col].isnull().sum()\n",
|
||||||
|
" after_missing = df_basic_impute[col].isnull().sum()\n",
|
||||||
|
" print(f\"{col}: {before_missing} → {after_missing} missing values\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 3. Advanced Imputation Techniques\n",
|
||||||
|
"\n",
|
||||||
|
"Sophisticated methods for handling missing data."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Group-based imputation\n",
|
||||||
|
"def group_based_imputation(df):\n",
|
||||||
|
" \"\"\"Impute missing values based on group statistics\"\"\"\n",
|
||||||
|
" df_group_impute = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" print(\"=== GROUP-BASED IMPUTATION ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Impute income based on region and education level\n",
|
||||||
|
" # First, create education level categories\n",
|
||||||
|
" df_group_impute['education_level'] = pd.cut(\n",
|
||||||
|
" df_group_impute['education_years'].fillna(df_group_impute['education_years'].median()),\n",
|
||||||
|
" bins=[0, 12, 16, 20],\n",
|
||||||
|
" labels=['High School', 'Bachelor', 'Advanced']\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" # Calculate group-based statistics\n",
|
||||||
|
" income_by_group = df_group_impute.groupby(['region', 'education_level'])['income'].median()\n",
|
||||||
|
" \n",
|
||||||
|
" # Fill missing income values\n",
|
||||||
|
" def fill_income(row):\n",
|
||||||
|
" if pd.isna(row['income']):\n",
|
||||||
|
" try:\n",
|
||||||
|
" return income_by_group.loc[(row['region'], row['education_level'])]\n",
|
||||||
|
" except KeyError:\n",
|
||||||
|
" return df_group_impute['income'].median()\n",
|
||||||
|
" return row['income']\n",
|
||||||
|
" \n",
|
||||||
|
" df_group_impute['income'] = df_group_impute.apply(fill_income, axis=1)\n",
|
||||||
|
" \n",
|
||||||
|
" print(\"Income imputed based on region and education level\")\n",
|
||||||
|
" print(\"Group-based median income:\")\n",
|
||||||
|
" print(income_by_group.round(0))\n",
|
||||||
|
" \n",
|
||||||
|
" return df_group_impute\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply group-based imputation\n",
|
||||||
|
"df_group_imputed = group_based_imputation(df_missing)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 4. Comparison of Imputation Methods\n",
|
||||||
|
"\n",
|
||||||
|
"Compare different imputation approaches and their impact."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def compare_imputation_methods(original_complete, original_missing, *imputed_dfs, methods_names):\n",
|
||||||
|
" \"\"\"Compare different imputation methods\"\"\"\n",
|
||||||
|
" print(\"=== IMPUTATION METHODS COMPARISON ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Focus on a specific column for comparison\n",
|
||||||
|
" column = 'income'\n",
|
||||||
|
" \n",
|
||||||
|
" if column not in original_complete.columns:\n",
|
||||||
|
" print(f\"Column {column} not found\")\n",
|
||||||
|
" return\n",
|
||||||
|
" \n",
|
||||||
|
" # Get original values that were made missing\n",
|
||||||
|
" missing_mask = original_missing[column].isnull()\n",
|
||||||
|
" true_values = original_complete.loc[missing_mask, column]\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Comparing imputation for '{column}' column\")\n",
|
||||||
|
" print(f\"Number of missing values: {len(true_values)}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Calculate errors for each method\n",
|
||||||
|
" results = {}\n",
|
||||||
|
" \n",
|
||||||
|
" for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
|
||||||
|
" if column in df_imputed.columns:\n",
|
||||||
|
" imputed_values = df_imputed.loc[missing_mask, column]\n",
|
||||||
|
" \n",
|
||||||
|
" # Calculate metrics\n",
|
||||||
|
" mae = np.mean(np.abs(true_values - imputed_values))\n",
|
||||||
|
" rmse = np.sqrt(np.mean((true_values - imputed_values) ** 2))\n",
|
||||||
|
" bias = np.mean(imputed_values - true_values)\n",
|
||||||
|
" \n",
|
||||||
|
" results[method_name] = {\n",
|
||||||
|
" 'MAE': mae,\n",
|
||||||
|
" 'RMSE': rmse,\n",
|
||||||
|
" 'Bias': bias,\n",
|
||||||
|
" 'Mean_Imputed': np.mean(imputed_values),\n",
|
||||||
|
" 'Std_Imputed': np.std(imputed_values)\n",
|
||||||
|
" }\n",
|
||||||
|
" \n",
|
||||||
|
" # True statistics\n",
|
||||||
|
" print(f\"\\nTrue statistics for missing values:\")\n",
|
||||||
|
" print(f\"Mean: {np.mean(true_values):.2f}\")\n",
|
||||||
|
" print(f\"Std: {np.std(true_values):.2f}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Results comparison\n",
|
||||||
|
" results_df = pd.DataFrame(results).T\n",
|
||||||
|
" print(f\"\\nImputation comparison results:\")\n",
|
||||||
|
" print(results_df.round(2))\n",
|
||||||
|
" \n",
|
||||||
|
" # Visualize comparison\n",
|
||||||
|
" fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
|
||||||
|
" \n",
|
||||||
|
" # Distribution comparison\n",
|
||||||
|
" axes[0, 0].hist(true_values, alpha=0.7, label='True Values', bins=20)\n",
|
||||||
|
" for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
|
||||||
|
" if column in df_imputed.columns:\n",
|
||||||
|
" imputed_values = df_imputed.loc[missing_mask, column]\n",
|
||||||
|
" axes[0, 0].hist(imputed_values, alpha=0.7, label=f'{method_name}', bins=20)\n",
|
||||||
|
" axes[0, 0].set_title('Distribution Comparison')\n",
|
||||||
|
" axes[0, 0].legend()\n",
|
||||||
|
" \n",
|
||||||
|
" # Error metrics\n",
|
||||||
|
" metrics = ['MAE', 'RMSE']\n",
|
||||||
|
" for i, metric in enumerate(metrics):\n",
|
||||||
|
" values = [results[method][metric] for method in results.keys()]\n",
|
||||||
|
" axes[0, 1].bar(range(len(values)), values, alpha=0.7)\n",
|
||||||
|
" axes[0, 1].set_xticks(range(len(results)))\n",
|
||||||
|
" axes[0, 1].set_xticklabels(list(results.keys()), rotation=45)\n",
|
||||||
|
" axes[0, 1].set_title(f'{metric} Comparison')\n",
|
||||||
|
" break # Show only MAE for now\n",
|
||||||
|
" \n",
|
||||||
|
" # Scatter plot: True vs Imputed\n",
|
||||||
|
" for i, (df_imputed, method_name) in enumerate(zip(imputed_dfs[:2], methods_names[:2])):\n",
|
||||||
|
" if column in df_imputed.columns:\n",
|
||||||
|
" imputed_values = df_imputed.loc[missing_mask, column]\n",
|
||||||
|
" ax = axes[1, i]\n",
|
||||||
|
" ax.scatter(true_values, imputed_values, alpha=0.6)\n",
|
||||||
|
" ax.plot([true_values.min(), true_values.max()], \n",
|
||||||
|
" [true_values.min(), true_values.max()], 'r--', label='Perfect Prediction')\n",
|
||||||
|
" ax.set_xlabel('True Values')\n",
|
||||||
|
" ax.set_ylabel('Imputed Values')\n",
|
||||||
|
" ax.set_title(f'{method_name}: True vs Imputed')\n",
|
||||||
|
" ax.legend()\n",
|
||||||
|
" \n",
|
||||||
|
" plt.tight_layout()\n",
|
||||||
|
" plt.show()\n",
|
||||||
|
" \n",
|
||||||
|
" return results_df\n",
|
||||||
|
"\n",
|
||||||
|
"# Compare methods\n",
|
||||||
|
"comparison_results = compare_imputation_methods(\n",
|
||||||
|
" df_complete, \n",
|
||||||
|
" df_missing,\n",
|
||||||
|
" df_basic_impute,\n",
|
||||||
|
" methods_names=['Basic Fill', 'KNN', 'Iterative']\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 5. Domain-Specific Imputation Strategies\n",
|
||||||
|
"\n",
|
||||||
|
"Business logic-driven approaches to missing data."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def business_logic_imputation(df):\n",
|
||||||
|
" \"\"\"Apply business logic for missing value imputation\"\"\"\n",
|
||||||
|
" print(\"=== BUSINESS LOGIC IMPUTATION ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" df_business = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" # 1. Income imputation based on age and education\n",
|
||||||
|
" def estimate_income(row):\n",
|
||||||
|
" if pd.notna(row['income']):\n",
|
||||||
|
" return row['income']\n",
|
||||||
|
" \n",
|
||||||
|
" # Base income estimation\n",
|
||||||
|
" base_income = 30000\n",
|
||||||
|
" \n",
|
||||||
|
" # Age factor (experience premium)\n",
|
||||||
|
" if pd.notna(row['age']):\n",
|
||||||
|
" if row['age'] > 40:\n",
|
||||||
|
" base_income *= 1.5\n",
|
||||||
|
" elif row['age'] > 30:\n",
|
||||||
|
" base_income *= 1.2\n",
|
||||||
|
" \n",
|
||||||
|
" # Education factor\n",
|
||||||
|
" if pd.notna(row['education_years']):\n",
|
||||||
|
" if row['education_years'] > 16: # Graduate degree\n",
|
||||||
|
" base_income *= 1.8\n",
|
||||||
|
" elif row['education_years'] > 12: # Bachelor's\n",
|
||||||
|
" base_income *= 1.4\n",
|
||||||
|
" \n",
|
||||||
|
" # Regional adjustment\n",
|
||||||
|
" regional_multipliers = {\n",
|
||||||
|
" 'North': 1.2, # Higher cost of living\n",
|
||||||
|
" 'South': 0.9,\n",
|
||||||
|
" 'East': 1.1,\n",
|
||||||
|
" 'West': 1.0\n",
|
||||||
|
" }\n",
|
||||||
|
" base_income *= regional_multipliers.get(row['region'], 1.0)\n",
|
||||||
|
" \n",
|
||||||
|
" return base_income\n",
|
||||||
|
" \n",
|
||||||
|
" # Apply income estimation\n",
|
||||||
|
" df_business['income'] = df_business.apply(estimate_income, axis=1)\n",
|
||||||
|
" \n",
|
||||||
|
" # 2. Satisfaction score based on purchase behavior\n",
|
||||||
|
" def estimate_satisfaction(row):\n",
|
||||||
|
" if pd.notna(row['satisfaction_score']):\n",
|
||||||
|
" return row['satisfaction_score']\n",
|
||||||
|
" \n",
|
||||||
|
" # Base satisfaction\n",
|
||||||
|
" base_satisfaction = 3 # Neutral\n",
|
||||||
|
" \n",
|
||||||
|
" # Purchase amount influence\n",
|
||||||
|
" if pd.notna(row['purchase_amount']):\n",
|
||||||
|
" if row['purchase_amount'] > 250: # High value purchase\n",
|
||||||
|
" base_satisfaction = 4\n",
|
||||||
|
" elif row['purchase_amount'] < 100: # Low value might indicate dissatisfaction\n",
|
||||||
|
" base_satisfaction = 2\n",
|
||||||
|
" \n",
|
||||||
|
" return base_satisfaction\n",
|
||||||
|
" \n",
|
||||||
|
" # Apply satisfaction estimation\n",
|
||||||
|
" df_business['satisfaction_score'] = df_business.apply(estimate_satisfaction, axis=1)\n",
|
||||||
|
" \n",
|
||||||
|
" # 3. Education years based on income and age\n",
|
||||||
|
" def estimate_education(row):\n",
|
||||||
|
" if pd.notna(row['education_years']):\n",
|
||||||
|
" return row['education_years']\n",
|
||||||
|
" \n",
|
||||||
|
" # Base education\n",
|
||||||
|
" base_education = 12 # High school\n",
|
||||||
|
" \n",
|
||||||
|
" # Income-based estimation\n",
|
||||||
|
" if pd.notna(row['income']):\n",
|
||||||
|
" if row['income'] > 70000:\n",
|
||||||
|
" base_education = 18 # Graduate level\n",
|
||||||
|
" elif row['income'] > 45000:\n",
|
||||||
|
" base_education = 16 # Bachelor's\n",
|
||||||
|
" elif row['income'] > 35000:\n",
|
||||||
|
" base_education = 14 # Some college\n",
|
||||||
|
" \n",
|
||||||
|
" # Age adjustment (older people might have different education patterns)\n",
|
||||||
|
" if pd.notna(row['age']) and row['age'] > 55:\n",
|
||||||
|
" base_education = max(12, base_education - 2) # Lower average for older generation\n",
|
||||||
|
" \n",
|
||||||
|
" return base_education\n",
|
||||||
|
" \n",
|
||||||
|
" # Apply education estimation\n",
|
||||||
|
" df_business['education_years'] = df_business.apply(estimate_education, axis=1)\n",
|
||||||
|
" \n",
|
||||||
|
" print(\"Business logic imputation completed\")\n",
|
||||||
|
" print(f\"Missing values remaining: {df_business.isnull().sum().sum()}\")\n",
|
||||||
|
" \n",
|
||||||
|
" return df_business\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply business logic imputation\n",
|
||||||
|
"df_business_imputed = business_logic_imputation(df_missing)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nBusiness logic imputation summary:\")\n",
|
||||||
|
"for col in ['income', 'satisfaction_score', 'education_years']:\n",
|
||||||
|
" before = df_missing[col].isnull().sum()\n",
|
||||||
|
" after = df_business_imputed[col].isnull().sum()\n",
|
||||||
|
" print(f\"{col}: {before} → {after} missing values\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 6. Missing Data Flags and Indicators\n",
|
||||||
|
"\n",
|
||||||
|
"Track which values were imputed for transparency and analysis."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def create_missing_indicators(df_original, df_imputed):\n",
|
||||||
|
" \"\"\"Create indicator variables for missing data\"\"\"\n",
|
||||||
|
" print(\"=== CREATING MISSING DATA INDICATORS ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" df_with_indicators = df_imputed.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" # Create indicator columns for each column that had missing data\n",
|
||||||
|
" columns_with_missing = df_original.columns[df_original.isnull().any()].tolist()\n",
|
||||||
|
" \n",
|
||||||
|
" for col in columns_with_missing:\n",
|
||||||
|
" indicator_col = f'{col}_was_missing'\n",
|
||||||
|
" df_with_indicators[indicator_col] = df_original[col].isnull().astype(int)\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Created {len(columns_with_missing)} missing data indicators\")\n",
|
||||||
|
" print(f\"Indicator columns: {[f'{col}_was_missing' for col in columns_with_missing]}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Summary of missing patterns\n",
|
||||||
|
" indicator_cols = [f'{col}_was_missing' for col in columns_with_missing]\n",
|
||||||
|
" missing_patterns = df_with_indicators[indicator_cols].sum()\n",
|
||||||
|
" \n",
|
||||||
|
" print(\"\\nMissing data summary by column:\")\n",
|
||||||
|
" for col, count in missing_patterns.items():\n",
|
||||||
|
" original_col = col.replace('_was_missing', '')\n",
|
||||||
|
" percentage = (count / len(df_with_indicators)) * 100\n",
|
||||||
|
" print(f\"{original_col}: {count} values imputed ({percentage:.1f}%)\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Create composite missing indicator\n",
|
||||||
|
" df_with_indicators['total_missing_count'] = df_with_indicators[indicator_cols].sum(axis=1)\n",
|
||||||
|
" df_with_indicators['has_any_missing'] = (df_with_indicators['total_missing_count'] > 0).astype(int)\n",
|
||||||
|
" \n",
|
||||||
|
" return df_with_indicators, indicator_cols\n",
|
||||||
|
"\n",
|
||||||
|
"# Create missing indicators\n",
|
||||||
|
"df_with_indicators, indicator_columns = create_missing_indicators(df_missing, df_business_imputed)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nDataset with missing indicators:\")\n",
|
||||||
|
"sample_cols = ['income', 'income_was_missing', 'education_years', 'education_years_was_missing', \n",
|
||||||
|
" 'satisfaction_score', 'satisfaction_score_was_missing', 'total_missing_count']\n",
|
||||||
|
"available_cols = [col for col in sample_cols if col in df_with_indicators.columns]\n",
|
||||||
|
"print(df_with_indicators[available_cols].head(10))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 7. Validation and Quality Assessment\n",
|
||||||
|
"\n",
|
||||||
|
"Validate the quality of imputation results."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def validate_imputation_quality(df_original, df_missing, df_imputed):\n",
|
||||||
|
" \"\"\"Validate the quality of imputation\"\"\"\n",
|
||||||
|
" print(\"=== IMPUTATION QUALITY VALIDATION ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" validation_results = {}\n",
|
||||||
|
" \n",
|
||||||
|
" # Check each column that had missing data\n",
|
||||||
|
" for col in df_missing.columns:\n",
|
||||||
|
" if df_missing[col].isnull().any() and col in df_imputed.columns:\n",
|
||||||
|
" print(f\"\\n--- Validating {col} ---\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Get missing mask\n",
|
||||||
|
" missing_mask = df_missing[col].isnull()\n",
|
||||||
|
" \n",
|
||||||
|
" # Original statistics (complete data)\n",
|
||||||
|
" original_stats = df_original[col].describe()\n",
|
||||||
|
" \n",
|
||||||
|
" # Imputed statistics (only imputed values)\n",
|
||||||
|
" if missing_mask.any():\n",
|
||||||
|
" imputed_values = df_imputed.loc[missing_mask, col]\n",
|
||||||
|
" \n",
|
||||||
|
" if pd.api.types.is_numeric_dtype(df_original[col]):\n",
|
||||||
|
" imputed_stats = imputed_values.describe()\n",
|
||||||
|
" \n",
|
||||||
|
" # Statistical tests\n",
|
||||||
|
" mean_diff = abs(original_stats['mean'] - imputed_stats['mean'])\n",
|
||||||
|
" std_diff = abs(original_stats['std'] - imputed_stats['std'])\n",
|
||||||
|
" \n",
|
||||||
|
" validation_results[col] = {\n",
|
||||||
|
" 'original_mean': original_stats['mean'],\n",
|
||||||
|
" 'imputed_mean': imputed_stats['mean'],\n",
|
||||||
|
" 'mean_difference': mean_diff,\n",
|
||||||
|
" 'original_std': original_stats['std'],\n",
|
||||||
|
" 'imputed_std': imputed_stats['std'],\n",
|
||||||
|
" 'std_difference': std_diff,\n",
|
||||||
|
" 'values_imputed': len(imputed_values)\n",
|
||||||
|
" }\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Original mean: {original_stats['mean']:.2f}, Imputed mean: {imputed_stats['mean']:.2f}\")\n",
|
||||||
|
" print(f\"Mean difference: {mean_diff:.2f} ({mean_diff/original_stats['mean']*100:.1f}%)\")\n",
|
||||||
|
" print(f\"Original std: {original_stats['std']:.2f}, Imputed std: {imputed_stats['std']:.2f}\")\n",
|
||||||
|
" \n",
|
||||||
|
" else:\n",
|
||||||
|
" # Categorical data\n",
|
||||||
|
" original_dist = df_original[col].value_counts(normalize=True)\n",
|
||||||
|
" imputed_dist = imputed_values.value_counts(normalize=True)\n",
|
||||||
|
" print(f\"Original distribution: {original_dist.to_dict()}\")\n",
|
||||||
|
" print(f\"Imputed distribution: {imputed_dist.to_dict()}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Overall validation summary\n",
|
||||||
|
" if validation_results:\n",
|
||||||
|
" validation_df = pd.DataFrame(validation_results).T\n",
|
||||||
|
" print(\"\\n=== VALIDATION SUMMARY ===\")\n",
|
||||||
|
" print(validation_df.round(3))\n",
|
||||||
|
" \n",
|
||||||
|
" # Flag potential issues\n",
|
||||||
|
" print(\"\\n--- Potential Issues ---\")\n",
|
||||||
|
" for col, stats in validation_results.items():\n",
|
||||||
|
" mean_change = abs(stats['mean_difference'] / stats['original_mean']) * 100\n",
|
||||||
|
" if mean_change > 10: # More than 10% change in mean\n",
|
||||||
|
" print(f\"⚠️ {col}: Large mean change ({mean_change:.1f}%)\")\n",
|
||||||
|
" \n",
|
||||||
|
" std_change = abs(stats['std_difference'] / stats['original_std']) * 100\n",
|
||||||
|
" if std_change > 20: # More than 20% change in std\n",
|
||||||
|
" print(f\"⚠️ {col}: Large variance change ({std_change:.1f}%)\")\n",
|
||||||
|
" \n",
|
||||||
|
" return validation_results\n",
|
||||||
|
"\n",
|
||||||
|
"# Validate imputation quality\n",
|
||||||
|
"validation_results = validate_imputation_quality(df_complete, df_missing, df_business_imputed)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Practice Exercises\n",
|
||||||
|
"\n",
|
||||||
|
"Apply missing data handling techniques to challenging scenarios:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 53,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 1: Multi-step imputation strategy\n",
|
||||||
|
"# Create a sophisticated imputation pipeline that:\n",
|
||||||
|
"# 1. Handles different types of missing data appropriately\n",
|
||||||
|
"# 2. Uses multiple imputation methods in sequence\n",
|
||||||
|
"# 3. Validates results at each step\n",
|
||||||
|
"# 4. Creates comprehensive documentation\n",
|
||||||
|
"\n",
|
||||||
|
"def comprehensive_imputation_pipeline(df):\n",
|
||||||
|
" \"\"\"Comprehensive missing data handling pipeline\"\"\"\n",
|
||||||
|
" # Your implementation here\n",
|
||||||
|
" pass\n",
|
||||||
|
"\n",
|
||||||
|
"# result_df = comprehensive_imputation_pipeline(df_missing)\n",
|
||||||
|
"# print(\"Comprehensive pipeline results:\")\n",
|
||||||
|
"# print(result_df.isnull().sum())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 54,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 2: Missing data pattern analysis\n",
|
||||||
|
"# Analyze if missing data follows specific patterns:\n",
|
||||||
|
"# - Time-based patterns\n",
|
||||||
|
"# - User behavior patterns\n",
|
||||||
|
"# - System/technical patterns\n",
|
||||||
|
"# Create insights and recommendations\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 55,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 3: Impact assessment\n",
|
||||||
|
"# Assess how different missing data handling approaches\n",
|
||||||
|
"# affect downstream analysis:\n",
|
||||||
|
"# - Statistical analysis results\n",
|
||||||
|
"# - Machine learning model performance\n",
|
||||||
|
"# - Business insights and decisions\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Key Takeaways\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Understanding Missing Data Types**:\n",
|
||||||
|
" - **MCAR**: Missing Completely at Random\n",
|
||||||
|
" - **MAR**: Missing at Random (depends on observed data)\n",
|
||||||
|
" - **MNAR**: Missing Not at Random (depends on unobserved data)\n",
|
||||||
|
"\n",
|
||||||
|
"2. **Detection and Analysis**:\n",
|
||||||
|
" - Always analyze missing patterns before imputation\n",
|
||||||
|
" - Use visualizations to understand missing data structure\n",
|
||||||
|
" - Look for relationships between missing values and other variables\n",
|
||||||
|
"\n",
|
||||||
|
"3. **Handling Strategies**:\n",
|
||||||
|
" - **Deletion**: Simple but can lose valuable information\n",
|
||||||
|
" - **Simple Imputation**: Fast but may not preserve relationships\n",
|
||||||
|
" - **Advanced Methods**: KNN, MICE preserve more complex relationships\n",
|
||||||
|
" - **Business Logic**: Domain knowledge often provides best results\n",
|
||||||
|
"\n",
|
||||||
|
"4. **Best Practices**:\n",
|
||||||
|
" - Create missing data indicators for transparency\n",
|
||||||
|
" - Validate imputation quality against original data when possible\n",
|
||||||
|
" - Consider the impact on downstream analysis\n",
|
||||||
|
" - Document all imputation decisions and methods\n",
|
||||||
|
"\n",
|
||||||
|
"## Method Selection Guide\n",
|
||||||
|
"\n",
|
||||||
|
"| Scenario | Recommended Method | Rationale |\n",
|
||||||
|
"|----------|-------------------|----------|\n",
|
||||||
|
"| < 5% missing, MCAR | Simple imputation | Low impact, efficiency |\n",
|
||||||
|
"| 5-20% missing, MAR | KNN or Group-based | Preserve relationships |\n",
|
||||||
|
"| > 20% missing, complex patterns | MICE or Multiple imputation | Handle complex dependencies |\n",
|
||||||
|
"| Business-critical decisions | Domain knowledge + validation | Accuracy and explainability |\n",
|
||||||
|
"| Machine learning features | Advanced methods + indicators | Preserve predictive power |\n",
|
||||||
|
"\n",
|
||||||
|
"## Common Pitfalls to Avoid\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Data Leakage**: Don't use future information to impute past values\n",
|
||||||
|
"2. **Ignoring Patterns**: Missing data often has meaningful patterns\n",
|
||||||
|
"3. **Over-imputation**: Sometimes missing data is informative itself\n",
|
||||||
|
"4. **One-size-fits-all**: Different columns may need different strategies\n",
|
||||||
|
"5. **No Validation**: Always check if imputation preserved data characteristics"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.13.3"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
937
Session_01/PandasDataFrame-exmples/07_merging_joining.ipynb
Executable file
937
Session_01/PandasDataFrame-exmples/07_merging_joining.ipynb
Executable file
|
@ -0,0 +1,937 @@
|
||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Session 1 - DataFrames - Lesson 7: Merging and Joining DataFrames\n",
|
||||||
|
"\n",
|
||||||
|
"## Learning Objectives\n",
|
||||||
|
"- Master different types of joins (inner, outer, left, right)\n",
|
||||||
|
"- Understand when to use merge vs join vs concat\n",
|
||||||
|
"- Handle duplicate keys and join conflicts\n",
|
||||||
|
"- Learn advanced merging techniques and best practices\n",
|
||||||
|
"- Practice with real-world data integration scenarios\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"- Completed Lessons 1-6\n",
|
||||||
|
"- Understanding of relational database concepts (helpful)\n",
|
||||||
|
"- Basic knowledge of SQL joins (helpful but not required)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Import required libraries\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"from datetime import datetime, timedelta\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"import warnings\n",
|
||||||
|
"warnings.filterwarnings('ignore')\n",
|
||||||
|
"\n",
|
||||||
|
"# Set display options\n",
|
||||||
|
"pd.set_option('display.max_columns', None)\n",
|
||||||
|
"pd.set_option('display.max_rows', 50)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Libraries loaded successfully!\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Creating Sample Datasets\n",
|
||||||
|
"\n",
|
||||||
|
"Let's create realistic datasets that represent common business scenarios."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Create sample datasets for merging examples\n",
|
||||||
|
"np.random.seed(42)\n",
|
||||||
|
"\n",
|
||||||
|
"# Customer dataset\n",
|
||||||
|
"customers_data = {\n",
|
||||||
|
" 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n",
|
||||||
|
" 'customer_name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson',\n",
|
||||||
|
" 'Frank Miller', 'Grace Lee', 'Henry Davis', 'Ivy Chen', 'Jack Robinson'],\n",
|
||||||
|
" 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com',\n",
|
||||||
|
" 'frank@email.com', 'grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],\n",
|
||||||
|
" 'age': [28, 35, 42, 31, 29, 45, 38, 33, 27, 41],\n",
|
||||||
|
" 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',\n",
|
||||||
|
" 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],\n",
|
||||||
|
" 'signup_date': pd.date_range('2023-01-01', periods=10, freq='M')\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_customers = pd.DataFrame(customers_data)\n",
|
||||||
|
"\n",
|
||||||
|
"# Orders dataset\n",
|
||||||
|
"orders_data = {\n",
|
||||||
|
" 'order_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],\n",
|
||||||
|
" 'customer_id': [1, 2, 1, 3, 4, 2, 5, 1, 6, 11, 3, 2], # Note: customer_id 11 doesn't exist in customers\n",
|
||||||
|
" 'order_date': pd.date_range('2023-06-01', periods=12, freq='W'),\n",
|
||||||
|
" 'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Monitor', 'Phone', \n",
|
||||||
|
" 'Headphones', 'Mouse', 'Keyboard', 'Laptop', 'Tablet', 'Monitor'],\n",
|
||||||
|
" 'quantity': [1, 2, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1],\n",
|
||||||
|
" 'amount': [1200, 800, 400, 1200, 300, 800, 150, 50, 75, 1200, 800, 300]\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_orders = pd.DataFrame(orders_data)\n",
|
||||||
|
"\n",
|
||||||
|
"# Product information dataset\n",
|
||||||
|
"products_data = {\n",
|
||||||
|
" 'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones', 'Mouse', 'Keyboard', 'Webcam'],\n",
|
||||||
|
" 'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', \n",
|
||||||
|
" 'Audio', 'Accessories', 'Accessories', 'Electronics'],\n",
|
||||||
|
" 'price': [1200, 800, 400, 300, 150, 50, 75, 100],\n",
|
||||||
|
" 'supplier': ['TechCorp', 'MobileCorp', 'TechCorp', 'DisplayCorp', \n",
|
||||||
|
" 'AudioCorp', 'AccessoryCorp', 'AccessoryCorp', 'TechCorp']\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_products = pd.DataFrame(products_data)\n",
|
||||||
|
"\n",
|
||||||
|
"# Customer segments dataset\n",
|
||||||
|
"segments_data = {\n",
|
||||||
|
" 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 12, 13], # Some customers not in main customer table\n",
|
||||||
|
" 'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Basic', \n",
|
||||||
|
" 'Premium', 'Standard', 'Basic', 'Premium', 'Standard'],\n",
|
||||||
|
" 'loyalty_points': [1500, 800, 1200, 600, 200, 1800, 750, 300, 2000, 900]\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"df_segments = pd.DataFrame(segments_data)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Sample datasets created:\")\n",
|
||||||
|
"print(f\"Customers: {df_customers.shape}\")\n",
|
||||||
|
"print(f\"Orders: {df_orders.shape}\")\n",
|
||||||
|
"print(f\"Products: {df_products.shape}\")\n",
|
||||||
|
"print(f\"Segments: {df_segments.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nCustomers dataset:\")\n",
|
||||||
|
"print(df_customers.head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nOrders dataset:\")\n",
|
||||||
|
"print(df_orders.head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 1. Basic Merge Operations\n",
|
||||||
|
"\n",
|
||||||
|
"Understanding the fundamental merge operations and join types."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Inner Join - only matching records\n",
|
||||||
|
"print(\"=== INNER JOIN ===\")\n",
|
||||||
|
"inner_join = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
|
||||||
|
"print(f\"Result shape: {inner_join.shape}\")\n",
|
||||||
|
"print(\"Sample results:\")\n",
|
||||||
|
"print(inner_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"\\nUnique customers in result: {inner_join['customer_id'].nunique()}\")\n",
|
||||||
|
"print(f\"Total orders: {len(inner_join)}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Check which customers have orders\n",
|
||||||
|
"customers_with_orders = inner_join['customer_id'].unique()\n",
|
||||||
|
"print(f\"Customers with orders: {sorted(customers_with_orders)}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Left Join - all records from left table\n",
|
||||||
|
"print(\"=== LEFT JOIN ===\")\n",
|
||||||
|
"left_join = pd.merge(df_customers, df_orders, on='customer_id', how='left')\n",
|
||||||
|
"print(f\"Result shape: {left_join.shape}\")\n",
|
||||||
|
"print(\"Sample results:\")\n",
|
||||||
|
"print(left_join[['customer_name', 'order_id', 'product', 'amount']].head(10))\n",
|
||||||
|
"\n",
|
||||||
|
"# Check customers without orders\n",
|
||||||
|
"customers_without_orders = left_join[left_join['order_id'].isnull()]['customer_name'].tolist()\n",
|
||||||
|
"print(f\"\\nCustomers without orders: {customers_without_orders}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Summary statistics\n",
|
||||||
|
"print(f\"\\nTotal records: {len(left_join)}\")\n",
|
||||||
|
"print(f\"Records with orders: {left_join['order_id'].notna().sum()}\")\n",
|
||||||
|
"print(f\"Records without orders: {left_join['order_id'].isnull().sum()}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Right Join - all records from right table\n",
|
||||||
|
"print(\"=== RIGHT JOIN ===\")\n",
|
||||||
|
"right_join = pd.merge(df_customers, df_orders, on='customer_id', how='right')\n",
|
||||||
|
"print(f\"Result shape: {right_join.shape}\")\n",
|
||||||
|
"print(\"Sample results:\")\n",
|
||||||
|
"print(right_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Check orders without customer information\n",
|
||||||
|
"orders_without_customers = right_join[right_join['customer_name'].isnull()]\n",
|
||||||
|
"print(f\"\\nOrders without customer info: {len(orders_without_customers)}\")\n",
|
||||||
|
"if len(orders_without_customers) > 0:\n",
|
||||||
|
" print(orders_without_customers[['customer_id', 'order_id', 'product', 'amount']])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Outer Join - all records from both tables\n",
|
||||||
|
"print(\"=== OUTER JOIN ===\")\n",
|
||||||
|
"outer_join = pd.merge(df_customers, df_orders, on='customer_id', how='outer')\n",
|
||||||
|
"print(f\"Result shape: {outer_join.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Analyze the result\n",
|
||||||
|
"print(\"\\nData quality analysis:\")\n",
|
||||||
|
"print(f\"Records with complete customer info: {outer_join['customer_name'].notna().sum()}\")\n",
|
||||||
|
"print(f\"Records with complete order info: {outer_join['order_id'].notna().sum()}\")\n",
|
||||||
|
"print(f\"Records with both customer and order info: {(outer_join['customer_name'].notna() & outer_join['order_id'].notna()).sum()}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Show different categories of records\n",
|
||||||
|
"print(\"\\nCustomers without orders:\")\n",
|
||||||
|
"customers_only = outer_join[(outer_join['customer_name'].notna()) & (outer_join['order_id'].isnull())]\n",
|
||||||
|
"print(customers_only[['customer_name', 'city']].drop_duplicates())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nOrders without customer data:\")\n",
|
||||||
|
"orders_only = outer_join[(outer_join['customer_name'].isnull()) & (outer_join['order_id'].notna())]\n",
|
||||||
|
"print(orders_only[['customer_id', 'order_id', 'product', 'amount']])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 2. Multiple Table Joins\n",
|
||||||
|
"\n",
|
||||||
|
"Combining data from multiple sources in sequence."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Three-way join: Customers + Orders + Products\n",
|
||||||
|
"print(\"=== THREE-WAY JOIN ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Step 1: Join customers and orders\n",
|
||||||
|
"customer_orders = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
|
||||||
|
"print(f\"After joining customers and orders: {customer_orders.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Step 2: Join with products\n",
|
||||||
|
"complete_data = pd.merge(customer_orders, df_products, on='product', how='left')\n",
|
||||||
|
"print(f\"After joining with products: {complete_data.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Display comprehensive view\n",
|
||||||
|
"print(\"\\nComplete order information:\")\n",
|
||||||
|
"display_cols = ['customer_name', 'order_id', 'product', 'category', 'quantity', 'amount', 'price', 'supplier']\n",
|
||||||
|
"print(complete_data[display_cols].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Verify data consistency\n",
|
||||||
|
"print(\"\\nData consistency check:\")\n",
|
||||||
|
"# Check if order amount matches product price * quantity\n",
|
||||||
|
"complete_data['calculated_amount'] = complete_data['price'] * complete_data['quantity']\n",
|
||||||
|
"amount_matches = (complete_data['amount'] == complete_data['calculated_amount']).all()\n",
|
||||||
|
"print(f\"Order amounts match calculated amounts: {amount_matches}\")\n",
|
||||||
|
"\n",
|
||||||
|
"if not amount_matches:\n",
|
||||||
|
" mismatched = complete_data[complete_data['amount'] != complete_data['calculated_amount']]\n",
|
||||||
|
" print(f\"\\nMismatched records: {len(mismatched)}\")\n",
|
||||||
|
" print(mismatched[['order_id', 'product', 'amount', 'calculated_amount']])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Add customer segment information\n",
|
||||||
|
"print(\"=== ADDING CUSTOMER SEGMENTS ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Join with segments (left join to keep all customers)\n",
|
||||||
|
"customers_with_segments = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
|
||||||
|
"print(f\"Customers with segments shape: {customers_with_segments.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Check which customers don't have segment information\n",
|
||||||
|
"missing_segments = customers_with_segments[customers_with_segments['segment'].isnull()]\n",
|
||||||
|
"print(f\"\\nCustomers without segment info: {len(missing_segments)}\")\n",
|
||||||
|
"if len(missing_segments) > 0:\n",
|
||||||
|
" print(missing_segments[['customer_name', 'city']])\n",
|
||||||
|
"\n",
|
||||||
|
"# Create comprehensive customer profile\n",
|
||||||
|
"full_customer_profile = pd.merge(complete_data, df_segments, on='customer_id', how='left')\n",
|
||||||
|
"print(f\"\\nFull customer profile shape: {full_customer_profile.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Analyze by segment\n",
|
||||||
|
"segment_analysis = full_customer_profile.groupby('segment').agg({\n",
|
||||||
|
" 'amount': ['sum', 'mean', 'count'],\n",
|
||||||
|
" 'customer_id': 'nunique'\n",
|
||||||
|
"}).round(2)\n",
|
||||||
|
"segment_analysis.columns = ['Total_Revenue', 'Avg_Order_Value', 'Total_Orders', 'Unique_Customers']\n",
|
||||||
|
"print(\"\\nRevenue by customer segment:\")\n",
|
||||||
|
"print(segment_analysis)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 3. Advanced Merge Techniques\n",
|
||||||
|
"\n",
|
||||||
|
"Handling complex merging scenarios and edge cases."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Merge with different column names\n",
|
||||||
|
"print(\"=== MERGE WITH DIFFERENT COLUMN NAMES ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Create a dataset with different column name\n",
|
||||||
|
"customer_demographics = pd.DataFrame({\n",
|
||||||
|
" 'cust_id': [1, 2, 3, 4, 5],\n",
|
||||||
|
" 'income_range': ['50-75k', '75-100k', '50-75k', '100k+', '25-50k'],\n",
|
||||||
|
" 'education': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor'],\n",
|
||||||
|
" 'occupation': ['Engineer', 'Manager', 'Professor', 'Director', 'Analyst']\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"# Merge using left_on and right_on parameters\n",
|
||||||
|
"customers_with_demographics = pd.merge(\n",
|
||||||
|
" df_customers, \n",
|
||||||
|
" customer_demographics, \n",
|
||||||
|
" left_on='customer_id', \n",
|
||||||
|
" right_on='cust_id', \n",
|
||||||
|
" how='left'\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Merge with different column names:\")\n",
|
||||||
|
"print(customers_with_demographics[['customer_name', 'customer_id', 'cust_id', 'income_range', 'education']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Clean up duplicate columns\n",
|
||||||
|
"customers_with_demographics = customers_with_demographics.drop('cust_id', axis=1)\n",
|
||||||
|
"print(f\"\\nAfter cleanup: {customers_with_demographics.shape}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Merge on multiple columns\n",
|
||||||
|
"print(\"=== MERGE ON MULTIPLE COLUMNS ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Create time-based pricing data\n",
|
||||||
|
"pricing_data = pd.DataFrame({\n",
|
||||||
|
" 'product': ['Laptop', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Tablet'],\n",
|
||||||
|
" 'date': pd.to_datetime(['2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01']),\n",
|
||||||
|
" 'price': [1200, 1100, 800, 750, 400, 380],\n",
|
||||||
|
" 'promotion': [False, True, False, True, False, True]\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"# Add year-month to orders for matching\n",
|
||||||
|
"df_orders_with_period = df_orders.copy()\n",
|
||||||
|
"df_orders_with_period['order_month'] = df_orders_with_period['order_date'].dt.to_period('M').dt.start_time\n",
|
||||||
|
"\n",
|
||||||
|
"# Create matching periods in pricing data\n",
|
||||||
|
"pricing_data['period'] = pricing_data['date'].dt.to_period('M').dt.start_time\n",
|
||||||
|
"\n",
|
||||||
|
"# Merge on product and time period\n",
|
||||||
|
"orders_with_pricing = pd.merge(\n",
|
||||||
|
" df_orders_with_period,\n",
|
||||||
|
" pricing_data,\n",
|
||||||
|
" left_on=['product', 'order_month'],\n",
|
||||||
|
" right_on=['product', 'period'],\n",
|
||||||
|
" how='left'\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Orders with time-based pricing:\")\n",
|
||||||
|
"print(orders_with_pricing[['order_id', 'product', 'order_date', 'amount', 'price', 'promotion']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Check for pricing discrepancies\n",
|
||||||
|
"pricing_discrepancies = orders_with_pricing[\n",
|
||||||
|
" (orders_with_pricing['amount'] != orders_with_pricing['price'] * orders_with_pricing['quantity']) &\n",
|
||||||
|
" orders_with_pricing['price'].notna()\n",
|
||||||
|
"]\n",
|
||||||
|
"print(f\"\\nOrders with pricing discrepancies: {len(pricing_discrepancies)}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Handling duplicate keys in merge\n",
|
||||||
|
"print(\"=== HANDLING DUPLICATE KEYS ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Create data with duplicate keys\n",
|
||||||
|
"customer_contacts = pd.DataFrame({\n",
|
||||||
|
" 'customer_id': [1, 1, 2, 2, 3],\n",
|
||||||
|
" 'contact_type': ['email', 'phone', 'email', 'phone', 'email'],\n",
|
||||||
|
" 'contact_value': ['alice@email.com', '555-0101', 'bob@email.com', '555-0102', 'charlie@email.com'],\n",
|
||||||
|
" 'is_primary': [True, False, True, True, True]\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Customer contacts with duplicates:\")\n",
|
||||||
|
"print(customer_contacts)\n",
|
||||||
|
"\n",
|
||||||
|
"# Merge will create cartesian product for duplicate keys\n",
|
||||||
|
"customers_with_contacts = pd.merge(df_customers, customer_contacts, on='customer_id', how='inner')\n",
|
||||||
|
"print(f\"\\nResult of merge with duplicates: {customers_with_contacts.shape}\")\n",
|
||||||
|
"print(customers_with_contacts[['customer_name', 'contact_type', 'contact_value', 'is_primary']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Strategy 1: Filter before merge\n",
|
||||||
|
"primary_contacts = customer_contacts[customer_contacts['is_primary'] == True]\n",
|
||||||
|
"customers_primary_contacts = pd.merge(df_customers, primary_contacts, on='customer_id', how='left')\n",
|
||||||
|
"print(f\"\\nAfter filtering to primary contacts: {customers_primary_contacts.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Strategy 2: Pivot contacts to columns\n",
|
||||||
|
"contacts_pivoted = customer_contacts.pivot_table(\n",
|
||||||
|
" index='customer_id',\n",
|
||||||
|
" columns='contact_type',\n",
|
||||||
|
" values='contact_value',\n",
|
||||||
|
" aggfunc='first'\n",
|
||||||
|
").reset_index()\n",
|
||||||
|
"print(\"\\nPivoted contacts:\")\n",
|
||||||
|
"print(contacts_pivoted)\n",
|
||||||
|
"\n",
|
||||||
|
"customers_with_pivoted_contacts = pd.merge(df_customers, contacts_pivoted, on='customer_id', how='left')\n",
|
||||||
|
"print(f\"\\nAfter merging pivoted contacts: {customers_with_pivoted_contacts.shape}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 4. Index-based Joins\n",
|
||||||
|
"\n",
|
||||||
|
"Using DataFrame indices for joining operations."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Set up DataFrames with indices\n",
|
||||||
|
"print(\"=== INDEX-BASED JOINS ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Set customer_id as index\n",
|
||||||
|
"customers_indexed = df_customers.set_index('customer_id')\n",
|
||||||
|
"segments_indexed = df_segments.set_index('customer_id')\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Customers with index:\")\n",
|
||||||
|
"print(customers_indexed.head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Join using indices\n",
|
||||||
|
"joined_by_index = customers_indexed.join(segments_indexed, how='left')\n",
|
||||||
|
"print(f\"\\nJoined by index shape: {joined_by_index.shape}\")\n",
|
||||||
|
"print(joined_by_index[['customer_name', 'city', 'segment', 'loyalty_points']].head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Compare with merge\n",
|
||||||
|
"merged_equivalent = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
|
||||||
|
"print(f\"\\nEquivalent merge shape: {merged_equivalent.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Verify they're the same (after sorting)\n",
|
||||||
|
"joined_sorted = joined_by_index.reset_index().sort_values('customer_id')\n",
|
||||||
|
"merged_sorted = merged_equivalent.sort_values('customer_id')\n",
|
||||||
|
"are_equal = joined_sorted.equals(merged_sorted)\n",
|
||||||
|
"print(f\"Results are identical: {are_equal}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Multi-index joins\n",
|
||||||
|
"print(\"=== MULTI-INDEX JOINS ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Create a dataset with multiple index levels\n",
|
||||||
|
"sales_by_region_product = pd.DataFrame({\n",
|
||||||
|
" 'region': ['North', 'North', 'South', 'South', 'East', 'East'],\n",
|
||||||
|
" 'product': ['Laptop', 'Phone', 'Laptop', 'Phone', 'Laptop', 'Phone'],\n",
|
||||||
|
" 'sales_target': [10, 15, 8, 12, 12, 18],\n",
|
||||||
|
" 'commission_rate': [0.05, 0.04, 0.06, 0.05, 0.05, 0.04]\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"# Set multi-index\n",
|
||||||
|
"sales_targets = sales_by_region_product.set_index(['region', 'product'])\n",
|
||||||
|
"print(\"Sales targets with multi-index:\")\n",
|
||||||
|
"print(sales_targets)\n",
|
||||||
|
"\n",
|
||||||
|
"# Create customer orders with region mapping\n",
|
||||||
|
"customer_regions = {\n",
|
||||||
|
" 1: 'North', 2: 'South', 3: 'East', 4: 'North', 5: 'South', 6: 'East'\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"orders_with_region = df_orders.copy()\n",
|
||||||
|
"orders_with_region['region'] = orders_with_region['customer_id'].map(customer_regions)\n",
|
||||||
|
"orders_with_region = orders_with_region.dropna(subset=['region'])\n",
|
||||||
|
"\n",
|
||||||
|
"# Merge on multiple columns to match multi-index\n",
|
||||||
|
"orders_with_targets = pd.merge(\n",
|
||||||
|
" orders_with_region,\n",
|
||||||
|
" sales_targets.reset_index(),\n",
|
||||||
|
" on=['region', 'product'],\n",
|
||||||
|
" how='left'\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nOrders with sales targets:\")\n",
|
||||||
|
"print(orders_with_targets[['order_id', 'region', 'product', 'amount', 'sales_target', 'commission_rate']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 5. Concatenation Operations\n",
|
||||||
|
"\n",
|
||||||
|
"Combining DataFrames vertically and horizontally."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Vertical concatenation (stacking DataFrames)\n",
|
||||||
|
"print(\"=== VERTICAL CONCATENATION ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Create additional customer data (new batch)\n",
|
||||||
|
"new_customers = pd.DataFrame({\n",
|
||||||
|
" 'customer_id': [11, 12, 13, 14, 15],\n",
|
||||||
|
" 'customer_name': ['Kate Wilson', 'Liam Brown', 'Mia Garcia', 'Noah Jones', 'Olivia Miller'],\n",
|
||||||
|
" 'email': ['kate@email.com', 'liam@email.com', 'mia@email.com', 'noah@email.com', 'olivia@email.com'],\n",
|
||||||
|
" 'age': [26, 39, 31, 44, 28],\n",
|
||||||
|
" 'city': ['Austin', 'Seattle', 'Denver', 'Boston', 'Miami'],\n",
|
||||||
|
" 'signup_date': pd.date_range('2024-01-01', periods=5, freq='M')\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"# Concatenate vertically\n",
|
||||||
|
"all_customers = pd.concat([df_customers, new_customers], ignore_index=True)\n",
|
||||||
|
"print(f\"Original customers: {len(df_customers)}\")\n",
|
||||||
|
"print(f\"New customers: {len(new_customers)}\")\n",
|
||||||
|
"print(f\"Combined customers: {len(all_customers)}\")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nCombined customer data:\")\n",
|
||||||
|
"print(all_customers.tail())\n",
|
||||||
|
"\n",
|
||||||
|
"# Concatenation with different columns\n",
|
||||||
|
"customers_with_extra_info = pd.DataFrame({\n",
|
||||||
|
" 'customer_id': [16, 17],\n",
|
||||||
|
" 'customer_name': ['Paul Davis', 'Quinn Taylor'],\n",
|
||||||
|
" 'email': ['paul@email.com', 'quinn@email.com'],\n",
|
||||||
|
" 'age': [35, 29],\n",
|
||||||
|
" 'city': ['Portland', 'Nashville'],\n",
|
||||||
|
" 'signup_date': pd.date_range('2024-06-01', periods=2, freq='M'),\n",
|
||||||
|
" 'referral_source': ['Google', 'Facebook'] # Extra column\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"# Concat with different columns (creates NaN for missing columns)\n",
|
||||||
|
"all_customers_extended = pd.concat([all_customers, customers_with_extra_info], ignore_index=True, sort=False)\n",
|
||||||
|
"print(f\"\\nAfter adding customers with extra info: {all_customers_extended.shape}\")\n",
|
||||||
|
"print(\"Missing values in referral_source:\")\n",
|
||||||
|
"print(all_customers_extended['referral_source'].isnull().sum())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Horizontal concatenation\n",
|
||||||
|
"print(\"=== HORIZONTAL CONCATENATION ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Split customer data into parts\n",
|
||||||
|
"customer_basic_info = df_customers[['customer_id', 'customer_name', 'email']]\n",
|
||||||
|
"customer_demographics = df_customers[['customer_id', 'age', 'city', 'signup_date']]\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Customer basic info:\")\n",
|
||||||
|
"print(customer_basic_info.head())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nCustomer demographics:\")\n",
|
||||||
|
"print(customer_demographics.head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Concatenate horizontally (by index)\n",
|
||||||
|
"customers_recombined = pd.concat([customer_basic_info, customer_demographics.drop('customer_id', axis=1)], axis=1)\n",
|
||||||
|
"print(f\"\\nRecombined shape: {customers_recombined.shape}\")\n",
|
||||||
|
"print(customers_recombined.head())\n",
|
||||||
|
"\n",
|
||||||
|
"# Verify it matches original\n",
|
||||||
|
"columns_match = set(customers_recombined.columns) == set(df_customers.columns)\n",
|
||||||
|
"print(f\"\\nColumns match original: {columns_match}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Concat with keys (creating hierarchical columns)\n",
|
||||||
|
"print(\"=== CONCAT WITH KEYS ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Create quarterly sales data\n",
|
||||||
|
"q1_sales = pd.DataFrame({\n",
|
||||||
|
" 'product': ['Laptop', 'Phone', 'Tablet'],\n",
|
||||||
|
" 'units_sold': [50, 75, 30],\n",
|
||||||
|
" 'revenue': [60000, 60000, 12000]\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"q2_sales = pd.DataFrame({\n",
|
||||||
|
" 'product': ['Laptop', 'Phone', 'Tablet'],\n",
|
||||||
|
" 'units_sold': [45, 80, 35],\n",
|
||||||
|
" 'revenue': [54000, 64000, 14000]\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"# Concatenate with keys\n",
|
||||||
|
"quarterly_sales = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])\n",
|
||||||
|
"print(\"Quarterly sales with hierarchical index:\")\n",
|
||||||
|
"print(quarterly_sales)\n",
|
||||||
|
"\n",
|
||||||
|
"# Access specific quarter\n",
|
||||||
|
"print(\"\\nQ1 sales only:\")\n",
|
||||||
|
"print(quarterly_sales.loc['Q1'])\n",
|
||||||
|
"\n",
|
||||||
|
"# Create summary comparison\n",
|
||||||
|
"quarterly_comparison = pd.concat([q1_sales.set_index('product'), q2_sales.set_index('product')], \n",
|
||||||
|
" keys=['Q1', 'Q2'], axis=1)\n",
|
||||||
|
"print(\"\\nQuarterly comparison (side by side):\")\n",
|
||||||
|
"print(quarterly_comparison)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 6. Performance and Best Practices\n",
|
||||||
|
"\n",
|
||||||
|
"Optimizing merge operations and avoiding common pitfalls."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Performance comparison: merge vs join\n",
|
||||||
|
"import time\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"=== PERFORMANCE COMPARISON ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Create larger datasets for performance testing\n",
|
||||||
|
"np.random.seed(42)\n",
|
||||||
|
"large_customers = pd.DataFrame({\n",
|
||||||
|
" 'customer_id': range(1, 10001),\n",
|
||||||
|
" 'customer_name': [f'Customer_{i}' for i in range(1, 10001)],\n",
|
||||||
|
" 'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], 10000)\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"large_orders = pd.DataFrame({\n",
|
||||||
|
" 'order_id': range(1, 50001),\n",
|
||||||
|
" 'customer_id': np.random.randint(1, 10001, 50000),\n",
|
||||||
|
" 'amount': np.random.normal(100, 30, 50000)\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"Large customers: {large_customers.shape}\")\n",
|
||||||
|
"print(f\"Large orders: {large_orders.shape}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Test merge performance\n",
|
||||||
|
"start_time = time.time()\n",
|
||||||
|
"merged_result = pd.merge(large_customers, large_orders, on='customer_id', how='inner')\n",
|
||||||
|
"merge_time = time.time() - start_time\n",
|
||||||
|
"\n",
|
||||||
|
"# Test join performance\n",
|
||||||
|
"customers_indexed = large_customers.set_index('customer_id')\n",
|
||||||
|
"orders_indexed = large_orders.set_index('customer_id')\n",
|
||||||
|
"\n",
|
||||||
|
"start_time = time.time()\n",
|
||||||
|
"joined_result = customers_indexed.join(orders_indexed, how='inner')\n",
|
||||||
|
"join_time = time.time() - start_time\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"\\nMerge time: {merge_time:.4f} seconds\")\n",
|
||||||
|
"print(f\"Join time: {join_time:.4f} seconds\")\n",
|
||||||
|
"print(f\"Join is {merge_time/join_time:.2f}x faster\")\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"\\nResults shape - Merge: {merged_result.shape}, Join: {joined_result.shape}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Best practices and common pitfalls\n",
|
||||||
|
"print(\"=== BEST PRACTICES ===\")\n",
|
||||||
|
"\n",
|
||||||
|
"def analyze_merge_keys(df1, df2, key_col):\n",
|
||||||
|
" \"\"\"Analyze merge keys before joining\"\"\"\n",
|
||||||
|
" print(f\"\\n--- Analyzing merge on '{key_col}' ---\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Check for duplicates\n",
|
||||||
|
" df1_dups = df1[key_col].duplicated().sum()\n",
|
||||||
|
" df2_dups = df2[key_col].duplicated().sum()\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Duplicates in left table: {df1_dups}\")\n",
|
||||||
|
" print(f\"Duplicates in right table: {df2_dups}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Check for missing values\n",
|
||||||
|
" df1_missing = df1[key_col].isnull().sum()\n",
|
||||||
|
" df2_missing = df2[key_col].isnull().sum()\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Missing values in left table: {df1_missing}\")\n",
|
||||||
|
" print(f\"Missing values in right table: {df2_missing}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Check overlap\n",
|
||||||
|
" left_keys = set(df1[key_col].dropna())\n",
|
||||||
|
" right_keys = set(df2[key_col].dropna())\n",
|
||||||
|
" \n",
|
||||||
|
" overlap = left_keys & right_keys\n",
|
||||||
|
" left_only = left_keys - right_keys\n",
|
||||||
|
" right_only = right_keys - left_keys\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Keys in both tables: {len(overlap)}\")\n",
|
||||||
|
" print(f\"Keys only in left: {len(left_only)}\")\n",
|
||||||
|
" print(f\"Keys only in right: {len(right_only)}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Predict result sizes\n",
|
||||||
|
" if df1_dups == 0 and df2_dups == 0:\n",
|
||||||
|
" inner_size = len(overlap)\n",
|
||||||
|
" left_size = len(df1)\n",
|
||||||
|
" right_size = len(df2)\n",
|
||||||
|
" outer_size = len(left_keys | right_keys)\n",
|
||||||
|
" else:\n",
|
||||||
|
" print(\"Warning: Duplicates present, result size may be larger than expected\")\n",
|
||||||
|
" inner_size = \"Cannot predict (duplicates present)\"\n",
|
||||||
|
" left_size = \"Cannot predict (duplicates present)\"\n",
|
||||||
|
" right_size = \"Cannot predict (duplicates present)\"\n",
|
||||||
|
" outer_size = \"Cannot predict (duplicates present)\"\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"\\nPredicted result sizes:\")\n",
|
||||||
|
" print(f\"Inner join: {inner_size}\")\n",
|
||||||
|
" print(f\"Left join: {left_size}\")\n",
|
||||||
|
" print(f\"Right join: {right_size}\")\n",
|
||||||
|
" print(f\"Outer join: {outer_size}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Analyze our sample data\n",
|
||||||
|
"analyze_merge_keys(df_customers, df_orders, 'customer_id')\n",
|
||||||
|
"analyze_merge_keys(df_customers, df_segments, 'customer_id')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Data validation after merge\n",
|
||||||
|
"def validate_merge_result(df, expected_rows=None, key_col=None):\n",
|
||||||
|
" \"\"\"Validate merge results\"\"\"\n",
|
||||||
|
" print(\"\\n=== MERGE VALIDATION ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Result shape: {df.shape}\")\n",
|
||||||
|
" \n",
|
||||||
|
" if expected_rows:\n",
|
||||||
|
" print(f\"Expected rows: {expected_rows}\")\n",
|
||||||
|
" if len(df) != expected_rows:\n",
|
||||||
|
" print(\"⚠️ Row count doesn't match expectation!\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Check for unexpected duplicates\n",
|
||||||
|
" if key_col and key_col in df.columns:\n",
|
||||||
|
" duplicates = df[key_col].duplicated().sum()\n",
|
||||||
|
" if duplicates > 0:\n",
|
||||||
|
" print(f\"⚠️ Found {duplicates} duplicate keys after merge\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Check for missing values in key columns\n",
|
||||||
|
" missing_summary = df.isnull().sum()\n",
|
||||||
|
" critical_missing = missing_summary[missing_summary > 0]\n",
|
||||||
|
" \n",
|
||||||
|
" if len(critical_missing) > 0:\n",
|
||||||
|
" print(\"Missing values after merge:\")\n",
|
||||||
|
" print(critical_missing)\n",
|
||||||
|
" \n",
|
||||||
|
" # Data type consistency\n",
|
||||||
|
" print(f\"\\nData types:\")\n",
|
||||||
|
" print(df.dtypes)\n",
|
||||||
|
" \n",
|
||||||
|
" return df\n",
|
||||||
|
"\n",
|
||||||
|
"# Example validation\n",
|
||||||
|
"sample_merge = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
|
||||||
|
"validated_result = validate_merge_result(sample_merge, key_col='customer_id')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Practice Exercises\n",
|
||||||
|
"\n",
|
||||||
|
"Apply merging and joining techniques to real-world scenarios:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 20,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 1: Customer Lifetime Value Analysis\n",
|
||||||
|
"# Create a comprehensive customer analysis by joining:\n",
|
||||||
|
"# - Customer demographics\n",
|
||||||
|
"# - Order history\n",
|
||||||
|
"# - Product information\n",
|
||||||
|
"# - Customer segments\n",
|
||||||
|
"# Calculate CLV metrics for each customer\n",
|
||||||
|
"\n",
|
||||||
|
"def calculate_customer_lifetime_value(customers, orders, products, segments):\n",
|
||||||
|
" \"\"\"Calculate comprehensive customer lifetime value metrics\"\"\"\n",
|
||||||
|
" # Your implementation here\n",
|
||||||
|
" pass\n",
|
||||||
|
"\n",
|
||||||
|
"# clv_analysis = calculate_customer_lifetime_value(df_customers, df_orders, df_products, df_segments)\n",
|
||||||
|
"# print(\"Customer Lifetime Value Analysis:\")\n",
|
||||||
|
"# print(clv_analysis.head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 21,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 2: Data Quality Assessment\n",
|
||||||
|
"# Create a function that analyzes data quality issues when merging multiple datasets:\n",
|
||||||
|
"# - Identify orphaned records\n",
|
||||||
|
"# - Find data inconsistencies\n",
|
||||||
|
"# - Suggest data cleaning steps\n",
|
||||||
|
"# - Provide merge recommendations\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 22,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 3: Time-series Join Challenge\n",
|
||||||
|
"# Create a complex time-based join scenario:\n",
|
||||||
|
"# - Join orders with time-varying product prices\n",
|
||||||
|
"# - Handle seasonal promotions\n",
|
||||||
|
"# - Calculate accurate historical revenue\n",
|
||||||
|
"# - Account for price changes over time\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Key Takeaways\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Join Types**:\n",
|
||||||
|
" - **Inner**: Only matching records from both tables\n",
|
||||||
|
" - **Left**: All records from left table + matching from right\n",
|
||||||
|
" - **Right**: All records from right table + matching from left\n",
|
||||||
|
" - **Outer**: All records from both tables\n",
|
||||||
|
"\n",
|
||||||
|
"2. **Method Selection**:\n",
|
||||||
|
" - **`pd.merge()`**: Most flexible, works with any columns\n",
|
||||||
|
" - **`.join()`**: Faster for index-based joins\n",
|
||||||
|
" - **`pd.concat()`**: For stacking DataFrames vertically/horizontally\n",
|
||||||
|
"\n",
|
||||||
|
"3. **Best Practices**:\n",
|
||||||
|
" - Always analyze merge keys before joining\n",
|
||||||
|
" - Check for duplicates and missing values\n",
|
||||||
|
" - Validate results after merging\n",
|
||||||
|
" - Use appropriate join types for your use case\n",
|
||||||
|
" - Consider performance implications for large datasets\n",
|
||||||
|
"\n",
|
||||||
|
"4. **Common Pitfalls**:\n",
|
||||||
|
" - Cartesian products from duplicate keys\n",
|
||||||
|
" - Unexpected result sizes\n",
|
||||||
|
" - Data type inconsistencies\n",
|
||||||
|
" - Missing value propagation\n",
|
||||||
|
"\n",
|
||||||
|
"## Join Type Selection Guide\n",
|
||||||
|
"\n",
|
||||||
|
"| Use Case | Recommended Join | Rationale |\n",
|
||||||
|
"|----------|-----------------|----------|\n",
|
||||||
|
"| Customer orders analysis | Inner | Only customers with orders |\n",
|
||||||
|
"| Customer segmentation | Left | Keep all customers, add segment info |\n",
|
||||||
|
"| Order validation | Right | Keep all orders, check customer validity |\n",
|
||||||
|
"| Data completeness analysis | Outer | See all records and identify gaps |\n",
|
||||||
|
"| Performance-critical operations | Index-based join | Faster execution |\n",
|
||||||
|
"\n",
|
||||||
|
"## Performance Tips\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Index Usage**: Set indexes for frequently joined columns\n",
|
||||||
|
"2. **Data Types**: Ensure consistent data types before joining\n",
|
||||||
|
"3. **Memory Management**: Consider chunking for very large datasets\n",
|
||||||
|
"4. **Join Order**: Start with smallest datasets\n",
|
||||||
|
"5. **Validation**: Always validate merge results"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.13.3"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
1408
Session_01/PandasDataFrame-exmples/08_sorting_ranking.ipynb
Executable file
1408
Session_01/PandasDataFrame-exmples/08_sorting_ranking.ipynb
Executable file
File diff suppressed because it is too large
Load diff
1978
Session_01/PandasDataFrame-exmples/09_pivot_tables.ipynb
Executable file
1978
Session_01/PandasDataFrame-exmples/09_pivot_tables.ipynb
Executable file
File diff suppressed because it is too large
Load diff
1149
Session_01/PandasDataFrame-exmples/10_time_series_analysis.ipynb
Executable file
1149
Session_01/PandasDataFrame-exmples/10_time_series_analysis.ipynb
Executable file
File diff suppressed because it is too large
Load diff
1059
Session_01/PandasDataFrame-exmples/11_string_operation.ipynb
Executable file
1059
Session_01/PandasDataFrame-exmples/11_string_operation.ipynb
Executable file
File diff suppressed because it is too large
Load diff
1126
Session_01/PandasDataFrame-exmples/12_data_visualization.ipynb
Executable file
1126
Session_01/PandasDataFrame-exmples/12_data_visualization.ipynb
Executable file
File diff suppressed because one or more lines are too long
815
Session_01/PandasDataFrame-exmples/13_advanced_data_cleaning.ipynb
Executable file
815
Session_01/PandasDataFrame-exmples/13_advanced_data_cleaning.ipynb
Executable file
|
@ -0,0 +1,815 @@
|
||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Session 1 - DataFrames - Lesson 13: Advanced Data Cleaning\n",
|
||||||
|
"\n",
|
||||||
|
"## Learning Objectives\n",
|
||||||
|
"- Master advanced techniques for data cleaning and validation\n",
|
||||||
|
"- Learn to detect and handle various types of data quality issues\n",
|
||||||
|
"- Understand data standardization and normalization techniques\n",
|
||||||
|
"- Practice with real-world messy data scenarios\n",
|
||||||
|
"- Develop automated data cleaning pipelines\n",
|
||||||
|
"\n",
|
||||||
|
"## Prerequisites\n",
|
||||||
|
"- Completed previous lessons on DataFrames\n",
|
||||||
|
"- Understanding of basic data cleaning concepts\n",
|
||||||
|
"- Familiarity with regular expressions (helpful but not required)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Import required libraries\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import re\n",
|
||||||
|
"from datetime import datetime, timedelta\n",
|
||||||
|
"import warnings\n",
|
||||||
|
"warnings.filterwarnings('ignore')\n",
|
||||||
|
"\n",
|
||||||
|
"# Display settings\n",
|
||||||
|
"pd.set_option('display.max_columns', None)\n",
|
||||||
|
"pd.set_option('display.max_rows', 100)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Libraries loaded successfully!\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Creating Messy Sample Data\n",
|
||||||
|
"\n",
|
||||||
|
"Let's create a realistic messy dataset to practice advanced cleaning techniques."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"\n",
|
||||||
|
"# Create intentionally messy data that mimics real-world issues\n",
|
||||||
|
"np.random.seed(42)\n",
|
||||||
|
"\n",
|
||||||
|
"# Base data\n",
|
||||||
|
"n_records = 200\n",
|
||||||
|
"messy_data = {\n",
|
||||||
|
" 'customer_id': [f'CUST{i:04d}' if i % 10 != 0 else f'cust{i:04d}' for i in range(1, n_records + 1)],\n",
|
||||||
|
" 'customer_name': [\n",
|
||||||
|
" 'John Smith', 'jane doe', 'MARY JOHNSON', 'bob wilson', 'Sarah Davis',\n",
|
||||||
|
" 'Mike Brown', 'lisa garcia', 'DAVID MILLER', 'Amy Wilson', 'Tom Anderson'\n",
|
||||||
|
" ] * 20,\n",
|
||||||
|
" 'email': [\n",
|
||||||
|
" 'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',\n",
|
||||||
|
" 'bob..wilson@test.com', 'sarah@invalid-email', 'mike@email.com',\n",
|
||||||
|
" 'lisa.garcia@email.com', 'david@company.org', 'amy@email.com', 'tom@test.com'\n",
|
||||||
|
" ] * 20,\n",
|
||||||
|
" 'phone': [\n",
|
||||||
|
" '(555) 123-4567', '555.987.6543', '5551234567', '555-987-6543',\n",
|
||||||
|
" '(555)123-4567', '+1-555-123-4567', '555 123 4567', '5559876543',\n",
|
||||||
|
" '(555) 987 6543', '555-123-4567'\n",
|
||||||
|
" ] * 20,\n",
|
||||||
|
" 'address': [\n",
|
||||||
|
" '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',\n",
|
||||||
|
" '789 Pine Rd, Los Angeles, CA 90210', '321 Elm St, Chicago, IL 60601',\n",
|
||||||
|
" '654 Maple Dr, Houston, TX 77001', '987 Cedar Ln, Phoenix, AZ 85001',\n",
|
||||||
|
" '147 Birch Way, Philadelphia, PA 19101', '258 Ash Ct, San Antonio, TX 78201',\n",
|
||||||
|
" '369 Walnut St, San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201'\n",
|
||||||
|
" ] * 20,\n",
|
||||||
|
" 'purchase_amount': np.random.normal(100, 30, n_records).round(2),\n",
|
||||||
|
" 'purchase_date': [\n",
|
||||||
|
" '2024-01-15', '01/16/2024', '2024-1-17', '16-01-2024', '2024/01/18',\n",
|
||||||
|
" 'January 19, 2024', '2024-01-20', '01-21-24', '2024.01.22', '23/01/2024'\n",
|
||||||
|
" ] * 20,\n",
|
||||||
|
" 'category': [\n",
|
||||||
|
" 'Electronics', 'electronics', 'ELECTRONICS', 'Books', 'books',\n",
|
||||||
|
" 'Clothing', 'clothing', 'CLOTHING', 'Home & Garden', 'home&garden'\n",
|
||||||
|
" ] * 20,\n",
|
||||||
|
" 'satisfaction_score': np.random.choice([1, 2, 3, 4, 5, 99, -1, None], n_records, p=[0.05, 0.1, 0.15, 0.35, 0.3, 0.02, 0.02, 0.01])\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"# Convert to DataFrame first\n",
|
||||||
|
"df_messy = pd.DataFrame(messy_data)\n",
|
||||||
|
"\n",
|
||||||
|
"# Introduce missing values and anomalies using proper indexing\n",
|
||||||
|
"df_messy.loc[df_messy.index[::25], 'customer_name'] = None # Some missing names\n",
|
||||||
|
"df_messy.loc[df_messy.index[::30], 'email'] = None # Some missing emails\n",
|
||||||
|
"df_messy.loc[df_messy.index[::35], 'purchase_amount'] = np.nan # Some missing amounts\n",
|
||||||
|
"df_messy.loc[df_messy.index[::40], 'purchase_amount'] = -999 # Invalid negative values\n",
|
||||||
|
"\n",
|
||||||
|
"# Add some duplicate records\n",
|
||||||
|
"duplicate_indices = [0, 1, 2, 3, 4]\n",
|
||||||
|
"duplicate_rows = df_messy.iloc[duplicate_indices].copy()\n",
|
||||||
|
"df_messy = pd.concat([df_messy, duplicate_rows], ignore_index=True)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Messy dataset created:\")\n",
|
||||||
|
"print(f\"Shape: {df_messy.shape}\")\n",
|
||||||
|
"print(\"\\nFirst few rows:\")\n",
|
||||||
|
"print(df_messy.head(10))\n",
|
||||||
|
"print(\"\\nData types:\")\n",
|
||||||
|
"print(df_messy.dtypes)\n",
|
||||||
|
"print(\"\\nSample of data quality issues:\")\n",
|
||||||
|
"print(\"\\n1. Missing values:\")\n",
|
||||||
|
"print(df_messy.isnull().sum())\n",
|
||||||
|
"print(\"\\n2. Inconsistent formatting examples:\")\n",
|
||||||
|
"print(\"Customer IDs:\", df_messy['customer_id'].head(15).tolist())\n",
|
||||||
|
"print(\"Customer names:\", df_messy['customer_name'].dropna().head(5).tolist())\n",
|
||||||
|
"print(\"Categories:\", df_messy['category'].unique()[:5])\n",
|
||||||
|
"print(\"\\n3. Invalid satisfaction scores:\")\n",
|
||||||
|
"print(\"Unique satisfaction scores:\", sorted(df_messy['satisfaction_score'].dropna().unique()))\n",
|
||||||
|
"print(\"\\n4. Invalid purchase amounts:\")\n",
|
||||||
|
"print(\"Negative amounts:\", df_messy[df_messy['purchase_amount'] < 0]['purchase_amount'].count())\n",
|
||||||
|
"print(\"\\n5. Date format inconsistencies:\")\n",
|
||||||
|
"print(\"Sample dates:\", df_messy['purchase_date'].head(10).tolist())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 1. Data Quality Assessment\n",
|
||||||
|
"\n",
|
||||||
|
"First, let's assess the quality of our messy data."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def assess_data_quality(df):\n",
|
||||||
|
" \"\"\"Comprehensive data quality assessment\"\"\"\n",
|
||||||
|
" print(\"=== DATA QUALITY ASSESSMENT ===\")\n",
|
||||||
|
" print(f\"Dataset shape: {df.shape}\")\n",
|
||||||
|
" print(f\"Total cells: {df.size}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Missing values analysis\n",
|
||||||
|
" print(\"\\n--- Missing Values ---\")\n",
|
||||||
|
" missing_stats = pd.DataFrame({\n",
|
||||||
|
" 'Missing_Count': df.isnull().sum(),\n",
|
||||||
|
" 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100\n",
|
||||||
|
" })\n",
|
||||||
|
" missing_stats = missing_stats[missing_stats['Missing_Count'] > 0]\n",
|
||||||
|
" print(missing_stats.round(2))\n",
|
||||||
|
" \n",
|
||||||
|
" # Duplicate analysis\n",
|
||||||
|
" print(\"\\n--- Duplicates ---\")\n",
|
||||||
|
" total_duplicates = df.duplicated().sum()\n",
|
||||||
|
" print(f\"Complete duplicate rows: {total_duplicates}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Column-specific analysis\n",
|
||||||
|
" print(\"\\n--- Column Analysis ---\")\n",
|
||||||
|
" for col in df.columns:\n",
|
||||||
|
" unique_count = df[col].nunique()\n",
|
||||||
|
" unique_percentage = (unique_count / len(df)) * 100\n",
|
||||||
|
" print(f\"{col}: {unique_count} unique values ({unique_percentage:.1f}%)\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Data type issues\n",
|
||||||
|
" print(\"\\n--- Data Types ---\")\n",
|
||||||
|
" print(df.dtypes)\n",
|
||||||
|
" \n",
|
||||||
|
" return missing_stats, total_duplicates\n",
|
||||||
|
"\n",
|
||||||
|
"# Assess the messy data\n",
|
||||||
|
"missing_stats, duplicate_count = assess_data_quality(df_messy)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Identify specific data quality issues\n",
|
||||||
|
"def identify_issues(df):\n",
|
||||||
|
" \"\"\"Identify specific data quality issues\"\"\"\n",
|
||||||
|
" issues = []\n",
|
||||||
|
" \n",
|
||||||
|
" # Check for inconsistent formatting\n",
|
||||||
|
" print(\"=== SPECIFIC ISSUES IDENTIFIED ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Customer ID formatting\n",
|
||||||
|
" id_patterns = df['customer_id'].str.extract(r'(CUST|cust)(\\d+)').fillna('')\n",
|
||||||
|
" inconsistent_ids = (id_patterns[0] == 'cust').sum()\n",
|
||||||
|
" print(f\"Inconsistent customer ID format: {inconsistent_ids} records\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Email validation\n",
|
||||||
|
" email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n",
|
||||||
|
" invalid_emails = ~df['email'].str.match(email_pattern, na=False)\n",
|
||||||
|
" print(f\"Invalid email formats: {invalid_emails.sum()} records\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Negative purchase amounts\n",
|
||||||
|
" negative_amounts = (df['purchase_amount'] < 0).sum()\n",
|
||||||
|
" print(f\"Negative purchase amounts: {negative_amounts} records\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Invalid satisfaction scores\n",
|
||||||
|
" invalid_scores = ((df['satisfaction_score'] < 1) | (df['satisfaction_score'] > 5)) & df['satisfaction_score'].notna()\n",
|
||||||
|
" print(f\"Invalid satisfaction scores: {invalid_scores.sum()} records\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Category inconsistencies\n",
|
||||||
|
" category_variations = df['category'].value_counts()\n",
|
||||||
|
" print(f\"\\nCategory variations: {len(category_variations)} different values\")\n",
|
||||||
|
" print(category_variations)\n",
|
||||||
|
" \n",
|
||||||
|
" return issues\n",
|
||||||
|
"\n",
|
||||||
|
"issues = identify_issues(df_messy)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 2. Text Data Standardization\n",
|
||||||
|
"\n",
|
||||||
|
"Clean and standardize text fields."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Text cleaning functions\n",
|
||||||
|
"def clean_text_data(df):\n",
|
||||||
|
" \"\"\"Comprehensive text data cleaning\"\"\"\n",
|
||||||
|
" df_clean = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" # Standardize customer names\n",
|
||||||
|
" print(\"Cleaning customer names...\")\n",
|
||||||
|
" df_clean['customer_name_clean'] = df_clean['customer_name'].str.strip() # Remove whitespace\n",
|
||||||
|
" df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.title() # Title case\n",
|
||||||
|
" df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.replace(r'\\s+', ' ', regex=True) # Multiple spaces\n",
|
||||||
|
" \n",
|
||||||
|
" # Standardize customer IDs\n",
|
||||||
|
" print(\"Standardizing customer IDs...\")\n",
|
||||||
|
" df_clean['customer_id_clean'] = df_clean['customer_id'].str.upper() # All uppercase\n",
|
||||||
|
" df_clean['customer_id_clean'] = df_clean['customer_id_clean'].str.replace('CUST', 'CUST') # Ensure consistent prefix\n",
|
||||||
|
" \n",
|
||||||
|
" # Clean email addresses\n",
|
||||||
|
" print(\"Cleaning email addresses...\")\n",
|
||||||
|
" df_clean['email_clean'] = df_clean['email'].str.lower() # Lowercase\n",
|
||||||
|
" df_clean['email_clean'] = df_clean['email_clean'].str.strip() # Remove whitespace\n",
|
||||||
|
" df_clean['email_clean'] = df_clean['email_clean'].str.replace(r'\\.{2,}', '.', regex=True) # Multiple dots\n",
|
||||||
|
" \n",
|
||||||
|
" # Standardize categories\n",
|
||||||
|
" print(\"Standardizing categories...\")\n",
|
||||||
|
" category_mapping = {\n",
|
||||||
|
" 'electronics': 'Electronics',\n",
|
||||||
|
" 'ELECTRONICS': 'Electronics',\n",
|
||||||
|
" 'books': 'Books',\n",
|
||||||
|
" 'clothing': 'Clothing',\n",
|
||||||
|
" 'CLOTHING': 'Clothing',\n",
|
||||||
|
" 'home&garden': 'Home & Garden',\n",
|
||||||
|
" 'Home & Garden': 'Home & Garden'\n",
|
||||||
|
" }\n",
|
||||||
|
" df_clean['category_clean'] = df_clean['category'].map(category_mapping).fillna(df_clean['category'])\n",
|
||||||
|
" \n",
|
||||||
|
" return df_clean\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply text cleaning\n",
|
||||||
|
"df_text_clean = clean_text_data(df_messy)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nText cleaning comparison:\")\n",
|
||||||
|
"comparison_cols = ['customer_name', 'customer_name_clean', 'customer_id', 'customer_id_clean', \n",
|
||||||
|
" 'email', 'email_clean', 'category', 'category_clean']\n",
|
||||||
|
"print(df_text_clean[comparison_cols].head(10))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Advanced text cleaning with regex\n",
|
||||||
|
"def advanced_text_cleaning(df):\n",
|
||||||
|
" \"\"\"Advanced text cleaning using regular expressions\"\"\"\n",
|
||||||
|
" df_advanced = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" # Extract and standardize address components\n",
|
||||||
|
" print(\"Processing addresses...\")\n",
|
||||||
|
" # Basic address pattern: number street, city, state zipcode\n",
|
||||||
|
" address_pattern = r'(\\d+)\\s+([^,]+),\\s*([^,]+),\\s*([A-Z]{2})\\s+(\\d{5})'\n",
|
||||||
|
" address_parts = df_advanced['address'].str.extract(address_pattern)\n",
|
||||||
|
" address_parts.columns = ['street_number', 'street_name', 'city', 'state', 'zipcode']\n",
|
||||||
|
" \n",
|
||||||
|
" # Clean street names\n",
|
||||||
|
" address_parts['street_name'] = address_parts['street_name'].str.title()\n",
|
||||||
|
" address_parts['city'] = address_parts['city'].str.title()\n",
|
||||||
|
" \n",
|
||||||
|
" # Combine cleaned parts\n",
|
||||||
|
" df_advanced['address_clean'] = (\n",
|
||||||
|
" address_parts['street_number'] + ' ' + address_parts['street_name'] + ', ' +\n",
|
||||||
|
" address_parts['city'] + ', ' + address_parts['state'] + ' ' + address_parts['zipcode']\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" # Add individual address components\n",
|
||||||
|
" for col in address_parts.columns:\n",
|
||||||
|
" df_advanced[col] = address_parts[col]\n",
|
||||||
|
" \n",
|
||||||
|
" return df_advanced\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply advanced cleaning\n",
|
||||||
|
"df_advanced_clean = advanced_text_cleaning(df_text_clean)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Address cleaning results:\")\n",
|
||||||
|
"print(df_advanced_clean[['address', 'address_clean', 'city', 'state', 'zipcode']].head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 3. Phone Number Standardization\n",
|
||||||
|
"\n",
|
||||||
|
"Clean and standardize phone numbers using regex patterns."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def standardize_phone_numbers(df):\n",
|
||||||
|
" \"\"\"Standardize phone numbers to consistent format\"\"\"\n",
|
||||||
|
" df_phone = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" def clean_phone(phone):\n",
|
||||||
|
" \"\"\"Clean individual phone number\"\"\"\n",
|
||||||
|
" if pd.isna(phone):\n",
|
||||||
|
" return None\n",
|
||||||
|
" \n",
|
||||||
|
" # Remove all non-digit characters\n",
|
||||||
|
" digits_only = re.sub(r'\\D', '', str(phone))\n",
|
||||||
|
" \n",
|
||||||
|
" # Handle different formats\n",
|
||||||
|
" if len(digits_only) == 10:\n",
|
||||||
|
" # Format as (XXX) XXX-XXXX\n",
|
||||||
|
" return f\"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}\"\n",
|
||||||
|
" elif len(digits_only) == 11 and digits_only.startswith('1'):\n",
|
||||||
|
" # Remove country code and format\n",
|
||||||
|
" phone_part = digits_only[1:]\n",
|
||||||
|
" return f\"({phone_part[:3]}) {phone_part[3:6]}-{phone_part[6:]}\"\n",
|
||||||
|
" else:\n",
|
||||||
|
" # Invalid phone number\n",
|
||||||
|
" return 'INVALID'\n",
|
||||||
|
" \n",
|
||||||
|
" # Apply phone cleaning\n",
|
||||||
|
" df_phone['phone_clean'] = df_phone['phone'].apply(clean_phone)\n",
|
||||||
|
" \n",
|
||||||
|
" # Extract area code\n",
|
||||||
|
" df_phone['area_code'] = df_phone['phone_clean'].str.extract(r'\\((\\d{3})\\)')\n",
|
||||||
|
" \n",
|
||||||
|
" # Flag invalid phone numbers\n",
|
||||||
|
" df_phone['phone_is_valid'] = df_phone['phone_clean'] != 'INVALID'\n",
|
||||||
|
" \n",
|
||||||
|
" return df_phone\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply phone standardization\n",
|
||||||
|
"df_phone_clean = standardize_phone_numbers(df_advanced_clean)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Phone number standardization:\")\n",
|
||||||
|
"print(df_phone_clean[['phone', 'phone_clean', 'area_code', 'phone_is_valid']].head(15))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nPhone validation summary:\")\n",
|
||||||
|
"print(df_phone_clean['phone_is_valid'].value_counts())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nArea code distribution:\")\n",
|
||||||
|
"print(df_phone_clean['area_code'].value_counts().head())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 4. Date Standardization\n",
|
||||||
|
"\n",
|
||||||
|
"Parse and standardize dates from various formats."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def standardize_dates(df):\n",
|
||||||
|
" \"\"\"Parse and standardize dates from multiple formats\"\"\"\n",
|
||||||
|
" df_dates = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" def parse_date(date_str):\n",
|
||||||
|
" \"\"\"Try to parse date from various formats\"\"\"\n",
|
||||||
|
" if pd.isna(date_str):\n",
|
||||||
|
" return None\n",
|
||||||
|
" \n",
|
||||||
|
" date_str = str(date_str).strip()\n",
|
||||||
|
" \n",
|
||||||
|
" # Common date formats to try\n",
|
||||||
|
" formats = [\n",
|
||||||
|
" '%Y-%m-%d', # 2024-01-15\n",
|
||||||
|
" '%m/%d/%Y', # 01/16/2024\n",
|
||||||
|
" '%Y-%m-%d', # 2024-1-17 (handled by first format)\n",
|
||||||
|
" '%d-%m-%Y', # 16-01-2024\n",
|
||||||
|
" '%Y/%m/%d', # 2024/01/18\n",
|
||||||
|
" '%B %d, %Y', # January 19, 2024\n",
|
||||||
|
" '%m-%d-%y', # 01-21-24\n",
|
||||||
|
" '%Y.%m.%d', # 2024.01.22\n",
|
||||||
|
" '%d/%m/%Y' # 23/01/2024\n",
|
||||||
|
" ]\n",
|
||||||
|
" \n",
|
||||||
|
" for fmt in formats:\n",
|
||||||
|
" try:\n",
|
||||||
|
" return pd.to_datetime(date_str, format=fmt)\n",
|
||||||
|
" except ValueError:\n",
|
||||||
|
" continue\n",
|
||||||
|
" \n",
|
||||||
|
" # If all else fails, try pandas' flexible parser\n",
|
||||||
|
" try:\n",
|
||||||
|
" return pd.to_datetime(date_str, infer_datetime_format=True)\n",
|
||||||
|
" except:\n",
|
||||||
|
" return None\n",
|
||||||
|
" \n",
|
||||||
|
" # Apply date parsing\n",
|
||||||
|
" print(\"Parsing dates...\")\n",
|
||||||
|
" df_dates['purchase_date_clean'] = df_dates['purchase_date'].apply(parse_date)\n",
|
||||||
|
" \n",
|
||||||
|
" # Flag unparseable dates\n",
|
||||||
|
" df_dates['date_is_valid'] = df_dates['purchase_date_clean'].notna()\n",
|
||||||
|
" \n",
|
||||||
|
" # Extract date components for valid dates\n",
|
||||||
|
" df_dates['purchase_year'] = df_dates['purchase_date_clean'].dt.year\n",
|
||||||
|
" df_dates['purchase_month'] = df_dates['purchase_date_clean'].dt.month\n",
|
||||||
|
" df_dates['purchase_day'] = df_dates['purchase_date_clean'].dt.day\n",
|
||||||
|
" df_dates['purchase_day_of_week'] = df_dates['purchase_date_clean'].dt.day_name()\n",
|
||||||
|
" \n",
|
||||||
|
" return df_dates\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply date standardization\n",
|
||||||
|
"df_date_clean = standardize_dates(df_phone_clean)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Date standardization results:\")\n",
|
||||||
|
"print(df_date_clean[['purchase_date', 'purchase_date_clean', 'date_is_valid', \n",
|
||||||
|
" 'purchase_year', 'purchase_month', 'purchase_day_of_week']].head(15))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nDate parsing summary:\")\n",
|
||||||
|
"print(df_date_clean['date_is_valid'].value_counts())\n",
|
||||||
|
"\n",
|
||||||
|
"invalid_dates = df_date_clean[~df_date_clean['date_is_valid']]['purchase_date'].unique()\n",
|
||||||
|
"if len(invalid_dates) > 0:\n",
|
||||||
|
" print(f\"\\nInvalid date formats found: {invalid_dates}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 5. Numerical Data Cleaning\n",
|
||||||
|
"\n",
|
||||||
|
"Handle outliers, invalid values, and missing numerical data."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def clean_numerical_data(df):\n",
|
||||||
|
" \"\"\"Clean and validate numerical data\"\"\"\n",
|
||||||
|
" df_numeric = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" # Clean purchase amounts\n",
|
||||||
|
" print(\"Cleaning purchase amounts...\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Flag invalid values\n",
|
||||||
|
" df_numeric['amount_is_valid'] = (\n",
|
||||||
|
" df_numeric['purchase_amount'].notna() & \n",
|
||||||
|
" (df_numeric['purchase_amount'] >= 0) & \n",
|
||||||
|
" (df_numeric['purchase_amount'] <= 10000) # Reasonable upper limit\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" # Replace invalid values with NaN\n",
|
||||||
|
" df_numeric['purchase_amount_clean'] = df_numeric['purchase_amount'].where(\n",
|
||||||
|
" df_numeric['amount_is_valid'], np.nan\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" # Detect outliers using IQR method\n",
|
||||||
|
" Q1 = df_numeric['purchase_amount_clean'].quantile(0.25)\n",
|
||||||
|
" Q3 = df_numeric['purchase_amount_clean'].quantile(0.75)\n",
|
||||||
|
" IQR = Q3 - Q1\n",
|
||||||
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
||||||
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
||||||
|
" \n",
|
||||||
|
" df_numeric['amount_is_outlier'] = (\n",
|
||||||
|
" (df_numeric['purchase_amount_clean'] < lower_bound) |\n",
|
||||||
|
" (df_numeric['purchase_amount_clean'] > upper_bound)\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" # Clean satisfaction scores\n",
|
||||||
|
" print(\"Cleaning satisfaction scores...\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Valid satisfaction scores are 1-5\n",
|
||||||
|
" df_numeric['satisfaction_is_valid'] = (\n",
|
||||||
|
" df_numeric['satisfaction_score'].notna() &\n",
|
||||||
|
" (df_numeric['satisfaction_score'].between(1, 5))\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" df_numeric['satisfaction_score_clean'] = df_numeric['satisfaction_score'].where(\n",
|
||||||
|
" df_numeric['satisfaction_is_valid'], np.nan\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" return df_numeric\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply numerical cleaning\n",
|
||||||
|
"df_numeric_clean = clean_numerical_data(df_date_clean)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Numerical data cleaning results:\")\n",
|
||||||
|
"print(df_numeric_clean[['purchase_amount', 'purchase_amount_clean', 'amount_is_valid', \n",
|
||||||
|
" 'amount_is_outlier', 'satisfaction_score', 'satisfaction_score_clean', \n",
|
||||||
|
" 'satisfaction_is_valid']].head(15))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nNumerical data quality summary:\")\n",
|
||||||
|
"print(f\"Valid purchase amounts: {df_numeric_clean['amount_is_valid'].sum()}/{len(df_numeric_clean)}\")\n",
|
||||||
|
"print(f\"Outlier amounts: {df_numeric_clean['amount_is_outlier'].sum()}\")\n",
|
||||||
|
"print(f\"Valid satisfaction scores: {df_numeric_clean['satisfaction_is_valid'].sum()}/{len(df_numeric_clean)}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Show statistics for cleaned data\n",
|
||||||
|
"print(\"\\nCleaned amount statistics:\")\n",
|
||||||
|
"print(df_numeric_clean['purchase_amount_clean'].describe())"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 6. Duplicate Detection and Handling\n",
|
||||||
|
"\n",
|
||||||
|
"Identify and handle duplicate records intelligently."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def handle_duplicates(df):\n",
|
||||||
|
" \"\"\"Comprehensive duplicate detection and handling\"\"\"\n",
|
||||||
|
" df_dedup = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" print(\"=== DUPLICATE ANALYSIS ===\")\n",
|
||||||
|
" \n",
|
||||||
|
" # 1. Exact duplicates\n",
|
||||||
|
" exact_duplicates = df_dedup.duplicated()\n",
|
||||||
|
" print(f\"Exact duplicate rows: {exact_duplicates.sum()}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # 2. Duplicates based on key columns (likely same customer)\n",
|
||||||
|
" key_cols = ['customer_name_clean', 'email_clean']\n",
|
||||||
|
" key_duplicates = df_dedup.duplicated(subset=key_cols, keep=False)\n",
|
||||||
|
" print(f\"Duplicate customers (by name/email): {key_duplicates.sum()}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # 3. Near duplicates (similar but not exact)\n",
|
||||||
|
" # For demonstration, we'll check phone numbers\n",
|
||||||
|
" phone_duplicates = df_dedup.duplicated(subset=['phone_clean'], keep=False)\n",
|
||||||
|
" print(f\"Duplicate phone numbers: {phone_duplicates.sum()}\")\n",
|
||||||
|
" \n",
|
||||||
|
" # Show duplicate examples\n",
|
||||||
|
" if key_duplicates.any():\n",
|
||||||
|
" print(\"\\nExample duplicate customers:\")\n",
|
||||||
|
" duplicate_customers = df_dedup[key_duplicates].sort_values(key_cols)\n",
|
||||||
|
" print(duplicate_customers[key_cols + ['customer_id_clean', 'purchase_amount_clean']].head(10))\n",
|
||||||
|
" \n",
|
||||||
|
" # Remove exact duplicates\n",
|
||||||
|
" print(f\"\\nRemoving {exact_duplicates.sum()} exact duplicates...\")\n",
|
||||||
|
" df_no_exact_dups = df_dedup[~exact_duplicates]\n",
|
||||||
|
" \n",
|
||||||
|
" # For customer duplicates, keep the one with the highest purchase amount\n",
|
||||||
|
" print(\"Handling customer duplicates (keeping highest purchase)...\")\n",
|
||||||
|
" df_final = df_no_exact_dups.sort_values('purchase_amount_clean', ascending=False).drop_duplicates(\n",
|
||||||
|
" subset=key_cols, keep='first'\n",
|
||||||
|
" )\n",
|
||||||
|
" \n",
|
||||||
|
" print(f\"Final dataset size after deduplication: {len(df_final)} (was {len(df)})\")\n",
|
||||||
|
" \n",
|
||||||
|
" return df_final\n",
|
||||||
|
"\n",
|
||||||
|
"# Apply duplicate handling\n",
|
||||||
|
"df_deduplicated = handle_duplicates(df_numeric_clean)\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"\\nRows removed: {len(df_numeric_clean) - len(df_deduplicated)}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## 7. Data Validation and Quality Scores\n",
|
||||||
|
"\n",
|
||||||
|
"Create comprehensive data quality metrics."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def calculate_quality_scores(df):\n",
|
||||||
|
" \"\"\"Calculate comprehensive data quality scores\"\"\"\n",
|
||||||
|
" df_quality = df.copy()\n",
|
||||||
|
" \n",
|
||||||
|
" # Define quality checks\n",
|
||||||
|
" quality_checks = {\n",
|
||||||
|
" 'has_customer_name': df_quality['customer_name_clean'].notna(),\n",
|
||||||
|
" 'has_valid_email': df_quality['email_clean'].notna() & \n",
|
||||||
|
" df_quality['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', na=False),\n",
|
||||||
|
" 'has_valid_phone': df_quality['phone_is_valid'] == True,\n",
|
||||||
|
" 'has_valid_date': df_quality['date_is_valid'] == True,\n",
|
||||||
|
" 'has_valid_amount': df_quality['amount_is_valid'] == True,\n",
|
||||||
|
" 'has_valid_satisfaction': df_quality['satisfaction_is_valid'] == True,\n",
|
||||||
|
" 'amount_not_outlier': df_quality['amount_is_outlier'] == False,\n",
|
||||||
|
" 'has_complete_address': df_quality['city'].notna() & df_quality['state'].notna() & df_quality['zipcode'].notna()\n",
|
||||||
|
" }\n",
|
||||||
|
" \n",
|
||||||
|
" # Add individual quality flags\n",
|
||||||
|
" for check_name, check_result in quality_checks.items():\n",
|
||||||
|
" df_quality[f'quality_{check_name}'] = check_result.astype(int)\n",
|
||||||
|
" \n",
|
||||||
|
" # Calculate overall quality score (percentage of passed checks)\n",
|
||||||
|
" quality_cols = [col for col in df_quality.columns if col.startswith('quality_')]\n",
|
||||||
|
" df_quality['data_quality_score'] = df_quality[quality_cols].mean(axis=1) * 100\n",
|
||||||
|
" \n",
|
||||||
|
" # Categorize quality levels\n",
|
||||||
|
" def quality_category(score):\n",
|
||||||
|
" if score >= 90:\n",
|
||||||
|
" return 'Excellent'\n",
|
||||||
|
" elif score >= 75:\n",
|
||||||
|
" return 'Good'\n",
|
||||||
|
" elif score >= 50:\n",
|
||||||
|
" return 'Fair'\n",
|
||||||
|
" else:\n",
|
||||||
|
" return 'Poor'\n",
|
||||||
|
" \n",
|
||||||
|
" df_quality['quality_category'] = df_quality['data_quality_score'].apply(quality_category)\n",
|
||||||
|
" \n",
|
||||||
|
" return df_quality, quality_checks\n",
|
||||||
|
"\n",
|
||||||
|
"# Calculate quality scores\n",
|
||||||
|
"df_with_quality, quality_checks = calculate_quality_scores(df_deduplicated)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Data quality analysis:\")\n",
|
||||||
|
"print(df_with_quality[['customer_name_clean', 'data_quality_score', 'quality_category']].head(10))\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nQuality category distribution:\")\n",
|
||||||
|
"print(df_with_quality['quality_category'].value_counts())\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\nAverage quality scores by check:\")\n",
|
||||||
|
"quality_summary = {}\n",
|
||||||
|
"for check_name in quality_checks.keys():\n",
|
||||||
|
" col_name = f'quality_{check_name}'\n",
|
||||||
|
" quality_summary[check_name] = df_with_quality[col_name].mean() * 100\n",
|
||||||
|
"\n",
|
||||||
|
"quality_df = pd.DataFrame(list(quality_summary.items()), columns=['Quality_Check', 'Pass_Rate_%'])\n",
|
||||||
|
"quality_df = quality_df.sort_values('Pass_Rate_%', ascending=False)\n",
|
||||||
|
"print(quality_df.round(1))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Practice Exercises\n",
|
||||||
|
"\n",
|
||||||
|
"Apply advanced data cleaning techniques to challenging scenarios:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 32,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 1: Create a custom validation function\n",
|
||||||
|
"# Build a function that validates business rules:\n",
|
||||||
|
"# - Email domains should be from approved list\n",
|
||||||
|
"# - Purchase amounts should be within reasonable ranges by category\n",
|
||||||
|
"# - Dates should be within business operating period\n",
|
||||||
|
"# - Customer IDs should follow specific format patterns\n",
|
||||||
|
"\n",
|
||||||
|
"def validate_business_rules(df):\n",
|
||||||
|
" \"\"\"Validate business-specific rules\"\"\"\n",
|
||||||
|
" # Your implementation here\n",
|
||||||
|
" pass\n",
|
||||||
|
"\n",
|
||||||
|
"# validation_results = validate_business_rules(df_final_clean)\n",
|
||||||
|
"# print(validation_results)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 33,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 2: Advanced duplicate detection\n",
|
||||||
|
"# Implement fuzzy matching for near-duplicate detection:\n",
|
||||||
|
"# - Similar names (edit distance)\n",
|
||||||
|
"# - Similar addresses\n",
|
||||||
|
"# - Similar email patterns\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 34,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Exercise 3: Data cleaning metrics dashboard\n",
|
||||||
|
"# Create a comprehensive data quality dashboard that shows:\n",
|
||||||
|
"# - Data quality trends over time\n",
|
||||||
|
"# - Field-by-field quality scores\n",
|
||||||
|
"# - Impact of cleaning steps\n",
|
||||||
|
"# - Recommendations for further improvement\n",
|
||||||
|
"\n",
|
||||||
|
"# Your code here:\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Key Takeaways\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Assessment First**: Always assess data quality before cleaning\n",
|
||||||
|
"2. **Systematic Approach**: Use a structured pipeline for consistent results\n",
|
||||||
|
"3. **Preserve Original Data**: Keep original values while creating cleaned versions\n",
|
||||||
|
"4. **Document Everything**: Log all cleaning steps and decisions\n",
|
||||||
|
"5. **Validation**: Implement business rule validation\n",
|
||||||
|
"6. **Quality Metrics**: Measure and track data quality improvements\n",
|
||||||
|
"7. **Reusable Pipeline**: Create automated, configurable cleaning processes\n",
|
||||||
|
"8. **Context Matters**: Consider domain-specific requirements\n",
|
||||||
|
"\n",
|
||||||
|
"## Common Data Issues and Solutions\n",
|
||||||
|
"\n",
|
||||||
|
"| Issue | Detection Method | Solution |\n",
|
||||||
|
"|-------|-----------------|----------|\n",
|
||||||
|
"| Inconsistent Format | Pattern analysis | Standardization rules |\n",
|
||||||
|
"| Missing Values | `.isnull()` | Imputation or flagging |\n",
|
||||||
|
"| Duplicates | `.duplicated()` | Deduplication logic |\n",
|
||||||
|
"| Outliers | Statistical methods | Capping or flagging |\n",
|
||||||
|
"| Invalid Values | Business rules | Validation and correction |\n",
|
||||||
|
"| Inconsistent Naming | String analysis | Normalization |\n",
|
||||||
|
"| Date Issues | Parsing attempts | Multiple format handling |\n",
|
||||||
|
"| Text Issues | Regex patterns | Cleaning and standardization |\n",
|
||||||
|
"\n",
|
||||||
|
"## Best Practices\n",
|
||||||
|
"\n",
|
||||||
|
"1. **Start with Exploration**: Understand your data before cleaning\n",
|
||||||
|
"2. **Preserve Traceability**: Keep original and cleaned versions\n",
|
||||||
|
"3. **Validate Assumptions**: Test cleaning rules on sample data\n",
|
||||||
|
"4. **Measure Impact**: Quantify improvements from cleaning\n",
|
||||||
|
"5. **Automate When Possible**: Build reusable cleaning pipelines\n",
|
||||||
|
"6. **Handle Edge Cases**: Plan for unusual but valid data\n",
|
||||||
|
"7. **Business Context**: Include domain experts in rule definition\n",
|
||||||
|
"8. **Iterative Process**: Refine cleaning rules based on results\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.13.3"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
1301
Session_01/ohlcv_analysis.ipynb
Executable file
1301
Session_01/ohlcv_analysis.ipynb
Executable file
File diff suppressed because one or more lines are too long
1733
Session_01/ohlcv_analysis_advanced.ipynb
Executable file
1733
Session_01/ohlcv_analysis_advanced.ipynb
Executable file
File diff suppressed because one or more lines are too long
Loading…
Add table
Reference in a new issue