Session_01

2025-06-13 07:25:59 +02:00 · 2025-06-13 07:25:59 +02:00 · 6befd2d50c
commit 6befd2d50c
parent 16a25d8ee5
18 changed files with 99852 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -1 +1,98 @@
-# crypto_bot_training
+# Build Your Own Crypto Trading Bot – Course Repository
 Welcome to the private repository for the **"Build Your Own Crypto Trading Bot – Hands-On Course with Alex"** by QuantJourney.
 This repository contains materials, templates, and code samples used during the 6 live sessions held in June 2025.
 > ⚠️ This repository is for registered participants only.
 ---
 ## Content Overview
 **Session 1: Foundations & Data Structures**
 - Set up Python, IDE, and required libraries
 - Pandas basics for financial time series
 - Understanding OHLCV format
 - Create your first crypto DataFrame with sample data
 **Session 2: Data Acquisition & Exchange Connectivity**
 - WebSocket basics for real-time crypto feeds (Binance focus)
 - Fail-safe reconnection logic and error handling
 - Logging basics for live systems
 - Build tools: order flow scanner, liquidation monitor, funding rate tracker
 **Session 3: Data Processing & Technical Analysis**
 - API access using CCXT
 - Handle rate limits and API error scenarios
 - Reconnect & retry mechanisms
 - Use pandas-ta to compute SMA, EMA, RSI
 - Create your own indicator pipeline
 **Session 4: Strategy Development & Backtesting**
 - Overview of strategy types (trend, mean reversion)
 - Backtesting with `backtesting.py`
 - Compute Sharpe ratio, drawdown, profit factor
 - Add position sizing, SL/TP, and walk-forward logic
 - Adjust for fees, slippage, and latency
 **Session 5: Bot Architecture & Implementation**
 - Bot system design: event-driven vs loop-based
 - Core components: order manager, position tracker, error handler
 - Risk constraints: daily limits, max size
 - Logging & monitoring structure
 - Write the engine core for your bot
 **Session 6: Live Trading & Deployment**
 - API keys and secure credential handling
 - Deployment targets: local, VPS, cloud (e.g., Hetzner)
 - Running 24/7: restart logic, alerting
 - Final bot launch + testing in production
 - Send alerts via Telegram or email
 ---
 ## 🤖 AI-Enhanced Trading
 Bonus section:
 - Use ChatGPT/Claude for strategy suggestions
 - Integrate AI-based filters or signal generation
 - Let LLMs help you refactor and extend your logic
 ---
 ## 📁 Repository Structure
 ```text
 /Session_01/      # Foundations & DataFrame Handling
 /Session_02/      # WebSockets & Real-Time Feed Tools
 /Session_03/      # Indicators & Analysis
 /Session_04/      # Backtesting + Strategy Logic
 /Session_05/      # Trading Bot Core Engine
 /Session_06/      # Live Deployment and Monitoring
 /templates/     # Starter and final bot code
 /utils/         # Helper scripts for logging, reconnection, etc.
 README.md       # You are here
 ```
 ---
 ## 🛠 Requirements
 - Python 3.10+
 - Install dependencies per session in each folder or via a top-level `requirements.txt` (provided)
 ---
 ## 📫 Support
 You can reach Alex directly at [alex@quantjourney.pro](mailto:alex@quantjourney.pro) for post-course support (1 week included).
 ---
 ## ⚠️ Disclaimer
 This project is for **educational use only**. No financial advice. Always trade with caution and use proper risk management.
 ---
 Happy coding – and trade smart.
 QuantJourney Team
--- a/Session_01/.DS_Store
+++ b/Session_01/.DS_Store
--- a/Session_01/Data/BTCUSD-1h-data.csv
+++ b/Session_01/Data/BTCUSD-1h-data.csv
--- a/Session_01/PandasDataFrame-exmples/01_creating_dataframes.ipynb
+++ b/Session_01/PandasDataFrame-exmples/01_creating_dataframes.ipynb
@ -0,0 +1,391 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 1: Creating DataFrames\n",
    "\n",
    "## Learning Objectives\n",
    "- Understand different methods to create pandas DataFrames\n",
    "- Learn to create DataFrames from dictionaries, lists, and NumPy arrays\n",
    "- Practice with various data types and structures\n",
    "\n",
    "## Prerequisites\n",
    "- Basic Python knowledge\n",
    "- Understanding of lists and dictionaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pandas version: 2.2.3\n",
      "NumPy version: 2.2.6\n"
     ]
    }
   ],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "print(f\"Pandas version: {pd.__version__}\")\n",
    "print(f\"NumPy version: {np.__version__}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method 1: Creating DataFrame from Dictionary\n",
    "\n",
    "This is the most common and intuitive way to create a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Student DataFrame:\n",
      "      Name  Age Grade  Score\n",
      "0    Alice   23     A     95\n",
      "1      Bob   25     B     87\n",
      "2  Charlie   22     A     92\n",
      "3    Diana   24     C     78\n",
      "4      Eve   23     B     89\n",
      "\n",
      "Shape: (5, 4)\n",
      "Data types:\n",
      "Name     object\n",
      "Age       int64\n",
      "Grade    object\n",
      "Score     int64\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Creating DataFrame from dictionary\n",
    "student_data = {\n",
    "    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],\n",
    "    'Age': [23, 25, 22, 24, 23],\n",
    "    'Grade': ['A', 'B', 'A', 'C', 'B'],\n",
    "    'Score': [95, 87, 92, 78, 89]\n",
    "}\n",
    "\n",
    "df_students = pd.DataFrame(student_data)\n",
    "print(\"Student DataFrame:\")\n",
    "print(df_students)\n",
    "print(f\"\\nShape: {df_students.shape}\")\n",
    "print(f\"Data types:\\n{df_students.dtypes}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method 2: Creating DataFrame from Lists\n",
    "\n",
    "You can create DataFrames from separate lists by combining them in a dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cities DataFrame:\n",
      "       City  Population_Million    Country\n",
      "0  New York                 8.4        USA\n",
      "1    London                 8.9         UK\n",
      "2     Tokyo                13.9      Japan\n",
      "3     Paris                 2.1     France\n",
      "4    Sydney                 5.3  Australia\n",
      "\n",
      "Index: [0, 1, 2, 3, 4]\n",
      "Columns: ['City', 'Population_Million', 'Country']\n"
     ]
    }
   ],
   "source": [
    "# Creating DataFrame from separate lists\n",
    "cities = ['New York', 'London', 'Tokyo', 'Paris', 'Sydney']\n",
    "populations = [8.4, 8.9, 13.9, 2.1, 5.3]\n",
    "countries = ['USA', 'UK', 'Japan', 'France', 'Australia']\n",
    "\n",
    "df_cities = pd.DataFrame({\n",
    "    'City': cities,\n",
    "    'Population_Million': populations,\n",
    "    'Country': countries\n",
    "})\n",
    "\n",
    "print(\"Cities DataFrame:\")\n",
    "print(df_cities)\n",
    "print(f\"\\nIndex: {df_cities.index.tolist()}\")\n",
    "print(f\"Columns: {df_cities.columns.tolist()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method 3: Creating DataFrame from NumPy Array\n",
    "\n",
    "This method is useful when working with numerical data or when you need random data for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Random DataFrame:\n",
      "      Column_A  Column_B  Column_C\n",
      "Row1        52        93        15\n",
      "Row2        72        61        21\n",
      "Row3        83        87        75\n",
      "Row4        75        88        24\n",
      "Row5         3        22        53\n",
      "\n",
      "Summary statistics:\n",
      "        Column_A   Column_B   Column_C\n",
      "count   5.000000   5.000000   5.000000\n",
      "mean   57.000000  70.200000  37.600000\n",
      "std    32.272279  29.693434  25.530374\n",
      "min     3.000000  22.000000  15.000000\n",
      "25%    52.000000  61.000000  21.000000\n",
      "50%    72.000000  87.000000  24.000000\n",
      "75%    75.000000  88.000000  53.000000\n",
      "max    83.000000  93.000000  75.000000\n"
     ]
    }
   ],
   "source": [
    "# Creating DataFrame from NumPy array\n",
    "np.random.seed(42)  # For reproducible results\n",
    "random_data = np.random.randint(1, 100, size=(5, 3))\n",
    "\n",
    "df_random = pd.DataFrame(random_data, \n",
    "                        columns=['Column_A', 'Column_B', 'Column_C'],\n",
    "                        index=['Row1', 'Row2', 'Row3', 'Row4', 'Row5'])\n",
    "\n",
    "print(\"Random DataFrame:\")\n",
    "print(df_random)\n",
    "print(f\"\\nSummary statistics:\")\n",
    "print(df_random.describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method 4: Creating DataFrame with Custom Index\n",
    "\n",
    "You can specify custom row labels (index) when creating DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Products DataFrame with Custom Index:\n",
      "         Product  Price  Stock\n",
      "PROD001   Laptop   1200     15\n",
      "PROD002    Phone    800     50\n",
      "PROD003   Tablet    600     30\n",
      "PROD004  Monitor    300     20\n",
      "\n",
      "Accessing by index label 'PROD002':\n",
      "Product    Phone\n",
      "Price        800\n",
      "Stock         50\n",
      "Name: PROD002, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Creating DataFrame with custom index\n",
    "product_data = {\n",
    "    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],\n",
    "    'Price': [1200, 800, 600, 300],\n",
    "    'Stock': [15, 50, 30, 20]\n",
    "}\n",
    "\n",
    "# Custom index using product codes\n",
    "custom_index = ['PROD001', 'PROD002', 'PROD003', 'PROD004']\n",
    "df_products = pd.DataFrame(product_data, index=custom_index)\n",
    "\n",
    "print(\"Products DataFrame with Custom Index:\")\n",
    "print(df_products)\n",
    "print(f\"\\nAccessing by index label 'PROD002':\")\n",
    "print(df_products.loc['PROD002'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method 5: Creating Empty DataFrame and Adding Data\n",
    "\n",
    "Sometimes you need to start with an empty DataFrame and add data incrementally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Empty DataFrame:\n",
      "Empty DataFrame\n",
      "Columns: [Date, Temperature, Humidity, Pressure]\n",
      "Index: []\n",
      "Shape: (0, 4)\n",
      "\n",
      "DataFrame after adding data:\n",
      "         Date  Temperature  Humidity  Pressure\n",
      "0  2024-01-01         22.5        65    1013.2\n",
      "1  2024-01-02         24.1        68    1015.1\n",
      "2  2024-01-03         21.8        72    1012.8\n"
     ]
    }
   ],
   "source": [
    "# Creating empty DataFrame with specified columns\n",
    "columns = ['Date', 'Temperature', 'Humidity', 'Pressure']\n",
    "df_weather = pd.DataFrame(columns=columns)\n",
    "\n",
    "print(\"Empty DataFrame:\")\n",
    "print(df_weather)\n",
    "print(f\"Shape: {df_weather.shape}\")\n",
    "\n",
    "# Adding data row by row (not recommended for large datasets)\n",
    "weather_data = [\n",
    "    ['2024-01-01', 22.5, 65, 1013.2],\n",
    "    ['2024-01-02', 24.1, 68, 1015.1],\n",
    "    ['2024-01-03', 21.8, 72, 1012.8]\n",
    "]\n",
    "\n",
    "for row in weather_data:\n",
    "    df_weather.loc[len(df_weather)] = row\n",
    "\n",
    "print(\"\\nDataFrame after adding data:\")\n",
    "print(df_weather)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Try these exercises to reinforce your learning:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Create a DataFrame from dictionary with employee information\n",
    "# Include: Employee ID, Name, Department, Salary, Years of Experience\n",
    "\n",
    "# Your code here:\n",
    "employee_data = {\n",
    "    # Add your data here\n",
    "}\n",
    "\n",
    "# df_employees = pd.DataFrame(employee_data)\n",
    "# print(df_employees)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Create a DataFrame using NumPy with 6 rows and 4 columns\n",
    "# Use column names: 'A', 'B', 'C', 'D'\n",
    "# Use row indices: 'R1', 'R2', 'R3', 'R4', 'R5', 'R6'\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Create a DataFrame with mixed data types\n",
    "# Include at least one string, integer, float, and boolean column\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Dictionary method** is most intuitive for creating DataFrames\n",
    "2. **NumPy arrays** are useful for numerical data and testing\n",
    "3. **Custom indices** provide meaningful row labels\n",
    "4. **Empty DataFrames** can be useful but avoid adding rows one by one for large datasets\n",
    "5. Always check the **shape** and **data types** of your DataFrame after creation\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/Session_01/PandasDataFrame-exmples/02_basic_operations.ipynb
+++ b/Session_01/PandasDataFrame-exmples/02_basic_operations.ipynb
@ -0,0 +1,523 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 2: Basic Operations\n",
    "\n",
    "## Learning Objectives\n",
    "- Learn essential methods to explore DataFrame structure\n",
    "- Understand how to get basic information about your data\n",
    "- Master data inspection techniques\n",
    "- Practice with summary statistics\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lesson 1: Creating DataFrames\n",
    "- Basic understanding of pandas DataFrames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "# Set display options for better output\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.width', None)\n",
    "pd.set_option('display.max_colwidth', 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Sample Dataset\n",
    "\n",
    "Let's create a comprehensive sales dataset to practice basic operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sales Dataset Created!\n",
      "Dataset shape: (20, 6)\n"
     ]
    }
   ],
   "source": [
    "# Create a comprehensive sales dataset\n",
    "np.random.seed(42)\n",
    "\n",
    "sales_data = {\n",
    "    'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
    "    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
    "    'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
    "              1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
    "    'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
    "    'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
    "    'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
    "}\n",
    "\n",
    "df_sales = pd.DataFrame(sales_data)\n",
    "print(\"Sales Dataset Created!\")\n",
    "print(f\"Dataset shape: {df_sales.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Viewing Data\n",
    "\n",
    "These methods help you quickly inspect your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View first few rows\n",
    "print(\"First 5 rows (default):\")\n",
    "print(df_sales.head())\n",
    "\n",
    "print(\"\\nFirst 3 rows:\")\n",
    "print(df_sales.head(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View last few rows\n",
    "print(\"Last 5 rows (default):\")\n",
    "print(df_sales.tail())\n",
    "\n",
    "print(\"\\nLast 3 rows:\")\n",
    "print(df_sales.tail(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Random sample of 5 rows:\n",
      "         Date Product  Sales Region Salesperson  Commission_Rate\n",
      "0  2024-01-01  Laptop   1200  North        John             0.10\n",
      "17 2024-01-18  Tablet    620   East        Mike             0.08\n",
      "15 2024-01-16  Laptop   1220  North        John             0.10\n",
      "1  2024-01-02   Phone    800  South       Sarah             0.12\n",
      "8  2024-01-09  Laptop   1250   West        Lisa             0.11\n",
      "\n",
      "Random sample with different random state:\n",
      "         Date Product  Sales Region Salesperson  Commission_Rate\n",
      "7  2024-01-08  Tablet    650   East        Mike             0.08\n",
      "10 2024-01-11  Laptop   1150  North        John             0.10\n",
      "5  2024-01-06  Laptop   1300  North        John             0.10\n"
     ]
    }
   ],
   "source": [
    "# Sample random rows\n",
    "print(\"Random sample of 5 rows:\")\n",
    "print(df_sales.sample(5))\n",
    "\n",
    "print(\"\\nRandom sample with different random state:\")\n",
    "print(df_sales.sample(3, random_state=10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. DataFrame Information\n",
    "\n",
    "Get detailed information about your DataFrame structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Comprehensive information about the DataFrame\n",
    "print(\"DataFrame Info:\")\n",
    "df_sales.info()\n",
    "\n",
    "print(\"\\nMemory usage:\")\n",
    "df_sales.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic properties\n",
    "print(f\"Shape (rows, columns): {df_sales.shape}\")\n",
    "print(f\"Number of rows: {len(df_sales)}\")\n",
    "print(f\"Number of columns: {len(df_sales.columns)}\")\n",
    "print(f\"Total elements: {df_sales.size}\")\n",
    "print(f\"Dimensions: {df_sales.ndim}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column and index information\n",
    "print(\"Column names:\")\n",
    "print(df_sales.columns.tolist())\n",
    "\n",
    "print(\"\\nData types:\")\n",
    "print(df_sales.dtypes)\n",
    "\n",
    "print(\"\\nIndex information:\")\n",
    "print(f\"Index: {df_sales.index}\")\n",
    "print(f\"Index type: {type(df_sales.index)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Summary Statistics\n",
    "\n",
    "Understand your data through statistical summaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary statistics for numerical columns\n",
    "print(\"Summary statistics:\")\n",
    "print(df_sales.describe())\n",
    "\n",
    "print(\"\\nRounded to 2 decimal places:\")\n",
    "print(df_sales.describe().round(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary statistics for all columns (including non-numeric)\n",
    "print(\"Summary for all columns:\")\n",
    "print(df_sales.describe(include='all'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Individual statistics\n",
    "print(\"Individual Statistical Measures:\")\n",
    "print(f\"Mean sales: {df_sales['Sales'].mean():.2f}\")\n",
    "print(f\"Median sales: {df_sales['Sales'].median():.2f}\")\n",
    "print(f\"Standard deviation: {df_sales['Sales'].std():.2f}\")\n",
    "print(f\"Minimum sales: {df_sales['Sales'].min()}\")\n",
    "print(f\"Maximum sales: {df_sales['Sales'].max()}\")\n",
    "print(f\"Sales range: {df_sales['Sales'].max() - df_sales['Sales'].min()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quantiles and percentiles\n",
    "print(\"Quantiles for Sales:\")\n",
    "print(f\"25th percentile (Q1): {df_sales['Sales'].quantile(0.25)}\")\n",
    "print(f\"50th percentile (Q2/Median): {df_sales['Sales'].quantile(0.50)}\")\n",
    "print(f\"75th percentile (Q3): {df_sales['Sales'].quantile(0.75)}\")\n",
    "print(f\"90th percentile: {df_sales['Sales'].quantile(0.90)}\")\n",
    "\n",
    "print(\"\\nCustom quantiles:\")\n",
    "quantiles = df_sales['Sales'].quantile([0.1, 0.3, 0.7, 0.9])\n",
    "print(quantiles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Counting and Unique Values\n",
    "\n",
    "Understand the distribution of categorical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Count unique values in each column\n",
    "print(\"Number of unique values per column:\")\n",
    "print(df_sales.nunique())\n",
    "\n",
    "print(\"\\nUnique values in 'Product' column:\")\n",
    "print(df_sales['Product'].unique())\n",
    "\n",
    "print(\"\\nValue counts for 'Product':\")\n",
    "print(df_sales['Product'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Product distribution (counts and percentages):\n",
      "         Count  Percentage\n",
      "Product                   \n",
      "Laptop       8        40.0\n",
      "Phone        8        40.0\n",
      "Tablet       4        20.0\n"
     ]
    }
   ],
   "source": [
    "# Value counts with percentages\n",
    "print(\"Product distribution (counts and percentages):\")\n",
    "product_counts = df_sales['Product'].value_counts()\n",
    "product_percentages = df_sales['Product'].value_counts(normalize=True) * 100\n",
    "\n",
    "distribution = pd.DataFrame({\n",
    "    'Count': product_counts,\n",
    "    'Percentage': product_percentages.round(1)\n",
    "})\n",
    "print(distribution)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cross-tabulation\n",
    "print(\"Cross-tabulation of Product vs Region:\")\n",
    "crosstab = pd.crosstab(df_sales['Product'], df_sales['Region'])\n",
    "print(crosstab)\n",
    "\n",
    "print(\"\\nWith percentages:\")\n",
    "crosstab_pct = pd.crosstab(df_sales['Product'], df_sales['Region'], normalize='all') * 100\n",
    "print(crosstab_pct.round(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Data Quality Checks\n",
    "\n",
    "Essential checks for data quality and integrity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for missing values\n",
    "print(\"Missing values per column:\")\n",
    "print(df_sales.isnull().sum())\n",
    "\n",
    "print(\"\\nPercentage of missing values:\")\n",
    "missing_percentages = (df_sales.isnull().sum() / len(df_sales)) * 100\n",
    "print(missing_percentages.round(2))\n",
    "\n",
    "print(\"\\nAny missing values in dataset?\", df_sales.isnull().any().any())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for duplicates\n",
    "print(f\"Number of duplicate rows: {df_sales.duplicated().sum()}\")\n",
    "print(f\"Any duplicate rows? {df_sales.duplicated().any()}\")\n",
    "\n",
    "# Check for duplicates based on specific columns\n",
    "print(f\"\\nDuplicate combinations of Date and Salesperson: {df_sales.duplicated(['Date', 'Salesperson']).sum()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Quick Data Exploration\n",
    "\n",
    "Rapid exploration techniques to understand your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quick exploration function\n",
    "def quick_explore(df, column_name):\n",
    "    \"\"\"Quick exploration of a specific column\"\"\"\n",
    "    print(f\"=== Quick Exploration: {column_name} ===\")\n",
    "    col = df[column_name]\n",
    "    \n",
    "    print(f\"Data type: {col.dtype}\")\n",
    "    print(f\"Non-null values: {col.count()}/{len(col)}\")\n",
    "    print(f\"Unique values: {col.nunique()}\")\n",
    "    \n",
    "    if col.dtype in ['int64', 'float64']:\n",
    "        print(f\"Min: {col.min()}, Max: {col.max()}\")\n",
    "        print(f\"Mean: {col.mean():.2f}, Median: {col.median():.2f}\")\n",
    "    else:\n",
    "        print(f\"Most common: {col.mode().iloc[0] if not col.mode().empty else 'N/A'}\")\n",
    "        print(f\"Sample values: {col.unique()[:5].tolist()}\")\n",
    "    print()\n",
    "\n",
    "# Explore different columns\n",
    "for col in ['Sales', 'Product', 'Region']:\n",
    "    quick_explore(df_sales, col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Test your understanding with these exercises:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Create a larger dataset and explore it\n",
    "# Create a dataset with 100 rows and at least 5 columns\n",
    "# Include different data types (numeric, categorical, datetime)\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Write a function that provides a complete data profile\n",
    "# Include: shape, data types, missing values, unique values, and basic stats\n",
    "\n",
    "def data_profile(df):\n",
    "    \"\"\"Provide a comprehensive data profile\"\"\"\n",
    "    # Your code here:\n",
    "    pass\n",
    "\n",
    "# Test your function\n",
    "# data_profile(df_sales)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Find interesting insights from the sales data\n",
    "# Questions to answer:\n",
    "# 1. Which product has the highest average sales?\n",
    "# 2. Which region has the most consistent sales (lowest standard deviation)?\n",
    "# 3. What's the total commission earned by each salesperson?\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **`.head()` and `.tail()`** are essential for quick data inspection\n",
    "2. **`.info()`** provides comprehensive DataFrame structure information\n",
    "3. **`.describe()`** gives statistical summaries for numerical columns\n",
    "4. **`.nunique()` and `.value_counts()`** help understand categorical data\n",
    "5. **Always check for missing values** and duplicates in your data\n",
    "6. **Statistical measures** (mean, median, std) provide insights into data distribution\n",
    "7. **Cross-tabulation** helps understand relationships between categorical variables\n",
    "\n",
    "## Common Gotchas\n",
    "\n",
    "- `.describe()` only includes numeric columns by default (use `include='all'` for all columns)\n",
    "- Missing values can affect statistical calculations\n",
    "- Large datasets might need memory-efficient exploration techniques\n",
    "- Always verify data types are correct for your analysis"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/Session_01/PandasDataFrame-exmples/03_selecting_filtering.ipynb
+++ b/Session_01/PandasDataFrame-exmples/03_selecting_filtering.ipynb
@ -0,0 +1,593 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 3: Selecting and Filtering Data\n",
    "\n",
    "## Learning Objectives\n",
    "- Master column and row selection techniques\n",
    "- Learn boolean indexing for data filtering\n",
    "- Understand the difference between `.loc[]` and `.iloc[]`\n",
    "- Practice complex filtering conditions\n",
    "- Handle edge cases in data selection\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lessons 1-2\n",
    "- Understanding of Python boolean operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "# Create sample dataset\n",
    "np.random.seed(42)\n",
    "sales_data = {\n",
    "    'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
    "    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
    "    'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
    "              1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
    "    'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
    "    'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
    "    'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
    "}\n",
    "\n",
    "df_sales = pd.DataFrame(sales_data)\n",
    "print(\"Dataset loaded:\")\n",
    "print(df_sales.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Selecting Columns\n",
    "\n",
    "Different ways to select columns from a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Single column (Product) - Returns Series:\n",
      "Type: <class 'pandas.core.series.Series'>\n",
      "0    Laptop\n",
      "1     Phone\n",
      "2    Tablet\n",
      "3    Laptop\n",
      "4     Phone\n",
      "Name: Product, dtype: object\n",
      "\n",
      "Single column with dot notation:\n",
      "0    Laptop\n",
      "1     Phone\n",
      "2    Tablet\n",
      "3    Laptop\n",
      "4     Phone\n",
      "Name: Product, dtype: object\n",
      "\n",
      "Single column as DataFrame (note the double brackets):\n",
      "Type: <class 'pandas.core.frame.DataFrame'>\n",
      "  Product\n",
      "0  Laptop\n",
      "1   Phone\n",
      "2  Tablet\n",
      "3  Laptop\n",
      "4   Phone\n"
     ]
    }
   ],
   "source": [
    "# Single column selection (returns Series)\n",
    "print(\"Single column (Product) - Returns Series:\")\n",
    "product_series = df_sales['Product']\n",
    "print(f\"Type: {type(product_series)}\")\n",
    "print(product_series.head())\n",
    "\n",
    "print(\"\\nSingle column with dot notation:\")\n",
    "print(df_sales.Product.head())\n",
    "\n",
    "print(\"\\nSingle column as DataFrame (note the double brackets):\")\n",
    "product_df = df_sales[['Product']]\n",
    "print(f\"Type: {type(product_df)}\")\n",
    "print(product_df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multiple column selection\n",
    "print(\"Multiple columns:\")\n",
    "selected_cols = df_sales[['Product', 'Sales', 'Region']]\n",
    "print(selected_cols.head())\n",
    "\n",
    "print(\"\\nUsing a list variable:\")\n",
    "columns_to_select = ['Date', 'Salesperson', 'Sales']\n",
    "selected_df = df_sales[columns_to_select]\n",
    "print(selected_df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column selection with conditions\n",
    "print(\"Selecting columns by data type:\")\n",
    "numeric_cols = df_sales.select_dtypes(include=[np.number])\n",
    "print(\"Numeric columns:\")\n",
    "print(numeric_cols.head())\n",
    "\n",
    "print(\"\\nSelecting columns by name pattern:\")\n",
    "# Columns containing 'S'\n",
    "s_columns = [col for col in df_sales.columns if 'S' in col]\n",
    "print(f\"Columns with 'S': {s_columns}\")\n",
    "print(df_sales[s_columns].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Selecting Rows\n",
    "\n",
    "Different methods to select specific rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Row selection by index position\n",
    "print(\"First row (index 0):\")\n",
    "print(df_sales.iloc[0])\n",
    "\n",
    "print(\"\\nRows 2 to 4 (positions 1, 2, 3):\")\n",
    "print(df_sales.iloc[1:4])\n",
    "\n",
    "print(\"\\nLast 3 rows:\")\n",
    "print(df_sales.iloc[-3:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Row selection by label/index\n",
    "print(\"Using .loc with index labels:\")\n",
    "print(df_sales.loc[0:2])  # Note: includes endpoint with .loc\n",
    "\n",
    "print(\"\\nSpecific rows by index:\")\n",
    "specific_rows = df_sales.loc[[0, 5, 10, 15]]\n",
    "print(specific_rows[['Product', 'Sales', 'Region']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random sampling\n",
    "print(\"Random sample of 5 rows:\")\n",
    "random_sample = df_sales.sample(n=5, random_state=42)\n",
    "print(random_sample[['Product', 'Sales', 'Salesperson']])\n",
    "\n",
    "print(\"\\nRandom 25% of the data:\")\n",
    "percentage_sample = df_sales.sample(frac=0.25, random_state=42)\n",
    "print(f\"Sample size: {len(percentage_sample)} rows\")\n",
    "print(percentage_sample[['Product', 'Sales']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Boolean Indexing and Filtering\n",
    "\n",
    "Filter data based on conditions using boolean indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simple boolean conditions\n",
    "print(\"Sales greater than 1000:\")\n",
    "high_sales = df_sales[df_sales['Sales'] > 1000]\n",
    "print(high_sales[['Product', 'Sales', 'Region']])\n",
    "\n",
    "print(\"\\nSpecific product filter:\")\n",
    "laptops_only = df_sales[df_sales['Product'] == 'Laptop']\n",
    "print(laptops_only[['Date', 'Sales', 'Salesperson']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multiple conditions with AND (&)\n",
    "print(\"Laptops with sales > 1100:\")\n",
    "laptop_high_sales = df_sales[(df_sales['Product'] == 'Laptop') & (df_sales['Sales'] > 1100)]\n",
    "print(laptop_high_sales[['Date', 'Product', 'Sales', 'Region']])\n",
    "\n",
    "print(\"\\nNorth region with commission rate >= 0.10:\")\n",
    "north_high_commission = df_sales[(df_sales['Region'] == 'North') & (df_sales['Commission_Rate'] >= 0.10)]\n",
    "print(north_high_commission[['Product', 'Sales', 'Commission_Rate', 'Salesperson']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multiple conditions with OR (|)\n",
    "print(\"Laptops OR high sales (>1200):\")\n",
    "laptop_or_high = df_sales[(df_sales['Product'] == 'Laptop') | (df_sales['Sales'] > 1200)]\n",
    "print(laptop_or_high[['Product', 'Sales', 'Region']])\n",
    "\n",
    "print(\"\\nNorth OR South regions:\")\n",
    "north_or_south = df_sales[(df_sales['Region'] == 'North') | (df_sales['Region'] == 'South')]\n",
    "print(north_or_south[['Product', 'Sales', 'Region']].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using .isin() for multiple values\n",
    "print(\"Products: Laptop or Phone\")\n",
    "laptop_phone = df_sales[df_sales['Product'].isin(['Laptop', 'Phone'])]\n",
    "print(laptop_phone[['Product', 'Sales', 'Region']].head())\n",
    "\n",
    "print(\"\\nSpecific salespersons:\")\n",
    "selected_salespeople = df_sales[df_sales['Salesperson'].isin(['John', 'Sarah'])]\n",
    "print(selected_salespeople[['Salesperson', 'Product', 'Sales']].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NOT conditions using ~\n",
    "print(\"NOT Tablets:\")\n",
    "not_tablets = df_sales[~(df_sales['Product'] == 'Tablet')]\n",
    "print(not_tablets['Product'].value_counts())\n",
    "\n",
    "print(\"\\nNOT in North region:\")\n",
    "not_north = df_sales[~df_sales['Region'].isin(['North'])]\n",
    "print(not_north['Region'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Advanced Selection with .loc and .iloc\n",
    "\n",
    "Powerful selection methods for precise data access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# .loc for label-based selection\n",
    "print(\".loc examples - Label-based selection:\")\n",
    "\n",
    "# Select specific rows and columns\n",
    "print(\"Rows 0-2, specific columns:\")\n",
    "result = df_sales.loc[0:2, ['Product', 'Sales', 'Region']]\n",
    "print(result)\n",
    "\n",
    "print(\"\\nAll rows, specific columns:\")\n",
    "result = df_sales.loc[:, ['Product', 'Sales']]\n",
    "print(result.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# .iloc for position-based selection\n",
    "print(\".iloc examples - Position-based selection:\")\n",
    "\n",
    "# Select by position\n",
    "print(\"First 3 rows, first 3 columns:\")\n",
    "result = df_sales.iloc[0:3, 0:3]\n",
    "print(result)\n",
    "\n",
    "print(\"\\nEvery other row, specific columns:\")\n",
    "result = df_sales.iloc[::2, [1, 2, 3]]  # Every 2nd row, columns 1,2,3\n",
    "print(result.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Combining boolean indexing with .loc\n",
    "print(\"Boolean indexing with .loc:\")\n",
    "\n",
    "# High sales, specific columns\n",
    "high_sales_subset = df_sales.loc[df_sales['Sales'] > 1000, ['Product', 'Sales', 'Salesperson']]\n",
    "print(high_sales_subset)\n",
    "\n",
    "print(\"\\nComplex condition with .loc:\")\n",
    "complex_filter = (df_sales['Product'] == 'Laptop') & (df_sales['Region'] == 'North')\n",
    "result = df_sales.loc[complex_filter, ['Date', 'Sales', 'Commission_Rate']]\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. String-based Filtering\n",
    "\n",
    "Filter data based on string patterns and conditions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# String methods for filtering\n",
    "print(\"Salesperson names starting with 'J':\")\n",
    "j_names = df_sales[df_sales['Salesperson'].str.startswith('J')]\n",
    "print(j_names[['Salesperson', 'Product', 'Sales']].head())\n",
    "\n",
    "print(\"\\nRegions containing 'th':\")\n",
    "th_regions = df_sales[df_sales['Region'].str.contains('th')]\n",
    "print(th_regions[['Region', 'Product', 'Sales']].head())\n",
    "\n",
    "print(\"\\nProducts with exactly 5 characters:\")\n",
    "five_char_products = df_sales[df_sales['Product'].str.len() == 5]\n",
    "print(five_char_products['Product'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Date-based Filtering\n",
    "\n",
    "Filter data based on date conditions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Date filtering\n",
    "print(\"Data from first week of January 2024:\")\n",
    "first_week = df_sales[df_sales['Date'] <= '2024-01-07']\n",
    "print(first_week[['Date', 'Product', 'Sales']])\n",
    "\n",
    "print(\"\\nData from specific date range:\")\n",
    "date_range = df_sales[(df_sales['Date'] >= '2024-01-10') & (df_sales['Date'] <= '2024-01-15')]\n",
    "print(date_range[['Date', 'Product', 'Sales', 'Region']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using date components\n",
    "print(\"Data from weekends (Saturday=5, Sunday=6):\")\n",
    "weekends = df_sales[df_sales['Date'].dt.dayofweek >= 5]\n",
    "print(weekends[['Date', 'Product', 'Sales']])\n",
    "\n",
    "print(\"\\nData from specific days of week:\")\n",
    "mondays = df_sales[df_sales['Date'].dt.day_name() == 'Monday']\n",
    "print(f\"Monday sales: {len(mondays)} records\")\n",
    "if len(mondays) > 0:\n",
    "    print(mondays[['Date', 'Product', 'Sales']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Query Method\n",
    "\n",
    "Alternative syntax for filtering using the `.query()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using .query() method for cleaner syntax\n",
    "print(\"Using .query() method:\")\n",
    "\n",
    "# Simple condition\n",
    "high_sales_query = df_sales.query('Sales > 1000')\n",
    "print(f\"High sales records: {len(high_sales_query)}\")\n",
    "print(high_sales_query[['Product', 'Sales', 'Region']].head())\n",
    "\n",
    "print(\"\\nMultiple conditions:\")\n",
    "complex_query = df_sales.query('Product == \"Laptop\" and Region == \"North\"')\n",
    "print(complex_query[['Date', 'Sales', 'Commission_Rate']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Query with variables\n",
    "min_sales = 900\n",
    "target_region = 'East'\n",
    "\n",
    "print(\"Query with variables:\")\n",
    "var_query = df_sales.query('Sales >= @min_sales and Region == @target_region')\n",
    "print(var_query[['Product', 'Sales', 'Region']])\n",
    "\n",
    "print(\"\\nQuery with list (isin equivalent):\")\n",
    "products = ['Laptop', 'Phone']\n",
    "list_query = df_sales.query('Product in @products')\n",
    "print(f\"Records for {products}: {len(list_query)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Test your filtering and selection skills:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Complex Filtering\n",
    "# Find all sales where:\n",
    "# - Product is either 'Laptop' or 'Phone'\n",
    "# - Sales are above the median\n",
    "# - Commission rate is at least 0.10\n",
    "# Show only Date, Product, Sales, and Salesperson columns\n",
    "\n",
    "# Your code here:\n",
    "median_sales = df_sales['Sales'].median()\n",
    "print(f\"Median sales: {median_sales}\")\n",
    "\n",
    "# complex_filter = ?\n",
    "# result = ?\n",
    "# print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Date-based Analysis\n",
    "# Find sales data for the second week of January 2024\n",
    "# Calculate the average sales for that week\n",
    "# Show which products were sold and by whom\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Performance Analysis\n",
    "# Create a function that finds top performers:\n",
    "# - Takes a DataFrame and a percentile (e.g., 0.8 for top 20%)\n",
    "# - Returns salespeople whose average sales are in the top percentile\n",
    "# - Show their average sales and total number of sales\n",
    "\n",
    "def find_top_performers(df, percentile=0.8):\n",
    "    \"\"\"Find top performing salespeople\"\"\"\n",
    "    # Your code here:\n",
    "    pass\n",
    "\n",
    "# Test your function\n",
    "# top_performers = find_top_performers(df_sales, 0.8)\n",
    "# print(top_performers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Column Selection**: Use `[]` for single/multiple columns, understand Series vs DataFrame return types\n",
    "2. **Row Selection**: `.iloc[]` for position-based, `.loc[]` for label-based selection\n",
    "3. **Boolean Indexing**: Use `&` (AND), `|` (OR), `~` (NOT) for combining conditions\n",
    "4. **Parentheses Matter**: Always wrap individual conditions in parentheses when combining\n",
    "5. **`.isin()` Method**: Efficient way to filter for multiple values\n",
    "6. **String Methods**: Use `.str` accessor for string-based filtering\n",
    "7. **Date Filtering**: Leverage `.dt` accessor for date-based conditions\n",
    "8. **`.query()` Method**: Alternative syntax for complex filtering\n",
    "\n",
    "## Common Mistakes to Avoid\n",
    "\n",
    "- Using `and/or` instead of `&/|` in boolean conditions\n",
    "- Forgetting parentheses around conditions\n",
    "- Confusing `.loc[]` and `.iloc[]` usage\n",
    "- Not handling empty results from filtering\n",
    "- Using chained indexing instead of `.loc[]`\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/Session_01/PandasDataFrame-exmples/04_grouping_aggregation.ipynb
+++ b/Session_01/PandasDataFrame-exmples/04_grouping_aggregation.ipynb
--- a/Session_01/PandasDataFrame-exmples/05_adding_modifying_columns.ipynb
+++ b/Session_01/PandasDataFrame-exmples/05_adding_modifying_columns.ipynb
@ -0,0 +1,733 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 5: Adding and Modifying Columns\n",
    "\n",
    "## Learning Objectives\n",
    "- Learn different methods to add new columns to DataFrames\n",
    "- Master conditional column creation using various techniques\n",
    "- Understand how to modify existing columns\n",
    "- Practice with calculated fields and derived columns\n",
    "- Explore data type conversions and transformations\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lessons 1-4\n",
    "- Understanding of basic Python operations and functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "# Create sample dataset\n",
    "np.random.seed(42)\n",
    "n_records = 150\n",
    "\n",
    "sales_data = {\n",
    "    'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n",
    "    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),\n",
    "    'Sales': np.random.normal(1000, 200, n_records).astype(int),\n",
    "    'Quantity': np.random.randint(1, 8, n_records),\n",
    "    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
    "    'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n_records),\n",
    "    'Customer_Type': np.random.choice(['New', 'Returning', 'VIP'], n_records, p=[0.3, 0.6, 0.1])\n",
    "}\n",
    "\n",
    "df_sales = pd.DataFrame(sales_data)\n",
    "df_sales['Sales'] = np.abs(df_sales['Sales'])  # Ensure positive values\n",
    "\n",
    "print(\"Original dataset:\")\n",
    "print(f\"Shape: {df_sales.shape}\")\n",
    "print(\"\\nFirst few rows:\")\n",
    "print(df_sales.head())\n",
    "print(\"\\nData types:\")\n",
    "print(df_sales.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Basic Column Addition\n",
    "\n",
    "Simple methods to add new columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 1: Direct assignment\n",
    "df_modified = df_sales.copy()\n",
    "\n",
    "# Add simple calculated columns\n",
    "df_modified['Revenue'] = df_modified['Sales'] * df_modified['Quantity']\n",
    "df_modified['Commission_10%'] = df_modified['Sales'] * 0.10\n",
    "df_modified['Sales_per_Unit'] = df_modified['Sales'] / df_modified['Quantity']\n",
    "\n",
    "print(\"New calculated columns:\")\n",
    "print(df_modified[['Sales', 'Quantity', 'Revenue', 'Commission_10%', 'Sales_per_Unit']].head())\n",
    "\n",
    "# Add constant value column\n",
    "df_modified['Year'] = 2024\n",
    "df_modified['Currency'] = 'USD'\n",
    "df_modified['Department'] = 'Sales'\n",
    "\n",
    "print(\"\\nConstant value columns added:\")\n",
    "print(df_modified[['Year', 'Currency', 'Department']].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 2: Using assign() method (more functional approach)\n",
    "df_assigned = df_sales.assign(\n",
    "    Revenue=lambda x: x['Sales'] * x['Quantity'],\n",
    "    Commission_Rate=0.08,\n",
    "    Commission_Amount=lambda x: x['Sales'] * 0.08,\n",
    "    Sales_Squared=lambda x: x['Sales'] ** 2,\n",
    "    Is_High_Volume=lambda x: x['Quantity'] > 5\n",
    ")\n",
    "\n",
    "print(\"Using assign() method:\")\n",
    "print(df_assigned[['Sales', 'Quantity', 'Revenue', 'Commission_Amount', 'Is_High_Volume']].head())\n",
    "\n",
    "print(f\"\\nOriginal shape: {df_sales.shape}\")\n",
    "print(f\"Modified shape: {df_assigned.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 3: Using insert() for specific positioning\n",
    "df_insert = df_sales.copy()\n",
    "\n",
    "# Insert column at specific position (after 'Sales')\n",
    "sales_index = df_insert.columns.get_loc('Sales')\n",
    "df_insert.insert(sales_index + 1, 'Sales_Tax', df_insert['Sales'] * 0.08)\n",
    "df_insert.insert(sales_index + 2, 'Total_with_Tax', df_insert['Sales'] + df_insert['Sales_Tax'])\n",
    "\n",
    "print(\"Using insert() for positioned columns:\")\n",
    "print(df_insert[['Product', 'Sales', 'Sales_Tax', 'Total_with_Tax', 'Quantity']].head())\n",
    "print(f\"\\nColumn order: {list(df_insert.columns)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Conditional Column Creation\n",
    "\n",
    "Create columns based on conditions and business logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 1: Using np.where() for simple conditions\n",
    "df_conditional = df_sales.copy()\n",
    "\n",
    "# Simple binary conditions\n",
    "df_conditional['High_Sales'] = np.where(df_conditional['Sales'] > 1000, 'Yes', 'No')\n",
    "df_conditional['Weekend'] = np.where(df_conditional['Date'].dt.dayofweek >= 5, 'Weekend', 'Weekday')\n",
    "df_conditional['Bulk_Order'] = np.where(df_conditional['Quantity'] >= 5, 'Bulk', 'Regular')\n",
    "\n",
    "print(\"Simple conditional columns:\")\n",
    "print(df_conditional[['Sales', 'High_Sales', 'Date', 'Weekend', 'Quantity', 'Bulk_Order']].head())\n",
    "\n",
    "# Nested conditions\n",
    "df_conditional['Sales_Category'] = np.where(df_conditional['Sales'] > 1200, 'High',\n",
    "                                  np.where(df_conditional['Sales'] > 800, 'Medium', 'Low'))\n",
    "\n",
    "print(\"\\nNested conditions:\")\n",
    "print(df_conditional[['Sales', 'Sales_Category']].head(10))\n",
    "print(\"\\nCategory distribution:\")\n",
    "print(df_conditional['Sales_Category'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 2: Using pd.cut() for binning numerical data\n",
    "df_conditional['Sales_Tier'] = pd.cut(df_conditional['Sales'], \n",
    "                                     bins=[0, 500, 800, 1200, float('inf')],\n",
    "                                     labels=['Entry', 'Standard', 'Premium', 'Luxury'])\n",
    "\n",
    "print(\"Using pd.cut() for binning:\")\n",
    "print(df_conditional[['Sales', 'Sales_Tier']].head(10))\n",
    "print(\"\\nTier distribution:\")\n",
    "print(df_conditional['Sales_Tier'].value_counts())\n",
    "\n",
    "# Using pd.qcut() for quantile-based binning\n",
    "df_conditional['Sales_Quintile'] = pd.qcut(df_conditional['Sales'], \n",
    "                                          q=5, \n",
    "                                          labels=['Bottom 20%', 'Low 20%', 'Mid 20%', 'High 20%', 'Top 20%'])\n",
    "\n",
    "print(\"\\nUsing pd.qcut() for quantile binning:\")\n",
    "print(df_conditional['Sales_Quintile'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 3: Using pandas.select() for multiple conditions\n",
    "# Define conditions and choices\n",
    "conditions = [\n",
    "    (df_conditional['Sales'] >= 1200) & (df_conditional['Quantity'] >= 5),\n",
    "    (df_conditional['Sales'] >= 1000) & (df_conditional['Customer_Type'] == 'VIP'),\n",
    "    (df_conditional['Sales'] >= 800) & (df_conditional['Region'] == 'North'),\n",
    "    df_conditional['Customer_Type'] == 'New'\n",
    "]\n",
    "\n",
    "choices = ['Premium Deal', 'VIP Sale', 'North Preferred', 'New Customer']\n",
    "default = 'Standard'\n",
    "\n",
    "df_conditional['Deal_Type'] = np.select(conditions, choices, default=default)\n",
    "\n",
    "print(\"Using np.select() for complex conditions:\")\n",
    "print(df_conditional[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Deal_Type']].head(10))\n",
    "print(\"\\nDeal type distribution:\")\n",
    "print(df_conditional['Deal_Type'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Using Apply and Lambda Functions\n",
    "\n",
    "Create complex calculated columns using custom functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simple lambda functions\n",
    "df_apply = df_sales.copy()\n",
    "\n",
    "# Single column transformations\n",
    "df_apply['Sales_Log'] = df_apply['Sales'].apply(lambda x: np.log(x))\n",
    "df_apply['Product_Length'] = df_apply['Product'].apply(lambda x: len(x))\n",
    "df_apply['Days_Since_Start'] = df_apply['Date'].apply(lambda x: (x - df_apply['Date'].min()).days)\n",
    "\n",
    "print(\"Simple lambda transformations:\")\n",
    "print(df_apply[['Sales', 'Sales_Log', 'Product', 'Product_Length', 'Days_Since_Start']].head())\n",
    "\n",
    "# Multiple column operations using lambda\n",
    "df_apply['Efficiency_Score'] = df_apply.apply(\n",
    "    lambda row: (row['Sales'] * row['Quantity']) / (row['Days_Since_Start'] + 1), \n",
    "    axis=1\n",
    ")\n",
    "\n",
    "print(\"\\nMultiple column lambda:\")\n",
    "print(df_apply[['Sales', 'Quantity', 'Days_Since_Start', 'Efficiency_Score']].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Custom functions for complex business logic\n",
    "def calculate_commission(row):\n",
    "    \"\"\"Calculate commission based on complex business rules\"\"\"\n",
    "    base_rate = 0.05\n",
    "    \n",
    "    # VIP customers get higher commission\n",
    "    if row['Customer_Type'] == 'VIP':\n",
    "        base_rate += 0.02\n",
    "    \n",
    "    # High quantity orders get bonus\n",
    "    if row['Quantity'] >= 5:\n",
    "        base_rate += 0.01\n",
    "    \n",
    "    # Regional multipliers\n",
    "    region_multipliers = {'North': 1.2, 'South': 1.0, 'East': 1.1, 'West': 0.9}\n",
    "    multiplier = region_multipliers.get(row['Region'], 1.0)\n",
    "    \n",
    "    return row['Sales'] * base_rate * multiplier\n",
    "\n",
    "def performance_rating(row):\n",
    "    \"\"\"Calculate performance rating based on multiple factors\"\"\"\n",
    "    score = 0\n",
    "    \n",
    "    # Sales performance (40% weight)\n",
    "    if row['Sales'] > 1200:\n",
    "        score += 40\n",
    "    elif row['Sales'] > 800:\n",
    "        score += 30\n",
    "    else:\n",
    "        score += 20\n",
    "    \n",
    "    # Quantity performance (30% weight)\n",
    "    if row['Quantity'] >= 6:\n",
    "        score += 30\n",
    "    elif row['Quantity'] >= 4:\n",
    "        score += 20\n",
    "    else:\n",
    "        score += 10\n",
    "    \n",
    "    # Customer type bonus (30% weight)\n",
    "    customer_bonus = {'VIP': 30, 'Returning': 20, 'New': 15}\n",
    "    score += customer_bonus.get(row['Customer_Type'], 0)\n",
    "    \n",
    "    # Convert to letter grade\n",
    "    if score >= 85:\n",
    "        return 'A'\n",
    "    elif score >= 70:\n",
    "        return 'B'\n",
    "    elif score >= 55:\n",
    "        return 'C'\n",
    "    else:\n",
    "        return 'D'\n",
    "\n",
    "# Apply custom functions\n",
    "df_apply['Commission'] = df_apply.apply(calculate_commission, axis=1)\n",
    "df_apply['Performance_Rating'] = df_apply.apply(performance_rating, axis=1)\n",
    "\n",
    "print(\"Custom function results:\")\n",
    "print(df_apply[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Commission', 'Performance_Rating']].head())\n",
    "\n",
    "print(\"\\nPerformance rating distribution:\")\n",
    "print(df_apply['Performance_Rating'].value_counts().sort_index())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Date and Time Derived Columns\n",
    "\n",
    "Extract useful information from datetime columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract date components\n",
    "df_dates = df_sales.copy()\n",
    "\n",
    "# Basic date components\n",
    "df_dates['Year'] = df_dates['Date'].dt.year\n",
    "df_dates['Month'] = df_dates['Date'].dt.month\n",
    "df_dates['Day'] = df_dates['Date'].dt.day\n",
    "df_dates['DayOfWeek'] = df_dates['Date'].dt.dayofweek  # 0=Monday, 6=Sunday\n",
    "df_dates['DayName'] = df_dates['Date'].dt.day_name()\n",
    "df_dates['MonthName'] = df_dates['Date'].dt.month_name()\n",
    "\n",
    "print(\"Basic date components:\")\n",
    "print(df_dates[['Date', 'Year', 'Month', 'Day', 'DayOfWeek', 'DayName', 'MonthName']].head())\n",
    "\n",
    "# Business-relevant date features\n",
    "df_dates['Quarter'] = df_dates['Date'].dt.quarter\n",
    "df_dates['Week'] = df_dates['Date'].dt.isocalendar().week\n",
    "df_dates['DayOfYear'] = df_dates['Date'].dt.dayofyear\n",
    "df_dates['IsWeekend'] = df_dates['Date'].dt.dayofweek >= 5\n",
    "df_dates['IsMonthStart'] = df_dates['Date'].dt.is_month_start\n",
    "df_dates['IsMonthEnd'] = df_dates['Date'].dt.is_month_end\n",
    "\n",
    "print(\"\\nBusiness date features:\")\n",
    "print(df_dates[['Date', 'Quarter', 'Week', 'IsWeekend', 'IsMonthStart', 'IsMonthEnd']].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Time-based calculations\n",
    "start_date = df_dates['Date'].min()\n",
    "df_dates['Days_Since_Start'] = (df_dates['Date'] - start_date).dt.days\n",
    "df_dates['Weeks_Since_Start'] = df_dates['Days_Since_Start'] // 7\n",
    "\n",
    "# Create season column\n",
    "def get_season(month):\n",
    "    if month in [12, 1, 2]:\n",
    "        return 'Winter'\n",
    "    elif month in [3, 4, 5]:\n",
    "        return 'Spring'\n",
    "    elif month in [6, 7, 8]:\n",
    "        return 'Summer'\n",
    "    else:\n",
    "        return 'Fall'\n",
    "\n",
    "df_dates['Season'] = df_dates['Month'].apply(get_season)\n",
    "\n",
    "# Business day calculations\n",
    "df_dates['IsBusinessDay'] = df_dates['Date'].dt.dayofweek < 5\n",
    "df_dates['BusinessDaysSinceStart'] = df_dates.apply(\n",
    "    lambda row: np.busday_count(start_date.date(), row['Date'].date()), axis=1\n",
    ")\n",
    "\n",
    "print(\"Time-based calculations:\")\n",
    "print(df_dates[['Date', 'Days_Since_Start', 'Weeks_Since_Start', 'Season', \n",
    "               'IsBusinessDay', 'BusinessDaysSinceStart']].head())\n",
    "\n",
    "print(\"\\nSeason distribution:\")\n",
    "print(df_dates['Season'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Text and String Manipulations\n",
    "\n",
    "Create columns based on string operations and text processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# String manipulations\n",
    "df_text = df_sales.copy()\n",
    "\n",
    "# Basic string operations\n",
    "df_text['Product_Upper'] = df_text['Product'].str.upper()\n",
    "df_text['Product_Lower'] = df_text['Product'].str.lower()\n",
    "df_text['Product_Length'] = df_text['Product'].str.len()\n",
    "df_text['Product_First_Char'] = df_text['Product'].str[0]\n",
    "df_text['Product_Last_Three'] = df_text['Product'].str[-3:]\n",
    "\n",
    "print(\"Basic string operations:\")\n",
    "print(df_text[['Product', 'Product_Upper', 'Product_Lower', 'Product_Length', \n",
    "              'Product_First_Char', 'Product_Last_Three']].head())\n",
    "\n",
    "# Text categorization\n",
    "df_text['Product_Category'] = df_text['Product'].apply(lambda x: \n",
    "    'Computer' if x in ['Laptop', 'Monitor'] else\n",
    "    'Mobile' if x in ['Phone', 'Tablet'] else\n",
    "    'Other'\n",
    ")\n",
    "\n",
    "# Check for patterns\n",
    "df_text['Has_Letter_A'] = df_text['Product'].str.contains('a', case=False)\n",
    "df_text['Starts_With_L'] = df_text['Product'].str.startswith('L')\n",
    "df_text['Ends_With_E'] = df_text['Product'].str.endswith('e')\n",
    "\n",
    "print(\"\\nText patterns and categorization:\")\n",
    "print(df_text[['Product', 'Product_Category', 'Has_Letter_A', 'Starts_With_L', 'Ends_With_E']].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create formatted text columns\n",
    "df_text['Sales_Formatted'] = df_text['Sales'].apply(lambda x: f\"${x:,.2f}\")\n",
    "df_text['Transaction_ID'] = df_text.apply(\n",
    "    lambda row: f\"{row['Region'][:1]}{row['Product'][:3].upper()}{row.name:04d}\", axis=1\n",
    ")\n",
    "\n",
    "# Create summary descriptions\n",
    "df_text['Transaction_Summary'] = df_text.apply(\n",
    "    lambda row: f\"{row['Salesperson']} sold {row['Quantity']} {row['Product']}(s) \"\n",
    "                f\"for {row['Sales_Formatted']} in {row['Region']} region\", \n",
    "    axis=1\n",
    ")\n",
    "\n",
    "print(\"Formatted text columns:\")\n",
    "print(df_text[['Sales_Formatted', 'Transaction_ID']].head())\n",
    "print(\"\\nTransaction summaries:\")\n",
    "for i, summary in enumerate(df_text['Transaction_Summary'].head(3)):\n",
    "    print(f\"{i+1}. {summary}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Working with Categorical Data\n",
    "\n",
    "Optimize memory usage and enable category-specific operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert to categorical data types\n",
    "df_categorical = df_sales.copy()\n",
    "\n",
    "# Check memory usage before\n",
    "print(\"Memory usage before categorical conversion:\")\n",
    "print(df_categorical.memory_usage(deep=True))\n",
    "\n",
    "# Convert string columns to categorical\n",
    "categorical_columns = ['Product', 'Region', 'Salesperson', 'Customer_Type']\n",
    "for col in categorical_columns:\n",
    "    df_categorical[col] = df_categorical[col].astype('category')\n",
    "\n",
    "print(\"\\nMemory usage after categorical conversion:\")\n",
    "print(df_categorical.memory_usage(deep=True))\n",
    "\n",
    "print(\"\\nData types after conversion:\")\n",
    "print(df_categorical.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Working with ordered categories\n",
    "# Create ordered categorical for sales performance\n",
    "performance_categories = ['Poor', 'Fair', 'Good', 'Excellent']\n",
    "df_categorical['Performance_Level'] = pd.cut(\n",
    "    df_categorical['Sales'],\n",
    "    bins=[0, 700, 900, 1200, float('inf')],\n",
    "    labels=performance_categories,\n",
    "    ordered=True\n",
    ")\n",
    "\n",
    "print(\"Ordered categorical data:\")\n",
    "print(df_categorical['Performance_Level'].head(10))\n",
    "print(\"\\nCategory info:\")\n",
    "print(df_categorical['Performance_Level'].cat.categories)\n",
    "print(f\"Is ordered: {df_categorical['Performance_Level'].cat.ordered}\")\n",
    "\n",
    "# Categorical operations\n",
    "print(\"\\nPerformance level distribution:\")\n",
    "print(df_categorical['Performance_Level'].value_counts().sort_index())\n",
    "\n",
    "# Add new category\n",
    "df_categorical['Performance_Level'] = df_categorical['Performance_Level'].cat.add_categories(['Outstanding'])\n",
    "print(f\"\\nCategories after adding 'Outstanding': {df_categorical['Performance_Level'].cat.categories}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Mathematical and Statistical Transformations\n",
    "\n",
    "Create columns using mathematical functions and statistical transformations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mathematical transformations\n",
    "df_math = df_sales.copy()\n",
    "\n",
    "# Common mathematical transformations\n",
    "df_math['Sales_Log'] = np.log(df_math['Sales'])\n",
    "df_math['Sales_Sqrt'] = np.sqrt(df_math['Sales'])\n",
    "df_math['Sales_Squared'] = df_math['Sales'] ** 2\n",
    "df_math['Sales_Reciprocal'] = 1 / df_math['Sales']\n",
    "\n",
    "print(\"Mathematical transformations:\")\n",
    "print(df_math[['Sales', 'Sales_Log', 'Sales_Sqrt', 'Sales_Squared', 'Sales_Reciprocal']].head())\n",
    "\n",
    "# Statistical standardization\n",
    "df_math['Sales_Z_Score'] = (df_math['Sales'] - df_math['Sales'].mean()) / df_math['Sales'].std()\n",
    "df_math['Sales_Min_Max_Scaled'] = (df_math['Sales'] - df_math['Sales'].min()) / (df_math['Sales'].max() - df_math['Sales'].min())\n",
    "\n",
    "# Rolling statistics\n",
    "df_math = df_math.sort_values('Date')\n",
    "df_math['Sales_Rolling_7_Mean'] = df_math['Sales'].rolling(window=7, min_periods=1).mean()\n",
    "df_math['Sales_Rolling_7_Std'] = df_math['Sales'].rolling(window=7, min_periods=1).std()\n",
    "df_math['Sales_Cumulative_Sum'] = df_math['Sales'].cumsum()\n",
    "\n",
    "print(\"\\nStatistical transformations:\")\n",
    "print(df_math[['Sales', 'Sales_Z_Score', 'Sales_Min_Max_Scaled', \n",
    "              'Sales_Rolling_7_Mean', 'Sales_Cumulative_Sum']].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rank and percentile columns\n",
    "df_math['Sales_Rank'] = df_math['Sales'].rank(ascending=False)\n",
    "df_math['Sales_Percentile'] = df_math['Sales'].rank(pct=True) * 100\n",
    "df_math['Sales_Rank_by_Region'] = df_math.groupby('Region')['Sales'].rank(ascending=False)\n",
    "\n",
    "# Binning and discretization\n",
    "df_math['Sales_Decile'] = pd.qcut(df_math['Sales'], q=10, labels=range(1, 11))\n",
    "df_math['Sales_Tertile'] = pd.qcut(df_math['Sales'], q=3, labels=['Low', 'Medium', 'High'])\n",
    "\n",
    "print(\"Ranking and binning:\")\n",
    "print(df_math[['Sales', 'Sales_Rank', 'Sales_Percentile', 'Sales_Rank_by_Region', \n",
    "              'Sales_Decile', 'Sales_Tertile']].head(10))\n",
    "\n",
    "print(\"\\nDecile distribution:\")\n",
    "print(df_math['Sales_Decile'].value_counts().sort_index())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Apply your column creation and modification skills:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Customer Segmentation\n",
    "# Create a comprehensive customer segmentation system:\n",
    "# - Combine purchase behavior, frequency, and value\n",
    "# - Create RFM-like scores (Recency, Frequency, Monetary)\n",
    "# - Assign customer segments (e.g., Champion, Loyal, At Risk, etc.)\n",
    "\n",
    "def create_customer_segmentation(df):\n",
    "    \"\"\"Create customer segmentation based on purchase patterns\"\"\"\n",
    "    # Your implementation here\n",
    "    pass\n",
    "\n",
    "# segmented_df = create_customer_segmentation(df_sales)\n",
    "# print(segmented_df[['Customer_Type', 'Sales', 'Frequency_Score', 'Monetary_Score', 'Segment']].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Performance Metrics Dashboard\n",
    "# Create a comprehensive set of KPI columns:\n",
    "# - Sales efficiency metrics\n",
    "# - Trend indicators (growth rates, momentum)\n",
    "# - Comparative metrics (vs. average, vs. target)\n",
    "# - Alert flags for unusual patterns\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Feature Engineering for ML\n",
    "# Create features that could be useful for machine learning:\n",
    "# - Interaction features (product of two variables)\n",
    "# - Polynomial features\n",
    "# - Time-based features (seasonality, trends)\n",
    "# - Lag features (previous period values)\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Column Assignment**: Use direct assignment (`df['col'] = value`) for simple cases\n",
    "2. **Assign Method**: Use `.assign()` for functional programming style and method chaining\n",
    "3. **Conditional Logic**: Combine `np.where()`, `pd.cut()`, `pd.qcut()`, and `np.select()` for complex conditions\n",
    "4. **Apply Functions**: Use `.apply()` with lambda or custom functions for complex transformations\n",
    "5. **Date Features**: Extract meaningful components from datetime columns\n",
    "6. **String Operations**: Leverage `.str` accessor for text manipulations\n",
    "7. **Categorical Data**: Convert to categories for memory efficiency and special operations\n",
    "8. **Mathematical Transformations**: Apply statistical and mathematical functions for data preprocessing\n",
    "\n",
    "## Performance Tips\n",
    "\n",
    "1. **Vectorized Operations**: Prefer pandas/numpy operations over loops\n",
    "2. **Categorical Types**: Use categorical data for repeated string values\n",
    "3. **Memory Management**: Monitor memory usage when creating many new columns\n",
    "4. **Method Chaining**: Use `.assign()` for readable method chains\n",
    "5. **Avoid apply() When Possible**: Use vectorized operations instead of `.apply()` for better performance\n",
    "\n",
    "## Common Patterns\n",
    "\n",
    "```python\n",
    "# Simple calculation\n",
    "df['new_col'] = df['col1'] * df['col2']\n",
    "\n",
    "# Conditional column\n",
    "df['category'] = np.where(df['value'] > threshold, 'High', 'Low')\n",
    "\n",
    "# Apply custom function\n",
    "df['result'] = df.apply(custom_function, axis=1)\n",
    "\n",
    "# Date features\n",
    "df['month'] = df['date'].dt.month\n",
    "\n",
    "# String operations\n",
    "df['upper'] = df['text'].str.upper()\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/Session_01/PandasDataFrame-exmples/06_handling_missing_data.ipynb
+++ b/Session_01/PandasDataFrame-exmples/06_handling_missing_data.ipynb
@ -0,0 +1,916 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 6: Handling Missing Data\n",
    "\n",
    "## Learning Objectives\n",
    "- Understand different types of missing data and their implications\n",
    "- Master techniques for detecting and analyzing missing values\n",
    "- Learn various strategies for handling missing data\n",
    "- Practice imputation methods and their trade-offs\n",
    "- Develop best practices for missing data management\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lessons 1-5\n",
    "- Understanding of basic statistical concepts\n",
    "- Familiarity with data quality principles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from datetime import datetime, timedelta\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set display options\n",
    "pd.set_option('display.max_columns', None)\n",
    "plt.style.use('seaborn-v0_8')\n",
    "\n",
    "print(\"Libraries loaded successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Dataset with Missing Values\n",
    "\n",
    "Let's create a realistic dataset with different patterns of missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create comprehensive dataset with various missing data patterns\n",
    "np.random.seed(42)\n",
    "n_records = 500\n",
    "\n",
    "# Base data\n",
    "data = {\n",
    "    'customer_id': range(1, n_records + 1),\n",
    "    'age': np.random.normal(35, 12, n_records).astype(int),\n",
    "    'income': np.random.normal(50000, 15000, n_records),\n",
    "    'education_years': np.random.normal(14, 3, n_records),\n",
    "    'purchase_amount': np.random.normal(200, 50, n_records),\n",
    "    'satisfaction_score': np.random.randint(1, 6, n_records),\n",
    "    'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
    "    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),\n",
    "    'signup_date': pd.date_range('2023-01-01', periods=n_records, freq='D'),\n",
    "    'last_purchase_date': pd.date_range('2023-01-01', periods=n_records, freq='D') + pd.Timedelta(days=30)\n",
    "}\n",
    "\n",
    "df_complete = pd.DataFrame(data)\n",
    "\n",
    "# Ensure positive values where appropriate\n",
    "df_complete['age'] = np.abs(df_complete['age'])\n",
    "df_complete['income'] = np.abs(df_complete['income'])\n",
    "df_complete['education_years'] = np.clip(df_complete['education_years'], 6, 20)\n",
    "df_complete['purchase_amount'] = np.abs(df_complete['purchase_amount'])\n",
    "\n",
    "print(\"Complete dataset created:\")\n",
    "print(f\"Shape: {df_complete.shape}\")\n",
    "print(\"\\nFirst few rows:\")\n",
    "print(df_complete.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Introduce different patterns of missing data\n",
    "df_missing = df_complete.copy()\n",
    "\n",
    "# 1. Missing Completely at Random (MCAR) - income data\n",
    "# Randomly missing 15% of income values\n",
    "mcar_indices = np.random.choice(df_missing.index, size=int(0.15 * len(df_missing)), replace=False)\n",
    "df_missing.loc[mcar_indices, 'income'] = np.nan\n",
    "\n",
    "# 2. Missing at Random (MAR) - education years missing based on age\n",
    "# Older people less likely to report education\n",
    "older_customers = df_missing['age'] > 60\n",
    "older_indices = df_missing[older_customers].index\n",
    "education_missing = np.random.choice(older_indices, size=int(0.4 * len(older_indices)), replace=False)\n",
    "df_missing.loc[education_missing, 'education_years'] = np.nan\n",
    "\n",
    "# 3. Missing Not at Random (MNAR) - satisfaction scores\n",
    "# Unsatisfied customers less likely to provide ratings\n",
    "low_satisfaction = df_missing['satisfaction_score'] <= 2\n",
    "low_sat_indices = df_missing[low_satisfaction].index\n",
    "satisfaction_missing = np.random.choice(low_sat_indices, size=int(0.6 * len(low_sat_indices)), replace=False)\n",
    "df_missing.loc[satisfaction_missing, 'satisfaction_score'] = np.nan\n",
    "\n",
    "# 4. Systematic missing - last purchase date for new customers\n",
    "# New customers (signed up recently) haven't made purchases yet\n",
    "recent_signups = df_missing['signup_date'] > '2023-11-01'\n",
    "df_missing.loc[recent_signups, 'last_purchase_date'] = pd.NaT\n",
    "\n",
    "# 5. Random missing in other columns\n",
    "# Purchase amount - 10% missing\n",
    "purchase_missing = np.random.choice(df_missing.index, size=int(0.10 * len(df_missing)), replace=False)\n",
    "df_missing.loc[purchase_missing, 'purchase_amount'] = np.nan\n",
    "\n",
    "print(\"Missing data patterns introduced:\")\n",
    "print(f\"Dataset shape: {df_missing.shape}\")\n",
    "print(\"\\nMissing value counts:\")\n",
    "missing_summary = df_missing.isnull().sum()\n",
    "missing_summary = missing_summary[missing_summary > 0]\n",
    "print(missing_summary)\n",
    "\n",
    "print(\"\\nMissing value percentages:\")\n",
    "missing_pct = (df_missing.isnull().sum() / len(df_missing) * 100).round(2)\n",
    "missing_pct = missing_pct[missing_pct > 0]\n",
    "print(missing_pct)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Detecting and Analyzing Missing Data\n",
    "\n",
    "Comprehensive techniques for understanding missing data patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def analyze_missing_data(df):\n",
    "    \"\"\"Comprehensive missing data analysis\"\"\"\n",
    "    print(\"=== MISSING DATA ANALYSIS ===\")\n",
    "    \n",
    "    # Basic missing data statistics\n",
    "    total_cells = df.size\n",
    "    total_missing = df.isnull().sum().sum()\n",
    "    print(f\"Total cells: {total_cells:,}\")\n",
    "    print(f\"Missing cells: {total_missing:,} ({total_missing/total_cells*100:.2f}%)\")\n",
    "    \n",
    "    # Missing data by column\n",
    "    missing_by_column = pd.DataFrame({\n",
    "        'Missing_Count': df.isnull().sum(),\n",
    "        'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,\n",
    "        'Data_Type': df.dtypes\n",
    "    })\n",
    "    missing_by_column = missing_by_column[missing_by_column['Missing_Count'] > 0]\n",
    "    missing_by_column = missing_by_column.sort_values('Missing_Percentage', ascending=False)\n",
    "    \n",
    "    print(\"\\n--- Missing Data by Column ---\")\n",
    "    print(missing_by_column.round(2))\n",
    "    \n",
    "    # Missing data patterns\n",
    "    print(\"\\n--- Missing Data Patterns ---\")\n",
    "    missing_patterns = df.isnull().value_counts().head(10)\n",
    "    print(\"Top 10 missing patterns (True = Missing):\")\n",
    "    for pattern, count in missing_patterns.items():\n",
    "        percentage = (count / len(df)) * 100\n",
    "        print(f\"{count:4d} rows ({percentage:5.1f}%): {dict(zip(df.columns, pattern))}\")\n",
    "    \n",
    "    return missing_by_column\n",
    "\n",
    "# Analyze missing data\n",
    "missing_analysis = analyze_missing_data(df_missing)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize missing data patterns\n",
    "def visualize_missing_data(df):\n",
    "    \"\"\"Create visualizations for missing data patterns\"\"\"\n",
    "    fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
    "    \n",
    "    # 1. Missing data heatmap\n",
    "    missing_mask = df.isnull()\n",
    "    sns.heatmap(missing_mask.iloc[:100], \n",
    "                yticklabels=False, \n",
    "                cbar=True, \n",
    "                cmap='viridis',\n",
    "                ax=axes[0, 0])\n",
    "    axes[0, 0].set_title('Missing Data Heatmap (First 100 rows)')\n",
    "    \n",
    "    # 2. Missing data by column\n",
    "    missing_counts = df.isnull().sum()\n",
    "    missing_counts = missing_counts[missing_counts > 0]\n",
    "    missing_counts.plot(kind='bar', ax=axes[0, 1], color='skyblue')\n",
    "    axes[0, 1].set_title('Missing Values by Column')\n",
    "    axes[0, 1].set_ylabel('Count')\n",
    "    axes[0, 1].tick_params(axis='x', rotation=45)\n",
    "    \n",
    "    # 3. Missing data correlation\n",
    "    missing_corr = df.isnull().corr()\n",
    "    sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])\n",
    "    axes[1, 0].set_title('Missing Data Correlation')\n",
    "    \n",
    "    # 4. Missing data by row\n",
    "    missing_per_row = df.isnull().sum(axis=1)\n",
    "    missing_per_row.hist(bins=range(len(df.columns) + 2), ax=axes[1, 1], alpha=0.7, color='orange')\n",
    "    axes[1, 1].set_title('Distribution of Missing Values per Row')\n",
    "    axes[1, 1].set_xlabel('Number of Missing Values')\n",
    "    axes[1, 1].set_ylabel('Number of Rows')\n",
    "    \n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "\n",
    "# Visualize missing patterns\n",
    "visualize_missing_data(df_missing)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze missing data relationships\n",
    "def analyze_missing_relationships(df):\n",
    "    \"\"\"Analyze relationships between missing data and other variables\"\"\"\n",
    "    print(\"=== MISSING DATA RELATIONSHIPS ===\")\n",
    "    \n",
    "    # Example: Relationship between age and missing education\n",
    "    if 'age' in df.columns and 'education_years' in df.columns:\n",
    "        print(\"\\n--- Age vs Missing Education ---\")\n",
    "        education_missing = df['education_years'].isnull()\n",
    "        age_stats = df.groupby(education_missing)['age'].agg(['mean', 'median', 'std']).round(2)\n",
    "        age_stats.index = ['Education Present', 'Education Missing']\n",
    "        print(age_stats)\n",
    "    \n",
    "    # Example: Missing satisfaction by purchase amount\n",
    "    if 'satisfaction_score' in df.columns and 'purchase_amount' in df.columns:\n",
    "        print(\"\\n--- Purchase Amount vs Missing Satisfaction ---\")\n",
    "        satisfaction_missing = df['satisfaction_score'].isnull()\n",
    "        purchase_stats = df.groupby(satisfaction_missing)['purchase_amount'].agg(['mean', 'median', 'count']).round(2)\n",
    "        purchase_stats.index = ['Satisfaction Present', 'Satisfaction Missing']\n",
    "        print(purchase_stats)\n",
    "    \n",
    "    # Missing data by categorical variables\n",
    "    if 'region' in df.columns:\n",
    "        print(\"\\n--- Missing Data by Region ---\")\n",
    "        region_missing = df.groupby('region').apply(lambda x: x.isnull().sum())\n",
    "        print(region_missing[region_missing.sum(axis=1) > 0])\n",
    "\n",
    "# Analyze relationships\n",
    "analyze_missing_relationships(df_missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Basic Missing Data Handling\n",
    "\n",
    "Fundamental techniques for dealing with missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 1: Dropping missing values\n",
    "print(\"=== DROPPING MISSING VALUES ===\")\n",
    "\n",
    "# Drop rows with any missing values\n",
    "df_drop_any = df_missing.dropna()\n",
    "print(f\"Original shape: {df_missing.shape}\")\n",
    "print(f\"After dropping any missing: {df_drop_any.shape}\")\n",
    "print(f\"Rows removed: {len(df_missing) - len(df_drop_any)} ({(len(df_missing) - len(df_drop_any))/len(df_missing)*100:.1f}%)\")\n",
    "\n",
    "# Drop rows with missing values in specific columns\n",
    "critical_columns = ['customer_id', 'age', 'region']\n",
    "df_drop_critical = df_missing.dropna(subset=critical_columns)\n",
    "print(f\"\\nAfter dropping rows missing critical columns: {df_drop_critical.shape}\")\n",
    "\n",
    "# Drop rows with more than X missing values\n",
    "df_drop_thresh = df_missing.dropna(thresh=len(df_missing.columns) - 2)  # Allow max 2 missing\n",
    "print(f\"After dropping rows with >2 missing values: {df_drop_thresh.shape}\")\n",
    "\n",
    "# Drop columns with too many missing values\n",
    "missing_threshold = 0.5  # 50%\n",
    "cols_to_keep = df_missing.columns[df_missing.isnull().mean() < missing_threshold]\n",
    "df_drop_cols = df_missing[cols_to_keep]\n",
    "print(f\"\\nAfter dropping columns with >{missing_threshold*100}% missing: {df_drop_cols.shape}\")\n",
    "print(f\"Columns dropped: {set(df_missing.columns) - set(cols_to_keep)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 2: Basic imputation with fillna()\n",
    "print(\"=== BASIC IMPUTATION ===\")\n",
    "\n",
    "df_basic_impute = df_missing.copy()\n",
    "\n",
    "# Fill with specific values\n",
    "df_basic_impute['satisfaction_score'] = df_basic_impute['satisfaction_score'].fillna(3)  # Neutral score\n",
    "print(\"Filled satisfaction_score with 3 (neutral)\")\n",
    "\n",
    "# Fill with statistical measures\n",
    "df_basic_impute['income'] = df_basic_impute['income'].fillna(df_basic_impute['income'].median())\n",
    "df_basic_impute['education_years'] = df_basic_impute['education_years'].fillna(df_basic_impute['education_years'].mean())\n",
    "df_basic_impute['purchase_amount'] = df_basic_impute['purchase_amount'].fillna(df_basic_impute['purchase_amount'].mean())\n",
    "print(\"Filled numerical columns with mean/median\")\n",
    "\n",
    "# Forward fill and backward fill for dates\n",
    "df_basic_impute['last_purchase_date'] = df_basic_impute['last_purchase_date'].fillna(method='bfill')\n",
    "print(\"Filled dates with backward fill\")\n",
    "\n",
    "print(f\"\\nMissing values after basic imputation:\")\n",
    "print(df_basic_impute.isnull().sum().sum())\n",
    "\n",
    "# Show before/after comparison\n",
    "print(\"\\nComparison (first 10 rows):\")\n",
    "comparison_cols = ['income', 'education_years', 'purchase_amount', 'satisfaction_score']\n",
    "for col in comparison_cols:\n",
    "    before_missing = df_missing[col].isnull().sum()\n",
    "    after_missing = df_basic_impute[col].isnull().sum()\n",
    "    print(f\"{col}: {before_missing} → {after_missing} missing values\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Advanced Imputation Techniques\n",
    "\n",
    "Sophisticated methods for handling missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Group-based imputation\n",
    "def group_based_imputation(df):\n",
    "    \"\"\"Impute missing values based on group statistics\"\"\"\n",
    "    df_group_impute = df.copy()\n",
    "    \n",
    "    print(\"=== GROUP-BASED IMPUTATION ===\")\n",
    "    \n",
    "    # Impute income based on region and education level\n",
    "    # First, create education level categories\n",
    "    df_group_impute['education_level'] = pd.cut(\n",
    "        df_group_impute['education_years'].fillna(df_group_impute['education_years'].median()),\n",
    "        bins=[0, 12, 16, 20],\n",
    "        labels=['High School', 'Bachelor', 'Advanced']\n",
    "    )\n",
    "    \n",
    "    # Calculate group-based statistics\n",
    "    income_by_group = df_group_impute.groupby(['region', 'education_level'])['income'].median()\n",
    "    \n",
    "    # Fill missing income values\n",
    "    def fill_income(row):\n",
    "        if pd.isna(row['income']):\n",
    "            try:\n",
    "                return income_by_group.loc[(row['region'], row['education_level'])]\n",
    "            except KeyError:\n",
    "                return df_group_impute['income'].median()\n",
    "        return row['income']\n",
    "    \n",
    "    df_group_impute['income'] = df_group_impute.apply(fill_income, axis=1)\n",
    "    \n",
    "    print(\"Income imputed based on region and education level\")\n",
    "    print(\"Group-based median income:\")\n",
    "    print(income_by_group.round(0))\n",
    "    \n",
    "    return df_group_impute\n",
    "\n",
    "# Apply group-based imputation\n",
    "df_group_imputed = group_based_imputation(df_missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Comparison of Imputation Methods\n",
    "\n",
    "Compare different imputation approaches and their impact."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compare_imputation_methods(original_complete, original_missing, *imputed_dfs, methods_names):\n",
    "    \"\"\"Compare different imputation methods\"\"\"\n",
    "    print(\"=== IMPUTATION METHODS COMPARISON ===\")\n",
    "    \n",
    "    # Focus on a specific column for comparison\n",
    "    column = 'income'\n",
    "    \n",
    "    if column not in original_complete.columns:\n",
    "        print(f\"Column {column} not found\")\n",
    "        return\n",
    "    \n",
    "    # Get original values that were made missing\n",
    "    missing_mask = original_missing[column].isnull()\n",
    "    true_values = original_complete.loc[missing_mask, column]\n",
    "    \n",
    "    print(f\"Comparing imputation for '{column}' column\")\n",
    "    print(f\"Number of missing values: {len(true_values)}\")\n",
    "    \n",
    "    # Calculate errors for each method\n",
    "    results = {}\n",
    "    \n",
    "    for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
    "        if column in df_imputed.columns:\n",
    "            imputed_values = df_imputed.loc[missing_mask, column]\n",
    "            \n",
    "            # Calculate metrics\n",
    "            mae = np.mean(np.abs(true_values - imputed_values))\n",
    "            rmse = np.sqrt(np.mean((true_values - imputed_values) ** 2))\n",
    "            bias = np.mean(imputed_values - true_values)\n",
    "            \n",
    "            results[method_name] = {\n",
    "                'MAE': mae,\n",
    "                'RMSE': rmse,\n",
    "                'Bias': bias,\n",
    "                'Mean_Imputed': np.mean(imputed_values),\n",
    "                'Std_Imputed': np.std(imputed_values)\n",
    "            }\n",
    "    \n",
    "    # True statistics\n",
    "    print(f\"\\nTrue statistics for missing values:\")\n",
    "    print(f\"Mean: {np.mean(true_values):.2f}\")\n",
    "    print(f\"Std: {np.std(true_values):.2f}\")\n",
    "    \n",
    "    # Results comparison\n",
    "    results_df = pd.DataFrame(results).T\n",
    "    print(f\"\\nImputation comparison results:\")\n",
    "    print(results_df.round(2))\n",
    "    \n",
    "    # Visualize comparison\n",
    "    fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
    "    \n",
    "    # Distribution comparison\n",
    "    axes[0, 0].hist(true_values, alpha=0.7, label='True Values', bins=20)\n",
    "    for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
    "        if column in df_imputed.columns:\n",
    "            imputed_values = df_imputed.loc[missing_mask, column]\n",
    "            axes[0, 0].hist(imputed_values, alpha=0.7, label=f'{method_name}', bins=20)\n",
    "    axes[0, 0].set_title('Distribution Comparison')\n",
    "    axes[0, 0].legend()\n",
    "    \n",
    "    # Error metrics\n",
    "    metrics = ['MAE', 'RMSE']\n",
    "    for i, metric in enumerate(metrics):\n",
    "        values = [results[method][metric] for method in results.keys()]\n",
    "        axes[0, 1].bar(range(len(values)), values, alpha=0.7)\n",
    "        axes[0, 1].set_xticks(range(len(results)))\n",
    "        axes[0, 1].set_xticklabels(list(results.keys()), rotation=45)\n",
    "        axes[0, 1].set_title(f'{metric} Comparison')\n",
    "        break  # Show only MAE for now\n",
    "    \n",
    "    # Scatter plot: True vs Imputed\n",
    "    for i, (df_imputed, method_name) in enumerate(zip(imputed_dfs[:2], methods_names[:2])):\n",
    "        if column in df_imputed.columns:\n",
    "            imputed_values = df_imputed.loc[missing_mask, column]\n",
    "            ax = axes[1, i]\n",
    "            ax.scatter(true_values, imputed_values, alpha=0.6)\n",
    "            ax.plot([true_values.min(), true_values.max()], \n",
    "                   [true_values.min(), true_values.max()], 'r--', label='Perfect Prediction')\n",
    "            ax.set_xlabel('True Values')\n",
    "            ax.set_ylabel('Imputed Values')\n",
    "            ax.set_title(f'{method_name}: True vs Imputed')\n",
    "            ax.legend()\n",
    "    \n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    \n",
    "    return results_df\n",
    "\n",
    "# Compare methods\n",
    "comparison_results = compare_imputation_methods(\n",
    "    df_complete, \n",
    "    df_missing,\n",
    "    df_basic_impute,\n",
    "    methods_names=['Basic Fill', 'KNN', 'Iterative']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Domain-Specific Imputation Strategies\n",
    "\n",
    "Business logic-driven approaches to missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def business_logic_imputation(df):\n",
    "    \"\"\"Apply business logic for missing value imputation\"\"\"\n",
    "    print(\"=== BUSINESS LOGIC IMPUTATION ===\")\n",
    "    \n",
    "    df_business = df.copy()\n",
    "    \n",
    "    # 1. Income imputation based on age and education\n",
    "    def estimate_income(row):\n",
    "        if pd.notna(row['income']):\n",
    "            return row['income']\n",
    "        \n",
    "        # Base income estimation\n",
    "        base_income = 30000\n",
    "        \n",
    "        # Age factor (experience premium)\n",
    "        if pd.notna(row['age']):\n",
    "            if row['age'] > 40:\n",
    "                base_income *= 1.5\n",
    "            elif row['age'] > 30:\n",
    "                base_income *= 1.2\n",
    "        \n",
    "        # Education factor\n",
    "        if pd.notna(row['education_years']):\n",
    "            if row['education_years'] > 16:  # Graduate degree\n",
    "                base_income *= 1.8\n",
    "            elif row['education_years'] > 12:  # Bachelor's\n",
    "                base_income *= 1.4\n",
    "        \n",
    "        # Regional adjustment\n",
    "        regional_multipliers = {\n",
    "            'North': 1.2,  # Higher cost of living\n",
    "            'South': 0.9,\n",
    "            'East': 1.1,\n",
    "            'West': 1.0\n",
    "        }\n",
    "        base_income *= regional_multipliers.get(row['region'], 1.0)\n",
    "        \n",
    "        return base_income\n",
    "    \n",
    "    # Apply income estimation\n",
    "    df_business['income'] = df_business.apply(estimate_income, axis=1)\n",
    "    \n",
    "    # 2. Satisfaction score based on purchase behavior\n",
    "    def estimate_satisfaction(row):\n",
    "        if pd.notna(row['satisfaction_score']):\n",
    "            return row['satisfaction_score']\n",
    "        \n",
    "        # Base satisfaction\n",
    "        base_satisfaction = 3  # Neutral\n",
    "        \n",
    "        # Purchase amount influence\n",
    "        if pd.notna(row['purchase_amount']):\n",
    "            if row['purchase_amount'] > 250:  # High value purchase\n",
    "                base_satisfaction = 4\n",
    "            elif row['purchase_amount'] < 100:  # Low value might indicate dissatisfaction\n",
    "                base_satisfaction = 2\n",
    "        \n",
    "        return base_satisfaction\n",
    "    \n",
    "    # Apply satisfaction estimation\n",
    "    df_business['satisfaction_score'] = df_business.apply(estimate_satisfaction, axis=1)\n",
    "    \n",
    "    # 3. Education years based on income and age\n",
    "    def estimate_education(row):\n",
    "        if pd.notna(row['education_years']):\n",
    "            return row['education_years']\n",
    "        \n",
    "        # Base education\n",
    "        base_education = 12  # High school\n",
    "        \n",
    "        # Income-based estimation\n",
    "        if pd.notna(row['income']):\n",
    "            if row['income'] > 70000:\n",
    "                base_education = 18  # Graduate level\n",
    "            elif row['income'] > 45000:\n",
    "                base_education = 16  # Bachelor's\n",
    "            elif row['income'] > 35000:\n",
    "                base_education = 14  # Some college\n",
    "        \n",
    "        # Age adjustment (older people might have different education patterns)\n",
    "        if pd.notna(row['age']) and row['age'] > 55:\n",
    "            base_education = max(12, base_education - 2)  # Lower average for older generation\n",
    "        \n",
    "        return base_education\n",
    "    \n",
    "    # Apply education estimation\n",
    "    df_business['education_years'] = df_business.apply(estimate_education, axis=1)\n",
    "    \n",
    "    print(\"Business logic imputation completed\")\n",
    "    print(f\"Missing values remaining: {df_business.isnull().sum().sum()}\")\n",
    "    \n",
    "    return df_business\n",
    "\n",
    "# Apply business logic imputation\n",
    "df_business_imputed = business_logic_imputation(df_missing)\n",
    "\n",
    "print(\"\\nBusiness logic imputation summary:\")\n",
    "for col in ['income', 'satisfaction_score', 'education_years']:\n",
    "    before = df_missing[col].isnull().sum()\n",
    "    after = df_business_imputed[col].isnull().sum()\n",
    "    print(f\"{col}: {before} → {after} missing values\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Missing Data Flags and Indicators\n",
    "\n",
    "Track which values were imputed for transparency and analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_missing_indicators(df_original, df_imputed):\n",
    "    \"\"\"Create indicator variables for missing data\"\"\"\n",
    "    print(\"=== CREATING MISSING DATA INDICATORS ===\")\n",
    "    \n",
    "    df_with_indicators = df_imputed.copy()\n",
    "    \n",
    "    # Create indicator columns for each column that had missing data\n",
    "    columns_with_missing = df_original.columns[df_original.isnull().any()].tolist()\n",
    "    \n",
    "    for col in columns_with_missing:\n",
    "        indicator_col = f'{col}_was_missing'\n",
    "        df_with_indicators[indicator_col] = df_original[col].isnull().astype(int)\n",
    "    \n",
    "    print(f\"Created {len(columns_with_missing)} missing data indicators\")\n",
    "    print(f\"Indicator columns: {[f'{col}_was_missing' for col in columns_with_missing]}\")\n",
    "    \n",
    "    # Summary of missing patterns\n",
    "    indicator_cols = [f'{col}_was_missing' for col in columns_with_missing]\n",
    "    missing_patterns = df_with_indicators[indicator_cols].sum()\n",
    "    \n",
    "    print(\"\\nMissing data summary by column:\")\n",
    "    for col, count in missing_patterns.items():\n",
    "        original_col = col.replace('_was_missing', '')\n",
    "        percentage = (count / len(df_with_indicators)) * 100\n",
    "        print(f\"{original_col}: {count} values imputed ({percentage:.1f}%)\")\n",
    "    \n",
    "    # Create composite missing indicator\n",
    "    df_with_indicators['total_missing_count'] = df_with_indicators[indicator_cols].sum(axis=1)\n",
    "    df_with_indicators['has_any_missing'] = (df_with_indicators['total_missing_count'] > 0).astype(int)\n",
    "    \n",
    "    return df_with_indicators, indicator_cols\n",
    "\n",
    "# Create missing indicators\n",
    "df_with_indicators, indicator_columns = create_missing_indicators(df_missing, df_business_imputed)\n",
    "\n",
    "print(\"\\nDataset with missing indicators:\")\n",
    "sample_cols = ['income', 'income_was_missing', 'education_years', 'education_years_was_missing', \n",
    "               'satisfaction_score', 'satisfaction_score_was_missing', 'total_missing_count']\n",
    "available_cols = [col for col in sample_cols if col in df_with_indicators.columns]\n",
    "print(df_with_indicators[available_cols].head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Validation and Quality Assessment\n",
    "\n",
    "Validate the quality of imputation results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def validate_imputation_quality(df_original, df_missing, df_imputed):\n",
    "    \"\"\"Validate the quality of imputation\"\"\"\n",
    "    print(\"=== IMPUTATION QUALITY VALIDATION ===\")\n",
    "    \n",
    "    validation_results = {}\n",
    "    \n",
    "    # Check each column that had missing data\n",
    "    for col in df_missing.columns:\n",
    "        if df_missing[col].isnull().any() and col in df_imputed.columns:\n",
    "            print(f\"\\n--- Validating {col} ---\")\n",
    "            \n",
    "            # Get missing mask\n",
    "            missing_mask = df_missing[col].isnull()\n",
    "            \n",
    "            # Original statistics (complete data)\n",
    "            original_stats = df_original[col].describe()\n",
    "            \n",
    "            # Imputed statistics (only imputed values)\n",
    "            if missing_mask.any():\n",
    "                imputed_values = df_imputed.loc[missing_mask, col]\n",
    "                \n",
    "                if pd.api.types.is_numeric_dtype(df_original[col]):\n",
    "                    imputed_stats = imputed_values.describe()\n",
    "                    \n",
    "                    # Statistical tests\n",
    "                    mean_diff = abs(original_stats['mean'] - imputed_stats['mean'])\n",
    "                    std_diff = abs(original_stats['std'] - imputed_stats['std'])\n",
    "                    \n",
    "                    validation_results[col] = {\n",
    "                        'original_mean': original_stats['mean'],\n",
    "                        'imputed_mean': imputed_stats['mean'],\n",
    "                        'mean_difference': mean_diff,\n",
    "                        'original_std': original_stats['std'],\n",
    "                        'imputed_std': imputed_stats['std'],\n",
    "                        'std_difference': std_diff,\n",
    "                        'values_imputed': len(imputed_values)\n",
    "                    }\n",
    "                    \n",
    "                    print(f\"Original mean: {original_stats['mean']:.2f}, Imputed mean: {imputed_stats['mean']:.2f}\")\n",
    "                    print(f\"Mean difference: {mean_diff:.2f} ({mean_diff/original_stats['mean']*100:.1f}%)\")\n",
    "                    print(f\"Original std: {original_stats['std']:.2f}, Imputed std: {imputed_stats['std']:.2f}\")\n",
    "                    \n",
    "                else:\n",
    "                    # Categorical data\n",
    "                    original_dist = df_original[col].value_counts(normalize=True)\n",
    "                    imputed_dist = imputed_values.value_counts(normalize=True)\n",
    "                    print(f\"Original distribution: {original_dist.to_dict()}\")\n",
    "                    print(f\"Imputed distribution: {imputed_dist.to_dict()}\")\n",
    "    \n",
    "    # Overall validation summary\n",
    "    if validation_results:\n",
    "        validation_df = pd.DataFrame(validation_results).T\n",
    "        print(\"\\n=== VALIDATION SUMMARY ===\")\n",
    "        print(validation_df.round(3))\n",
    "        \n",
    "        # Flag potential issues\n",
    "        print(\"\\n--- Potential Issues ---\")\n",
    "        for col, stats in validation_results.items():\n",
    "            mean_change = abs(stats['mean_difference'] / stats['original_mean']) * 100\n",
    "            if mean_change > 10:  # More than 10% change in mean\n",
    "                print(f\"⚠️  {col}: Large mean change ({mean_change:.1f}%)\")\n",
    "            \n",
    "            std_change = abs(stats['std_difference'] / stats['original_std']) * 100\n",
    "            if std_change > 20:  # More than 20% change in std\n",
    "                print(f\"⚠️  {col}: Large variance change ({std_change:.1f}%)\")\n",
    "    \n",
    "    return validation_results\n",
    "\n",
    "# Validate imputation quality\n",
    "validation_results = validate_imputation_quality(df_complete, df_missing, df_business_imputed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Apply missing data handling techniques to challenging scenarios:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Multi-step imputation strategy\n",
    "# Create a sophisticated imputation pipeline that:\n",
    "# 1. Handles different types of missing data appropriately\n",
    "# 2. Uses multiple imputation methods in sequence\n",
    "# 3. Validates results at each step\n",
    "# 4. Creates comprehensive documentation\n",
    "\n",
    "def comprehensive_imputation_pipeline(df):\n",
    "    \"\"\"Comprehensive missing data handling pipeline\"\"\"\n",
    "    # Your implementation here\n",
    "    pass\n",
    "\n",
    "# result_df = comprehensive_imputation_pipeline(df_missing)\n",
    "# print(\"Comprehensive pipeline results:\")\n",
    "# print(result_df.isnull().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Missing data pattern analysis\n",
    "# Analyze if missing data follows specific patterns:\n",
    "# - Time-based patterns\n",
    "# - User behavior patterns\n",
    "# - System/technical patterns\n",
    "# Create insights and recommendations\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Impact assessment\n",
    "# Assess how different missing data handling approaches\n",
    "# affect downstream analysis:\n",
    "# - Statistical analysis results\n",
    "# - Machine learning model performance\n",
    "# - Business insights and decisions\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Understanding Missing Data Types**:\n",
    "   - **MCAR**: Missing Completely at Random\n",
    "   - **MAR**: Missing at Random (depends on observed data)\n",
    "   - **MNAR**: Missing Not at Random (depends on unobserved data)\n",
    "\n",
    "2. **Detection and Analysis**:\n",
    "   - Always analyze missing patterns before imputation\n",
    "   - Use visualizations to understand missing data structure\n",
    "   - Look for relationships between missing values and other variables\n",
    "\n",
    "3. **Handling Strategies**:\n",
    "   - **Deletion**: Simple but can lose valuable information\n",
    "   - **Simple Imputation**: Fast but may not preserve relationships\n",
    "   - **Advanced Methods**: KNN, MICE preserve more complex relationships\n",
    "   - **Business Logic**: Domain knowledge often provides best results\n",
    "\n",
    "4. **Best Practices**:\n",
    "   - Create missing data indicators for transparency\n",
    "   - Validate imputation quality against original data when possible\n",
    "   - Consider the impact on downstream analysis\n",
    "   - Document all imputation decisions and methods\n",
    "\n",
    "## Method Selection Guide\n",
    "\n",
    "| Scenario | Recommended Method | Rationale |\n",
    "|----------|-------------------|----------|\n",
    "| < 5% missing, MCAR | Simple imputation | Low impact, efficiency |\n",
    "| 5-20% missing, MAR | KNN or Group-based | Preserve relationships |\n",
    "| > 20% missing, complex patterns | MICE or Multiple imputation | Handle complex dependencies |\n",
    "| Business-critical decisions | Domain knowledge + validation | Accuracy and explainability |\n",
    "| Machine learning features | Advanced methods + indicators | Preserve predictive power |\n",
    "\n",
    "## Common Pitfalls to Avoid\n",
    "\n",
    "1. **Data Leakage**: Don't use future information to impute past values\n",
    "2. **Ignoring Patterns**: Missing data often has meaningful patterns\n",
    "3. **Over-imputation**: Sometimes missing data is informative itself\n",
    "4. **One-size-fits-all**: Different columns may need different strategies\n",
    "5. **No Validation**: Always check if imputation preserved data characteristics"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/Session_01/PandasDataFrame-exmples/07_merging_joining.ipynb
+++ b/Session_01/PandasDataFrame-exmples/07_merging_joining.ipynb
@ -0,0 +1,937 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 7: Merging and Joining DataFrames\n",
    "\n",
    "## Learning Objectives\n",
    "- Master different types of joins (inner, outer, left, right)\n",
    "- Understand when to use merge vs join vs concat\n",
    "- Handle duplicate keys and join conflicts\n",
    "- Learn advanced merging techniques and best practices\n",
    "- Practice with real-world data integration scenarios\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lessons 1-6\n",
    "- Understanding of relational database concepts (helpful)\n",
    "- Basic knowledge of SQL joins (helpful but not required)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "import matplotlib.pyplot as plt\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set display options\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 50)\n",
    "\n",
    "print(\"Libraries loaded successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Sample Datasets\n",
    "\n",
    "Let's create realistic datasets that represent common business scenarios."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create sample datasets for merging examples\n",
    "np.random.seed(42)\n",
    "\n",
    "# Customer dataset\n",
    "customers_data = {\n",
    "    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n",
    "    'customer_name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson',\n",
    "                     'Frank Miller', 'Grace Lee', 'Henry Davis', 'Ivy Chen', 'Jack Robinson'],\n",
    "    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com',\n",
    "             'frank@email.com', 'grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],\n",
    "    'age': [28, 35, 42, 31, 29, 45, 38, 33, 27, 41],\n",
    "    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',\n",
    "            'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],\n",
    "    'signup_date': pd.date_range('2023-01-01', periods=10, freq='M')\n",
    "}\n",
    "\n",
    "df_customers = pd.DataFrame(customers_data)\n",
    "\n",
    "# Orders dataset\n",
    "orders_data = {\n",
    "    'order_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],\n",
    "    'customer_id': [1, 2, 1, 3, 4, 2, 5, 1, 6, 11, 3, 2],  # Note: customer_id 11 doesn't exist in customers\n",
    "    'order_date': pd.date_range('2023-06-01', periods=12, freq='W'),\n",
    "    'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Monitor', 'Phone', \n",
    "               'Headphones', 'Mouse', 'Keyboard', 'Laptop', 'Tablet', 'Monitor'],\n",
    "    'quantity': [1, 2, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1],\n",
    "    'amount': [1200, 800, 400, 1200, 300, 800, 150, 50, 75, 1200, 800, 300]\n",
    "}\n",
    "\n",
    "df_orders = pd.DataFrame(orders_data)\n",
    "\n",
    "# Product information dataset\n",
    "products_data = {\n",
    "    'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones', 'Mouse', 'Keyboard', 'Webcam'],\n",
    "    'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', \n",
    "                'Audio', 'Accessories', 'Accessories', 'Electronics'],\n",
    "    'price': [1200, 800, 400, 300, 150, 50, 75, 100],\n",
    "    'supplier': ['TechCorp', 'MobileCorp', 'TechCorp', 'DisplayCorp', \n",
    "                'AudioCorp', 'AccessoryCorp', 'AccessoryCorp', 'TechCorp']\n",
    "}\n",
    "\n",
    "df_products = pd.DataFrame(products_data)\n",
    "\n",
    "# Customer segments dataset\n",
    "segments_data = {\n",
    "    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 12, 13],  # Some customers not in main customer table\n",
    "    'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Basic', \n",
    "               'Premium', 'Standard', 'Basic', 'Premium', 'Standard'],\n",
    "    'loyalty_points': [1500, 800, 1200, 600, 200, 1800, 750, 300, 2000, 900]\n",
    "}\n",
    "\n",
    "df_segments = pd.DataFrame(segments_data)\n",
    "\n",
    "print(\"Sample datasets created:\")\n",
    "print(f\"Customers: {df_customers.shape}\")\n",
    "print(f\"Orders: {df_orders.shape}\")\n",
    "print(f\"Products: {df_products.shape}\")\n",
    "print(f\"Segments: {df_segments.shape}\")\n",
    "\n",
    "print(\"\\nCustomers dataset:\")\n",
    "print(df_customers.head())\n",
    "\n",
    "print(\"\\nOrders dataset:\")\n",
    "print(df_orders.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Basic Merge Operations\n",
    "\n",
    "Understanding the fundamental merge operations and join types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inner Join - only matching records\n",
    "print(\"=== INNER JOIN ===\")\n",
    "inner_join = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
    "print(f\"Result shape: {inner_join.shape}\")\n",
    "print(\"Sample results:\")\n",
    "print(inner_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
    "\n",
    "print(f\"\\nUnique customers in result: {inner_join['customer_id'].nunique()}\")\n",
    "print(f\"Total orders: {len(inner_join)}\")\n",
    "\n",
    "# Check which customers have orders\n",
    "customers_with_orders = inner_join['customer_id'].unique()\n",
    "print(f\"Customers with orders: {sorted(customers_with_orders)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Left Join - all records from left table\n",
    "print(\"=== LEFT JOIN ===\")\n",
    "left_join = pd.merge(df_customers, df_orders, on='customer_id', how='left')\n",
    "print(f\"Result shape: {left_join.shape}\")\n",
    "print(\"Sample results:\")\n",
    "print(left_join[['customer_name', 'order_id', 'product', 'amount']].head(10))\n",
    "\n",
    "# Check customers without orders\n",
    "customers_without_orders = left_join[left_join['order_id'].isnull()]['customer_name'].tolist()\n",
    "print(f\"\\nCustomers without orders: {customers_without_orders}\")\n",
    "\n",
    "# Summary statistics\n",
    "print(f\"\\nTotal records: {len(left_join)}\")\n",
    "print(f\"Records with orders: {left_join['order_id'].notna().sum()}\")\n",
    "print(f\"Records without orders: {left_join['order_id'].isnull().sum()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Right Join - all records from right table\n",
    "print(\"=== RIGHT JOIN ===\")\n",
    "right_join = pd.merge(df_customers, df_orders, on='customer_id', how='right')\n",
    "print(f\"Result shape: {right_join.shape}\")\n",
    "print(\"Sample results:\")\n",
    "print(right_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
    "\n",
    "# Check orders without customer information\n",
    "orders_without_customers = right_join[right_join['customer_name'].isnull()]\n",
    "print(f\"\\nOrders without customer info: {len(orders_without_customers)}\")\n",
    "if len(orders_without_customers) > 0:\n",
    "    print(orders_without_customers[['customer_id', 'order_id', 'product', 'amount']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Outer Join - all records from both tables\n",
    "print(\"=== OUTER JOIN ===\")\n",
    "outer_join = pd.merge(df_customers, df_orders, on='customer_id', how='outer')\n",
    "print(f\"Result shape: {outer_join.shape}\")\n",
    "\n",
    "# Analyze the result\n",
    "print(\"\\nData quality analysis:\")\n",
    "print(f\"Records with complete customer info: {outer_join['customer_name'].notna().sum()}\")\n",
    "print(f\"Records with complete order info: {outer_join['order_id'].notna().sum()}\")\n",
    "print(f\"Records with both customer and order info: {(outer_join['customer_name'].notna() & outer_join['order_id'].notna()).sum()}\")\n",
    "\n",
    "# Show different categories of records\n",
    "print(\"\\nCustomers without orders:\")\n",
    "customers_only = outer_join[(outer_join['customer_name'].notna()) & (outer_join['order_id'].isnull())]\n",
    "print(customers_only[['customer_name', 'city']].drop_duplicates())\n",
    "\n",
    "print(\"\\nOrders without customer data:\")\n",
    "orders_only = outer_join[(outer_join['customer_name'].isnull()) & (outer_join['order_id'].notna())]\n",
    "print(orders_only[['customer_id', 'order_id', 'product', 'amount']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Multiple Table Joins\n",
    "\n",
    "Combining data from multiple sources in sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Three-way join: Customers + Orders + Products\n",
    "print(\"=== THREE-WAY JOIN ===\")\n",
    "\n",
    "# Step 1: Join customers and orders\n",
    "customer_orders = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
    "print(f\"After joining customers and orders: {customer_orders.shape}\")\n",
    "\n",
    "# Step 2: Join with products\n",
    "complete_data = pd.merge(customer_orders, df_products, on='product', how='left')\n",
    "print(f\"After joining with products: {complete_data.shape}\")\n",
    "\n",
    "# Display comprehensive view\n",
    "print(\"\\nComplete order information:\")\n",
    "display_cols = ['customer_name', 'order_id', 'product', 'category', 'quantity', 'amount', 'price', 'supplier']\n",
    "print(complete_data[display_cols].head())\n",
    "\n",
    "# Verify data consistency\n",
    "print(\"\\nData consistency check:\")\n",
    "# Check if order amount matches product price * quantity\n",
    "complete_data['calculated_amount'] = complete_data['price'] * complete_data['quantity']\n",
    "amount_matches = (complete_data['amount'] == complete_data['calculated_amount']).all()\n",
    "print(f\"Order amounts match calculated amounts: {amount_matches}\")\n",
    "\n",
    "if not amount_matches:\n",
    "    mismatched = complete_data[complete_data['amount'] != complete_data['calculated_amount']]\n",
    "    print(f\"\\nMismatched records: {len(mismatched)}\")\n",
    "    print(mismatched[['order_id', 'product', 'amount', 'calculated_amount']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add customer segment information\n",
    "print(\"=== ADDING CUSTOMER SEGMENTS ===\")\n",
    "\n",
    "# Join with segments (left join to keep all customers)\n",
    "customers_with_segments = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
    "print(f\"Customers with segments shape: {customers_with_segments.shape}\")\n",
    "\n",
    "# Check which customers don't have segment information\n",
    "missing_segments = customers_with_segments[customers_with_segments['segment'].isnull()]\n",
    "print(f\"\\nCustomers without segment info: {len(missing_segments)}\")\n",
    "if len(missing_segments) > 0:\n",
    "    print(missing_segments[['customer_name', 'city']])\n",
    "\n",
    "# Create comprehensive customer profile\n",
    "full_customer_profile = pd.merge(complete_data, df_segments, on='customer_id', how='left')\n",
    "print(f\"\\nFull customer profile shape: {full_customer_profile.shape}\")\n",
    "\n",
    "# Analyze by segment\n",
    "segment_analysis = full_customer_profile.groupby('segment').agg({\n",
    "    'amount': ['sum', 'mean', 'count'],\n",
    "    'customer_id': 'nunique'\n",
    "}).round(2)\n",
    "segment_analysis.columns = ['Total_Revenue', 'Avg_Order_Value', 'Total_Orders', 'Unique_Customers']\n",
    "print(\"\\nRevenue by customer segment:\")\n",
    "print(segment_analysis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Advanced Merge Techniques\n",
    "\n",
    "Handling complex merging scenarios and edge cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge with different column names\n",
    "print(\"=== MERGE WITH DIFFERENT COLUMN NAMES ===\")\n",
    "\n",
    "# Create a dataset with different column name\n",
    "customer_demographics = pd.DataFrame({\n",
    "    'cust_id': [1, 2, 3, 4, 5],\n",
    "    'income_range': ['50-75k', '75-100k', '50-75k', '100k+', '25-50k'],\n",
    "    'education': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor'],\n",
    "    'occupation': ['Engineer', 'Manager', 'Professor', 'Director', 'Analyst']\n",
    "})\n",
    "\n",
    "# Merge using left_on and right_on parameters\n",
    "customers_with_demographics = pd.merge(\n",
    "    df_customers, \n",
    "    customer_demographics, \n",
    "    left_on='customer_id', \n",
    "    right_on='cust_id', \n",
    "    how='left'\n",
    ")\n",
    "\n",
    "print(\"Merge with different column names:\")\n",
    "print(customers_with_demographics[['customer_name', 'customer_id', 'cust_id', 'income_range', 'education']].head())\n",
    "\n",
    "# Clean up duplicate columns\n",
    "customers_with_demographics = customers_with_demographics.drop('cust_id', axis=1)\n",
    "print(f\"\\nAfter cleanup: {customers_with_demographics.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge on multiple columns\n",
    "print(\"=== MERGE ON MULTIPLE COLUMNS ===\")\n",
    "\n",
    "# Create time-based pricing data\n",
    "pricing_data = pd.DataFrame({\n",
    "    'product': ['Laptop', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Tablet'],\n",
    "    'date': pd.to_datetime(['2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01']),\n",
    "    'price': [1200, 1100, 800, 750, 400, 380],\n",
    "    'promotion': [False, True, False, True, False, True]\n",
    "})\n",
    "\n",
    "# Add year-month to orders for matching\n",
    "df_orders_with_period = df_orders.copy()\n",
    "df_orders_with_period['order_month'] = df_orders_with_period['order_date'].dt.to_period('M').dt.start_time\n",
    "\n",
    "# Create matching periods in pricing data\n",
    "pricing_data['period'] = pricing_data['date'].dt.to_period('M').dt.start_time\n",
    "\n",
    "# Merge on product and time period\n",
    "orders_with_pricing = pd.merge(\n",
    "    df_orders_with_period,\n",
    "    pricing_data,\n",
    "    left_on=['product', 'order_month'],\n",
    "    right_on=['product', 'period'],\n",
    "    how='left'\n",
    ")\n",
    "\n",
    "print(\"Orders with time-based pricing:\")\n",
    "print(orders_with_pricing[['order_id', 'product', 'order_date', 'amount', 'price', 'promotion']].head())\n",
    "\n",
    "# Check for pricing discrepancies\n",
    "pricing_discrepancies = orders_with_pricing[\n",
    "    (orders_with_pricing['amount'] != orders_with_pricing['price'] * orders_with_pricing['quantity']) &\n",
    "    orders_with_pricing['price'].notna()\n",
    "]\n",
    "print(f\"\\nOrders with pricing discrepancies: {len(pricing_discrepancies)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Handling duplicate keys in merge\n",
    "print(\"=== HANDLING DUPLICATE KEYS ===\")\n",
    "\n",
    "# Create data with duplicate keys\n",
    "customer_contacts = pd.DataFrame({\n",
    "    'customer_id': [1, 1, 2, 2, 3],\n",
    "    'contact_type': ['email', 'phone', 'email', 'phone', 'email'],\n",
    "    'contact_value': ['alice@email.com', '555-0101', 'bob@email.com', '555-0102', 'charlie@email.com'],\n",
    "    'is_primary': [True, False, True, True, True]\n",
    "})\n",
    "\n",
    "print(\"Customer contacts with duplicates:\")\n",
    "print(customer_contacts)\n",
    "\n",
    "# Merge will create cartesian product for duplicate keys\n",
    "customers_with_contacts = pd.merge(df_customers, customer_contacts, on='customer_id', how='inner')\n",
    "print(f\"\\nResult of merge with duplicates: {customers_with_contacts.shape}\")\n",
    "print(customers_with_contacts[['customer_name', 'contact_type', 'contact_value', 'is_primary']].head())\n",
    "\n",
    "# Strategy 1: Filter before merge\n",
    "primary_contacts = customer_contacts[customer_contacts['is_primary'] == True]\n",
    "customers_primary_contacts = pd.merge(df_customers, primary_contacts, on='customer_id', how='left')\n",
    "print(f\"\\nAfter filtering to primary contacts: {customers_primary_contacts.shape}\")\n",
    "\n",
    "# Strategy 2: Pivot contacts to columns\n",
    "contacts_pivoted = customer_contacts.pivot_table(\n",
    "    index='customer_id',\n",
    "    columns='contact_type',\n",
    "    values='contact_value',\n",
    "    aggfunc='first'\n",
    ").reset_index()\n",
    "print(\"\\nPivoted contacts:\")\n",
    "print(contacts_pivoted)\n",
    "\n",
    "customers_with_pivoted_contacts = pd.merge(df_customers, contacts_pivoted, on='customer_id', how='left')\n",
    "print(f\"\\nAfter merging pivoted contacts: {customers_with_pivoted_contacts.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Index-based Joins\n",
    "\n",
    "Using DataFrame indices for joining operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up DataFrames with indices\n",
    "print(\"=== INDEX-BASED JOINS ===\")\n",
    "\n",
    "# Set customer_id as index\n",
    "customers_indexed = df_customers.set_index('customer_id')\n",
    "segments_indexed = df_segments.set_index('customer_id')\n",
    "\n",
    "print(\"Customers with index:\")\n",
    "print(customers_indexed.head())\n",
    "\n",
    "# Join using indices\n",
    "joined_by_index = customers_indexed.join(segments_indexed, how='left')\n",
    "print(f\"\\nJoined by index shape: {joined_by_index.shape}\")\n",
    "print(joined_by_index[['customer_name', 'city', 'segment', 'loyalty_points']].head())\n",
    "\n",
    "# Compare with merge\n",
    "merged_equivalent = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
    "print(f\"\\nEquivalent merge shape: {merged_equivalent.shape}\")\n",
    "\n",
    "# Verify they're the same (after sorting)\n",
    "joined_sorted = joined_by_index.reset_index().sort_values('customer_id')\n",
    "merged_sorted = merged_equivalent.sort_values('customer_id')\n",
    "are_equal = joined_sorted.equals(merged_sorted)\n",
    "print(f\"Results are identical: {are_equal}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multi-index joins\n",
    "print(\"=== MULTI-INDEX JOINS ===\")\n",
    "\n",
    "# Create a dataset with multiple index levels\n",
    "sales_by_region_product = pd.DataFrame({\n",
    "    'region': ['North', 'North', 'South', 'South', 'East', 'East'],\n",
    "    'product': ['Laptop', 'Phone', 'Laptop', 'Phone', 'Laptop', 'Phone'],\n",
    "    'sales_target': [10, 15, 8, 12, 12, 18],\n",
    "    'commission_rate': [0.05, 0.04, 0.06, 0.05, 0.05, 0.04]\n",
    "})\n",
    "\n",
    "# Set multi-index\n",
    "sales_targets = sales_by_region_product.set_index(['region', 'product'])\n",
    "print(\"Sales targets with multi-index:\")\n",
    "print(sales_targets)\n",
    "\n",
    "# Create customer orders with region mapping\n",
    "customer_regions = {\n",
    "    1: 'North', 2: 'South', 3: 'East', 4: 'North', 5: 'South', 6: 'East'\n",
    "}\n",
    "\n",
    "orders_with_region = df_orders.copy()\n",
    "orders_with_region['region'] = orders_with_region['customer_id'].map(customer_regions)\n",
    "orders_with_region = orders_with_region.dropna(subset=['region'])\n",
    "\n",
    "# Merge on multiple columns to match multi-index\n",
    "orders_with_targets = pd.merge(\n",
    "    orders_with_region,\n",
    "    sales_targets.reset_index(),\n",
    "    on=['region', 'product'],\n",
    "    how='left'\n",
    ")\n",
    "\n",
    "print(\"\\nOrders with sales targets:\")\n",
    "print(orders_with_targets[['order_id', 'region', 'product', 'amount', 'sales_target', 'commission_rate']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Concatenation Operations\n",
    "\n",
    "Combining DataFrames vertically and horizontally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Vertical concatenation (stacking DataFrames)\n",
    "print(\"=== VERTICAL CONCATENATION ===\")\n",
    "\n",
    "# Create additional customer data (new batch)\n",
    "new_customers = pd.DataFrame({\n",
    "    'customer_id': [11, 12, 13, 14, 15],\n",
    "    'customer_name': ['Kate Wilson', 'Liam Brown', 'Mia Garcia', 'Noah Jones', 'Olivia Miller'],\n",
    "    'email': ['kate@email.com', 'liam@email.com', 'mia@email.com', 'noah@email.com', 'olivia@email.com'],\n",
    "    'age': [26, 39, 31, 44, 28],\n",
    "    'city': ['Austin', 'Seattle', 'Denver', 'Boston', 'Miami'],\n",
    "    'signup_date': pd.date_range('2024-01-01', periods=5, freq='M')\n",
    "})\n",
    "\n",
    "# Concatenate vertically\n",
    "all_customers = pd.concat([df_customers, new_customers], ignore_index=True)\n",
    "print(f\"Original customers: {len(df_customers)}\")\n",
    "print(f\"New customers: {len(new_customers)}\")\n",
    "print(f\"Combined customers: {len(all_customers)}\")\n",
    "\n",
    "print(\"\\nCombined customer data:\")\n",
    "print(all_customers.tail())\n",
    "\n",
    "# Concatenation with different columns\n",
    "customers_with_extra_info = pd.DataFrame({\n",
    "    'customer_id': [16, 17],\n",
    "    'customer_name': ['Paul Davis', 'Quinn Taylor'],\n",
    "    'email': ['paul@email.com', 'quinn@email.com'],\n",
    "    'age': [35, 29],\n",
    "    'city': ['Portland', 'Nashville'],\n",
    "    'signup_date': pd.date_range('2024-06-01', periods=2, freq='M'),\n",
    "    'referral_source': ['Google', 'Facebook']  # Extra column\n",
    "})\n",
    "\n",
    "# Concat with different columns (creates NaN for missing columns)\n",
    "all_customers_extended = pd.concat([all_customers, customers_with_extra_info], ignore_index=True, sort=False)\n",
    "print(f\"\\nAfter adding customers with extra info: {all_customers_extended.shape}\")\n",
    "print(\"Missing values in referral_source:\")\n",
    "print(all_customers_extended['referral_source'].isnull().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Horizontal concatenation\n",
    "print(\"=== HORIZONTAL CONCATENATION ===\")\n",
    "\n",
    "# Split customer data into parts\n",
    "customer_basic_info = df_customers[['customer_id', 'customer_name', 'email']]\n",
    "customer_demographics = df_customers[['customer_id', 'age', 'city', 'signup_date']]\n",
    "\n",
    "print(\"Customer basic info:\")\n",
    "print(customer_basic_info.head())\n",
    "\n",
    "print(\"\\nCustomer demographics:\")\n",
    "print(customer_demographics.head())\n",
    "\n",
    "# Concatenate horizontally (by index)\n",
    "customers_recombined = pd.concat([customer_basic_info, customer_demographics.drop('customer_id', axis=1)], axis=1)\n",
    "print(f\"\\nRecombined shape: {customers_recombined.shape}\")\n",
    "print(customers_recombined.head())\n",
    "\n",
    "# Verify it matches original\n",
    "columns_match = set(customers_recombined.columns) == set(df_customers.columns)\n",
    "print(f\"\\nColumns match original: {columns_match}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Concat with keys (creating hierarchical columns)\n",
    "print(\"=== CONCAT WITH KEYS ===\")\n",
    "\n",
    "# Create quarterly sales data\n",
    "q1_sales = pd.DataFrame({\n",
    "    'product': ['Laptop', 'Phone', 'Tablet'],\n",
    "    'units_sold': [50, 75, 30],\n",
    "    'revenue': [60000, 60000, 12000]\n",
    "})\n",
    "\n",
    "q2_sales = pd.DataFrame({\n",
    "    'product': ['Laptop', 'Phone', 'Tablet'],\n",
    "    'units_sold': [45, 80, 35],\n",
    "    'revenue': [54000, 64000, 14000]\n",
    "})\n",
    "\n",
    "# Concatenate with keys\n",
    "quarterly_sales = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])\n",
    "print(\"Quarterly sales with hierarchical index:\")\n",
    "print(quarterly_sales)\n",
    "\n",
    "# Access specific quarter\n",
    "print(\"\\nQ1 sales only:\")\n",
    "print(quarterly_sales.loc['Q1'])\n",
    "\n",
    "# Create summary comparison\n",
    "quarterly_comparison = pd.concat([q1_sales.set_index('product'), q2_sales.set_index('product')], \n",
    "                                keys=['Q1', 'Q2'], axis=1)\n",
    "print(\"\\nQuarterly comparison (side by side):\")\n",
    "print(quarterly_comparison)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Performance and Best Practices\n",
    "\n",
    "Optimizing merge operations and avoiding common pitfalls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Performance comparison: merge vs join\n",
    "import time\n",
    "\n",
    "print(\"=== PERFORMANCE COMPARISON ===\")\n",
    "\n",
    "# Create larger datasets for performance testing\n",
    "np.random.seed(42)\n",
    "large_customers = pd.DataFrame({\n",
    "    'customer_id': range(1, 10001),\n",
    "    'customer_name': [f'Customer_{i}' for i in range(1, 10001)],\n",
    "    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], 10000)\n",
    "})\n",
    "\n",
    "large_orders = pd.DataFrame({\n",
    "    'order_id': range(1, 50001),\n",
    "    'customer_id': np.random.randint(1, 10001, 50000),\n",
    "    'amount': np.random.normal(100, 30, 50000)\n",
    "})\n",
    "\n",
    "print(f\"Large customers: {large_customers.shape}\")\n",
    "print(f\"Large orders: {large_orders.shape}\")\n",
    "\n",
    "# Test merge performance\n",
    "start_time = time.time()\n",
    "merged_result = pd.merge(large_customers, large_orders, on='customer_id', how='inner')\n",
    "merge_time = time.time() - start_time\n",
    "\n",
    "# Test join performance\n",
    "customers_indexed = large_customers.set_index('customer_id')\n",
    "orders_indexed = large_orders.set_index('customer_id')\n",
    "\n",
    "start_time = time.time()\n",
    "joined_result = customers_indexed.join(orders_indexed, how='inner')\n",
    "join_time = time.time() - start_time\n",
    "\n",
    "print(f\"\\nMerge time: {merge_time:.4f} seconds\")\n",
    "print(f\"Join time: {join_time:.4f} seconds\")\n",
    "print(f\"Join is {merge_time/join_time:.2f}x faster\")\n",
    "\n",
    "print(f\"\\nResults shape - Merge: {merged_result.shape}, Join: {joined_result.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Best practices and common pitfalls\n",
    "print(\"=== BEST PRACTICES ===\")\n",
    "\n",
    "def analyze_merge_keys(df1, df2, key_col):\n",
    "    \"\"\"Analyze merge keys before joining\"\"\"\n",
    "    print(f\"\\n--- Analyzing merge on '{key_col}' ---\")\n",
    "    \n",
    "    # Check for duplicates\n",
    "    df1_dups = df1[key_col].duplicated().sum()\n",
    "    df2_dups = df2[key_col].duplicated().sum()\n",
    "    \n",
    "    print(f\"Duplicates in left table: {df1_dups}\")\n",
    "    print(f\"Duplicates in right table: {df2_dups}\")\n",
    "    \n",
    "    # Check for missing values\n",
    "    df1_missing = df1[key_col].isnull().sum()\n",
    "    df2_missing = df2[key_col].isnull().sum()\n",
    "    \n",
    "    print(f\"Missing values in left table: {df1_missing}\")\n",
    "    print(f\"Missing values in right table: {df2_missing}\")\n",
    "    \n",
    "    # Check overlap\n",
    "    left_keys = set(df1[key_col].dropna())\n",
    "    right_keys = set(df2[key_col].dropna())\n",
    "    \n",
    "    overlap = left_keys & right_keys\n",
    "    left_only = left_keys - right_keys\n",
    "    right_only = right_keys - left_keys\n",
    "    \n",
    "    print(f\"Keys in both tables: {len(overlap)}\")\n",
    "    print(f\"Keys only in left: {len(left_only)}\")\n",
    "    print(f\"Keys only in right: {len(right_only)}\")\n",
    "    \n",
    "    # Predict result sizes\n",
    "    if df1_dups == 0 and df2_dups == 0:\n",
    "        inner_size = len(overlap)\n",
    "        left_size = len(df1)\n",
    "        right_size = len(df2)\n",
    "        outer_size = len(left_keys | right_keys)\n",
    "    else:\n",
    "        print(\"Warning: Duplicates present, result size may be larger than expected\")\n",
    "        inner_size = \"Cannot predict (duplicates present)\"\n",
    "        left_size = \"Cannot predict (duplicates present)\"\n",
    "        right_size = \"Cannot predict (duplicates present)\"\n",
    "        outer_size = \"Cannot predict (duplicates present)\"\n",
    "    \n",
    "    print(f\"\\nPredicted result sizes:\")\n",
    "    print(f\"Inner join: {inner_size}\")\n",
    "    print(f\"Left join: {left_size}\")\n",
    "    print(f\"Right join: {right_size}\")\n",
    "    print(f\"Outer join: {outer_size}\")\n",
    "\n",
    "# Analyze our sample data\n",
    "analyze_merge_keys(df_customers, df_orders, 'customer_id')\n",
    "analyze_merge_keys(df_customers, df_segments, 'customer_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data validation after merge\n",
    "def validate_merge_result(df, expected_rows=None, key_col=None):\n",
    "    \"\"\"Validate merge results\"\"\"\n",
    "    print(\"\\n=== MERGE VALIDATION ===\")\n",
    "    \n",
    "    print(f\"Result shape: {df.shape}\")\n",
    "    \n",
    "    if expected_rows:\n",
    "        print(f\"Expected rows: {expected_rows}\")\n",
    "        if len(df) != expected_rows:\n",
    "            print(\"⚠️  Row count doesn't match expectation!\")\n",
    "    \n",
    "    # Check for unexpected duplicates\n",
    "    if key_col and key_col in df.columns:\n",
    "        duplicates = df[key_col].duplicated().sum()\n",
    "        if duplicates > 0:\n",
    "            print(f\"⚠️  Found {duplicates} duplicate keys after merge\")\n",
    "    \n",
    "    # Check for missing values in key columns\n",
    "    missing_summary = df.isnull().sum()\n",
    "    critical_missing = missing_summary[missing_summary > 0]\n",
    "    \n",
    "    if len(critical_missing) > 0:\n",
    "        print(\"Missing values after merge:\")\n",
    "        print(critical_missing)\n",
    "    \n",
    "    # Data type consistency\n",
    "    print(f\"\\nData types:\")\n",
    "    print(df.dtypes)\n",
    "    \n",
    "    return df\n",
    "\n",
    "# Example validation\n",
    "sample_merge = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
    "validated_result = validate_merge_result(sample_merge, key_col='customer_id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Apply merging and joining techniques to real-world scenarios:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Customer Lifetime Value Analysis\n",
    "# Create a comprehensive customer analysis by joining:\n",
    "# - Customer demographics\n",
    "# - Order history\n",
    "# - Product information\n",
    "# - Customer segments\n",
    "# Calculate CLV metrics for each customer\n",
    "\n",
    "def calculate_customer_lifetime_value(customers, orders, products, segments):\n",
    "    \"\"\"Calculate comprehensive customer lifetime value metrics\"\"\"\n",
    "    # Your implementation here\n",
    "    pass\n",
    "\n",
    "# clv_analysis = calculate_customer_lifetime_value(df_customers, df_orders, df_products, df_segments)\n",
    "# print(\"Customer Lifetime Value Analysis:\")\n",
    "# print(clv_analysis.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Data Quality Assessment\n",
    "# Create a function that analyzes data quality issues when merging multiple datasets:\n",
    "# - Identify orphaned records\n",
    "# - Find data inconsistencies\n",
    "# - Suggest data cleaning steps\n",
    "# - Provide merge recommendations\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Time-series Join Challenge\n",
    "# Create a complex time-based join scenario:\n",
    "# - Join orders with time-varying product prices\n",
    "# - Handle seasonal promotions\n",
    "# - Calculate accurate historical revenue\n",
    "# - Account for price changes over time\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Join Types**:\n",
    "   - **Inner**: Only matching records from both tables\n",
    "   - **Left**: All records from left table + matching from right\n",
    "   - **Right**: All records from right table + matching from left\n",
    "   - **Outer**: All records from both tables\n",
    "\n",
    "2. **Method Selection**:\n",
    "   - **`pd.merge()`**: Most flexible, works with any columns\n",
    "   - **`.join()`**: Faster for index-based joins\n",
    "   - **`pd.concat()`**: For stacking DataFrames vertically/horizontally\n",
    "\n",
    "3. **Best Practices**:\n",
    "   - Always analyze merge keys before joining\n",
    "   - Check for duplicates and missing values\n",
    "   - Validate results after merging\n",
    "   - Use appropriate join types for your use case\n",
    "   - Consider performance implications for large datasets\n",
    "\n",
    "4. **Common Pitfalls**:\n",
    "   - Cartesian products from duplicate keys\n",
    "   - Unexpected result sizes\n",
    "   - Data type inconsistencies\n",
    "   - Missing value propagation\n",
    "\n",
    "## Join Type Selection Guide\n",
    "\n",
    "| Use Case | Recommended Join | Rationale |\n",
    "|----------|-----------------|----------|\n",
    "| Customer orders analysis | Inner | Only customers with orders |\n",
    "| Customer segmentation | Left | Keep all customers, add segment info |\n",
    "| Order validation | Right | Keep all orders, check customer validity |\n",
    "| Data completeness analysis | Outer | See all records and identify gaps |\n",
    "| Performance-critical operations | Index-based join | Faster execution |\n",
    "\n",
    "## Performance Tips\n",
    "\n",
    "1. **Index Usage**: Set indexes for frequently joined columns\n",
    "2. **Data Types**: Ensure consistent data types before joining\n",
    "3. **Memory Management**: Consider chunking for very large datasets\n",
    "4. **Join Order**: Start with smallest datasets\n",
    "5. **Validation**: Always validate merge results"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/Session_01/PandasDataFrame-exmples/08_sorting_ranking.ipynb
+++ b/Session_01/PandasDataFrame-exmples/08_sorting_ranking.ipynb
--- a/Session_01/PandasDataFrame-exmples/09_pivot_tables.ipynb
+++ b/Session_01/PandasDataFrame-exmples/09_pivot_tables.ipynb
--- a/Session_01/PandasDataFrame-exmples/10_time_series_analysis.ipynb
+++ b/Session_01/PandasDataFrame-exmples/10_time_series_analysis.ipynb
--- a/Session_01/PandasDataFrame-exmples/11_string_operation.ipynb
+++ b/Session_01/PandasDataFrame-exmples/11_string_operation.ipynb
--- a/Session_01/PandasDataFrame-exmples/12_data_visualization.ipynb
+++ b/Session_01/PandasDataFrame-exmples/12_data_visualization.ipynb
--- a/Session_01/PandasDataFrame-exmples/13_advanced_data_cleaning.ipynb
+++ b/Session_01/PandasDataFrame-exmples/13_advanced_data_cleaning.ipynb
@ -0,0 +1,815 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 13: Advanced Data Cleaning\n",
    "\n",
    "## Learning Objectives\n",
    "- Master advanced techniques for data cleaning and validation\n",
    "- Learn to detect and handle various types of data quality issues\n",
    "- Understand data standardization and normalization techniques\n",
    "- Practice with real-world messy data scenarios\n",
    "- Develop automated data cleaning pipelines\n",
    "\n",
    "## Prerequisites\n",
    "- Completed previous lessons on DataFrames\n",
    "- Understanding of basic data cleaning concepts\n",
    "- Familiarity with regular expressions (helpful but not required)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import re\n",
    "from datetime import datetime, timedelta\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Display settings\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 100)\n",
    "\n",
    "print(\"Libraries loaded successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Messy Sample Data\n",
    "\n",
    "Let's create a realistic messy dataset to practice advanced cleaning techniques."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Create intentionally messy data that mimics real-world issues\n",
    "np.random.seed(42)\n",
    "\n",
    "# Base data\n",
    "n_records = 200\n",
    "messy_data = {\n",
    "    'customer_id': [f'CUST{i:04d}' if i % 10 != 0 else f'cust{i:04d}' for i in range(1, n_records + 1)],\n",
    "    'customer_name': [\n",
    "        'John Smith', 'jane doe', 'MARY JOHNSON', 'bob wilson', 'Sarah Davis',\n",
    "        'Mike Brown', 'lisa garcia', 'DAVID MILLER', 'Amy Wilson', 'Tom Anderson'\n",
    "    ] * 20,\n",
    "    'email': [\n",
    "        'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',\n",
    "        'bob..wilson@test.com', 'sarah@invalid-email', 'mike@email.com',\n",
    "        'lisa.garcia@email.com', 'david@company.org', 'amy@email.com', 'tom@test.com'\n",
    "    ] * 20,\n",
    "    'phone': [\n",
    "        '(555) 123-4567', '555.987.6543', '5551234567', '555-987-6543',\n",
    "        '(555)123-4567', '+1-555-123-4567', '555 123 4567', '5559876543',\n",
    "        '(555) 987 6543', '555-123-4567'\n",
    "    ] * 20,\n",
    "    'address': [\n",
    "        '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',\n",
    "        '789 Pine Rd, Los Angeles, CA 90210', '321 Elm St, Chicago, IL 60601',\n",
    "        '654 Maple Dr, Houston, TX 77001', '987 Cedar Ln, Phoenix, AZ 85001',\n",
    "        '147 Birch Way, Philadelphia, PA 19101', '258 Ash Ct, San Antonio, TX 78201',\n",
    "        '369 Walnut St, San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201'\n",
    "    ] * 20,\n",
    "    'purchase_amount': np.random.normal(100, 30, n_records).round(2),\n",
    "    'purchase_date': [\n",
    "        '2024-01-15', '01/16/2024', '2024-1-17', '16-01-2024', '2024/01/18',\n",
    "        'January 19, 2024', '2024-01-20', '01-21-24', '2024.01.22', '23/01/2024'\n",
    "    ] * 20,\n",
    "    'category': [\n",
    "        'Electronics', 'electronics', 'ELECTRONICS', 'Books', 'books',\n",
    "        'Clothing', 'clothing', 'CLOTHING', 'Home & Garden', 'home&garden'\n",
    "    ] * 20,\n",
    "    'satisfaction_score': np.random.choice([1, 2, 3, 4, 5, 99, -1, None], n_records, p=[0.05, 0.1, 0.15, 0.35, 0.3, 0.02, 0.02, 0.01])\n",
    "}\n",
    "\n",
    "# Convert to DataFrame first\n",
    "df_messy = pd.DataFrame(messy_data)\n",
    "\n",
    "# Introduce missing values and anomalies using proper indexing\n",
    "df_messy.loc[df_messy.index[::25], 'customer_name'] = None  # Some missing names\n",
    "df_messy.loc[df_messy.index[::30], 'email'] = None  # Some missing emails\n",
    "df_messy.loc[df_messy.index[::35], 'purchase_amount'] = np.nan  # Some missing amounts\n",
    "df_messy.loc[df_messy.index[::40], 'purchase_amount'] = -999  # Invalid negative values\n",
    "\n",
    "# Add some duplicate records\n",
    "duplicate_indices = [0, 1, 2, 3, 4]\n",
    "duplicate_rows = df_messy.iloc[duplicate_indices].copy()\n",
    "df_messy = pd.concat([df_messy, duplicate_rows], ignore_index=True)\n",
    "\n",
    "print(\"Messy dataset created:\")\n",
    "print(f\"Shape: {df_messy.shape}\")\n",
    "print(\"\\nFirst few rows:\")\n",
    "print(df_messy.head(10))\n",
    "print(\"\\nData types:\")\n",
    "print(df_messy.dtypes)\n",
    "print(\"\\nSample of data quality issues:\")\n",
    "print(\"\\n1. Missing values:\")\n",
    "print(df_messy.isnull().sum())\n",
    "print(\"\\n2. Inconsistent formatting examples:\")\n",
    "print(\"Customer IDs:\", df_messy['customer_id'].head(15).tolist())\n",
    "print(\"Customer names:\", df_messy['customer_name'].dropna().head(5).tolist())\n",
    "print(\"Categories:\", df_messy['category'].unique()[:5])\n",
    "print(\"\\n3. Invalid satisfaction scores:\")\n",
    "print(\"Unique satisfaction scores:\", sorted(df_messy['satisfaction_score'].dropna().unique()))\n",
    "print(\"\\n4. Invalid purchase amounts:\")\n",
    "print(\"Negative amounts:\", df_messy[df_messy['purchase_amount'] < 0]['purchase_amount'].count())\n",
    "print(\"\\n5. Date format inconsistencies:\")\n",
    "print(\"Sample dates:\", df_messy['purchase_date'].head(10).tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Data Quality Assessment\n",
    "\n",
    "First, let's assess the quality of our messy data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def assess_data_quality(df):\n",
    "    \"\"\"Comprehensive data quality assessment\"\"\"\n",
    "    print(\"=== DATA QUALITY ASSESSMENT ===\")\n",
    "    print(f\"Dataset shape: {df.shape}\")\n",
    "    print(f\"Total cells: {df.size}\")\n",
    "    \n",
    "    # Missing values analysis\n",
    "    print(\"\\n--- Missing Values ---\")\n",
    "    missing_stats = pd.DataFrame({\n",
    "        'Missing_Count': df.isnull().sum(),\n",
    "        'Missing_Percentage': (df.isnull().sum() / len(df)) * 100\n",
    "    })\n",
    "    missing_stats = missing_stats[missing_stats['Missing_Count'] > 0]\n",
    "    print(missing_stats.round(2))\n",
    "    \n",
    "    # Duplicate analysis\n",
    "    print(\"\\n--- Duplicates ---\")\n",
    "    total_duplicates = df.duplicated().sum()\n",
    "    print(f\"Complete duplicate rows: {total_duplicates}\")\n",
    "    \n",
    "    # Column-specific analysis\n",
    "    print(\"\\n--- Column Analysis ---\")\n",
    "    for col in df.columns:\n",
    "        unique_count = df[col].nunique()\n",
    "        unique_percentage = (unique_count / len(df)) * 100\n",
    "        print(f\"{col}: {unique_count} unique values ({unique_percentage:.1f}%)\")\n",
    "    \n",
    "    # Data type issues\n",
    "    print(\"\\n--- Data Types ---\")\n",
    "    print(df.dtypes)\n",
    "    \n",
    "    return missing_stats, total_duplicates\n",
    "\n",
    "# Assess the messy data\n",
    "missing_stats, duplicate_count = assess_data_quality(df_messy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Identify specific data quality issues\n",
    "def identify_issues(df):\n",
    "    \"\"\"Identify specific data quality issues\"\"\"\n",
    "    issues = []\n",
    "    \n",
    "    # Check for inconsistent formatting\n",
    "    print(\"=== SPECIFIC ISSUES IDENTIFIED ===\")\n",
    "    \n",
    "    # Customer ID formatting\n",
    "    id_patterns = df['customer_id'].str.extract(r'(CUST|cust)(\\d+)').fillna('')\n",
    "    inconsistent_ids = (id_patterns[0] == 'cust').sum()\n",
    "    print(f\"Inconsistent customer ID format: {inconsistent_ids} records\")\n",
    "    \n",
    "    # Email validation\n",
    "    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n",
    "    invalid_emails = ~df['email'].str.match(email_pattern, na=False)\n",
    "    print(f\"Invalid email formats: {invalid_emails.sum()} records\")\n",
    "    \n",
    "    # Negative purchase amounts\n",
    "    negative_amounts = (df['purchase_amount'] < 0).sum()\n",
    "    print(f\"Negative purchase amounts: {negative_amounts} records\")\n",
    "    \n",
    "    # Invalid satisfaction scores\n",
    "    invalid_scores = ((df['satisfaction_score'] < 1) | (df['satisfaction_score'] > 5)) & df['satisfaction_score'].notna()\n",
    "    print(f\"Invalid satisfaction scores: {invalid_scores.sum()} records\")\n",
    "    \n",
    "    # Category inconsistencies\n",
    "    category_variations = df['category'].value_counts()\n",
    "    print(f\"\\nCategory variations: {len(category_variations)} different values\")\n",
    "    print(category_variations)\n",
    "    \n",
    "    return issues\n",
    "\n",
    "issues = identify_issues(df_messy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Text Data Standardization\n",
    "\n",
    "Clean and standardize text fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Text cleaning functions\n",
    "def clean_text_data(df):\n",
    "    \"\"\"Comprehensive text data cleaning\"\"\"\n",
    "    df_clean = df.copy()\n",
    "    \n",
    "    # Standardize customer names\n",
    "    print(\"Cleaning customer names...\")\n",
    "    df_clean['customer_name_clean'] = df_clean['customer_name'].str.strip()  # Remove whitespace\n",
    "    df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.title()  # Title case\n",
    "    df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.replace(r'\\s+', ' ', regex=True)  # Multiple spaces\n",
    "    \n",
    "    # Standardize customer IDs\n",
    "    print(\"Standardizing customer IDs...\")\n",
    "    df_clean['customer_id_clean'] = df_clean['customer_id'].str.upper()  # All uppercase\n",
    "    df_clean['customer_id_clean'] = df_clean['customer_id_clean'].str.replace('CUST', 'CUST')  # Ensure consistent prefix\n",
    "    \n",
    "    # Clean email addresses\n",
    "    print(\"Cleaning email addresses...\")\n",
    "    df_clean['email_clean'] = df_clean['email'].str.lower()  # Lowercase\n",
    "    df_clean['email_clean'] = df_clean['email_clean'].str.strip()  # Remove whitespace\n",
    "    df_clean['email_clean'] = df_clean['email_clean'].str.replace(r'\\.{2,}', '.', regex=True)  # Multiple dots\n",
    "    \n",
    "    # Standardize categories\n",
    "    print(\"Standardizing categories...\")\n",
    "    category_mapping = {\n",
    "        'electronics': 'Electronics',\n",
    "        'ELECTRONICS': 'Electronics',\n",
    "        'books': 'Books',\n",
    "        'clothing': 'Clothing',\n",
    "        'CLOTHING': 'Clothing',\n",
    "        'home&garden': 'Home & Garden',\n",
    "        'Home & Garden': 'Home & Garden'\n",
    "    }\n",
    "    df_clean['category_clean'] = df_clean['category'].map(category_mapping).fillna(df_clean['category'])\n",
    "    \n",
    "    return df_clean\n",
    "\n",
    "# Apply text cleaning\n",
    "df_text_clean = clean_text_data(df_messy)\n",
    "\n",
    "print(\"\\nText cleaning comparison:\")\n",
    "comparison_cols = ['customer_name', 'customer_name_clean', 'customer_id', 'customer_id_clean', \n",
    "                  'email', 'email_clean', 'category', 'category_clean']\n",
    "print(df_text_clean[comparison_cols].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Advanced text cleaning with regex\n",
    "def advanced_text_cleaning(df):\n",
    "    \"\"\"Advanced text cleaning using regular expressions\"\"\"\n",
    "    df_advanced = df.copy()\n",
    "    \n",
    "    # Extract and standardize address components\n",
    "    print(\"Processing addresses...\")\n",
    "    # Basic address pattern: number street, city, state zipcode\n",
    "    address_pattern = r'(\\d+)\\s+([^,]+),\\s*([^,]+),\\s*([A-Z]{2})\\s+(\\d{5})'\n",
    "    address_parts = df_advanced['address'].str.extract(address_pattern)\n",
    "    address_parts.columns = ['street_number', 'street_name', 'city', 'state', 'zipcode']\n",
    "    \n",
    "    # Clean street names\n",
    "    address_parts['street_name'] = address_parts['street_name'].str.title()\n",
    "    address_parts['city'] = address_parts['city'].str.title()\n",
    "    \n",
    "    # Combine cleaned parts\n",
    "    df_advanced['address_clean'] = (\n",
    "        address_parts['street_number'] + ' ' + address_parts['street_name'] + ', ' +\n",
    "        address_parts['city'] + ', ' + address_parts['state'] + ' ' + address_parts['zipcode']\n",
    "    )\n",
    "    \n",
    "    # Add individual address components\n",
    "    for col in address_parts.columns:\n",
    "        df_advanced[col] = address_parts[col]\n",
    "    \n",
    "    return df_advanced\n",
    "\n",
    "# Apply advanced cleaning\n",
    "df_advanced_clean = advanced_text_cleaning(df_text_clean)\n",
    "\n",
    "print(\"Address cleaning results:\")\n",
    "print(df_advanced_clean[['address', 'address_clean', 'city', 'state', 'zipcode']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Phone Number Standardization\n",
    "\n",
    "Clean and standardize phone numbers using regex patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def standardize_phone_numbers(df):\n",
    "    \"\"\"Standardize phone numbers to consistent format\"\"\"\n",
    "    df_phone = df.copy()\n",
    "    \n",
    "    def clean_phone(phone):\n",
    "        \"\"\"Clean individual phone number\"\"\"\n",
    "        if pd.isna(phone):\n",
    "            return None\n",
    "        \n",
    "        # Remove all non-digit characters\n",
    "        digits_only = re.sub(r'\\D', '', str(phone))\n",
    "        \n",
    "        # Handle different formats\n",
    "        if len(digits_only) == 10:\n",
    "            # Format as (XXX) XXX-XXXX\n",
    "            return f\"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}\"\n",
    "        elif len(digits_only) == 11 and digits_only.startswith('1'):\n",
    "            # Remove country code and format\n",
    "            phone_part = digits_only[1:]\n",
    "            return f\"({phone_part[:3]}) {phone_part[3:6]}-{phone_part[6:]}\"\n",
    "        else:\n",
    "            # Invalid phone number\n",
    "            return 'INVALID'\n",
    "    \n",
    "    # Apply phone cleaning\n",
    "    df_phone['phone_clean'] = df_phone['phone'].apply(clean_phone)\n",
    "    \n",
    "    # Extract area code\n",
    "    df_phone['area_code'] = df_phone['phone_clean'].str.extract(r'\\((\\d{3})\\)')\n",
    "    \n",
    "    # Flag invalid phone numbers\n",
    "    df_phone['phone_is_valid'] = df_phone['phone_clean'] != 'INVALID'\n",
    "    \n",
    "    return df_phone\n",
    "\n",
    "# Apply phone standardization\n",
    "df_phone_clean = standardize_phone_numbers(df_advanced_clean)\n",
    "\n",
    "print(\"Phone number standardization:\")\n",
    "print(df_phone_clean[['phone', 'phone_clean', 'area_code', 'phone_is_valid']].head(15))\n",
    "\n",
    "print(\"\\nPhone validation summary:\")\n",
    "print(df_phone_clean['phone_is_valid'].value_counts())\n",
    "\n",
    "print(\"\\nArea code distribution:\")\n",
    "print(df_phone_clean['area_code'].value_counts().head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Date Standardization\n",
    "\n",
    "Parse and standardize dates from various formats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def standardize_dates(df):\n",
    "    \"\"\"Parse and standardize dates from multiple formats\"\"\"\n",
    "    df_dates = df.copy()\n",
    "    \n",
    "    def parse_date(date_str):\n",
    "        \"\"\"Try to parse date from various formats\"\"\"\n",
    "        if pd.isna(date_str):\n",
    "            return None\n",
    "        \n",
    "        date_str = str(date_str).strip()\n",
    "        \n",
    "        # Common date formats to try\n",
    "        formats = [\n",
    "            '%Y-%m-%d',      # 2024-01-15\n",
    "            '%m/%d/%Y',      # 01/16/2024\n",
    "            '%Y-%m-%d',      # 2024-1-17 (handled by first format)\n",
    "            '%d-%m-%Y',      # 16-01-2024\n",
    "            '%Y/%m/%d',      # 2024/01/18\n",
    "            '%B %d, %Y',     # January 19, 2024\n",
    "            '%m-%d-%y',      # 01-21-24\n",
    "            '%Y.%m.%d',      # 2024.01.22\n",
    "            '%d/%m/%Y'       # 23/01/2024\n",
    "        ]\n",
    "        \n",
    "        for fmt in formats:\n",
    "            try:\n",
    "                return pd.to_datetime(date_str, format=fmt)\n",
    "            except ValueError:\n",
    "                continue\n",
    "        \n",
    "        # If all else fails, try pandas' flexible parser\n",
    "        try:\n",
    "            return pd.to_datetime(date_str, infer_datetime_format=True)\n",
    "        except:\n",
    "            return None\n",
    "    \n",
    "    # Apply date parsing\n",
    "    print(\"Parsing dates...\")\n",
    "    df_dates['purchase_date_clean'] = df_dates['purchase_date'].apply(parse_date)\n",
    "    \n",
    "    # Flag unparseable dates\n",
    "    df_dates['date_is_valid'] = df_dates['purchase_date_clean'].notna()\n",
    "    \n",
    "    # Extract date components for valid dates\n",
    "    df_dates['purchase_year'] = df_dates['purchase_date_clean'].dt.year\n",
    "    df_dates['purchase_month'] = df_dates['purchase_date_clean'].dt.month\n",
    "    df_dates['purchase_day'] = df_dates['purchase_date_clean'].dt.day\n",
    "    df_dates['purchase_day_of_week'] = df_dates['purchase_date_clean'].dt.day_name()\n",
    "    \n",
    "    return df_dates\n",
    "\n",
    "# Apply date standardization\n",
    "df_date_clean = standardize_dates(df_phone_clean)\n",
    "\n",
    "print(\"Date standardization results:\")\n",
    "print(df_date_clean[['purchase_date', 'purchase_date_clean', 'date_is_valid', \n",
    "                    'purchase_year', 'purchase_month', 'purchase_day_of_week']].head(15))\n",
    "\n",
    "print(\"\\nDate parsing summary:\")\n",
    "print(df_date_clean['date_is_valid'].value_counts())\n",
    "\n",
    "invalid_dates = df_date_clean[~df_date_clean['date_is_valid']]['purchase_date'].unique()\n",
    "if len(invalid_dates) > 0:\n",
    "    print(f\"\\nInvalid date formats found: {invalid_dates}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Numerical Data Cleaning\n",
    "\n",
    "Handle outliers, invalid values, and missing numerical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_numerical_data(df):\n",
    "    \"\"\"Clean and validate numerical data\"\"\"\n",
    "    df_numeric = df.copy()\n",
    "    \n",
    "    # Clean purchase amounts\n",
    "    print(\"Cleaning purchase amounts...\")\n",
    "    \n",
    "    # Flag invalid values\n",
    "    df_numeric['amount_is_valid'] = (\n",
    "        df_numeric['purchase_amount'].notna() & \n",
    "        (df_numeric['purchase_amount'] >= 0) & \n",
    "        (df_numeric['purchase_amount'] <= 10000)  # Reasonable upper limit\n",
    "    )\n",
    "    \n",
    "    # Replace invalid values with NaN\n",
    "    df_numeric['purchase_amount_clean'] = df_numeric['purchase_amount'].where(\n",
    "        df_numeric['amount_is_valid'], np.nan\n",
    "    )\n",
    "    \n",
    "    # Detect outliers using IQR method\n",
    "    Q1 = df_numeric['purchase_amount_clean'].quantile(0.25)\n",
    "    Q3 = df_numeric['purchase_amount_clean'].quantile(0.75)\n",
    "    IQR = Q3 - Q1\n",
    "    lower_bound = Q1 - 1.5 * IQR\n",
    "    upper_bound = Q3 + 1.5 * IQR\n",
    "    \n",
    "    df_numeric['amount_is_outlier'] = (\n",
    "        (df_numeric['purchase_amount_clean'] < lower_bound) |\n",
    "        (df_numeric['purchase_amount_clean'] > upper_bound)\n",
    "    )\n",
    "    \n",
    "    # Clean satisfaction scores\n",
    "    print(\"Cleaning satisfaction scores...\")\n",
    "    \n",
    "    # Valid satisfaction scores are 1-5\n",
    "    df_numeric['satisfaction_is_valid'] = (\n",
    "        df_numeric['satisfaction_score'].notna() &\n",
    "        (df_numeric['satisfaction_score'].between(1, 5))\n",
    "    )\n",
    "    \n",
    "    df_numeric['satisfaction_score_clean'] = df_numeric['satisfaction_score'].where(\n",
    "        df_numeric['satisfaction_is_valid'], np.nan\n",
    "    )\n",
    "    \n",
    "    return df_numeric\n",
    "\n",
    "# Apply numerical cleaning\n",
    "df_numeric_clean = clean_numerical_data(df_date_clean)\n",
    "\n",
    "print(\"Numerical data cleaning results:\")\n",
    "print(df_numeric_clean[['purchase_amount', 'purchase_amount_clean', 'amount_is_valid', \n",
    "                       'amount_is_outlier', 'satisfaction_score', 'satisfaction_score_clean', \n",
    "                       'satisfaction_is_valid']].head(15))\n",
    "\n",
    "print(\"\\nNumerical data quality summary:\")\n",
    "print(f\"Valid purchase amounts: {df_numeric_clean['amount_is_valid'].sum()}/{len(df_numeric_clean)}\")\n",
    "print(f\"Outlier amounts: {df_numeric_clean['amount_is_outlier'].sum()}\")\n",
    "print(f\"Valid satisfaction scores: {df_numeric_clean['satisfaction_is_valid'].sum()}/{len(df_numeric_clean)}\")\n",
    "\n",
    "# Show statistics for cleaned data\n",
    "print(\"\\nCleaned amount statistics:\")\n",
    "print(df_numeric_clean['purchase_amount_clean'].describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Duplicate Detection and Handling\n",
    "\n",
    "Identify and handle duplicate records intelligently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def handle_duplicates(df):\n",
    "    \"\"\"Comprehensive duplicate detection and handling\"\"\"\n",
    "    df_dedup = df.copy()\n",
    "    \n",
    "    print(\"=== DUPLICATE ANALYSIS ===\")\n",
    "    \n",
    "    # 1. Exact duplicates\n",
    "    exact_duplicates = df_dedup.duplicated()\n",
    "    print(f\"Exact duplicate rows: {exact_duplicates.sum()}\")\n",
    "    \n",
    "    # 2. Duplicates based on key columns (likely same customer)\n",
    "    key_cols = ['customer_name_clean', 'email_clean']\n",
    "    key_duplicates = df_dedup.duplicated(subset=key_cols, keep=False)\n",
    "    print(f\"Duplicate customers (by name/email): {key_duplicates.sum()}\")\n",
    "    \n",
    "    # 3. Near duplicates (similar but not exact)\n",
    "    # For demonstration, we'll check phone numbers\n",
    "    phone_duplicates = df_dedup.duplicated(subset=['phone_clean'], keep=False)\n",
    "    print(f\"Duplicate phone numbers: {phone_duplicates.sum()}\")\n",
    "    \n",
    "    # Show duplicate examples\n",
    "    if key_duplicates.any():\n",
    "        print(\"\\nExample duplicate customers:\")\n",
    "        duplicate_customers = df_dedup[key_duplicates].sort_values(key_cols)\n",
    "        print(duplicate_customers[key_cols + ['customer_id_clean', 'purchase_amount_clean']].head(10))\n",
    "    \n",
    "    # Remove exact duplicates\n",
    "    print(f\"\\nRemoving {exact_duplicates.sum()} exact duplicates...\")\n",
    "    df_no_exact_dups = df_dedup[~exact_duplicates]\n",
    "    \n",
    "    # For customer duplicates, keep the one with the highest purchase amount\n",
    "    print(\"Handling customer duplicates (keeping highest purchase)...\")\n",
    "    df_final = df_no_exact_dups.sort_values('purchase_amount_clean', ascending=False).drop_duplicates(\n",
    "        subset=key_cols, keep='first'\n",
    "    )\n",
    "    \n",
    "    print(f\"Final dataset size after deduplication: {len(df_final)} (was {len(df)})\")\n",
    "    \n",
    "    return df_final\n",
    "\n",
    "# Apply duplicate handling\n",
    "df_deduplicated = handle_duplicates(df_numeric_clean)\n",
    "\n",
    "print(f\"\\nRows removed: {len(df_numeric_clean) - len(df_deduplicated)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Data Validation and Quality Scores\n",
    "\n",
    "Create comprehensive data quality metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_quality_scores(df):\n",
    "    \"\"\"Calculate comprehensive data quality scores\"\"\"\n",
    "    df_quality = df.copy()\n",
    "    \n",
    "    # Define quality checks\n",
    "    quality_checks = {\n",
    "        'has_customer_name': df_quality['customer_name_clean'].notna(),\n",
    "        'has_valid_email': df_quality['email_clean'].notna() & \n",
    "                          df_quality['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', na=False),\n",
    "        'has_valid_phone': df_quality['phone_is_valid'] == True,\n",
    "        'has_valid_date': df_quality['date_is_valid'] == True,\n",
    "        'has_valid_amount': df_quality['amount_is_valid'] == True,\n",
    "        'has_valid_satisfaction': df_quality['satisfaction_is_valid'] == True,\n",
    "        'amount_not_outlier': df_quality['amount_is_outlier'] == False,\n",
    "        'has_complete_address': df_quality['city'].notna() & df_quality['state'].notna() & df_quality['zipcode'].notna()\n",
    "    }\n",
    "    \n",
    "    # Add individual quality flags\n",
    "    for check_name, check_result in quality_checks.items():\n",
    "        df_quality[f'quality_{check_name}'] = check_result.astype(int)\n",
    "    \n",
    "    # Calculate overall quality score (percentage of passed checks)\n",
    "    quality_cols = [col for col in df_quality.columns if col.startswith('quality_')]\n",
    "    df_quality['data_quality_score'] = df_quality[quality_cols].mean(axis=1) * 100\n",
    "    \n",
    "    # Categorize quality levels\n",
    "    def quality_category(score):\n",
    "        if score >= 90:\n",
    "            return 'Excellent'\n",
    "        elif score >= 75:\n",
    "            return 'Good'\n",
    "        elif score >= 50:\n",
    "            return 'Fair'\n",
    "        else:\n",
    "            return 'Poor'\n",
    "    \n",
    "    df_quality['quality_category'] = df_quality['data_quality_score'].apply(quality_category)\n",
    "    \n",
    "    return df_quality, quality_checks\n",
    "\n",
    "# Calculate quality scores\n",
    "df_with_quality, quality_checks = calculate_quality_scores(df_deduplicated)\n",
    "\n",
    "print(\"Data quality analysis:\")\n",
    "print(df_with_quality[['customer_name_clean', 'data_quality_score', 'quality_category']].head(10))\n",
    "\n",
    "print(\"\\nQuality category distribution:\")\n",
    "print(df_with_quality['quality_category'].value_counts())\n",
    "\n",
    "print(\"\\nAverage quality scores by check:\")\n",
    "quality_summary = {}\n",
    "for check_name in quality_checks.keys():\n",
    "    col_name = f'quality_{check_name}'\n",
    "    quality_summary[check_name] = df_with_quality[col_name].mean() * 100\n",
    "\n",
    "quality_df = pd.DataFrame(list(quality_summary.items()), columns=['Quality_Check', 'Pass_Rate_%'])\n",
    "quality_df = quality_df.sort_values('Pass_Rate_%', ascending=False)\n",
    "print(quality_df.round(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Apply advanced data cleaning techniques to challenging scenarios:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Create a custom validation function\n",
    "# Build a function that validates business rules:\n",
    "# - Email domains should be from approved list\n",
    "# - Purchase amounts should be within reasonable ranges by category\n",
    "# - Dates should be within business operating period\n",
    "# - Customer IDs should follow specific format patterns\n",
    "\n",
    "def validate_business_rules(df):\n",
    "    \"\"\"Validate business-specific rules\"\"\"\n",
    "    # Your implementation here\n",
    "    pass\n",
    "\n",
    "# validation_results = validate_business_rules(df_final_clean)\n",
    "# print(validation_results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Advanced duplicate detection\n",
    "# Implement fuzzy matching for near-duplicate detection:\n",
    "# - Similar names (edit distance)\n",
    "# - Similar addresses\n",
    "# - Similar email patterns\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Data cleaning metrics dashboard\n",
    "# Create a comprehensive data quality dashboard that shows:\n",
    "# - Data quality trends over time\n",
    "# - Field-by-field quality scores\n",
    "# - Impact of cleaning steps\n",
    "# - Recommendations for further improvement\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Assessment First**: Always assess data quality before cleaning\n",
    "2. **Systematic Approach**: Use a structured pipeline for consistent results\n",
    "3. **Preserve Original Data**: Keep original values while creating cleaned versions\n",
    "4. **Document Everything**: Log all cleaning steps and decisions\n",
    "5. **Validation**: Implement business rule validation\n",
    "6. **Quality Metrics**: Measure and track data quality improvements\n",
    "7. **Reusable Pipeline**: Create automated, configurable cleaning processes\n",
    "8. **Context Matters**: Consider domain-specific requirements\n",
    "\n",
    "## Common Data Issues and Solutions\n",
    "\n",
    "| Issue | Detection Method | Solution |\n",
    "|-------|-----------------|----------|\n",
    "| Inconsistent Format | Pattern analysis | Standardization rules |\n",
    "| Missing Values | `.isnull()` | Imputation or flagging |\n",
    "| Duplicates | `.duplicated()` | Deduplication logic |\n",
    "| Outliers | Statistical methods | Capping or flagging |\n",
    "| Invalid Values | Business rules | Validation and correction |\n",
    "| Inconsistent Naming | String analysis | Normalization |\n",
    "| Date Issues | Parsing attempts | Multiple format handling |\n",
    "| Text Issues | Regex patterns | Cleaning and standardization |\n",
    "\n",
    "## Best Practices\n",
    "\n",
    "1. **Start with Exploration**: Understand your data before cleaning\n",
    "2. **Preserve Traceability**: Keep original and cleaned versions\n",
    "3. **Validate Assumptions**: Test cleaning rules on sample data\n",
    "4. **Measure Impact**: Quantify improvements from cleaning\n",
    "5. **Automate When Possible**: Build reusable cleaning pipelines\n",
    "6. **Handle Edge Cases**: Plan for unusual but valid data\n",
    "7. **Business Context**: Include domain experts in rule definition\n",
    "8. **Iterative Process**: Refine cleaning rules based on results\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/Session_01/ohlcv_analysis.ipynb
+++ b/Session_01/ohlcv_analysis.ipynb
--- a/Session_01/ohlcv_analysis_advanced.ipynb
+++ b/Session_01/ohlcv_analysis_advanced.ipynb