crypto_bot_training/Session_01/PandasDataFrame-exmples/06_handling_missing_data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 6: Handling Missing Data\n",
    "\n",
    "## Learning Objectives\n",
    "- Understand different types of missing data and their implications\n",
    "- Master techniques for detecting and analyzing missing values\n",
    "- Learn various strategies for handling missing data\n",
    "- Practice imputation methods and their trade-offs\n",
    "- Develop best practices for missing data management\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lessons 1-5\n",
    "- Understanding of basic statistical concepts\n",
    "- Familiarity with data quality principles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from datetime import datetime, timedelta\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set display options\n",
    "pd.set_option('display.max_columns', None)\n",
    "plt.style.use('seaborn-v0_8')\n",
    "\n",
    "print(\"Libraries loaded successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Dataset with Missing Values\n",
    "\n",
    "Let's create a realistic dataset with different patterns of missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create comprehensive dataset with various missing data patterns\n",
    "np.random.seed(42)\n",
    "n_records = 500\n",
    "\n",
    "# Base data\n",
    "data = {\n",
    "    'customer_id': range(1, n_records + 1),\n",
    "    'age': np.random.normal(35, 12, n_records).astype(int),\n",
    "    'income': np.random.normal(50000, 15000, n_records),\n",
    "    'education_years': np.random.normal(14, 3, n_records),\n",
    "    'purchase_amount': np.random.normal(200, 50, n_records),\n",
    "    'satisfaction_score': np.random.randint(1, 6, n_records),\n",
    "    'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
    "    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),\n",
    "    'signup_date': pd.date_range('2023-01-01', periods=n_records, freq='D'),\n",
    "    'last_purchase_date': pd.date_range('2023-01-01', periods=n_records, freq='D') + pd.Timedelta(days=30)\n",
    "}\n",
    "\n",
    "df_complete = pd.DataFrame(data)\n",
    "\n",
    "# Ensure positive values where appropriate\n",
    "df_complete['age'] = np.abs(df_complete['age'])\n",
    "df_complete['income'] = np.abs(df_complete['income'])\n",
    "df_complete['education_years'] = np.clip(df_complete['education_years'], 6, 20)\n",
    "df_complete['purchase_amount'] = np.abs(df_complete['purchase_amount'])\n",
    "\n",
    "print(\"Complete dataset created:\")\n",
    "print(f\"Shape: {df_complete.shape}\")\n",
    "print(\"\\nFirst few rows:\")\n",
    "print(df_complete.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Introduce different patterns of missing data\n",
    "df_missing = df_complete.copy()\n",
    "\n",
    "# 1. Missing Completely at Random (MCAR) - income data\n",
    "# Randomly missing 15% of income values\n",
    "mcar_indices = np.random.choice(df_missing.index, size=int(0.15 * len(df_missing)), replace=False)\n",
    "df_missing.loc[mcar_indices, 'income'] = np.nan\n",
    "\n",
    "# 2. Missing at Random (MAR) - education years missing based on age\n",
    "# Older people less likely to report education\n",
    "older_customers = df_missing['age'] > 60\n",
    "older_indices = df_missing[older_customers].index\n",
    "education_missing = np.random.choice(older_indices, size=int(0.4 * len(older_indices)), replace=False)\n",
    "df_missing.loc[education_missing, 'education_years'] = np.nan\n",
    "\n",
    "# 3. Missing Not at Random (MNAR) - satisfaction scores\n",
    "# Unsatisfied customers less likely to provide ratings\n",
    "low_satisfaction = df_missing['satisfaction_score'] <= 2\n",
    "low_sat_indices = df_missing[low_satisfaction].index\n",
    "satisfaction_missing = np.random.choice(low_sat_indices, size=int(0.6 * len(low_sat_indices)), replace=False)\n",
    "df_missing.loc[satisfaction_missing, 'satisfaction_score'] = np.nan\n",
    "\n",
    "# 4. Systematic missing - last purchase date for new customers\n",
    "# New customers (signed up recently) haven't made purchases yet\n",
    "recent_signups = df_missing['signup_date'] > '2023-11-01'\n",
    "df_missing.loc[recent_signups, 'last_purchase_date'] = pd.NaT\n",
    "\n",
    "# 5. Random missing in other columns\n",
    "# Purchase amount - 10% missing\n",
    "purchase_missing = np.random.choice(df_missing.index, size=int(0.10 * len(df_missing)), replace=False)\n",
    "df_missing.loc[purchase_missing, 'purchase_amount'] = np.nan\n",
    "\n",
    "print(\"Missing data patterns introduced:\")\n",
    "print(f\"Dataset shape: {df_missing.shape}\")\n",
    "print(\"\\nMissing value counts:\")\n",
    "missing_summary = df_missing.isnull().sum()\n",
    "missing_summary = missing_summary[missing_summary > 0]\n",
    "print(missing_summary)\n",
    "\n",
    "print(\"\\nMissing value percentages:\")\n",
    "missing_pct = (df_missing.isnull().sum() / len(df_missing) * 100).round(2)\n",
    "missing_pct = missing_pct[missing_pct > 0]\n",
    "print(missing_pct)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Detecting and Analyzing Missing Data\n",
    "\n",
    "Comprehensive techniques for understanding missing data patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def analyze_missing_data(df):\n",
    "    \"\"\"Comprehensive missing data analysis\"\"\"\n",
    "    print(\"=== MISSING DATA ANALYSIS ===\")\n",
    "    \n",
    "    # Basic missing data statistics\n",
    "    total_cells = df.size\n",
    "    total_missing = df.isnull().sum().sum()\n",
    "    print(f\"Total cells: {total_cells:,}\")\n",
    "    print(f\"Missing cells: {total_missing:,} ({total_missing/total_cells*100:.2f}%)\")\n",
    "    \n",
    "    # Missing data by column\n",
    "    missing_by_column = pd.DataFrame({\n",
    "        'Missing_Count': df.isnull().sum(),\n",
    "        'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,\n",
    "        'Data_Type': df.dtypes\n",
    "    })\n",
    "    missing_by_column = missing_by_column[missing_by_column['Missing_Count'] > 0]\n",
    "    missing_by_column = missing_by_column.sort_values('Missing_Percentage', ascending=False)\n",
    "    \n",
    "    print(\"\\n--- Missing Data by Column ---\")\n",
    "    print(missing_by_column.round(2))\n",
    "    \n",
    "    # Missing data patterns\n",
    "    print(\"\\n--- Missing Data Patterns ---\")\n",
    "    missing_patterns = df.isnull().value_counts().head(10)\n",
    "    print(\"Top 10 missing patterns (True = Missing):\")\n",
    "    for pattern, count in missing_patterns.items():\n",
    "        percentage = (count / len(df)) * 100\n",
    "        print(f\"{count:4d} rows ({percentage:5.1f}%): {dict(zip(df.columns, pattern))}\")\n",
    "    \n",
    "    return missing_by_column\n",
    "\n",
    "# Analyze missing data\n",
    "missing_analysis = analyze_missing_data(df_missing)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize missing data patterns\n",
    "def visualize_missing_data(df):\n",
    "    \"\"\"Create visualizations for missing data patterns\"\"\"\n",
    "    fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
    "    \n",
    "    # 1. Missing data heatmap\n",
    "    missing_mask = df.isnull()\n",
    "    sns.heatmap(missing_mask.iloc[:100], \n",
    "                yticklabels=False, \n",
    "                cbar=True, \n",
    "                cmap='viridis',\n",
    "                ax=axes[0, 0])\n",
    "    axes[0, 0].set_title('Missing Data Heatmap (First 100 rows)')\n",
    "    \n",
    "    # 2. Missing data by column\n",
    "    missing_counts = df.isnull().sum()\n",
    "    missing_counts = missing_counts[missing_counts > 0]\n",
    "    missing_counts.plot(kind='bar', ax=axes[0, 1], color='skyblue')\n",
    "    axes[0, 1].set_title('Missing Values by Column')\n",
    "    axes[0, 1].set_ylabel('Count')\n",
    "    axes[0, 1].tick_params(axis='x', rotation=45)\n",
    "    \n",
    "    # 3. Missing data correlation\n",
    "    missing_corr = df.isnull().corr()\n",
    "    sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])\n",
    "    axes[1, 0].set_title('Missing Data Correlation')\n",
    "    \n",
    "    # 4. Missing data by row\n",
    "    missing_per_row = df.isnull().sum(axis=1)\n",
    "    missing_per_row.hist(bins=range(len(df.columns) + 2), ax=axes[1, 1], alpha=0.7, color='orange')\n",
    "    axes[1, 1].set_title('Distribution of Missing Values per Row')\n",
    "    axes[1, 1].set_xlabel('Number of Missing Values')\n",
    "    axes[1, 1].set_ylabel('Number of Rows')\n",
    "    \n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "\n",
    "# Visualize missing patterns\n",
    "visualize_missing_data(df_missing)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze missing data relationships\n",
    "def analyze_missing_relationships(df):\n",
    "    \"\"\"Analyze relationships between missing data and other variables\"\"\"\n",
    "    print(\"=== MISSING DATA RELATIONSHIPS ===\")\n",
    "    \n",
    "    # Example: Relationship between age and missing education\n",
    "    if 'age' in df.columns and 'education_years' in df.columns:\n",
    "        print(\"\\n--- Age vs Missing Education ---\")\n",
    "        education_missing = df['education_years'].isnull()\n",
    "        age_stats = df.groupby(education_missing)['age'].agg(['mean', 'median', 'std']).round(2)\n",
    "        age_stats.index = ['Education Present', 'Education Missing']\n",
    "        print(age_stats)\n",
    "    \n",
    "    # Example: Missing satisfaction by purchase amount\n",
    "    if 'satisfaction_score' in df.columns and 'purchase_amount' in df.columns:\n",
    "        print(\"\\n--- Purchase Amount vs Missing Satisfaction ---\")\n",
    "        satisfaction_missing = df['satisfaction_score'].isnull()\n",
    "        purchase_stats = df.groupby(satisfaction_missing)['purchase_amount'].agg(['mean', 'median', 'count']).round(2)\n",
    "        purchase_stats.index = ['Satisfaction Present', 'Satisfaction Missing']\n",
    "        print(purchase_stats)\n",
    "    \n",
    "    # Missing data by categorical variables\n",
    "    if 'region' in df.columns:\n",
    "        print(\"\\n--- Missing Data by Region ---\")\n",
    "        region_missing = df.groupby('region').apply(lambda x: x.isnull().sum())\n",
    "        print(region_missing[region_missing.sum(axis=1) > 0])\n",
    "\n",
    "# Analyze relationships\n",
    "analyze_missing_relationships(df_missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Basic Missing Data Handling\n",
    "\n",
    "Fundamental techniques for dealing with missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 1: Dropping missing values\n",
    "print(\"=== DROPPING MISSING VALUES ===\")\n",
    "\n",
    "# Drop rows with any missing values\n",
    "df_drop_any = df_missing.dropna()\n",
    "print(f\"Original shape: {df_missing.shape}\")\n",
    "print(f\"After dropping any missing: {df_drop_any.shape}\")\n",
    "print(f\"Rows removed: {len(df_missing) - len(df_drop_any)} ({(len(df_missing) - len(df_drop_any))/len(df_missing)*100:.1f}%)\")\n",
    "\n",
    "# Drop rows with missing values in specific columns\n",
    "critical_columns = ['customer_id', 'age', 'region']\n",
    "df_drop_critical = df_missing.dropna(subset=critical_columns)\n",
    "print(f\"\\nAfter dropping rows missing critical columns: {df_drop_critical.shape}\")\n",
    "\n",
    "# Drop rows with more than X missing values\n",
    "df_drop_thresh = df_missing.dropna(thresh=len(df_missing.columns) - 2)  # Allow max 2 missing\n",
    "print(f\"After dropping rows with >2 missing values: {df_drop_thresh.shape}\")\n",
    "\n",
    "# Drop columns with too many missing values\n",
    "missing_threshold = 0.5  # 50%\n",
    "cols_to_keep = df_missing.columns[df_missing.isnull().mean() < missing_threshold]\n",
    "df_drop_cols = df_missing[cols_to_keep]\n",
    "print(f\"\\nAfter dropping columns with >{missing_threshold*100}% missing: {df_drop_cols.shape}\")\n",
    "print(f\"Columns dropped: {set(df_missing.columns) - set(cols_to_keep)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 2: Basic imputation with fillna()\n",
    "print(\"=== BASIC IMPUTATION ===\")\n",
    "\n",
    "df_basic_impute = df_missing.copy()\n",
    "\n",
    "# Fill with specific values\n",
    "df_basic_impute['satisfaction_score'] = df_basic_impute['satisfaction_score'].fillna(3)  # Neutral score\n",
    "print(\"Filled satisfaction_score with 3 (neutral)\")\n",
    "\n",
    "# Fill with statistical measures\n",
    "df_basic_impute['income'] = df_basic_impute['income'].fillna(df_basic_impute['income'].median())\n",
    "df_basic_impute['education_years'] = df_basic_impute['education_years'].fillna(df_basic_impute['education_years'].mean())\n",
    "df_basic_impute['purchase_amount'] = df_basic_impute['purchase_amount'].fillna(df_basic_impute['purchase_amount'].mean())\n",
    "print(\"Filled numerical columns with mean/median\")\n",
    "\n",
    "# Forward fill and backward fill for dates\n",
    "df_basic_impute['last_purchase_date'] = df_basic_impute['last_purchase_date'].fillna(method='bfill')\n",
    "print(\"Filled dates with backward fill\")\n",
    "\n",
    "print(f\"\\nMissing values after basic imputation:\")\n",
    "print(df_basic_impute.isnull().sum().sum())\n",
    "\n",
    "# Show before/after comparison\n",
    "print(\"\\nComparison (first 10 rows):\")\n",
    "comparison_cols = ['income', 'education_years', 'purchase_amount', 'satisfaction_score']\n",
    "for col in comparison_cols:\n",
    "    before_missing = df_missing[col].isnull().sum()\n",
    "    after_missing = df_basic_impute[col].isnull().sum()\n",
    "    print(f\"{col}: {before_missing} → {after_missing} missing values\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Advanced Imputation Techniques\n",
    "\n",
    "Sophisticated methods for handling missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Group-based imputation\n",
    "def group_based_imputation(df):\n",
    "    \"\"\"Impute missing values based on group statistics\"\"\"\n",
    "    df_group_impute = df.copy()\n",
    "    \n",
    "    print(\"=== GROUP-BASED IMPUTATION ===\")\n",
    "    \n",
    "    # Impute income based on region and education level\n",
    "    # First, create education level categories\n",
    "    df_group_impute['education_level'] = pd.cut(\n",
    "        df_group_impute['education_years'].fillna(df_group_impute['education_years'].median()),\n",
    "        bins=[0, 12, 16, 20],\n",
    "        labels=['High School', 'Bachelor', 'Advanced']\n",
    "    )\n",
    "    \n",
    "    # Calculate group-based statistics\n",
    "    income_by_group = df_group_impute.groupby(['region', 'education_level'])['income'].median()\n",
    "    \n",
    "    # Fill missing income values\n",
    "    def fill_income(row):\n",
    "        if pd.isna(row['income']):\n",
    "            try:\n",
    "                return income_by_group.loc[(row['region'], row['education_level'])]\n",
    "            except KeyError:\n",
    "                return df_group_impute['income'].median()\n",
    "        return row['income']\n",
    "    \n",
    "    df_group_impute['income'] = df_group_impute.apply(fill_income, axis=1)\n",
    "    \n",
    "    print(\"Income imputed based on region and education level\")\n",
    "    print(\"Group-based median income:\")\n",
    "    print(income_by_group.round(0))\n",
    "    \n",
    "    return df_group_impute\n",
    "\n",
    "# Apply group-based imputation\n",
    "df_group_imputed = group_based_imputation(df_missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Comparison of Imputation Methods\n",
    "\n",
    "Compare different imputation approaches and their impact."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compare_imputation_methods(original_complete, original_missing, *imputed_dfs, methods_names):\n",
    "    \"\"\"Compare different imputation methods\"\"\"\n",
    "    print(\"=== IMPUTATION METHODS COMPARISON ===\")\n",
    "    \n",
    "    # Focus on a specific column for comparison\n",
    "    column = 'income'\n",
    "    \n",
    "    if column not in original_complete.columns:\n",
    "        print(f\"Column {column} not found\")\n",
    "        return\n",
    "    \n",
    "    # Get original values that were made missing\n",
    "    missing_mask = original_missing[column].isnull()\n",
    "    true_values = original_complete.loc[missing_mask, column]\n",
    "    \n",
    "    print(f\"Comparing imputation for '{column}' column\")\n",
    "    print(f\"Number of missing values: {len(true_values)}\")\n",
    "    \n",
    "    # Calculate errors for each method\n",
    "    results = {}\n",
    "    \n",
    "    for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
    "        if column in df_imputed.columns:\n",
    "            imputed_values = df_imputed.loc[missing_mask, column]\n",
    "            \n",
    "            # Calculate metrics\n",
    "            mae = np.mean(np.abs(true_values - imputed_values))\n",
    "            rmse = np.sqrt(np.mean((true_values - imputed_values) ** 2))\n",
    "            bias = np.mean(imputed_values - true_values)\n",
    "            \n",
    "            results[method_name] = {\n",
    "                'MAE': mae,\n",
    "                'RMSE': rmse,\n",
    "                'Bias': bias,\n",
    "                'Mean_Imputed': np.mean(imputed_values),\n",
    "                'Std_Imputed': np.std(imputed_values)\n",
    "            }\n",
    "    \n",
    "    # True statistics\n",
    "    print(f\"\\nTrue statistics for missing values:\")\n",
    "    print(f\"Mean: {np.mean(true_values):.2f}\")\n",
    "    print(f\"Std: {np.std(true_values):.2f}\")\n",
    "    \n",
    "    # Results comparison\n",
    "    results_df = pd.DataFrame(results).T\n",
    "    print(f\"\\nImputation comparison results:\")\n",
    "    print(results_df.round(2))\n",
    "    \n",
    "    # Visualize comparison\n",
    "    fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
    "    \n",
    "    # Distribution comparison\n",
    "    axes[0, 0].hist(true_values, alpha=0.7, label='True Values', bins=20)\n",
    "    for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
    "        if column in df_imputed.columns:\n",
    "            imputed_values = df_imputed.loc[missing_mask, column]\n",
    "            axes[0, 0].hist(imputed_values, alpha=0.7, label=f'{method_name}', bins=20)\n",
    "    axes[0, 0].set_title('Distribution Comparison')\n",
    "    axes[0, 0].legend()\n",
    "    \n",
    "    # Error metrics\n",
    "    metrics = ['MAE', 'RMSE']\n",
    "    for i, metric in enumerate(metrics):\n",
    "        values = [results[method][metric] for method in results.keys()]\n",
    "        axes[0, 1].bar(range(len(values)), values, alpha=0.7)\n",
    "        axes[0, 1].set_xticks(range(len(results)))\n",
    "        axes[0, 1].set_xticklabels(list(results.keys()), rotation=45)\n",
    "        axes[0, 1].set_title(f'{metric} Comparison')\n",
    "        break  # Show only MAE for now\n",
    "    \n",
    "    # Scatter plot: True vs Imputed\n",
    "    for i, (df_imputed, method_name) in enumerate(zip(imputed_dfs[:2], methods_names[:2])):\n",
    "        if column in df_imputed.columns:\n",
    "            imputed_values = df_imputed.loc[missing_mask, column]\n",
    "            ax = axes[1, i]\n",
    "            ax.scatter(true_values, imputed_values, alpha=0.6)\n",
    "            ax.plot([true_values.min(), true_values.max()], \n",
    "                   [true_values.min(), true_values.max()], 'r--', label='Perfect Prediction')\n",
    "            ax.set_xlabel('True Values')\n",
    "            ax.set_ylabel('Imputed Values')\n",
    "            ax.set_title(f'{method_name}: True vs Imputed')\n",
    "            ax.legend()\n",
    "    \n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    \n",
    "    return results_df\n",
    "\n",
    "# Compare methods\n",
    "comparison_results = compare_imputation_methods(\n",
    "    df_complete, \n",
    "    df_missing,\n",
    "    df_basic_impute,\n",
    "    methods_names=['Basic Fill', 'KNN', 'Iterative']\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Domain-Specific Imputation Strategies\n",
    "\n",
    "Business logic-driven approaches to missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def business_logic_imputation(df):\n",
    "    \"\"\"Apply business logic for missing value imputation\"\"\"\n",
    "    print(\"=== BUSINESS LOGIC IMPUTATION ===\")\n",
    "    \n",
    "    df_business = df.copy()\n",
    "    \n",
    "    # 1. Income imputation based on age and education\n",
    "    def estimate_income(row):\n",
    "        if pd.notna(row['income']):\n",
    "            return row['income']\n",
    "        \n",
    "        # Base income estimation\n",
    "        base_income = 30000\n",
    "        \n",
    "        # Age factor (experience premium)\n",
    "        if pd.notna(row['age']):\n",
    "            if row['age'] > 40:\n",
    "                base_income *= 1.5\n",
    "            elif row['age'] > 30:\n",
    "                base_income *= 1.2\n",
    "        \n",
    "        # Education factor\n",
    "        if pd.notna(row['education_years']):\n",
    "            if row['education_years'] > 16:  # Graduate degree\n",
    "                base_income *= 1.8\n",
    "            elif row['education_years'] > 12:  # Bachelor's\n",
    "                base_income *= 1.4\n",
    "        \n",
    "        # Regional adjustment\n",
    "        regional_multipliers = {\n",
    "            'North': 1.2,  # Higher cost of living\n",
    "            'South': 0.9,\n",
    "            'East': 1.1,\n",
    "            'West': 1.0\n",
    "        }\n",
    "        base_income *= regional_multipliers.get(row['region'], 1.0)\n",
    "        \n",
    "        return base_income\n",
    "    \n",
    "    # Apply income estimation\n",
    "    df_business['income'] = df_business.apply(estimate_income, axis=1)\n",
    "    \n",
    "    # 2. Satisfaction score based on purchase behavior\n",
    "    def estimate_satisfaction(row):\n",
    "        if pd.notna(row['satisfaction_score']):\n",
    "            return row['satisfaction_score']\n",
    "        \n",
    "        # Base satisfaction\n",
    "        base_satisfaction = 3  # Neutral\n",
    "        \n",
    "        # Purchase amount influence\n",
    "        if pd.notna(row['purchase_amount']):\n",
    "            if row['purchase_amount'] > 250:  # High value purchase\n",
    "                base_satisfaction = 4\n",
    "            elif row['purchase_amount'] < 100:  # Low value might indicate dissatisfaction\n",
    "                base_satisfaction = 2\n",
    "        \n",
    "        return base_satisfaction\n",
    "    \n",
    "    # Apply satisfaction estimation\n",
    "    df_business['satisfaction_score'] = df_business.apply(estimate_satisfaction, axis=1)\n",
    "    \n",
    "    # 3. Education years based on income and age\n",
    "    def estimate_education(row):\n",
    "        if pd.notna(row['education_years']):\n",
    "            return row['education_years']\n",
    "        \n",
    "        # Base education\n",
    "        base_education = 12  # High school\n",
    "        \n",
    "        # Income-based estimation\n",
    "        if pd.notna(row['income']):\n",
    "            if row['income'] > 70000:\n",
    "                base_education = 18  # Graduate level\n",
    "            elif row['income'] > 45000:\n",
    "                base_education = 16  # Bachelor's\n",
    "            elif row['income'] > 35000:\n",
    "                base_education = 14  # Some college\n",
    "        \n",
    "        # Age adjustment (older people might have different education patterns)\n",
    "        if pd.notna(row['age']) and row['age'] > 55:\n",
    "            base_education = max(12, base_education - 2)  # Lower average for older generation\n",
    "        \n",
    "        return base_education\n",
    "    \n",
    "    # Apply education estimation\n",
    "    df_business['education_years'] = df_business.apply(estimate_education, axis=1)\n",
    "    \n",
    "    print(\"Business logic imputation completed\")\n",
    "    print(f\"Missing values remaining: {df_business.isnull().sum().sum()}\")\n",
    "    \n",
    "    return df_business\n",
    "\n",
    "# Apply business logic imputation\n",
    "df_business_imputed = business_logic_imputation(df_missing)\n",
    "\n",
    "print(\"\\nBusiness logic imputation summary:\")\n",
    "for col in ['income', 'satisfaction_score', 'education_years']:\n",
    "    before = df_missing[col].isnull().sum()\n",
    "    after = df_business_imputed[col].isnull().sum()\n",
    "    print(f\"{col}: {before} → {after} missing values\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Missing Data Flags and Indicators\n",
    "\n",
    "Track which values were imputed for transparency and analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_missing_indicators(df_original, df_imputed):\n",
    "    \"\"\"Create indicator variables for missing data\"\"\"\n",
    "    print(\"=== CREATING MISSING DATA INDICATORS ===\")\n",
    "    \n",
    "    df_with_indicators = df_imputed.copy()\n",
    "    \n",
    "    # Create indicator columns for each column that had missing data\n",
    "    columns_with_missing = df_original.columns[df_original.isnull().any()].tolist()\n",
    "    \n",
    "    for col in columns_with_missing:\n",
    "        indicator_col = f'{col}_was_missing'\n",
    "        df_with_indicators[indicator_col] = df_original[col].isnull().astype(int)\n",
    "    \n",
    "    print(f\"Created {len(columns_with_missing)} missing data indicators\")\n",
    "    print(f\"Indicator columns: {[f'{col}_was_missing' for col in columns_with_missing]}\")\n",
    "    \n",
    "    # Summary of missing patterns\n",
    "    indicator_cols = [f'{col}_was_missing' for col in columns_with_missing]\n",
    "    missing_patterns = df_with_indicators[indicator_cols].sum()\n",
    "    \n",
    "    print(\"\\nMissing data summary by column:\")\n",
    "    for col, count in missing_patterns.items():\n",
    "        original_col = col.replace('_was_missing', '')\n",
    "        percentage = (count / len(df_with_indicators)) * 100\n",
    "        print(f\"{original_col}: {count} values imputed ({percentage:.1f}%)\")\n",
    "    \n",
    "    # Create composite missing indicator\n",
    "    df_with_indicators['total_missing_count'] = df_with_indicators[indicator_cols].sum(axis=1)\n",
    "    df_with_indicators['has_any_missing'] = (df_with_indicators['total_missing_count'] > 0).astype(int)\n",
    "    \n",
    "    return df_with_indicators, indicator_cols\n",
    "\n",
    "# Create missing indicators\n",
    "df_with_indicators, indicator_columns = create_missing_indicators(df_missing, df_business_imputed)\n",
    "\n",
    "print(\"\\nDataset with missing indicators:\")\n",
    "sample_cols = ['income', 'income_was_missing', 'education_years', 'education_years_was_missing', \n",
    "               'satisfaction_score', 'satisfaction_score_was_missing', 'total_missing_count']\n",
    "available_cols = [col for col in sample_cols if col in df_with_indicators.columns]\n",
    "print(df_with_indicators[available_cols].head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Validation and Quality Assessment\n",
    "\n",
    "Validate the quality of imputation results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def validate_imputation_quality(df_original, df_missing, df_imputed):\n",
    "    \"\"\"Validate the quality of imputation\"\"\"\n",
    "    print(\"=== IMPUTATION QUALITY VALIDATION ===\")\n",
    "    \n",
    "    validation_results = {}\n",
    "    \n",
    "    # Check each column that had missing data\n",
    "    for col in df_missing.columns:\n",
    "        if df_missing[col].isnull().any() and col in df_imputed.columns:\n",
    "            print(f\"\\n--- Validating {col} ---\")\n",
    "            \n",
    "            # Get missing mask\n",
    "            missing_mask = df_missing[col].isnull()\n",
    "            \n",
    "            # Original statistics (complete data)\n",
    "            original_stats = df_original[col].describe()\n",
    "            \n",
    "            # Imputed statistics (only imputed values)\n",
    "            if missing_mask.any():\n",
    "                imputed_values = df_imputed.loc[missing_mask, col]\n",
    "                \n",
    "                if pd.api.types.is_numeric_dtype(df_original[col]):\n",
    "                    imputed_stats = imputed_values.describe()\n",
    "                    \n",
    "                    # Statistical tests\n",
    "                    mean_diff = abs(original_stats['mean'] - imputed_stats['mean'])\n",
    "                    std_diff = abs(original_stats['std'] - imputed_stats['std'])\n",
    "                    \n",
    "                    validation_results[col] = {\n",
    "                        'original_mean': original_stats['mean'],\n",
    "                        'imputed_mean': imputed_stats['mean'],\n",
    "                        'mean_difference': mean_diff,\n",
    "                        'original_std': original_stats['std'],\n",
    "                        'imputed_std': imputed_stats['std'],\n",
    "                        'std_difference': std_diff,\n",
    "                        'values_imputed': len(imputed_values)\n",
    "                    }\n",
    "                    \n",
    "                    print(f\"Original mean: {original_stats['mean']:.2f}, Imputed mean: {imputed_stats['mean']:.2f}\")\n",
    "                    print(f\"Mean difference: {mean_diff:.2f} ({mean_diff/original_stats['mean']*100:.1f}%)\")\n",
    "                    print(f\"Original std: {original_stats['std']:.2f}, Imputed std: {imputed_stats['std']:.2f}\")\n",
    "                    \n",
    "                else:\n",
    "                    # Categorical data\n",
    "                    original_dist = df_original[col].value_counts(normalize=True)\n",
    "                    imputed_dist = imputed_values.value_counts(normalize=True)\n",
    "                    print(f\"Original distribution: {original_dist.to_dict()}\")\n",
    "                    print(f\"Imputed distribution: {imputed_dist.to_dict()}\")\n",
    "    \n",
    "    # Overall validation summary\n",
    "    if validation_results:\n",
    "        validation_df = pd.DataFrame(validation_results).T\n",
    "        print(\"\\n=== VALIDATION SUMMARY ===\")\n",
    "        print(validation_df.round(3))\n",
    "        \n",
    "        # Flag potential issues\n",
    "        print(\"\\n--- Potential Issues ---\")\n",
    "        for col, stats in validation_results.items():\n",
    "            mean_change = abs(stats['mean_difference'] / stats['original_mean']) * 100\n",
    "            if mean_change > 10:  # More than 10% change in mean\n",
    "                print(f\"⚠️  {col}: Large mean change ({mean_change:.1f}%)\")\n",
    "            \n",
    "            std_change = abs(stats['std_difference'] / stats['original_std']) * 100\n",
    "            if std_change > 20:  # More than 20% change in std\n",
    "                print(f\"⚠️  {col}: Large variance change ({std_change:.1f}%)\")\n",
    "    \n",
    "    return validation_results\n",
    "\n",
    "# Validate imputation quality\n",
    "validation_results = validate_imputation_quality(df_complete, df_missing, df_business_imputed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Apply missing data handling techniques to challenging scenarios:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Multi-step imputation strategy\n",
    "# Create a sophisticated imputation pipeline that:\n",
    "# 1. Handles different types of missing data appropriately\n",
    "# 2. Uses multiple imputation methods in sequence\n",
    "# 3. Validates results at each step\n",
    "# 4. Creates comprehensive documentation\n",
    "\n",
    "def comprehensive_imputation_pipeline(df):\n",
    "    \"\"\"Comprehensive missing data handling pipeline\"\"\"\n",
    "    # Your implementation here\n",
    "    pass\n",
    "\n",
    "# result_df = comprehensive_imputation_pipeline(df_missing)\n",
    "# print(\"Comprehensive pipeline results:\")\n",
    "# print(result_df.isnull().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Missing data pattern analysis\n",
    "# Analyze if missing data follows specific patterns:\n",
    "# - Time-based patterns\n",
    "# - User behavior patterns\n",
    "# - System/technical patterns\n",
    "# Create insights and recommendations\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Impact assessment\n",
    "# Assess how different missing data handling approaches\n",
    "# affect downstream analysis:\n",
    "# - Statistical analysis results\n",
    "# - Machine learning model performance\n",
    "# - Business insights and decisions\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Understanding Missing Data Types**:\n",
    "   - **MCAR**: Missing Completely at Random\n",
    "   - **MAR**: Missing at Random (depends on observed data)\n",
    "   - **MNAR**: Missing Not at Random (depends on unobserved data)\n",
    "\n",
    "2. **Detection and Analysis**:\n",
    "   - Always analyze missing patterns before imputation\n",
    "   - Use visualizations to understand missing data structure\n",
    "   - Look for relationships between missing values and other variables\n",
    "\n",
    "3. **Handling Strategies**:\n",
    "   - **Deletion**: Simple but can lose valuable information\n",
    "   - **Simple Imputation**: Fast but may not preserve relationships\n",
    "   - **Advanced Methods**: KNN, MICE preserve more complex relationships\n",
    "   - **Business Logic**: Domain knowledge often provides best results\n",
    "\n",
    "4. **Best Practices**:\n",
    "   - Create missing data indicators for transparency\n",
    "   - Validate imputation quality against original data when possible\n",
    "   - Consider the impact on downstream analysis\n",
    "   - Document all imputation decisions and methods\n",
    "\n",
    "## Method Selection Guide\n",
    "\n",
    "| Scenario | Recommended Method | Rationale |\n",
    "|----------|-------------------|----------|\n",
    "| < 5% missing, MCAR | Simple imputation | Low impact, efficiency |\n",
    "| 5-20% missing, MAR | KNN or Group-based | Preserve relationships |\n",
    "| > 20% missing, complex patterns | MICE or Multiple imputation | Handle complex dependencies |\n",
    "| Business-critical decisions | Domain knowledge + validation | Accuracy and explainability |\n",
    "| Machine learning features | Advanced methods + indicators | Preserve predictive power |\n",
    "\n",
    "## Common Pitfalls to Avoid\n",
    "\n",
    "1. **Data Leakage**: Don't use future information to impute past values\n",
    "2. **Ignoring Patterns**: Missing data often has meaningful patterns\n",
    "3. **Over-imputation**: Sometimes missing data is informative itself\n",
    "4. **One-size-fits-all**: Different columns may need different strategies\n",
    "5. **No Validation**: Always check if imputation preserved data characteristics"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}