{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Session 1 - DataFrames - Lesson 6: Handling Missing Data\n", "\n", "## Learning Objectives\n", "- Understand different types of missing data and their implications\n", "- Master techniques for detecting and analyzing missing values\n", "- Learn various strategies for handling missing data\n", "- Practice imputation methods and their trade-offs\n", "- Develop best practices for missing data management\n", "\n", "## Prerequisites\n", "- Completed Lessons 1-5\n", "- Understanding of basic statistical concepts\n", "- Familiarity with data quality principles" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from datetime import datetime, timedelta\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Set display options\n", "pd.set_option('display.max_columns', None)\n", "plt.style.use('seaborn-v0_8')\n", "\n", "print(\"Libraries loaded successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Dataset with Missing Values\n", "\n", "Let's create a realistic dataset with different patterns of missing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create comprehensive dataset with various missing data patterns\n", "np.random.seed(42)\n", "n_records = 500\n", "\n", "# Base data\n", "data = {\n", " 'customer_id': range(1, n_records + 1),\n", " 'age': np.random.normal(35, 12, n_records).astype(int),\n", " 'income': np.random.normal(50000, 15000, n_records),\n", " 'education_years': np.random.normal(14, 3, n_records),\n", " 'purchase_amount': np.random.normal(200, 50, n_records),\n", " 'satisfaction_score': np.random.randint(1, 6, n_records),\n", " 'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n", " 'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),\n", " 'signup_date': pd.date_range('2023-01-01', periods=n_records, freq='D'),\n", " 'last_purchase_date': pd.date_range('2023-01-01', periods=n_records, freq='D') + pd.Timedelta(days=30)\n", "}\n", "\n", "df_complete = pd.DataFrame(data)\n", "\n", "# Ensure positive values where appropriate\n", "df_complete['age'] = np.abs(df_complete['age'])\n", "df_complete['income'] = np.abs(df_complete['income'])\n", "df_complete['education_years'] = np.clip(df_complete['education_years'], 6, 20)\n", "df_complete['purchase_amount'] = np.abs(df_complete['purchase_amount'])\n", "\n", "print(\"Complete dataset created:\")\n", "print(f\"Shape: {df_complete.shape}\")\n", "print(\"\\nFirst few rows:\")\n", "print(df_complete.head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Introduce different patterns of missing data\n", "df_missing = df_complete.copy()\n", "\n", "# 1. Missing Completely at Random (MCAR) - income data\n", "# Randomly missing 15% of income values\n", "mcar_indices = np.random.choice(df_missing.index, size=int(0.15 * len(df_missing)), replace=False)\n", "df_missing.loc[mcar_indices, 'income'] = np.nan\n", "\n", "# 2. Missing at Random (MAR) - education years missing based on age\n", "# Older people less likely to report education\n", "older_customers = df_missing['age'] > 60\n", "older_indices = df_missing[older_customers].index\n", "education_missing = np.random.choice(older_indices, size=int(0.4 * len(older_indices)), replace=False)\n", "df_missing.loc[education_missing, 'education_years'] = np.nan\n", "\n", "# 3. Missing Not at Random (MNAR) - satisfaction scores\n", "# Unsatisfied customers less likely to provide ratings\n", "low_satisfaction = df_missing['satisfaction_score'] <= 2\n", "low_sat_indices = df_missing[low_satisfaction].index\n", "satisfaction_missing = np.random.choice(low_sat_indices, size=int(0.6 * len(low_sat_indices)), replace=False)\n", "df_missing.loc[satisfaction_missing, 'satisfaction_score'] = np.nan\n", "\n", "# 4. Systematic missing - last purchase date for new customers\n", "# New customers (signed up recently) haven't made purchases yet\n", "recent_signups = df_missing['signup_date'] > '2023-11-01'\n", "df_missing.loc[recent_signups, 'last_purchase_date'] = pd.NaT\n", "\n", "# 5. Random missing in other columns\n", "# Purchase amount - 10% missing\n", "purchase_missing = np.random.choice(df_missing.index, size=int(0.10 * len(df_missing)), replace=False)\n", "df_missing.loc[purchase_missing, 'purchase_amount'] = np.nan\n", "\n", "print(\"Missing data patterns introduced:\")\n", "print(f\"Dataset shape: {df_missing.shape}\")\n", "print(\"\\nMissing value counts:\")\n", "missing_summary = df_missing.isnull().sum()\n", "missing_summary = missing_summary[missing_summary > 0]\n", "print(missing_summary)\n", "\n", "print(\"\\nMissing value percentages:\")\n", "missing_pct = (df_missing.isnull().sum() / len(df_missing) * 100).round(2)\n", "missing_pct = missing_pct[missing_pct > 0]\n", "print(missing_pct)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Detecting and Analyzing Missing Data\n", "\n", "Comprehensive techniques for understanding missing data patterns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def analyze_missing_data(df):\n", " \"\"\"Comprehensive missing data analysis\"\"\"\n", " print(\"=== MISSING DATA ANALYSIS ===\")\n", " \n", " # Basic missing data statistics\n", " total_cells = df.size\n", " total_missing = df.isnull().sum().sum()\n", " print(f\"Total cells: {total_cells:,}\")\n", " print(f\"Missing cells: {total_missing:,} ({total_missing/total_cells*100:.2f}%)\")\n", " \n", " # Missing data by column\n", " missing_by_column = pd.DataFrame({\n", " 'Missing_Count': df.isnull().sum(),\n", " 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,\n", " 'Data_Type': df.dtypes\n", " })\n", " missing_by_column = missing_by_column[missing_by_column['Missing_Count'] > 0]\n", " missing_by_column = missing_by_column.sort_values('Missing_Percentage', ascending=False)\n", " \n", " print(\"\\n--- Missing Data by Column ---\")\n", " print(missing_by_column.round(2))\n", " \n", " # Missing data patterns\n", " print(\"\\n--- Missing Data Patterns ---\")\n", " missing_patterns = df.isnull().value_counts().head(10)\n", " print(\"Top 10 missing patterns (True = Missing):\")\n", " for pattern, count in missing_patterns.items():\n", " percentage = (count / len(df)) * 100\n", " print(f\"{count:4d} rows ({percentage:5.1f}%): {dict(zip(df.columns, pattern))}\")\n", " \n", " return missing_by_column\n", "\n", "# Analyze missing data\n", "missing_analysis = analyze_missing_data(df_missing)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize missing data patterns\n", "def visualize_missing_data(df):\n", " \"\"\"Create visualizations for missing data patterns\"\"\"\n", " fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n", " \n", " # 1. Missing data heatmap\n", " missing_mask = df.isnull()\n", " sns.heatmap(missing_mask.iloc[:100], \n", " yticklabels=False, \n", " cbar=True, \n", " cmap='viridis',\n", " ax=axes[0, 0])\n", " axes[0, 0].set_title('Missing Data Heatmap (First 100 rows)')\n", " \n", " # 2. Missing data by column\n", " missing_counts = df.isnull().sum()\n", " missing_counts = missing_counts[missing_counts > 0]\n", " missing_counts.plot(kind='bar', ax=axes[0, 1], color='skyblue')\n", " axes[0, 1].set_title('Missing Values by Column')\n", " axes[0, 1].set_ylabel('Count')\n", " axes[0, 1].tick_params(axis='x', rotation=45)\n", " \n", " # 3. Missing data correlation\n", " missing_corr = df.isnull().corr()\n", " sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])\n", " axes[1, 0].set_title('Missing Data Correlation')\n", " \n", " # 4. Missing data by row\n", " missing_per_row = df.isnull().sum(axis=1)\n", " missing_per_row.hist(bins=range(len(df.columns) + 2), ax=axes[1, 1], alpha=0.7, color='orange')\n", " axes[1, 1].set_title('Distribution of Missing Values per Row')\n", " axes[1, 1].set_xlabel('Number of Missing Values')\n", " axes[1, 1].set_ylabel('Number of Rows')\n", " \n", " plt.tight_layout()\n", " plt.show()\n", "\n", "# Visualize missing patterns\n", "visualize_missing_data(df_missing)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Analyze missing data relationships\n", "def analyze_missing_relationships(df):\n", " \"\"\"Analyze relationships between missing data and other variables\"\"\"\n", " print(\"=== MISSING DATA RELATIONSHIPS ===\")\n", " \n", " # Example: Relationship between age and missing education\n", " if 'age' in df.columns and 'education_years' in df.columns:\n", " print(\"\\n--- Age vs Missing Education ---\")\n", " education_missing = df['education_years'].isnull()\n", " age_stats = df.groupby(education_missing)['age'].agg(['mean', 'median', 'std']).round(2)\n", " age_stats.index = ['Education Present', 'Education Missing']\n", " print(age_stats)\n", " \n", " # Example: Missing satisfaction by purchase amount\n", " if 'satisfaction_score' in df.columns and 'purchase_amount' in df.columns:\n", " print(\"\\n--- Purchase Amount vs Missing Satisfaction ---\")\n", " satisfaction_missing = df['satisfaction_score'].isnull()\n", " purchase_stats = df.groupby(satisfaction_missing)['purchase_amount'].agg(['mean', 'median', 'count']).round(2)\n", " purchase_stats.index = ['Satisfaction Present', 'Satisfaction Missing']\n", " print(purchase_stats)\n", " \n", " # Missing data by categorical variables\n", " if 'region' in df.columns:\n", " print(\"\\n--- Missing Data by Region ---\")\n", " region_missing = df.groupby('region').apply(lambda x: x.isnull().sum())\n", " print(region_missing[region_missing.sum(axis=1) > 0])\n", "\n", "# Analyze relationships\n", "analyze_missing_relationships(df_missing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Basic Missing Data Handling\n", "\n", "Fundamental techniques for dealing with missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Method 1: Dropping missing values\n", "print(\"=== DROPPING MISSING VALUES ===\")\n", "\n", "# Drop rows with any missing values\n", "df_drop_any = df_missing.dropna()\n", "print(f\"Original shape: {df_missing.shape}\")\n", "print(f\"After dropping any missing: {df_drop_any.shape}\")\n", "print(f\"Rows removed: {len(df_missing) - len(df_drop_any)} ({(len(df_missing) - len(df_drop_any))/len(df_missing)*100:.1f}%)\")\n", "\n", "# Drop rows with missing values in specific columns\n", "critical_columns = ['customer_id', 'age', 'region']\n", "df_drop_critical = df_missing.dropna(subset=critical_columns)\n", "print(f\"\\nAfter dropping rows missing critical columns: {df_drop_critical.shape}\")\n", "\n", "# Drop rows with more than X missing values\n", "df_drop_thresh = df_missing.dropna(thresh=len(df_missing.columns) - 2) # Allow max 2 missing\n", "print(f\"After dropping rows with >2 missing values: {df_drop_thresh.shape}\")\n", "\n", "# Drop columns with too many missing values\n", "missing_threshold = 0.5 # 50%\n", "cols_to_keep = df_missing.columns[df_missing.isnull().mean() < missing_threshold]\n", "df_drop_cols = df_missing[cols_to_keep]\n", "print(f\"\\nAfter dropping columns with >{missing_threshold*100}% missing: {df_drop_cols.shape}\")\n", "print(f\"Columns dropped: {set(df_missing.columns) - set(cols_to_keep)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Method 2: Basic imputation with fillna()\n", "print(\"=== BASIC IMPUTATION ===\")\n", "\n", "df_basic_impute = df_missing.copy()\n", "\n", "# Fill with specific values\n", "df_basic_impute['satisfaction_score'] = df_basic_impute['satisfaction_score'].fillna(3) # Neutral score\n", "print(\"Filled satisfaction_score with 3 (neutral)\")\n", "\n", "# Fill with statistical measures\n", "df_basic_impute['income'] = df_basic_impute['income'].fillna(df_basic_impute['income'].median())\n", "df_basic_impute['education_years'] = df_basic_impute['education_years'].fillna(df_basic_impute['education_years'].mean())\n", "df_basic_impute['purchase_amount'] = df_basic_impute['purchase_amount'].fillna(df_basic_impute['purchase_amount'].mean())\n", "print(\"Filled numerical columns with mean/median\")\n", "\n", "# Forward fill and backward fill for dates\n", "df_basic_impute['last_purchase_date'] = df_basic_impute['last_purchase_date'].fillna(method='bfill')\n", "print(\"Filled dates with backward fill\")\n", "\n", "print(f\"\\nMissing values after basic imputation:\")\n", "print(df_basic_impute.isnull().sum().sum())\n", "\n", "# Show before/after comparison\n", "print(\"\\nComparison (first 10 rows):\")\n", "comparison_cols = ['income', 'education_years', 'purchase_amount', 'satisfaction_score']\n", "for col in comparison_cols:\n", " before_missing = df_missing[col].isnull().sum()\n", " after_missing = df_basic_impute[col].isnull().sum()\n", " print(f\"{col}: {before_missing} → {after_missing} missing values\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Advanced Imputation Techniques\n", "\n", "Sophisticated methods for handling missing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Group-based imputation\n", "def group_based_imputation(df):\n", " \"\"\"Impute missing values based on group statistics\"\"\"\n", " df_group_impute = df.copy()\n", " \n", " print(\"=== GROUP-BASED IMPUTATION ===\")\n", " \n", " # Impute income based on region and education level\n", " # First, create education level categories\n", " df_group_impute['education_level'] = pd.cut(\n", " df_group_impute['education_years'].fillna(df_group_impute['education_years'].median()),\n", " bins=[0, 12, 16, 20],\n", " labels=['High School', 'Bachelor', 'Advanced']\n", " )\n", " \n", " # Calculate group-based statistics\n", " income_by_group = df_group_impute.groupby(['region', 'education_level'])['income'].median()\n", " \n", " # Fill missing income values\n", " def fill_income(row):\n", " if pd.isna(row['income']):\n", " try:\n", " return income_by_group.loc[(row['region'], row['education_level'])]\n", " except KeyError:\n", " return df_group_impute['income'].median()\n", " return row['income']\n", " \n", " df_group_impute['income'] = df_group_impute.apply(fill_income, axis=1)\n", " \n", " print(\"Income imputed based on region and education level\")\n", " print(\"Group-based median income:\")\n", " print(income_by_group.round(0))\n", " \n", " return df_group_impute\n", "\n", "# Apply group-based imputation\n", "df_group_imputed = group_based_imputation(df_missing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Comparison of Imputation Methods\n", "\n", "Compare different imputation approaches and their impact." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compare_imputation_methods(original_complete, original_missing, *imputed_dfs, methods_names):\n", " \"\"\"Compare different imputation methods\"\"\"\n", " print(\"=== IMPUTATION METHODS COMPARISON ===\")\n", " \n", " # Focus on a specific column for comparison\n", " column = 'income'\n", " \n", " if column not in original_complete.columns:\n", " print(f\"Column {column} not found\")\n", " return\n", " \n", " # Get original values that were made missing\n", " missing_mask = original_missing[column].isnull()\n", " true_values = original_complete.loc[missing_mask, column]\n", " \n", " print(f\"Comparing imputation for '{column}' column\")\n", " print(f\"Number of missing values: {len(true_values)}\")\n", " \n", " # Calculate errors for each method\n", " results = {}\n", " \n", " for df_imputed, method_name in zip(imputed_dfs, methods_names):\n", " if column in df_imputed.columns:\n", " imputed_values = df_imputed.loc[missing_mask, column]\n", " \n", " # Calculate metrics\n", " mae = np.mean(np.abs(true_values - imputed_values))\n", " rmse = np.sqrt(np.mean((true_values - imputed_values) ** 2))\n", " bias = np.mean(imputed_values - true_values)\n", " \n", " results[method_name] = {\n", " 'MAE': mae,\n", " 'RMSE': rmse,\n", " 'Bias': bias,\n", " 'Mean_Imputed': np.mean(imputed_values),\n", " 'Std_Imputed': np.std(imputed_values)\n", " }\n", " \n", " # True statistics\n", " print(f\"\\nTrue statistics for missing values:\")\n", " print(f\"Mean: {np.mean(true_values):.2f}\")\n", " print(f\"Std: {np.std(true_values):.2f}\")\n", " \n", " # Results comparison\n", " results_df = pd.DataFrame(results).T\n", " print(f\"\\nImputation comparison results:\")\n", " print(results_df.round(2))\n", " \n", " # Visualize comparison\n", " fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n", " \n", " # Distribution comparison\n", " axes[0, 0].hist(true_values, alpha=0.7, label='True Values', bins=20)\n", " for df_imputed, method_name in zip(imputed_dfs, methods_names):\n", " if column in df_imputed.columns:\n", " imputed_values = df_imputed.loc[missing_mask, column]\n", " axes[0, 0].hist(imputed_values, alpha=0.7, label=f'{method_name}', bins=20)\n", " axes[0, 0].set_title('Distribution Comparison')\n", " axes[0, 0].legend()\n", " \n", " # Error metrics\n", " metrics = ['MAE', 'RMSE']\n", " for i, metric in enumerate(metrics):\n", " values = [results[method][metric] for method in results.keys()]\n", " axes[0, 1].bar(range(len(values)), values, alpha=0.7)\n", " axes[0, 1].set_xticks(range(len(results)))\n", " axes[0, 1].set_xticklabels(list(results.keys()), rotation=45)\n", " axes[0, 1].set_title(f'{metric} Comparison')\n", " break # Show only MAE for now\n", " \n", " # Scatter plot: True vs Imputed\n", " for i, (df_imputed, method_name) in enumerate(zip(imputed_dfs[:2], methods_names[:2])):\n", " if column in df_imputed.columns:\n", " imputed_values = df_imputed.loc[missing_mask, column]\n", " ax = axes[1, i]\n", " ax.scatter(true_values, imputed_values, alpha=0.6)\n", " ax.plot([true_values.min(), true_values.max()], \n", " [true_values.min(), true_values.max()], 'r--', label='Perfect Prediction')\n", " ax.set_xlabel('True Values')\n", " ax.set_ylabel('Imputed Values')\n", " ax.set_title(f'{method_name}: True vs Imputed')\n", " ax.legend()\n", " \n", " plt.tight_layout()\n", " plt.show()\n", " \n", " return results_df\n", "\n", "# Compare methods\n", "comparison_results = compare_imputation_methods(\n", " df_complete, \n", " df_missing,\n", " df_basic_impute,\n", " methods_names=['Basic Fill', 'KNN', 'Iterative']\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Domain-Specific Imputation Strategies\n", "\n", "Business logic-driven approaches to missing data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def business_logic_imputation(df):\n", " \"\"\"Apply business logic for missing value imputation\"\"\"\n", " print(\"=== BUSINESS LOGIC IMPUTATION ===\")\n", " \n", " df_business = df.copy()\n", " \n", " # 1. Income imputation based on age and education\n", " def estimate_income(row):\n", " if pd.notna(row['income']):\n", " return row['income']\n", " \n", " # Base income estimation\n", " base_income = 30000\n", " \n", " # Age factor (experience premium)\n", " if pd.notna(row['age']):\n", " if row['age'] > 40:\n", " base_income *= 1.5\n", " elif row['age'] > 30:\n", " base_income *= 1.2\n", " \n", " # Education factor\n", " if pd.notna(row['education_years']):\n", " if row['education_years'] > 16: # Graduate degree\n", " base_income *= 1.8\n", " elif row['education_years'] > 12: # Bachelor's\n", " base_income *= 1.4\n", " \n", " # Regional adjustment\n", " regional_multipliers = {\n", " 'North': 1.2, # Higher cost of living\n", " 'South': 0.9,\n", " 'East': 1.1,\n", " 'West': 1.0\n", " }\n", " base_income *= regional_multipliers.get(row['region'], 1.0)\n", " \n", " return base_income\n", " \n", " # Apply income estimation\n", " df_business['income'] = df_business.apply(estimate_income, axis=1)\n", " \n", " # 2. Satisfaction score based on purchase behavior\n", " def estimate_satisfaction(row):\n", " if pd.notna(row['satisfaction_score']):\n", " return row['satisfaction_score']\n", " \n", " # Base satisfaction\n", " base_satisfaction = 3 # Neutral\n", " \n", " # Purchase amount influence\n", " if pd.notna(row['purchase_amount']):\n", " if row['purchase_amount'] > 250: # High value purchase\n", " base_satisfaction = 4\n", " elif row['purchase_amount'] < 100: # Low value might indicate dissatisfaction\n", " base_satisfaction = 2\n", " \n", " return base_satisfaction\n", " \n", " # Apply satisfaction estimation\n", " df_business['satisfaction_score'] = df_business.apply(estimate_satisfaction, axis=1)\n", " \n", " # 3. Education years based on income and age\n", " def estimate_education(row):\n", " if pd.notna(row['education_years']):\n", " return row['education_years']\n", " \n", " # Base education\n", " base_education = 12 # High school\n", " \n", " # Income-based estimation\n", " if pd.notna(row['income']):\n", " if row['income'] > 70000:\n", " base_education = 18 # Graduate level\n", " elif row['income'] > 45000:\n", " base_education = 16 # Bachelor's\n", " elif row['income'] > 35000:\n", " base_education = 14 # Some college\n", " \n", " # Age adjustment (older people might have different education patterns)\n", " if pd.notna(row['age']) and row['age'] > 55:\n", " base_education = max(12, base_education - 2) # Lower average for older generation\n", " \n", " return base_education\n", " \n", " # Apply education estimation\n", " df_business['education_years'] = df_business.apply(estimate_education, axis=1)\n", " \n", " print(\"Business logic imputation completed\")\n", " print(f\"Missing values remaining: {df_business.isnull().sum().sum()}\")\n", " \n", " return df_business\n", "\n", "# Apply business logic imputation\n", "df_business_imputed = business_logic_imputation(df_missing)\n", "\n", "print(\"\\nBusiness logic imputation summary:\")\n", "for col in ['income', 'satisfaction_score', 'education_years']:\n", " before = df_missing[col].isnull().sum()\n", " after = df_business_imputed[col].isnull().sum()\n", " print(f\"{col}: {before} → {after} missing values\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Missing Data Flags and Indicators\n", "\n", "Track which values were imputed for transparency and analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_missing_indicators(df_original, df_imputed):\n", " \"\"\"Create indicator variables for missing data\"\"\"\n", " print(\"=== CREATING MISSING DATA INDICATORS ===\")\n", " \n", " df_with_indicators = df_imputed.copy()\n", " \n", " # Create indicator columns for each column that had missing data\n", " columns_with_missing = df_original.columns[df_original.isnull().any()].tolist()\n", " \n", " for col in columns_with_missing:\n", " indicator_col = f'{col}_was_missing'\n", " df_with_indicators[indicator_col] = df_original[col].isnull().astype(int)\n", " \n", " print(f\"Created {len(columns_with_missing)} missing data indicators\")\n", " print(f\"Indicator columns: {[f'{col}_was_missing' for col in columns_with_missing]}\")\n", " \n", " # Summary of missing patterns\n", " indicator_cols = [f'{col}_was_missing' for col in columns_with_missing]\n", " missing_patterns = df_with_indicators[indicator_cols].sum()\n", " \n", " print(\"\\nMissing data summary by column:\")\n", " for col, count in missing_patterns.items():\n", " original_col = col.replace('_was_missing', '')\n", " percentage = (count / len(df_with_indicators)) * 100\n", " print(f\"{original_col}: {count} values imputed ({percentage:.1f}%)\")\n", " \n", " # Create composite missing indicator\n", " df_with_indicators['total_missing_count'] = df_with_indicators[indicator_cols].sum(axis=1)\n", " df_with_indicators['has_any_missing'] = (df_with_indicators['total_missing_count'] > 0).astype(int)\n", " \n", " return df_with_indicators, indicator_cols\n", "\n", "# Create missing indicators\n", "df_with_indicators, indicator_columns = create_missing_indicators(df_missing, df_business_imputed)\n", "\n", "print(\"\\nDataset with missing indicators:\")\n", "sample_cols = ['income', 'income_was_missing', 'education_years', 'education_years_was_missing', \n", " 'satisfaction_score', 'satisfaction_score_was_missing', 'total_missing_count']\n", "available_cols = [col for col in sample_cols if col in df_with_indicators.columns]\n", "print(df_with_indicators[available_cols].head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Validation and Quality Assessment\n", "\n", "Validate the quality of imputation results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def validate_imputation_quality(df_original, df_missing, df_imputed):\n", " \"\"\"Validate the quality of imputation\"\"\"\n", " print(\"=== IMPUTATION QUALITY VALIDATION ===\")\n", " \n", " validation_results = {}\n", " \n", " # Check each column that had missing data\n", " for col in df_missing.columns:\n", " if df_missing[col].isnull().any() and col in df_imputed.columns:\n", " print(f\"\\n--- Validating {col} ---\")\n", " \n", " # Get missing mask\n", " missing_mask = df_missing[col].isnull()\n", " \n", " # Original statistics (complete data)\n", " original_stats = df_original[col].describe()\n", " \n", " # Imputed statistics (only imputed values)\n", " if missing_mask.any():\n", " imputed_values = df_imputed.loc[missing_mask, col]\n", " \n", " if pd.api.types.is_numeric_dtype(df_original[col]):\n", " imputed_stats = imputed_values.describe()\n", " \n", " # Statistical tests\n", " mean_diff = abs(original_stats['mean'] - imputed_stats['mean'])\n", " std_diff = abs(original_stats['std'] - imputed_stats['std'])\n", " \n", " validation_results[col] = {\n", " 'original_mean': original_stats['mean'],\n", " 'imputed_mean': imputed_stats['mean'],\n", " 'mean_difference': mean_diff,\n", " 'original_std': original_stats['std'],\n", " 'imputed_std': imputed_stats['std'],\n", " 'std_difference': std_diff,\n", " 'values_imputed': len(imputed_values)\n", " }\n", " \n", " print(f\"Original mean: {original_stats['mean']:.2f}, Imputed mean: {imputed_stats['mean']:.2f}\")\n", " print(f\"Mean difference: {mean_diff:.2f} ({mean_diff/original_stats['mean']*100:.1f}%)\")\n", " print(f\"Original std: {original_stats['std']:.2f}, Imputed std: {imputed_stats['std']:.2f}\")\n", " \n", " else:\n", " # Categorical data\n", " original_dist = df_original[col].value_counts(normalize=True)\n", " imputed_dist = imputed_values.value_counts(normalize=True)\n", " print(f\"Original distribution: {original_dist.to_dict()}\")\n", " print(f\"Imputed distribution: {imputed_dist.to_dict()}\")\n", " \n", " # Overall validation summary\n", " if validation_results:\n", " validation_df = pd.DataFrame(validation_results).T\n", " print(\"\\n=== VALIDATION SUMMARY ===\")\n", " print(validation_df.round(3))\n", " \n", " # Flag potential issues\n", " print(\"\\n--- Potential Issues ---\")\n", " for col, stats in validation_results.items():\n", " mean_change = abs(stats['mean_difference'] / stats['original_mean']) * 100\n", " if mean_change > 10: # More than 10% change in mean\n", " print(f\"⚠️ {col}: Large mean change ({mean_change:.1f}%)\")\n", " \n", " std_change = abs(stats['std_difference'] / stats['original_std']) * 100\n", " if std_change > 20: # More than 20% change in std\n", " print(f\"⚠️ {col}: Large variance change ({std_change:.1f}%)\")\n", " \n", " return validation_results\n", "\n", "# Validate imputation quality\n", "validation_results = validate_imputation_quality(df_complete, df_missing, df_business_imputed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice Exercises\n", "\n", "Apply missing data handling techniques to challenging scenarios:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "# Exercise 1: Multi-step imputation strategy\n", "# Create a sophisticated imputation pipeline that:\n", "# 1. Handles different types of missing data appropriately\n", "# 2. Uses multiple imputation methods in sequence\n", "# 3. Validates results at each step\n", "# 4. Creates comprehensive documentation\n", "\n", "def comprehensive_imputation_pipeline(df):\n", " \"\"\"Comprehensive missing data handling pipeline\"\"\"\n", " # Your implementation here\n", " pass\n", "\n", "# result_df = comprehensive_imputation_pipeline(df_missing)\n", "# print(\"Comprehensive pipeline results:\")\n", "# print(result_df.isnull().sum())" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "# Exercise 2: Missing data pattern analysis\n", "# Analyze if missing data follows specific patterns:\n", "# - Time-based patterns\n", "# - User behavior patterns\n", "# - System/technical patterns\n", "# Create insights and recommendations\n", "\n", "# Your code here:\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "# Exercise 3: Impact assessment\n", "# Assess how different missing data handling approaches\n", "# affect downstream analysis:\n", "# - Statistical analysis results\n", "# - Machine learning model performance\n", "# - Business insights and decisions\n", "\n", "# Your code here:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "1. **Understanding Missing Data Types**:\n", " - **MCAR**: Missing Completely at Random\n", " - **MAR**: Missing at Random (depends on observed data)\n", " - **MNAR**: Missing Not at Random (depends on unobserved data)\n", "\n", "2. **Detection and Analysis**:\n", " - Always analyze missing patterns before imputation\n", " - Use visualizations to understand missing data structure\n", " - Look for relationships between missing values and other variables\n", "\n", "3. **Handling Strategies**:\n", " - **Deletion**: Simple but can lose valuable information\n", " - **Simple Imputation**: Fast but may not preserve relationships\n", " - **Advanced Methods**: KNN, MICE preserve more complex relationships\n", " - **Business Logic**: Domain knowledge often provides best results\n", "\n", "4. **Best Practices**:\n", " - Create missing data indicators for transparency\n", " - Validate imputation quality against original data when possible\n", " - Consider the impact on downstream analysis\n", " - Document all imputation decisions and methods\n", "\n", "## Method Selection Guide\n", "\n", "| Scenario | Recommended Method | Rationale |\n", "|----------|-------------------|----------|\n", "| < 5% missing, MCAR | Simple imputation | Low impact, efficiency |\n", "| 5-20% missing, MAR | KNN or Group-based | Preserve relationships |\n", "| > 20% missing, complex patterns | MICE or Multiple imputation | Handle complex dependencies |\n", "| Business-critical decisions | Domain knowledge + validation | Accuracy and explainability |\n", "| Machine learning features | Advanced methods + indicators | Preserve predictive power |\n", "\n", "## Common Pitfalls to Avoid\n", "\n", "1. **Data Leakage**: Don't use future information to impute past values\n", "2. **Ignoring Patterns**: Missing data often has meaningful patterns\n", "3. **Over-imputation**: Sometimes missing data is informative itself\n", "4. **One-size-fits-all**: Different columns may need different strategies\n", "5. **No Validation**: Always check if imputation preserved data characteristics" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 4 }