916 lines
37 KiB
Text
Executable file
916 lines
37 KiB
Text
Executable file
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Session 1 - DataFrames - Lesson 6: Handling Missing Data\n",
|
|
"\n",
|
|
"## Learning Objectives\n",
|
|
"- Understand different types of missing data and their implications\n",
|
|
"- Master techniques for detecting and analyzing missing values\n",
|
|
"- Learn various strategies for handling missing data\n",
|
|
"- Practice imputation methods and their trade-offs\n",
|
|
"- Develop best practices for missing data management\n",
|
|
"\n",
|
|
"## Prerequisites\n",
|
|
"- Completed Lessons 1-5\n",
|
|
"- Understanding of basic statistical concepts\n",
|
|
"- Familiarity with data quality principles"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import required libraries\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import seaborn as sns\n",
|
|
"from datetime import datetime, timedelta\n",
|
|
"import warnings\n",
|
|
"warnings.filterwarnings('ignore')\n",
|
|
"\n",
|
|
"# Set display options\n",
|
|
"pd.set_option('display.max_columns', None)\n",
|
|
"plt.style.use('seaborn-v0_8')\n",
|
|
"\n",
|
|
"print(\"Libraries loaded successfully!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating Dataset with Missing Values\n",
|
|
"\n",
|
|
"Let's create a realistic dataset with different patterns of missing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create comprehensive dataset with various missing data patterns\n",
|
|
"np.random.seed(42)\n",
|
|
"n_records = 500\n",
|
|
"\n",
|
|
"# Base data\n",
|
|
"data = {\n",
|
|
" 'customer_id': range(1, n_records + 1),\n",
|
|
" 'age': np.random.normal(35, 12, n_records).astype(int),\n",
|
|
" 'income': np.random.normal(50000, 15000, n_records),\n",
|
|
" 'education_years': np.random.normal(14, 3, n_records),\n",
|
|
" 'purchase_amount': np.random.normal(200, 50, n_records),\n",
|
|
" 'satisfaction_score': np.random.randint(1, 6, n_records),\n",
|
|
" 'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
|
|
" 'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),\n",
|
|
" 'signup_date': pd.date_range('2023-01-01', periods=n_records, freq='D'),\n",
|
|
" 'last_purchase_date': pd.date_range('2023-01-01', periods=n_records, freq='D') + pd.Timedelta(days=30)\n",
|
|
"}\n",
|
|
"\n",
|
|
"df_complete = pd.DataFrame(data)\n",
|
|
"\n",
|
|
"# Ensure positive values where appropriate\n",
|
|
"df_complete['age'] = np.abs(df_complete['age'])\n",
|
|
"df_complete['income'] = np.abs(df_complete['income'])\n",
|
|
"df_complete['education_years'] = np.clip(df_complete['education_years'], 6, 20)\n",
|
|
"df_complete['purchase_amount'] = np.abs(df_complete['purchase_amount'])\n",
|
|
"\n",
|
|
"print(\"Complete dataset created:\")\n",
|
|
"print(f\"Shape: {df_complete.shape}\")\n",
|
|
"print(\"\\nFirst few rows:\")\n",
|
|
"print(df_complete.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Introduce different patterns of missing data\n",
|
|
"df_missing = df_complete.copy()\n",
|
|
"\n",
|
|
"# 1. Missing Completely at Random (MCAR) - income data\n",
|
|
"# Randomly missing 15% of income values\n",
|
|
"mcar_indices = np.random.choice(df_missing.index, size=int(0.15 * len(df_missing)), replace=False)\n",
|
|
"df_missing.loc[mcar_indices, 'income'] = np.nan\n",
|
|
"\n",
|
|
"# 2. Missing at Random (MAR) - education years missing based on age\n",
|
|
"# Older people less likely to report education\n",
|
|
"older_customers = df_missing['age'] > 60\n",
|
|
"older_indices = df_missing[older_customers].index\n",
|
|
"education_missing = np.random.choice(older_indices, size=int(0.4 * len(older_indices)), replace=False)\n",
|
|
"df_missing.loc[education_missing, 'education_years'] = np.nan\n",
|
|
"\n",
|
|
"# 3. Missing Not at Random (MNAR) - satisfaction scores\n",
|
|
"# Unsatisfied customers less likely to provide ratings\n",
|
|
"low_satisfaction = df_missing['satisfaction_score'] <= 2\n",
|
|
"low_sat_indices = df_missing[low_satisfaction].index\n",
|
|
"satisfaction_missing = np.random.choice(low_sat_indices, size=int(0.6 * len(low_sat_indices)), replace=False)\n",
|
|
"df_missing.loc[satisfaction_missing, 'satisfaction_score'] = np.nan\n",
|
|
"\n",
|
|
"# 4. Systematic missing - last purchase date for new customers\n",
|
|
"# New customers (signed up recently) haven't made purchases yet\n",
|
|
"recent_signups = df_missing['signup_date'] > '2023-11-01'\n",
|
|
"df_missing.loc[recent_signups, 'last_purchase_date'] = pd.NaT\n",
|
|
"\n",
|
|
"# 5. Random missing in other columns\n",
|
|
"# Purchase amount - 10% missing\n",
|
|
"purchase_missing = np.random.choice(df_missing.index, size=int(0.10 * len(df_missing)), replace=False)\n",
|
|
"df_missing.loc[purchase_missing, 'purchase_amount'] = np.nan\n",
|
|
"\n",
|
|
"print(\"Missing data patterns introduced:\")\n",
|
|
"print(f\"Dataset shape: {df_missing.shape}\")\n",
|
|
"print(\"\\nMissing value counts:\")\n",
|
|
"missing_summary = df_missing.isnull().sum()\n",
|
|
"missing_summary = missing_summary[missing_summary > 0]\n",
|
|
"print(missing_summary)\n",
|
|
"\n",
|
|
"print(\"\\nMissing value percentages:\")\n",
|
|
"missing_pct = (df_missing.isnull().sum() / len(df_missing) * 100).round(2)\n",
|
|
"missing_pct = missing_pct[missing_pct > 0]\n",
|
|
"print(missing_pct)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Detecting and Analyzing Missing Data\n",
|
|
"\n",
|
|
"Comprehensive techniques for understanding missing data patterns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def analyze_missing_data(df):\n",
|
|
" \"\"\"Comprehensive missing data analysis\"\"\"\n",
|
|
" print(\"=== MISSING DATA ANALYSIS ===\")\n",
|
|
" \n",
|
|
" # Basic missing data statistics\n",
|
|
" total_cells = df.size\n",
|
|
" total_missing = df.isnull().sum().sum()\n",
|
|
" print(f\"Total cells: {total_cells:,}\")\n",
|
|
" print(f\"Missing cells: {total_missing:,} ({total_missing/total_cells*100:.2f}%)\")\n",
|
|
" \n",
|
|
" # Missing data by column\n",
|
|
" missing_by_column = pd.DataFrame({\n",
|
|
" 'Missing_Count': df.isnull().sum(),\n",
|
|
" 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,\n",
|
|
" 'Data_Type': df.dtypes\n",
|
|
" })\n",
|
|
" missing_by_column = missing_by_column[missing_by_column['Missing_Count'] > 0]\n",
|
|
" missing_by_column = missing_by_column.sort_values('Missing_Percentage', ascending=False)\n",
|
|
" \n",
|
|
" print(\"\\n--- Missing Data by Column ---\")\n",
|
|
" print(missing_by_column.round(2))\n",
|
|
" \n",
|
|
" # Missing data patterns\n",
|
|
" print(\"\\n--- Missing Data Patterns ---\")\n",
|
|
" missing_patterns = df.isnull().value_counts().head(10)\n",
|
|
" print(\"Top 10 missing patterns (True = Missing):\")\n",
|
|
" for pattern, count in missing_patterns.items():\n",
|
|
" percentage = (count / len(df)) * 100\n",
|
|
" print(f\"{count:4d} rows ({percentage:5.1f}%): {dict(zip(df.columns, pattern))}\")\n",
|
|
" \n",
|
|
" return missing_by_column\n",
|
|
"\n",
|
|
"# Analyze missing data\n",
|
|
"missing_analysis = analyze_missing_data(df_missing)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Visualize missing data patterns\n",
|
|
"def visualize_missing_data(df):\n",
|
|
" \"\"\"Create visualizations for missing data patterns\"\"\"\n",
|
|
" fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
|
|
" \n",
|
|
" # 1. Missing data heatmap\n",
|
|
" missing_mask = df.isnull()\n",
|
|
" sns.heatmap(missing_mask.iloc[:100], \n",
|
|
" yticklabels=False, \n",
|
|
" cbar=True, \n",
|
|
" cmap='viridis',\n",
|
|
" ax=axes[0, 0])\n",
|
|
" axes[0, 0].set_title('Missing Data Heatmap (First 100 rows)')\n",
|
|
" \n",
|
|
" # 2. Missing data by column\n",
|
|
" missing_counts = df.isnull().sum()\n",
|
|
" missing_counts = missing_counts[missing_counts > 0]\n",
|
|
" missing_counts.plot(kind='bar', ax=axes[0, 1], color='skyblue')\n",
|
|
" axes[0, 1].set_title('Missing Values by Column')\n",
|
|
" axes[0, 1].set_ylabel('Count')\n",
|
|
" axes[0, 1].tick_params(axis='x', rotation=45)\n",
|
|
" \n",
|
|
" # 3. Missing data correlation\n",
|
|
" missing_corr = df.isnull().corr()\n",
|
|
" sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])\n",
|
|
" axes[1, 0].set_title('Missing Data Correlation')\n",
|
|
" \n",
|
|
" # 4. Missing data by row\n",
|
|
" missing_per_row = df.isnull().sum(axis=1)\n",
|
|
" missing_per_row.hist(bins=range(len(df.columns) + 2), ax=axes[1, 1], alpha=0.7, color='orange')\n",
|
|
" axes[1, 1].set_title('Distribution of Missing Values per Row')\n",
|
|
" axes[1, 1].set_xlabel('Number of Missing Values')\n",
|
|
" axes[1, 1].set_ylabel('Number of Rows')\n",
|
|
" \n",
|
|
" plt.tight_layout()\n",
|
|
" plt.show()\n",
|
|
"\n",
|
|
"# Visualize missing patterns\n",
|
|
"visualize_missing_data(df_missing)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Analyze missing data relationships\n",
|
|
"def analyze_missing_relationships(df):\n",
|
|
" \"\"\"Analyze relationships between missing data and other variables\"\"\"\n",
|
|
" print(\"=== MISSING DATA RELATIONSHIPS ===\")\n",
|
|
" \n",
|
|
" # Example: Relationship between age and missing education\n",
|
|
" if 'age' in df.columns and 'education_years' in df.columns:\n",
|
|
" print(\"\\n--- Age vs Missing Education ---\")\n",
|
|
" education_missing = df['education_years'].isnull()\n",
|
|
" age_stats = df.groupby(education_missing)['age'].agg(['mean', 'median', 'std']).round(2)\n",
|
|
" age_stats.index = ['Education Present', 'Education Missing']\n",
|
|
" print(age_stats)\n",
|
|
" \n",
|
|
" # Example: Missing satisfaction by purchase amount\n",
|
|
" if 'satisfaction_score' in df.columns and 'purchase_amount' in df.columns:\n",
|
|
" print(\"\\n--- Purchase Amount vs Missing Satisfaction ---\")\n",
|
|
" satisfaction_missing = df['satisfaction_score'].isnull()\n",
|
|
" purchase_stats = df.groupby(satisfaction_missing)['purchase_amount'].agg(['mean', 'median', 'count']).round(2)\n",
|
|
" purchase_stats.index = ['Satisfaction Present', 'Satisfaction Missing']\n",
|
|
" print(purchase_stats)\n",
|
|
" \n",
|
|
" # Missing data by categorical variables\n",
|
|
" if 'region' in df.columns:\n",
|
|
" print(\"\\n--- Missing Data by Region ---\")\n",
|
|
" region_missing = df.groupby('region').apply(lambda x: x.isnull().sum())\n",
|
|
" print(region_missing[region_missing.sum(axis=1) > 0])\n",
|
|
"\n",
|
|
"# Analyze relationships\n",
|
|
"analyze_missing_relationships(df_missing)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Basic Missing Data Handling\n",
|
|
"\n",
|
|
"Fundamental techniques for dealing with missing values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Method 1: Dropping missing values\n",
|
|
"print(\"=== DROPPING MISSING VALUES ===\")\n",
|
|
"\n",
|
|
"# Drop rows with any missing values\n",
|
|
"df_drop_any = df_missing.dropna()\n",
|
|
"print(f\"Original shape: {df_missing.shape}\")\n",
|
|
"print(f\"After dropping any missing: {df_drop_any.shape}\")\n",
|
|
"print(f\"Rows removed: {len(df_missing) - len(df_drop_any)} ({(len(df_missing) - len(df_drop_any))/len(df_missing)*100:.1f}%)\")\n",
|
|
"\n",
|
|
"# Drop rows with missing values in specific columns\n",
|
|
"critical_columns = ['customer_id', 'age', 'region']\n",
|
|
"df_drop_critical = df_missing.dropna(subset=critical_columns)\n",
|
|
"print(f\"\\nAfter dropping rows missing critical columns: {df_drop_critical.shape}\")\n",
|
|
"\n",
|
|
"# Drop rows with more than X missing values\n",
|
|
"df_drop_thresh = df_missing.dropna(thresh=len(df_missing.columns) - 2) # Allow max 2 missing\n",
|
|
"print(f\"After dropping rows with >2 missing values: {df_drop_thresh.shape}\")\n",
|
|
"\n",
|
|
"# Drop columns with too many missing values\n",
|
|
"missing_threshold = 0.5 # 50%\n",
|
|
"cols_to_keep = df_missing.columns[df_missing.isnull().mean() < missing_threshold]\n",
|
|
"df_drop_cols = df_missing[cols_to_keep]\n",
|
|
"print(f\"\\nAfter dropping columns with >{missing_threshold*100}% missing: {df_drop_cols.shape}\")\n",
|
|
"print(f\"Columns dropped: {set(df_missing.columns) - set(cols_to_keep)}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Method 2: Basic imputation with fillna()\n",
|
|
"print(\"=== BASIC IMPUTATION ===\")\n",
|
|
"\n",
|
|
"df_basic_impute = df_missing.copy()\n",
|
|
"\n",
|
|
"# Fill with specific values\n",
|
|
"df_basic_impute['satisfaction_score'] = df_basic_impute['satisfaction_score'].fillna(3) # Neutral score\n",
|
|
"print(\"Filled satisfaction_score with 3 (neutral)\")\n",
|
|
"\n",
|
|
"# Fill with statistical measures\n",
|
|
"df_basic_impute['income'] = df_basic_impute['income'].fillna(df_basic_impute['income'].median())\n",
|
|
"df_basic_impute['education_years'] = df_basic_impute['education_years'].fillna(df_basic_impute['education_years'].mean())\n",
|
|
"df_basic_impute['purchase_amount'] = df_basic_impute['purchase_amount'].fillna(df_basic_impute['purchase_amount'].mean())\n",
|
|
"print(\"Filled numerical columns with mean/median\")\n",
|
|
"\n",
|
|
"# Forward fill and backward fill for dates\n",
|
|
"df_basic_impute['last_purchase_date'] = df_basic_impute['last_purchase_date'].fillna(method='bfill')\n",
|
|
"print(\"Filled dates with backward fill\")\n",
|
|
"\n",
|
|
"print(f\"\\nMissing values after basic imputation:\")\n",
|
|
"print(df_basic_impute.isnull().sum().sum())\n",
|
|
"\n",
|
|
"# Show before/after comparison\n",
|
|
"print(\"\\nComparison (first 10 rows):\")\n",
|
|
"comparison_cols = ['income', 'education_years', 'purchase_amount', 'satisfaction_score']\n",
|
|
"for col in comparison_cols:\n",
|
|
" before_missing = df_missing[col].isnull().sum()\n",
|
|
" after_missing = df_basic_impute[col].isnull().sum()\n",
|
|
" print(f\"{col}: {before_missing} → {after_missing} missing values\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Advanced Imputation Techniques\n",
|
|
"\n",
|
|
"Sophisticated methods for handling missing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Group-based imputation\n",
|
|
"def group_based_imputation(df):\n",
|
|
" \"\"\"Impute missing values based on group statistics\"\"\"\n",
|
|
" df_group_impute = df.copy()\n",
|
|
" \n",
|
|
" print(\"=== GROUP-BASED IMPUTATION ===\")\n",
|
|
" \n",
|
|
" # Impute income based on region and education level\n",
|
|
" # First, create education level categories\n",
|
|
" df_group_impute['education_level'] = pd.cut(\n",
|
|
" df_group_impute['education_years'].fillna(df_group_impute['education_years'].median()),\n",
|
|
" bins=[0, 12, 16, 20],\n",
|
|
" labels=['High School', 'Bachelor', 'Advanced']\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Calculate group-based statistics\n",
|
|
" income_by_group = df_group_impute.groupby(['region', 'education_level'])['income'].median()\n",
|
|
" \n",
|
|
" # Fill missing income values\n",
|
|
" def fill_income(row):\n",
|
|
" if pd.isna(row['income']):\n",
|
|
" try:\n",
|
|
" return income_by_group.loc[(row['region'], row['education_level'])]\n",
|
|
" except KeyError:\n",
|
|
" return df_group_impute['income'].median()\n",
|
|
" return row['income']\n",
|
|
" \n",
|
|
" df_group_impute['income'] = df_group_impute.apply(fill_income, axis=1)\n",
|
|
" \n",
|
|
" print(\"Income imputed based on region and education level\")\n",
|
|
" print(\"Group-based median income:\")\n",
|
|
" print(income_by_group.round(0))\n",
|
|
" \n",
|
|
" return df_group_impute\n",
|
|
"\n",
|
|
"# Apply group-based imputation\n",
|
|
"df_group_imputed = group_based_imputation(df_missing)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Comparison of Imputation Methods\n",
|
|
"\n",
|
|
"Compare different imputation approaches and their impact."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def compare_imputation_methods(original_complete, original_missing, *imputed_dfs, methods_names):\n",
|
|
" \"\"\"Compare different imputation methods\"\"\"\n",
|
|
" print(\"=== IMPUTATION METHODS COMPARISON ===\")\n",
|
|
" \n",
|
|
" # Focus on a specific column for comparison\n",
|
|
" column = 'income'\n",
|
|
" \n",
|
|
" if column not in original_complete.columns:\n",
|
|
" print(f\"Column {column} not found\")\n",
|
|
" return\n",
|
|
" \n",
|
|
" # Get original values that were made missing\n",
|
|
" missing_mask = original_missing[column].isnull()\n",
|
|
" true_values = original_complete.loc[missing_mask, column]\n",
|
|
" \n",
|
|
" print(f\"Comparing imputation for '{column}' column\")\n",
|
|
" print(f\"Number of missing values: {len(true_values)}\")\n",
|
|
" \n",
|
|
" # Calculate errors for each method\n",
|
|
" results = {}\n",
|
|
" \n",
|
|
" for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
|
|
" if column in df_imputed.columns:\n",
|
|
" imputed_values = df_imputed.loc[missing_mask, column]\n",
|
|
" \n",
|
|
" # Calculate metrics\n",
|
|
" mae = np.mean(np.abs(true_values - imputed_values))\n",
|
|
" rmse = np.sqrt(np.mean((true_values - imputed_values) ** 2))\n",
|
|
" bias = np.mean(imputed_values - true_values)\n",
|
|
" \n",
|
|
" results[method_name] = {\n",
|
|
" 'MAE': mae,\n",
|
|
" 'RMSE': rmse,\n",
|
|
" 'Bias': bias,\n",
|
|
" 'Mean_Imputed': np.mean(imputed_values),\n",
|
|
" 'Std_Imputed': np.std(imputed_values)\n",
|
|
" }\n",
|
|
" \n",
|
|
" # True statistics\n",
|
|
" print(f\"\\nTrue statistics for missing values:\")\n",
|
|
" print(f\"Mean: {np.mean(true_values):.2f}\")\n",
|
|
" print(f\"Std: {np.std(true_values):.2f}\")\n",
|
|
" \n",
|
|
" # Results comparison\n",
|
|
" results_df = pd.DataFrame(results).T\n",
|
|
" print(f\"\\nImputation comparison results:\")\n",
|
|
" print(results_df.round(2))\n",
|
|
" \n",
|
|
" # Visualize comparison\n",
|
|
" fig, axes = plt.subplots(2, 2, figsize=(15, 10))\n",
|
|
" \n",
|
|
" # Distribution comparison\n",
|
|
" axes[0, 0].hist(true_values, alpha=0.7, label='True Values', bins=20)\n",
|
|
" for df_imputed, method_name in zip(imputed_dfs, methods_names):\n",
|
|
" if column in df_imputed.columns:\n",
|
|
" imputed_values = df_imputed.loc[missing_mask, column]\n",
|
|
" axes[0, 0].hist(imputed_values, alpha=0.7, label=f'{method_name}', bins=20)\n",
|
|
" axes[0, 0].set_title('Distribution Comparison')\n",
|
|
" axes[0, 0].legend()\n",
|
|
" \n",
|
|
" # Error metrics\n",
|
|
" metrics = ['MAE', 'RMSE']\n",
|
|
" for i, metric in enumerate(metrics):\n",
|
|
" values = [results[method][metric] for method in results.keys()]\n",
|
|
" axes[0, 1].bar(range(len(values)), values, alpha=0.7)\n",
|
|
" axes[0, 1].set_xticks(range(len(results)))\n",
|
|
" axes[0, 1].set_xticklabels(list(results.keys()), rotation=45)\n",
|
|
" axes[0, 1].set_title(f'{metric} Comparison')\n",
|
|
" break # Show only MAE for now\n",
|
|
" \n",
|
|
" # Scatter plot: True vs Imputed\n",
|
|
" for i, (df_imputed, method_name) in enumerate(zip(imputed_dfs[:2], methods_names[:2])):\n",
|
|
" if column in df_imputed.columns:\n",
|
|
" imputed_values = df_imputed.loc[missing_mask, column]\n",
|
|
" ax = axes[1, i]\n",
|
|
" ax.scatter(true_values, imputed_values, alpha=0.6)\n",
|
|
" ax.plot([true_values.min(), true_values.max()], \n",
|
|
" [true_values.min(), true_values.max()], 'r--', label='Perfect Prediction')\n",
|
|
" ax.set_xlabel('True Values')\n",
|
|
" ax.set_ylabel('Imputed Values')\n",
|
|
" ax.set_title(f'{method_name}: True vs Imputed')\n",
|
|
" ax.legend()\n",
|
|
" \n",
|
|
" plt.tight_layout()\n",
|
|
" plt.show()\n",
|
|
" \n",
|
|
" return results_df\n",
|
|
"\n",
|
|
"# Compare methods\n",
|
|
"comparison_results = compare_imputation_methods(\n",
|
|
" df_complete, \n",
|
|
" df_missing,\n",
|
|
" df_basic_impute,\n",
|
|
" methods_names=['Basic Fill', 'KNN', 'Iterative']\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Domain-Specific Imputation Strategies\n",
|
|
"\n",
|
|
"Business logic-driven approaches to missing data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def business_logic_imputation(df):\n",
|
|
" \"\"\"Apply business logic for missing value imputation\"\"\"\n",
|
|
" print(\"=== BUSINESS LOGIC IMPUTATION ===\")\n",
|
|
" \n",
|
|
" df_business = df.copy()\n",
|
|
" \n",
|
|
" # 1. Income imputation based on age and education\n",
|
|
" def estimate_income(row):\n",
|
|
" if pd.notna(row['income']):\n",
|
|
" return row['income']\n",
|
|
" \n",
|
|
" # Base income estimation\n",
|
|
" base_income = 30000\n",
|
|
" \n",
|
|
" # Age factor (experience premium)\n",
|
|
" if pd.notna(row['age']):\n",
|
|
" if row['age'] > 40:\n",
|
|
" base_income *= 1.5\n",
|
|
" elif row['age'] > 30:\n",
|
|
" base_income *= 1.2\n",
|
|
" \n",
|
|
" # Education factor\n",
|
|
" if pd.notna(row['education_years']):\n",
|
|
" if row['education_years'] > 16: # Graduate degree\n",
|
|
" base_income *= 1.8\n",
|
|
" elif row['education_years'] > 12: # Bachelor's\n",
|
|
" base_income *= 1.4\n",
|
|
" \n",
|
|
" # Regional adjustment\n",
|
|
" regional_multipliers = {\n",
|
|
" 'North': 1.2, # Higher cost of living\n",
|
|
" 'South': 0.9,\n",
|
|
" 'East': 1.1,\n",
|
|
" 'West': 1.0\n",
|
|
" }\n",
|
|
" base_income *= regional_multipliers.get(row['region'], 1.0)\n",
|
|
" \n",
|
|
" return base_income\n",
|
|
" \n",
|
|
" # Apply income estimation\n",
|
|
" df_business['income'] = df_business.apply(estimate_income, axis=1)\n",
|
|
" \n",
|
|
" # 2. Satisfaction score based on purchase behavior\n",
|
|
" def estimate_satisfaction(row):\n",
|
|
" if pd.notna(row['satisfaction_score']):\n",
|
|
" return row['satisfaction_score']\n",
|
|
" \n",
|
|
" # Base satisfaction\n",
|
|
" base_satisfaction = 3 # Neutral\n",
|
|
" \n",
|
|
" # Purchase amount influence\n",
|
|
" if pd.notna(row['purchase_amount']):\n",
|
|
" if row['purchase_amount'] > 250: # High value purchase\n",
|
|
" base_satisfaction = 4\n",
|
|
" elif row['purchase_amount'] < 100: # Low value might indicate dissatisfaction\n",
|
|
" base_satisfaction = 2\n",
|
|
" \n",
|
|
" return base_satisfaction\n",
|
|
" \n",
|
|
" # Apply satisfaction estimation\n",
|
|
" df_business['satisfaction_score'] = df_business.apply(estimate_satisfaction, axis=1)\n",
|
|
" \n",
|
|
" # 3. Education years based on income and age\n",
|
|
" def estimate_education(row):\n",
|
|
" if pd.notna(row['education_years']):\n",
|
|
" return row['education_years']\n",
|
|
" \n",
|
|
" # Base education\n",
|
|
" base_education = 12 # High school\n",
|
|
" \n",
|
|
" # Income-based estimation\n",
|
|
" if pd.notna(row['income']):\n",
|
|
" if row['income'] > 70000:\n",
|
|
" base_education = 18 # Graduate level\n",
|
|
" elif row['income'] > 45000:\n",
|
|
" base_education = 16 # Bachelor's\n",
|
|
" elif row['income'] > 35000:\n",
|
|
" base_education = 14 # Some college\n",
|
|
" \n",
|
|
" # Age adjustment (older people might have different education patterns)\n",
|
|
" if pd.notna(row['age']) and row['age'] > 55:\n",
|
|
" base_education = max(12, base_education - 2) # Lower average for older generation\n",
|
|
" \n",
|
|
" return base_education\n",
|
|
" \n",
|
|
" # Apply education estimation\n",
|
|
" df_business['education_years'] = df_business.apply(estimate_education, axis=1)\n",
|
|
" \n",
|
|
" print(\"Business logic imputation completed\")\n",
|
|
" print(f\"Missing values remaining: {df_business.isnull().sum().sum()}\")\n",
|
|
" \n",
|
|
" return df_business\n",
|
|
"\n",
|
|
"# Apply business logic imputation\n",
|
|
"df_business_imputed = business_logic_imputation(df_missing)\n",
|
|
"\n",
|
|
"print(\"\\nBusiness logic imputation summary:\")\n",
|
|
"for col in ['income', 'satisfaction_score', 'education_years']:\n",
|
|
" before = df_missing[col].isnull().sum()\n",
|
|
" after = df_business_imputed[col].isnull().sum()\n",
|
|
" print(f\"{col}: {before} → {after} missing values\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 6. Missing Data Flags and Indicators\n",
|
|
"\n",
|
|
"Track which values were imputed for transparency and analysis."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def create_missing_indicators(df_original, df_imputed):\n",
|
|
" \"\"\"Create indicator variables for missing data\"\"\"\n",
|
|
" print(\"=== CREATING MISSING DATA INDICATORS ===\")\n",
|
|
" \n",
|
|
" df_with_indicators = df_imputed.copy()\n",
|
|
" \n",
|
|
" # Create indicator columns for each column that had missing data\n",
|
|
" columns_with_missing = df_original.columns[df_original.isnull().any()].tolist()\n",
|
|
" \n",
|
|
" for col in columns_with_missing:\n",
|
|
" indicator_col = f'{col}_was_missing'\n",
|
|
" df_with_indicators[indicator_col] = df_original[col].isnull().astype(int)\n",
|
|
" \n",
|
|
" print(f\"Created {len(columns_with_missing)} missing data indicators\")\n",
|
|
" print(f\"Indicator columns: {[f'{col}_was_missing' for col in columns_with_missing]}\")\n",
|
|
" \n",
|
|
" # Summary of missing patterns\n",
|
|
" indicator_cols = [f'{col}_was_missing' for col in columns_with_missing]\n",
|
|
" missing_patterns = df_with_indicators[indicator_cols].sum()\n",
|
|
" \n",
|
|
" print(\"\\nMissing data summary by column:\")\n",
|
|
" for col, count in missing_patterns.items():\n",
|
|
" original_col = col.replace('_was_missing', '')\n",
|
|
" percentage = (count / len(df_with_indicators)) * 100\n",
|
|
" print(f\"{original_col}: {count} values imputed ({percentage:.1f}%)\")\n",
|
|
" \n",
|
|
" # Create composite missing indicator\n",
|
|
" df_with_indicators['total_missing_count'] = df_with_indicators[indicator_cols].sum(axis=1)\n",
|
|
" df_with_indicators['has_any_missing'] = (df_with_indicators['total_missing_count'] > 0).astype(int)\n",
|
|
" \n",
|
|
" return df_with_indicators, indicator_cols\n",
|
|
"\n",
|
|
"# Create missing indicators\n",
|
|
"df_with_indicators, indicator_columns = create_missing_indicators(df_missing, df_business_imputed)\n",
|
|
"\n",
|
|
"print(\"\\nDataset with missing indicators:\")\n",
|
|
"sample_cols = ['income', 'income_was_missing', 'education_years', 'education_years_was_missing', \n",
|
|
" 'satisfaction_score', 'satisfaction_score_was_missing', 'total_missing_count']\n",
|
|
"available_cols = [col for col in sample_cols if col in df_with_indicators.columns]\n",
|
|
"print(df_with_indicators[available_cols].head(10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 7. Validation and Quality Assessment\n",
|
|
"\n",
|
|
"Validate the quality of imputation results."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def validate_imputation_quality(df_original, df_missing, df_imputed):\n",
|
|
" \"\"\"Validate the quality of imputation\"\"\"\n",
|
|
" print(\"=== IMPUTATION QUALITY VALIDATION ===\")\n",
|
|
" \n",
|
|
" validation_results = {}\n",
|
|
" \n",
|
|
" # Check each column that had missing data\n",
|
|
" for col in df_missing.columns:\n",
|
|
" if df_missing[col].isnull().any() and col in df_imputed.columns:\n",
|
|
" print(f\"\\n--- Validating {col} ---\")\n",
|
|
" \n",
|
|
" # Get missing mask\n",
|
|
" missing_mask = df_missing[col].isnull()\n",
|
|
" \n",
|
|
" # Original statistics (complete data)\n",
|
|
" original_stats = df_original[col].describe()\n",
|
|
" \n",
|
|
" # Imputed statistics (only imputed values)\n",
|
|
" if missing_mask.any():\n",
|
|
" imputed_values = df_imputed.loc[missing_mask, col]\n",
|
|
" \n",
|
|
" if pd.api.types.is_numeric_dtype(df_original[col]):\n",
|
|
" imputed_stats = imputed_values.describe()\n",
|
|
" \n",
|
|
" # Statistical tests\n",
|
|
" mean_diff = abs(original_stats['mean'] - imputed_stats['mean'])\n",
|
|
" std_diff = abs(original_stats['std'] - imputed_stats['std'])\n",
|
|
" \n",
|
|
" validation_results[col] = {\n",
|
|
" 'original_mean': original_stats['mean'],\n",
|
|
" 'imputed_mean': imputed_stats['mean'],\n",
|
|
" 'mean_difference': mean_diff,\n",
|
|
" 'original_std': original_stats['std'],\n",
|
|
" 'imputed_std': imputed_stats['std'],\n",
|
|
" 'std_difference': std_diff,\n",
|
|
" 'values_imputed': len(imputed_values)\n",
|
|
" }\n",
|
|
" \n",
|
|
" print(f\"Original mean: {original_stats['mean']:.2f}, Imputed mean: {imputed_stats['mean']:.2f}\")\n",
|
|
" print(f\"Mean difference: {mean_diff:.2f} ({mean_diff/original_stats['mean']*100:.1f}%)\")\n",
|
|
" print(f\"Original std: {original_stats['std']:.2f}, Imputed std: {imputed_stats['std']:.2f}\")\n",
|
|
" \n",
|
|
" else:\n",
|
|
" # Categorical data\n",
|
|
" original_dist = df_original[col].value_counts(normalize=True)\n",
|
|
" imputed_dist = imputed_values.value_counts(normalize=True)\n",
|
|
" print(f\"Original distribution: {original_dist.to_dict()}\")\n",
|
|
" print(f\"Imputed distribution: {imputed_dist.to_dict()}\")\n",
|
|
" \n",
|
|
" # Overall validation summary\n",
|
|
" if validation_results:\n",
|
|
" validation_df = pd.DataFrame(validation_results).T\n",
|
|
" print(\"\\n=== VALIDATION SUMMARY ===\")\n",
|
|
" print(validation_df.round(3))\n",
|
|
" \n",
|
|
" # Flag potential issues\n",
|
|
" print(\"\\n--- Potential Issues ---\")\n",
|
|
" for col, stats in validation_results.items():\n",
|
|
" mean_change = abs(stats['mean_difference'] / stats['original_mean']) * 100\n",
|
|
" if mean_change > 10: # More than 10% change in mean\n",
|
|
" print(f\"⚠️ {col}: Large mean change ({mean_change:.1f}%)\")\n",
|
|
" \n",
|
|
" std_change = abs(stats['std_difference'] / stats['original_std']) * 100\n",
|
|
" if std_change > 20: # More than 20% change in std\n",
|
|
" print(f\"⚠️ {col}: Large variance change ({std_change:.1f}%)\")\n",
|
|
" \n",
|
|
" return validation_results\n",
|
|
"\n",
|
|
"# Validate imputation quality\n",
|
|
"validation_results = validate_imputation_quality(df_complete, df_missing, df_business_imputed)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Practice Exercises\n",
|
|
"\n",
|
|
"Apply missing data handling techniques to challenging scenarios:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 53,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 1: Multi-step imputation strategy\n",
|
|
"# Create a sophisticated imputation pipeline that:\n",
|
|
"# 1. Handles different types of missing data appropriately\n",
|
|
"# 2. Uses multiple imputation methods in sequence\n",
|
|
"# 3. Validates results at each step\n",
|
|
"# 4. Creates comprehensive documentation\n",
|
|
"\n",
|
|
"def comprehensive_imputation_pipeline(df):\n",
|
|
" \"\"\"Comprehensive missing data handling pipeline\"\"\"\n",
|
|
" # Your implementation here\n",
|
|
" pass\n",
|
|
"\n",
|
|
"# result_df = comprehensive_imputation_pipeline(df_missing)\n",
|
|
"# print(\"Comprehensive pipeline results:\")\n",
|
|
"# print(result_df.isnull().sum())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 54,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 2: Missing data pattern analysis\n",
|
|
"# Analyze if missing data follows specific patterns:\n",
|
|
"# - Time-based patterns\n",
|
|
"# - User behavior patterns\n",
|
|
"# - System/technical patterns\n",
|
|
"# Create insights and recommendations\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 55,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 3: Impact assessment\n",
|
|
"# Assess how different missing data handling approaches\n",
|
|
"# affect downstream analysis:\n",
|
|
"# - Statistical analysis results\n",
|
|
"# - Machine learning model performance\n",
|
|
"# - Business insights and decisions\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Key Takeaways\n",
|
|
"\n",
|
|
"1. **Understanding Missing Data Types**:\n",
|
|
" - **MCAR**: Missing Completely at Random\n",
|
|
" - **MAR**: Missing at Random (depends on observed data)\n",
|
|
" - **MNAR**: Missing Not at Random (depends on unobserved data)\n",
|
|
"\n",
|
|
"2. **Detection and Analysis**:\n",
|
|
" - Always analyze missing patterns before imputation\n",
|
|
" - Use visualizations to understand missing data structure\n",
|
|
" - Look for relationships between missing values and other variables\n",
|
|
"\n",
|
|
"3. **Handling Strategies**:\n",
|
|
" - **Deletion**: Simple but can lose valuable information\n",
|
|
" - **Simple Imputation**: Fast but may not preserve relationships\n",
|
|
" - **Advanced Methods**: KNN, MICE preserve more complex relationships\n",
|
|
" - **Business Logic**: Domain knowledge often provides best results\n",
|
|
"\n",
|
|
"4. **Best Practices**:\n",
|
|
" - Create missing data indicators for transparency\n",
|
|
" - Validate imputation quality against original data when possible\n",
|
|
" - Consider the impact on downstream analysis\n",
|
|
" - Document all imputation decisions and methods\n",
|
|
"\n",
|
|
"## Method Selection Guide\n",
|
|
"\n",
|
|
"| Scenario | Recommended Method | Rationale |\n",
|
|
"|----------|-------------------|----------|\n",
|
|
"| < 5% missing, MCAR | Simple imputation | Low impact, efficiency |\n",
|
|
"| 5-20% missing, MAR | KNN or Group-based | Preserve relationships |\n",
|
|
"| > 20% missing, complex patterns | MICE or Multiple imputation | Handle complex dependencies |\n",
|
|
"| Business-critical decisions | Domain knowledge + validation | Accuracy and explainability |\n",
|
|
"| Machine learning features | Advanced methods + indicators | Preserve predictive power |\n",
|
|
"\n",
|
|
"## Common Pitfalls to Avoid\n",
|
|
"\n",
|
|
"1. **Data Leakage**: Don't use future information to impute past values\n",
|
|
"2. **Ignoring Patterns**: Missing data often has meaningful patterns\n",
|
|
"3. **Over-imputation**: Sometimes missing data is informative itself\n",
|
|
"4. **One-size-fits-all**: Different columns may need different strategies\n",
|
|
"5. **No Validation**: Always check if imputation preserved data characteristics"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|