# Session 1 - DataFrames - Lesson 6: Handling Missing Data

## Learning Objectives
- Understand different types of missing data and their implications
- Master techniques for detecting and analyzing missing values
- Learn various strategies for handling missing data
- Practice imputation methods and their trade-offs
- Develop best practices for missing data management

## Prerequisites
- Completed Lessons 1-5
- Understanding of basic statistical concepts
- Familiarity with data quality principles

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')

print("Libraries loaded successfully!")

## Creating Dataset with Missing Values

Let's create a realistic dataset with different patterns of missing data.

In [None]:
# Create comprehensive dataset with various missing data patterns
np.random.seed(42)
n_records = 500

# Base data
data = {
 'customer_id': range(1, n_records + 1),
 'age': np.random.normal(35, 12, n_records).astype(int),
 'income': np.random.normal(50000, 15000, n_records),
 'education_years': np.random.normal(14, 3, n_records),
 'purchase_amount': np.random.normal(200, 50, n_records),
 'satisfaction_score': np.random.randint(1, 6, n_records),
 'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
 'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], n_records),
 'signup_date': pd.date_range('2023-01-01', periods=n_records, freq='D'),
 'last_purchase_date': pd.date_range('2023-01-01', periods=n_records, freq='D') + pd.Timedelta(days=30)
}

df_complete = pd.DataFrame(data)

# Ensure positive values where appropriate
df_complete['age'] = np.abs(df_complete['age'])
df_complete['income'] = np.abs(df_complete['income'])
df_complete['education_years'] = np.clip(df_complete['education_years'], 6, 20)
df_complete['purchase_amount'] = np.abs(df_complete['purchase_amount'])

print("Complete dataset created:")
print(f"Shape: {df_complete.shape}")
print("\nFirst few rows:")
print(df_complete.head())

In [None]:
# Introduce different patterns of missing data
df_missing = df_complete.copy()

# 1. Missing Completely at Random (MCAR) - income data
# Randomly missing 15% of income values
mcar_indices = np.random.choice(df_missing.index, size=int(0.15 * len(df_missing)), replace=False)
df_missing.loc[mcar_indices, 'income'] = np.nan

# 2. Missing at Random (MAR) - education years missing based on age
# Older people less likely to report education
older_customers = df_missing['age'] > 60
older_indices = df_missing[older_customers].index
education_missing = np.random.choice(older_indices, size=int(0.4 * len(older_indices)), replace=False)
df_missing.loc[education_missing, 'education_years'] = np.nan

# 3. Missing Not at Random (MNAR) - satisfaction scores
# Unsatisfied customers less likely to provide ratings
low_satisfaction = df_missing['satisfaction_score'] <= 2
low_sat_indices = df_missing[low_satisfaction].index
satisfaction_missing = np.random.choice(low_sat_indices, size=int(0.6 * len(low_sat_indices)), replace=False)
df_missing.loc[satisfaction_missing, 'satisfaction_score'] = np.nan

# 4. Systematic missing - last purchase date for new customers
# New customers (signed up recently) haven't made purchases yet
recent_signups = df_missing['signup_date'] > '2023-11-01'
df_missing.loc[recent_signups, 'last_purchase_date'] = pd.NaT

# 5. Random missing in other columns
# Purchase amount - 10% missing
purchase_missing = np.random.choice(df_missing.index, size=int(0.10 * len(df_missing)), replace=False)
df_missing.loc[purchase_missing, 'purchase_amount'] = np.nan

print("Missing data patterns introduced:")
print(f"Dataset shape: {df_missing.shape}")
print("\nMissing value counts:")
missing_summary = df_missing.isnull().sum()
missing_summary = missing_summary[missing_summary > 0]
print(missing_summary)

print("\nMissing value percentages:")
missing_pct = (df_missing.isnull().sum() / len(df_missing) * 100).round(2)
missing_pct = missing_pct[missing_pct > 0]
print(missing_pct)

## 1. Detecting and Analyzing Missing Data

Comprehensive techniques for understanding missing data patterns.

In [None]:
def analyze_missing_data(df):
 """Comprehensive missing data analysis"""
 print("=== MISSING DATA ANALYSIS ===")
 
 # Basic missing data statistics
 total_cells = df.size
 total_missing = df.isnull().sum().sum()
 print(f"Total cells: {total_cells:,}")
 print(f"Missing cells: {total_missing:,} ({total_missing/total_cells*100:.2f}%)")
 
 # Missing data by column
 missing_by_column = pd.DataFrame({
 'Missing_Count': df.isnull().sum(),
 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,
 'Data_Type': df.dtypes
 })
 missing_by_column = missing_by_column[missing_by_column['Missing_Count'] > 0]
 missing_by_column = missing_by_column.sort_values('Missing_Percentage', ascending=False)
 
 print("\n--- Missing Data by Column ---")
 print(missing_by_column.round(2))
 
 # Missing data patterns
 print("\n--- Missing Data Patterns ---")
 missing_patterns = df.isnull().value_counts().head(10)
 print("Top 10 missing patterns (True = Missing):")
 for pattern, count in missing_patterns.items():
 percentage = (count / len(df)) * 100
 print(f"{count:4d} rows ({percentage:5.1f}%): {dict(zip(df.columns, pattern))}")
 
 return missing_by_column

# Analyze missing data
missing_analysis = analyze_missing_data(df_missing)

In [None]:
# Visualize missing data patterns
def visualize_missing_data(df):
 """Create visualizations for missing data patterns"""
 fig, axes = plt.subplots(2, 2, figsize=(15, 10))
 
 # 1. Missing data heatmap
 missing_mask = df.isnull()
 sns.heatmap(missing_mask.iloc[:100], 
 yticklabels=False, 
 cbar=True, 
 cmap='viridis',
 ax=axes[0, 0])
 axes[0, 0].set_title('Missing Data Heatmap (First 100 rows)')
 
 # 2. Missing data by column
 missing_counts = df.isnull().sum()
 missing_counts = missing_counts[missing_counts > 0]
 missing_counts.plot(kind='bar', ax=axes[0, 1], color='skyblue')
 axes[0, 1].set_title('Missing Values by Column')
 axes[0, 1].set_ylabel('Count')
 axes[0, 1].tick_params(axis='x', rotation=45)
 
 # 3. Missing data correlation
 missing_corr = df.isnull().corr()
 sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])
 axes[1, 0].set_title('Missing Data Correlation')
 
 # 4. Missing data by row
 missing_per_row = df.isnull().sum(axis=1)
 missing_per_row.hist(bins=range(len(df.columns) + 2), ax=axes[1, 1], alpha=0.7, color='orange')
 axes[1, 1].set_title('Distribution of Missing Values per Row')
 axes[1, 1].set_xlabel('Number of Missing Values')
 axes[1, 1].set_ylabel('Number of Rows')
 
 plt.tight_layout()
 plt.show()

# Visualize missing patterns
visualize_missing_data(df_missing)

In [None]:
# Analyze missing data relationships
def analyze_missing_relationships(df):
 """Analyze relationships between missing data and other variables"""
 print("=== MISSING DATA RELATIONSHIPS ===")
 
 # Example: Relationship between age and missing education
 if 'age' in df.columns and 'education_years' in df.columns:
 print("\n--- Age vs Missing Education ---")
 education_missing = df['education_years'].isnull()
 age_stats = df.groupby(education_missing)['age'].agg(['mean', 'median', 'std']).round(2)
 age_stats.index = ['Education Present', 'Education Missing']
 print(age_stats)
 
 # Example: Missing satisfaction by purchase amount
 if 'satisfaction_score' in df.columns and 'purchase_amount' in df.columns:
 print("\n--- Purchase Amount vs Missing Satisfaction ---")
 satisfaction_missing = df['satisfaction_score'].isnull()
 purchase_stats = df.groupby(satisfaction_missing)['purchase_amount'].agg(['mean', 'median', 'count']).round(2)
 purchase_stats.index = ['Satisfaction Present', 'Satisfaction Missing']
 print(purchase_stats)
 
 # Missing data by categorical variables
 if 'region' in df.columns:
 print("\n--- Missing Data by Region ---")
 region_missing = df.groupby('region').apply(lambda x: x.isnull().sum())
 print(region_missing[region_missing.sum(axis=1) > 0])

# Analyze relationships
analyze_missing_relationships(df_missing)

## 2. Basic Missing Data Handling

Fundamental techniques for dealing with missing values.

In [None]:
# Method 1: Dropping missing values
print("=== DROPPING MISSING VALUES ===")

# Drop rows with any missing values
df_drop_any = df_missing.dropna()
print(f"Original shape: {df_missing.shape}")
print(f"After dropping any missing: {df_drop_any.shape}")
print(f"Rows removed: {len(df_missing) - len(df_drop_any)} ({(len(df_missing) - len(df_drop_any))/len(df_missing)*100:.1f}%)")

# Drop rows with missing values in specific columns
critical_columns = ['customer_id', 'age', 'region']
df_drop_critical = df_missing.dropna(subset=critical_columns)
print(f"\nAfter dropping rows missing critical columns: {df_drop_critical.shape}")

# Drop rows with more than X missing values
df_drop_thresh = df_missing.dropna(thresh=len(df_missing.columns) - 2) # Allow max 2 missing
print(f"After dropping rows with >2 missing values: {df_drop_thresh.shape}")

# Drop columns with too many missing values
missing_threshold = 0.5 # 50%
cols_to_keep = df_missing.columns[df_missing.isnull().mean() < missing_threshold]
df_drop_cols = df_missing[cols_to_keep]
print(f"\nAfter dropping columns with >{missing_threshold*100}% missing: {df_drop_cols.shape}")
print(f"Columns dropped: {set(df_missing.columns) - set(cols_to_keep)}")

In [None]:
# Method 2: Basic imputation with fillna()
print("=== BASIC IMPUTATION ===")

df_basic_impute = df_missing.copy()

# Fill with specific values
df_basic_impute['satisfaction_score'] = df_basic_impute['satisfaction_score'].fillna(3) # Neutral score
print("Filled satisfaction_score with 3 (neutral)")

# Fill with statistical measures
df_basic_impute['income'] = df_basic_impute['income'].fillna(df_basic_impute['income'].median())
df_basic_impute['education_years'] = df_basic_impute['education_years'].fillna(df_basic_impute['education_years'].mean())
df_basic_impute['purchase_amount'] = df_basic_impute['purchase_amount'].fillna(df_basic_impute['purchase_amount'].mean())
print("Filled numerical columns with mean/median")

# Forward fill and backward fill for dates
df_basic_impute['last_purchase_date'] = df_basic_impute['last_purchase_date'].fillna(method='bfill')
print("Filled dates with backward fill")

print(f"\nMissing values after basic imputation:")
print(df_basic_impute.isnull().sum().sum())

# Show before/after comparison
print("\nComparison (first 10 rows):")
comparison_cols = ['income', 'education_years', 'purchase_amount', 'satisfaction_score']
for col in comparison_cols:
 before_missing = df_missing[col].isnull().sum()
 after_missing = df_basic_impute[col].isnull().sum()
 print(f"{col}: {before_missing} → {after_missing} missing values")

## 3. Advanced Imputation Techniques

Sophisticated methods for handling missing data.

In [None]:
# Group-based imputation
def group_based_imputation(df):
 """Impute missing values based on group statistics"""
 df_group_impute = df.copy()
 
 print("=== GROUP-BASED IMPUTATION ===")
 
 # Impute income based on region and education level
 # First, create education level categories
 df_group_impute['education_level'] = pd.cut(
 df_group_impute['education_years'].fillna(df_group_impute['education_years'].median()),
 bins=[0, 12, 16, 20],
 labels=['High School', 'Bachelor', 'Advanced']
 )
 
 # Calculate group-based statistics
 income_by_group = df_group_impute.groupby(['region', 'education_level'])['income'].median()
 
 # Fill missing income values
 def fill_income(row):
 if pd.isna(row['income']):
 try:
 return income_by_group.loc[(row['region'], row['education_level'])]
 except KeyError:
 return df_group_impute['income'].median()
 return row['income']
 
 df_group_impute['income'] = df_group_impute.apply(fill_income, axis=1)
 
 print("Income imputed based on region and education level")
 print("Group-based median income:")
 print(income_by_group.round(0))
 
 return df_group_impute

# Apply group-based imputation
df_group_imputed = group_based_imputation(df_missing)

## 4. Comparison of Imputation Methods

Compare different imputation approaches and their impact.

In [None]:
def compare_imputation_methods(original_complete, original_missing, *imputed_dfs, methods_names):
 """Compare different imputation methods"""
 print("=== IMPUTATION METHODS COMPARISON ===")
 
 # Focus on a specific column for comparison
 column = 'income'
 
 if column not in original_complete.columns:
 print(f"Column {column} not found")
 return
 
 # Get original values that were made missing
 missing_mask = original_missing[column].isnull()
 true_values = original_complete.loc[missing_mask, column]
 
 print(f"Comparing imputation for '{column}' column")
 print(f"Number of missing values: {len(true_values)}")
 
 # Calculate errors for each method
 results = {}
 
 for df_imputed, method_name in zip(imputed_dfs, methods_names):
 if column in df_imputed.columns:
 imputed_values = df_imputed.loc[missing_mask, column]
 
 # Calculate metrics
 mae = np.mean(np.abs(true_values - imputed_values))
 rmse = np.sqrt(np.mean((true_values - imputed_values) ** 2))
 bias = np.mean(imputed_values - true_values)
 
 results[method_name] = {
 'MAE': mae,
 'RMSE': rmse,
 'Bias': bias,
 'Mean_Imputed': np.mean(imputed_values),
 'Std_Imputed': np.std(imputed_values)
 }
 
 # True statistics
 print(f"\nTrue statistics for missing values:")
 print(f"Mean: {np.mean(true_values):.2f}")
 print(f"Std: {np.std(true_values):.2f}")
 
 # Results comparison
 results_df = pd.DataFrame(results).T
 print(f"\nImputation comparison results:")
 print(results_df.round(2))
 
 # Visualize comparison
 fig, axes = plt.subplots(2, 2, figsize=(15, 10))
 
 # Distribution comparison
 axes[0, 0].hist(true_values, alpha=0.7, label='True Values', bins=20)
 for df_imputed, method_name in zip(imputed_dfs, methods_names):
 if column in df_imputed.columns:
 imputed_values = df_imputed.loc[missing_mask, column]
 axes[0, 0].hist(imputed_values, alpha=0.7, label=f'{method_name}', bins=20)
 axes[0, 0].set_title('Distribution Comparison')
 axes[0, 0].legend()
 
 # Error metrics
 metrics = ['MAE', 'RMSE']
 for i, metric in enumerate(metrics):
 values = [results[method][metric] for method in results.keys()]
 axes[0, 1].bar(range(len(values)), values, alpha=0.7)
 axes[0, 1].set_xticks(range(len(results)))
 axes[0, 1].set_xticklabels(list(results.keys()), rotation=45)
 axes[0, 1].set_title(f'{metric} Comparison')
 break # Show only MAE for now
 
 # Scatter plot: True vs Imputed
 for i, (df_imputed, method_name) in enumerate(zip(imputed_dfs[:2], methods_names[:2])):
 if column in df_imputed.columns:
 imputed_values = df_imputed.loc[missing_mask, column]
 ax = axes[1, i]
 ax.scatter(true_values, imputed_values, alpha=0.6)
 ax.plot([true_values.min(), true_values.max()], 
 [true_values.min(), true_values.max()], 'r--', label='Perfect Prediction')
 ax.set_xlabel('True Values')
 ax.set_ylabel('Imputed Values')
 ax.set_title(f'{method_name}: True vs Imputed')
 ax.legend()
 
 plt.tight_layout()
 plt.show()
 
 return results_df

# Compare methods
comparison_results = compare_imputation_methods(
 df_complete, 
 df_missing,
 df_basic_impute,
 methods_names=['Basic Fill', 'KNN', 'Iterative']
)

## 5. Domain-Specific Imputation Strategies

Business logic-driven approaches to missing data.

In [None]:
def business_logic_imputation(df):
 """Apply business logic for missing value imputation"""
 print("=== BUSINESS LOGIC IMPUTATION ===")
 
 df_business = df.copy()
 
 # 1. Income imputation based on age and education
 def estimate_income(row):
 if pd.notna(row['income']):
 return row['income']
 
 # Base income estimation
 base_income = 30000
 
 # Age factor (experience premium)
 if pd.notna(row['age']):
 if row['age'] > 40:
 base_income *= 1.5
 elif row['age'] > 30:
 base_income *= 1.2
 
 # Education factor
 if pd.notna(row['education_years']):
 if row['education_years'] > 16: # Graduate degree
 base_income *= 1.8
 elif row['education_years'] > 12: # Bachelor's
 base_income *= 1.4
 
 # Regional adjustment
 regional_multipliers = {
 'North': 1.2, # Higher cost of living
 'South': 0.9,
 'East': 1.1,
 'West': 1.0
 }
 base_income *= regional_multipliers.get(row['region'], 1.0)
 
 return base_income
 
 # Apply income estimation
 df_business['income'] = df_business.apply(estimate_income, axis=1)
 
 # 2. Satisfaction score based on purchase behavior
 def estimate_satisfaction(row):
 if pd.notna(row['satisfaction_score']):
 return row['satisfaction_score']
 
 # Base satisfaction
 base_satisfaction = 3 # Neutral
 
 # Purchase amount influence
 if pd.notna(row['purchase_amount']):
 if row['purchase_amount'] > 250: # High value purchase
 base_satisfaction = 4
 elif row['purchase_amount'] < 100: # Low value might indicate dissatisfaction
 base_satisfaction = 2
 
 return base_satisfaction
 
 # Apply satisfaction estimation
 df_business['satisfaction_score'] = df_business.apply(estimate_satisfaction, axis=1)
 
 # 3. Education years based on income and age
 def estimate_education(row):
 if pd.notna(row['education_years']):
 return row['education_years']
 
 # Base education
 base_education = 12 # High school
 
 # Income-based estimation
 if pd.notna(row['income']):
 if row['income'] > 70000:
 base_education = 18 # Graduate level
 elif row['income'] > 45000:
 base_education = 16 # Bachelor's
 elif row['income'] > 35000:
 base_education = 14 # Some college
 
 # Age adjustment (older people might have different education patterns)
 if pd.notna(row['age']) and row['age'] > 55:
 base_education = max(12, base_education - 2) # Lower average for older generation
 
 return base_education
 
 # Apply education estimation
 df_business['education_years'] = df_business.apply(estimate_education, axis=1)
 
 print("Business logic imputation completed")
 print(f"Missing values remaining: {df_business.isnull().sum().sum()}")
 
 return df_business

# Apply business logic imputation
df_business_imputed = business_logic_imputation(df_missing)

print("\nBusiness logic imputation summary:")
for col in ['income', 'satisfaction_score', 'education_years']:
 before = df_missing[col].isnull().sum()
 after = df_business_imputed[col].isnull().sum()
 print(f"{col}: {before} → {after} missing values")

## 6. Missing Data Flags and Indicators

Track which values were imputed for transparency and analysis.

In [None]:
def create_missing_indicators(df_original, df_imputed):
 """Create indicator variables for missing data"""
 print("=== CREATING MISSING DATA INDICATORS ===")
 
 df_with_indicators = df_imputed.copy()
 
 # Create indicator columns for each column that had missing data
 columns_with_missing = df_original.columns[df_original.isnull().any()].tolist()
 
 for col in columns_with_missing:
 indicator_col = f'{col}_was_missing'
 df_with_indicators[indicator_col] = df_original[col].isnull().astype(int)
 
 print(f"Created {len(columns_with_missing)} missing data indicators")
 print(f"Indicator columns: {[f'{col}_was_missing' for col in columns_with_missing]}")
 
 # Summary of missing patterns
 indicator_cols = [f'{col}_was_missing' for col in columns_with_missing]
 missing_patterns = df_with_indicators[indicator_cols].sum()
 
 print("\nMissing data summary by column:")
 for col, count in missing_patterns.items():
 original_col = col.replace('_was_missing', '')
 percentage = (count / len(df_with_indicators)) * 100
 print(f"{original_col}: {count} values imputed ({percentage:.1f}%)")
 
 # Create composite missing indicator
 df_with_indicators['total_missing_count'] = df_with_indicators[indicator_cols].sum(axis=1)
 df_with_indicators['has_any_missing'] = (df_with_indicators['total_missing_count'] > 0).astype(int)
 
 return df_with_indicators, indicator_cols

# Create missing indicators
df_with_indicators, indicator_columns = create_missing_indicators(df_missing, df_business_imputed)

print("\nDataset with missing indicators:")
sample_cols = ['income', 'income_was_missing', 'education_years', 'education_years_was_missing', 
 'satisfaction_score', 'satisfaction_score_was_missing', 'total_missing_count']
available_cols = [col for col in sample_cols if col in df_with_indicators.columns]
print(df_with_indicators[available_cols].head(10))

## 7. Validation and Quality Assessment

Validate the quality of imputation results.

In [None]:
def validate_imputation_quality(df_original, df_missing, df_imputed):
 """Validate the quality of imputation"""
 print("=== IMPUTATION QUALITY VALIDATION ===")
 
 validation_results = {}
 
 # Check each column that had missing data
 for col in df_missing.columns:
 if df_missing[col].isnull().any() and col in df_imputed.columns:
 print(f"\n--- Validating {col} ---")
 
 # Get missing mask
 missing_mask = df_missing[col].isnull()
 
 # Original statistics (complete data)
 original_stats = df_original[col].describe()
 
 # Imputed statistics (only imputed values)
 if missing_mask.any():
 imputed_values = df_imputed.loc[missing_mask, col]
 
 if pd.api.types.is_numeric_dtype(df_original[col]):
 imputed_stats = imputed_values.describe()
 
 # Statistical tests
 mean_diff = abs(original_stats['mean'] - imputed_stats['mean'])
 std_diff = abs(original_stats['std'] - imputed_stats['std'])
 
 validation_results[col] = {
 'original_mean': original_stats['mean'],
 'imputed_mean': imputed_stats['mean'],
 'mean_difference': mean_diff,
 'original_std': original_stats['std'],
 'imputed_std': imputed_stats['std'],
 'std_difference': std_diff,
 'values_imputed': len(imputed_values)
 }
 
 print(f"Original mean: {original_stats['mean']:.2f}, Imputed mean: {imputed_stats['mean']:.2f}")
 print(f"Mean difference: {mean_diff:.2f} ({mean_diff/original_stats['mean']*100:.1f}%)")
 print(f"Original std: {original_stats['std']:.2f}, Imputed std: {imputed_stats['std']:.2f}")
 
 else:
 # Categorical data
 original_dist = df_original[col].value_counts(normalize=True)
 imputed_dist = imputed_values.value_counts(normalize=True)
 print(f"Original distribution: {original_dist.to_dict()}")
 print(f"Imputed distribution: {imputed_dist.to_dict()}")
 
 # Overall validation summary
 if validation_results:
 validation_df = pd.DataFrame(validation_results).T
 print("\n=== VALIDATION SUMMARY ===")
 print(validation_df.round(3))
 
 # Flag potential issues
 print("\n--- Potential Issues ---")
 for col, stats in validation_results.items():
 mean_change = abs(stats['mean_difference'] / stats['original_mean']) * 100
 if mean_change > 10: # More than 10% change in mean
 print(f"⚠️ {col}: Large mean change ({mean_change:.1f}%)")
 
 std_change = abs(stats['std_difference'] / stats['original_std']) * 100
 if std_change > 20: # More than 20% change in std
 print(f"⚠️ {col}: Large variance change ({std_change:.1f}%)")
 
 return validation_results

# Validate imputation quality
validation_results = validate_imputation_quality(df_complete, df_missing, df_business_imputed)

## Practice Exercises

Apply missing data handling techniques to challenging scenarios:

In [53]:
# Exercise 1: Multi-step imputation strategy
# Create a sophisticated imputation pipeline that:
# 1. Handles different types of missing data appropriately
# 2. Uses multiple imputation methods in sequence
# 3. Validates results at each step
# 4. Creates comprehensive documentation

def comprehensive_imputation_pipeline(df):
 """Comprehensive missing data handling pipeline"""
 # Your implementation here
 pass

# result_df = comprehensive_imputation_pipeline(df_missing)
# print("Comprehensive pipeline results:")
# print(result_df.isnull().sum())

In [54]:
# Exercise 2: Missing data pattern analysis
# Analyze if missing data follows specific patterns:
# - Time-based patterns
# - User behavior patterns
# - System/technical patterns
# Create insights and recommendations

# Your code here:


In [55]:
# Exercise 3: Impact assessment
# Assess how different missing data handling approaches
# affect downstream analysis:
# - Statistical analysis results
# - Machine learning model performance
# - Business insights and decisions

# Your code here:


## Key Takeaways

1. **Understanding Missing Data Types**:
 - **MCAR**: Missing Completely at Random
 - **MAR**: Missing at Random (depends on observed data)
 - **MNAR**: Missing Not at Random (depends on unobserved data)

2. **Detection and Analysis**:
 - Always analyze missing patterns before imputation
 - Use visualizations to understand missing data structure
 - Look for relationships between missing values and other variables

3. **Handling Strategies**:
 - **Deletion**: Simple but can lose valuable information
 - **Simple Imputation**: Fast but may not preserve relationships
 - **Advanced Methods**: KNN, MICE preserve more complex relationships
 - **Business Logic**: Domain knowledge often provides best results

4. **Best Practices**:
 - Create missing data indicators for transparency
 - Validate imputation quality against original data when possible
 - Consider the impact on downstream analysis
 - Document all imputation decisions and methods

## Method Selection Guide

| Scenario | Recommended Method | Rationale |
|----------|-------------------|----------|
| < 5% missing, MCAR | Simple imputation | Low impact, efficiency |
| 5-20% missing, MAR | KNN or Group-based | Preserve relationships |
| > 20% missing, complex patterns | MICE or Multiple imputation | Handle complex dependencies |
| Business-critical decisions | Domain knowledge + validation | Accuracy and explainability |
| Machine learning features | Advanced methods + indicators | Preserve predictive power |

## Common Pitfalls to Avoid

1. **Data Leakage**: Don't use future information to impute past values
2. **Ignoring Patterns**: Missing data often has meaningful patterns
3. **Over-imputation**: Sometimes missing data is informative itself
4. **One-size-fits-all**: Different columns may need different strategies
5. **No Validation**: Always check if imputation preserved data characteristics