815 lines
32 KiB
Text
Executable file
815 lines
32 KiB
Text
Executable file
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Session 1 - DataFrames - Lesson 13: Advanced Data Cleaning\n",
|
|
"\n",
|
|
"## Learning Objectives\n",
|
|
"- Master advanced techniques for data cleaning and validation\n",
|
|
"- Learn to detect and handle various types of data quality issues\n",
|
|
"- Understand data standardization and normalization techniques\n",
|
|
"- Practice with real-world messy data scenarios\n",
|
|
"- Develop automated data cleaning pipelines\n",
|
|
"\n",
|
|
"## Prerequisites\n",
|
|
"- Completed previous lessons on DataFrames\n",
|
|
"- Understanding of basic data cleaning concepts\n",
|
|
"- Familiarity with regular expressions (helpful but not required)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import required libraries\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import re\n",
|
|
"from datetime import datetime, timedelta\n",
|
|
"import warnings\n",
|
|
"warnings.filterwarnings('ignore')\n",
|
|
"\n",
|
|
"# Display settings\n",
|
|
"pd.set_option('display.max_columns', None)\n",
|
|
"pd.set_option('display.max_rows', 100)\n",
|
|
"\n",
|
|
"print(\"Libraries loaded successfully!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating Messy Sample Data\n",
|
|
"\n",
|
|
"Let's create a realistic messy dataset to practice advanced cleaning techniques."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"# Create intentionally messy data that mimics real-world issues\n",
|
|
"np.random.seed(42)\n",
|
|
"\n",
|
|
"# Base data\n",
|
|
"n_records = 200\n",
|
|
"messy_data = {\n",
|
|
" 'customer_id': [f'CUST{i:04d}' if i % 10 != 0 else f'cust{i:04d}' for i in range(1, n_records + 1)],\n",
|
|
" 'customer_name': [\n",
|
|
" 'John Smith', 'jane doe', 'MARY JOHNSON', 'bob wilson', 'Sarah Davis',\n",
|
|
" 'Mike Brown', 'lisa garcia', 'DAVID MILLER', 'Amy Wilson', 'Tom Anderson'\n",
|
|
" ] * 20,\n",
|
|
" 'email': [\n",
|
|
" 'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',\n",
|
|
" 'bob..wilson@test.com', 'sarah@invalid-email', 'mike@email.com',\n",
|
|
" 'lisa.garcia@email.com', 'david@company.org', 'amy@email.com', 'tom@test.com'\n",
|
|
" ] * 20,\n",
|
|
" 'phone': [\n",
|
|
" '(555) 123-4567', '555.987.6543', '5551234567', '555-987-6543',\n",
|
|
" '(555)123-4567', '+1-555-123-4567', '555 123 4567', '5559876543',\n",
|
|
" '(555) 987 6543', '555-123-4567'\n",
|
|
" ] * 20,\n",
|
|
" 'address': [\n",
|
|
" '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',\n",
|
|
" '789 Pine Rd, Los Angeles, CA 90210', '321 Elm St, Chicago, IL 60601',\n",
|
|
" '654 Maple Dr, Houston, TX 77001', '987 Cedar Ln, Phoenix, AZ 85001',\n",
|
|
" '147 Birch Way, Philadelphia, PA 19101', '258 Ash Ct, San Antonio, TX 78201',\n",
|
|
" '369 Walnut St, San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201'\n",
|
|
" ] * 20,\n",
|
|
" 'purchase_amount': np.random.normal(100, 30, n_records).round(2),\n",
|
|
" 'purchase_date': [\n",
|
|
" '2024-01-15', '01/16/2024', '2024-1-17', '16-01-2024', '2024/01/18',\n",
|
|
" 'January 19, 2024', '2024-01-20', '01-21-24', '2024.01.22', '23/01/2024'\n",
|
|
" ] * 20,\n",
|
|
" 'category': [\n",
|
|
" 'Electronics', 'electronics', 'ELECTRONICS', 'Books', 'books',\n",
|
|
" 'Clothing', 'clothing', 'CLOTHING', 'Home & Garden', 'home&garden'\n",
|
|
" ] * 20,\n",
|
|
" 'satisfaction_score': np.random.choice([1, 2, 3, 4, 5, 99, -1, None], n_records, p=[0.05, 0.1, 0.15, 0.35, 0.3, 0.02, 0.02, 0.01])\n",
|
|
"}\n",
|
|
"\n",
|
|
"# Convert to DataFrame first\n",
|
|
"df_messy = pd.DataFrame(messy_data)\n",
|
|
"\n",
|
|
"# Introduce missing values and anomalies using proper indexing\n",
|
|
"df_messy.loc[df_messy.index[::25], 'customer_name'] = None # Some missing names\n",
|
|
"df_messy.loc[df_messy.index[::30], 'email'] = None # Some missing emails\n",
|
|
"df_messy.loc[df_messy.index[::35], 'purchase_amount'] = np.nan # Some missing amounts\n",
|
|
"df_messy.loc[df_messy.index[::40], 'purchase_amount'] = -999 # Invalid negative values\n",
|
|
"\n",
|
|
"# Add some duplicate records\n",
|
|
"duplicate_indices = [0, 1, 2, 3, 4]\n",
|
|
"duplicate_rows = df_messy.iloc[duplicate_indices].copy()\n",
|
|
"df_messy = pd.concat([df_messy, duplicate_rows], ignore_index=True)\n",
|
|
"\n",
|
|
"print(\"Messy dataset created:\")\n",
|
|
"print(f\"Shape: {df_messy.shape}\")\n",
|
|
"print(\"\\nFirst few rows:\")\n",
|
|
"print(df_messy.head(10))\n",
|
|
"print(\"\\nData types:\")\n",
|
|
"print(df_messy.dtypes)\n",
|
|
"print(\"\\nSample of data quality issues:\")\n",
|
|
"print(\"\\n1. Missing values:\")\n",
|
|
"print(df_messy.isnull().sum())\n",
|
|
"print(\"\\n2. Inconsistent formatting examples:\")\n",
|
|
"print(\"Customer IDs:\", df_messy['customer_id'].head(15).tolist())\n",
|
|
"print(\"Customer names:\", df_messy['customer_name'].dropna().head(5).tolist())\n",
|
|
"print(\"Categories:\", df_messy['category'].unique()[:5])\n",
|
|
"print(\"\\n3. Invalid satisfaction scores:\")\n",
|
|
"print(\"Unique satisfaction scores:\", sorted(df_messy['satisfaction_score'].dropna().unique()))\n",
|
|
"print(\"\\n4. Invalid purchase amounts:\")\n",
|
|
"print(\"Negative amounts:\", df_messy[df_messy['purchase_amount'] < 0]['purchase_amount'].count())\n",
|
|
"print(\"\\n5. Date format inconsistencies:\")\n",
|
|
"print(\"Sample dates:\", df_messy['purchase_date'].head(10).tolist())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Data Quality Assessment\n",
|
|
"\n",
|
|
"First, let's assess the quality of our messy data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def assess_data_quality(df):\n",
|
|
" \"\"\"Comprehensive data quality assessment\"\"\"\n",
|
|
" print(\"=== DATA QUALITY ASSESSMENT ===\")\n",
|
|
" print(f\"Dataset shape: {df.shape}\")\n",
|
|
" print(f\"Total cells: {df.size}\")\n",
|
|
" \n",
|
|
" # Missing values analysis\n",
|
|
" print(\"\\n--- Missing Values ---\")\n",
|
|
" missing_stats = pd.DataFrame({\n",
|
|
" 'Missing_Count': df.isnull().sum(),\n",
|
|
" 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100\n",
|
|
" })\n",
|
|
" missing_stats = missing_stats[missing_stats['Missing_Count'] > 0]\n",
|
|
" print(missing_stats.round(2))\n",
|
|
" \n",
|
|
" # Duplicate analysis\n",
|
|
" print(\"\\n--- Duplicates ---\")\n",
|
|
" total_duplicates = df.duplicated().sum()\n",
|
|
" print(f\"Complete duplicate rows: {total_duplicates}\")\n",
|
|
" \n",
|
|
" # Column-specific analysis\n",
|
|
" print(\"\\n--- Column Analysis ---\")\n",
|
|
" for col in df.columns:\n",
|
|
" unique_count = df[col].nunique()\n",
|
|
" unique_percentage = (unique_count / len(df)) * 100\n",
|
|
" print(f\"{col}: {unique_count} unique values ({unique_percentage:.1f}%)\")\n",
|
|
" \n",
|
|
" # Data type issues\n",
|
|
" print(\"\\n--- Data Types ---\")\n",
|
|
" print(df.dtypes)\n",
|
|
" \n",
|
|
" return missing_stats, total_duplicates\n",
|
|
"\n",
|
|
"# Assess the messy data\n",
|
|
"missing_stats, duplicate_count = assess_data_quality(df_messy)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Identify specific data quality issues\n",
|
|
"def identify_issues(df):\n",
|
|
" \"\"\"Identify specific data quality issues\"\"\"\n",
|
|
" issues = []\n",
|
|
" \n",
|
|
" # Check for inconsistent formatting\n",
|
|
" print(\"=== SPECIFIC ISSUES IDENTIFIED ===\")\n",
|
|
" \n",
|
|
" # Customer ID formatting\n",
|
|
" id_patterns = df['customer_id'].str.extract(r'(CUST|cust)(\\d+)').fillna('')\n",
|
|
" inconsistent_ids = (id_patterns[0] == 'cust').sum()\n",
|
|
" print(f\"Inconsistent customer ID format: {inconsistent_ids} records\")\n",
|
|
" \n",
|
|
" # Email validation\n",
|
|
" email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n",
|
|
" invalid_emails = ~df['email'].str.match(email_pattern, na=False)\n",
|
|
" print(f\"Invalid email formats: {invalid_emails.sum()} records\")\n",
|
|
" \n",
|
|
" # Negative purchase amounts\n",
|
|
" negative_amounts = (df['purchase_amount'] < 0).sum()\n",
|
|
" print(f\"Negative purchase amounts: {negative_amounts} records\")\n",
|
|
" \n",
|
|
" # Invalid satisfaction scores\n",
|
|
" invalid_scores = ((df['satisfaction_score'] < 1) | (df['satisfaction_score'] > 5)) & df['satisfaction_score'].notna()\n",
|
|
" print(f\"Invalid satisfaction scores: {invalid_scores.sum()} records\")\n",
|
|
" \n",
|
|
" # Category inconsistencies\n",
|
|
" category_variations = df['category'].value_counts()\n",
|
|
" print(f\"\\nCategory variations: {len(category_variations)} different values\")\n",
|
|
" print(category_variations)\n",
|
|
" \n",
|
|
" return issues\n",
|
|
"\n",
|
|
"issues = identify_issues(df_messy)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Text Data Standardization\n",
|
|
"\n",
|
|
"Clean and standardize text fields."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Text cleaning functions\n",
|
|
"def clean_text_data(df):\n",
|
|
" \"\"\"Comprehensive text data cleaning\"\"\"\n",
|
|
" df_clean = df.copy()\n",
|
|
" \n",
|
|
" # Standardize customer names\n",
|
|
" print(\"Cleaning customer names...\")\n",
|
|
" df_clean['customer_name_clean'] = df_clean['customer_name'].str.strip() # Remove whitespace\n",
|
|
" df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.title() # Title case\n",
|
|
" df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.replace(r'\\s+', ' ', regex=True) # Multiple spaces\n",
|
|
" \n",
|
|
" # Standardize customer IDs\n",
|
|
" print(\"Standardizing customer IDs...\")\n",
|
|
" df_clean['customer_id_clean'] = df_clean['customer_id'].str.upper() # All uppercase\n",
|
|
" df_clean['customer_id_clean'] = df_clean['customer_id_clean'].str.replace('CUST', 'CUST') # Ensure consistent prefix\n",
|
|
" \n",
|
|
" # Clean email addresses\n",
|
|
" print(\"Cleaning email addresses...\")\n",
|
|
" df_clean['email_clean'] = df_clean['email'].str.lower() # Lowercase\n",
|
|
" df_clean['email_clean'] = df_clean['email_clean'].str.strip() # Remove whitespace\n",
|
|
" df_clean['email_clean'] = df_clean['email_clean'].str.replace(r'\\.{2,}', '.', regex=True) # Multiple dots\n",
|
|
" \n",
|
|
" # Standardize categories\n",
|
|
" print(\"Standardizing categories...\")\n",
|
|
" category_mapping = {\n",
|
|
" 'electronics': 'Electronics',\n",
|
|
" 'ELECTRONICS': 'Electronics',\n",
|
|
" 'books': 'Books',\n",
|
|
" 'clothing': 'Clothing',\n",
|
|
" 'CLOTHING': 'Clothing',\n",
|
|
" 'home&garden': 'Home & Garden',\n",
|
|
" 'Home & Garden': 'Home & Garden'\n",
|
|
" }\n",
|
|
" df_clean['category_clean'] = df_clean['category'].map(category_mapping).fillna(df_clean['category'])\n",
|
|
" \n",
|
|
" return df_clean\n",
|
|
"\n",
|
|
"# Apply text cleaning\n",
|
|
"df_text_clean = clean_text_data(df_messy)\n",
|
|
"\n",
|
|
"print(\"\\nText cleaning comparison:\")\n",
|
|
"comparison_cols = ['customer_name', 'customer_name_clean', 'customer_id', 'customer_id_clean', \n",
|
|
" 'email', 'email_clean', 'category', 'category_clean']\n",
|
|
"print(df_text_clean[comparison_cols].head(10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Advanced text cleaning with regex\n",
|
|
"def advanced_text_cleaning(df):\n",
|
|
" \"\"\"Advanced text cleaning using regular expressions\"\"\"\n",
|
|
" df_advanced = df.copy()\n",
|
|
" \n",
|
|
" # Extract and standardize address components\n",
|
|
" print(\"Processing addresses...\")\n",
|
|
" # Basic address pattern: number street, city, state zipcode\n",
|
|
" address_pattern = r'(\\d+)\\s+([^,]+),\\s*([^,]+),\\s*([A-Z]{2})\\s+(\\d{5})'\n",
|
|
" address_parts = df_advanced['address'].str.extract(address_pattern)\n",
|
|
" address_parts.columns = ['street_number', 'street_name', 'city', 'state', 'zipcode']\n",
|
|
" \n",
|
|
" # Clean street names\n",
|
|
" address_parts['street_name'] = address_parts['street_name'].str.title()\n",
|
|
" address_parts['city'] = address_parts['city'].str.title()\n",
|
|
" \n",
|
|
" # Combine cleaned parts\n",
|
|
" df_advanced['address_clean'] = (\n",
|
|
" address_parts['street_number'] + ' ' + address_parts['street_name'] + ', ' +\n",
|
|
" address_parts['city'] + ', ' + address_parts['state'] + ' ' + address_parts['zipcode']\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Add individual address components\n",
|
|
" for col in address_parts.columns:\n",
|
|
" df_advanced[col] = address_parts[col]\n",
|
|
" \n",
|
|
" return df_advanced\n",
|
|
"\n",
|
|
"# Apply advanced cleaning\n",
|
|
"df_advanced_clean = advanced_text_cleaning(df_text_clean)\n",
|
|
"\n",
|
|
"print(\"Address cleaning results:\")\n",
|
|
"print(df_advanced_clean[['address', 'address_clean', 'city', 'state', 'zipcode']].head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Phone Number Standardization\n",
|
|
"\n",
|
|
"Clean and standardize phone numbers using regex patterns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def standardize_phone_numbers(df):\n",
|
|
" \"\"\"Standardize phone numbers to consistent format\"\"\"\n",
|
|
" df_phone = df.copy()\n",
|
|
" \n",
|
|
" def clean_phone(phone):\n",
|
|
" \"\"\"Clean individual phone number\"\"\"\n",
|
|
" if pd.isna(phone):\n",
|
|
" return None\n",
|
|
" \n",
|
|
" # Remove all non-digit characters\n",
|
|
" digits_only = re.sub(r'\\D', '', str(phone))\n",
|
|
" \n",
|
|
" # Handle different formats\n",
|
|
" if len(digits_only) == 10:\n",
|
|
" # Format as (XXX) XXX-XXXX\n",
|
|
" return f\"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}\"\n",
|
|
" elif len(digits_only) == 11 and digits_only.startswith('1'):\n",
|
|
" # Remove country code and format\n",
|
|
" phone_part = digits_only[1:]\n",
|
|
" return f\"({phone_part[:3]}) {phone_part[3:6]}-{phone_part[6:]}\"\n",
|
|
" else:\n",
|
|
" # Invalid phone number\n",
|
|
" return 'INVALID'\n",
|
|
" \n",
|
|
" # Apply phone cleaning\n",
|
|
" df_phone['phone_clean'] = df_phone['phone'].apply(clean_phone)\n",
|
|
" \n",
|
|
" # Extract area code\n",
|
|
" df_phone['area_code'] = df_phone['phone_clean'].str.extract(r'\\((\\d{3})\\)')\n",
|
|
" \n",
|
|
" # Flag invalid phone numbers\n",
|
|
" df_phone['phone_is_valid'] = df_phone['phone_clean'] != 'INVALID'\n",
|
|
" \n",
|
|
" return df_phone\n",
|
|
"\n",
|
|
"# Apply phone standardization\n",
|
|
"df_phone_clean = standardize_phone_numbers(df_advanced_clean)\n",
|
|
"\n",
|
|
"print(\"Phone number standardization:\")\n",
|
|
"print(df_phone_clean[['phone', 'phone_clean', 'area_code', 'phone_is_valid']].head(15))\n",
|
|
"\n",
|
|
"print(\"\\nPhone validation summary:\")\n",
|
|
"print(df_phone_clean['phone_is_valid'].value_counts())\n",
|
|
"\n",
|
|
"print(\"\\nArea code distribution:\")\n",
|
|
"print(df_phone_clean['area_code'].value_counts().head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Date Standardization\n",
|
|
"\n",
|
|
"Parse and standardize dates from various formats."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def standardize_dates(df):\n",
|
|
" \"\"\"Parse and standardize dates from multiple formats\"\"\"\n",
|
|
" df_dates = df.copy()\n",
|
|
" \n",
|
|
" def parse_date(date_str):\n",
|
|
" \"\"\"Try to parse date from various formats\"\"\"\n",
|
|
" if pd.isna(date_str):\n",
|
|
" return None\n",
|
|
" \n",
|
|
" date_str = str(date_str).strip()\n",
|
|
" \n",
|
|
" # Common date formats to try\n",
|
|
" formats = [\n",
|
|
" '%Y-%m-%d', # 2024-01-15\n",
|
|
" '%m/%d/%Y', # 01/16/2024\n",
|
|
" '%Y-%m-%d', # 2024-1-17 (handled by first format)\n",
|
|
" '%d-%m-%Y', # 16-01-2024\n",
|
|
" '%Y/%m/%d', # 2024/01/18\n",
|
|
" '%B %d, %Y', # January 19, 2024\n",
|
|
" '%m-%d-%y', # 01-21-24\n",
|
|
" '%Y.%m.%d', # 2024.01.22\n",
|
|
" '%d/%m/%Y' # 23/01/2024\n",
|
|
" ]\n",
|
|
" \n",
|
|
" for fmt in formats:\n",
|
|
" try:\n",
|
|
" return pd.to_datetime(date_str, format=fmt)\n",
|
|
" except ValueError:\n",
|
|
" continue\n",
|
|
" \n",
|
|
" # If all else fails, try pandas' flexible parser\n",
|
|
" try:\n",
|
|
" return pd.to_datetime(date_str, infer_datetime_format=True)\n",
|
|
" except:\n",
|
|
" return None\n",
|
|
" \n",
|
|
" # Apply date parsing\n",
|
|
" print(\"Parsing dates...\")\n",
|
|
" df_dates['purchase_date_clean'] = df_dates['purchase_date'].apply(parse_date)\n",
|
|
" \n",
|
|
" # Flag unparseable dates\n",
|
|
" df_dates['date_is_valid'] = df_dates['purchase_date_clean'].notna()\n",
|
|
" \n",
|
|
" # Extract date components for valid dates\n",
|
|
" df_dates['purchase_year'] = df_dates['purchase_date_clean'].dt.year\n",
|
|
" df_dates['purchase_month'] = df_dates['purchase_date_clean'].dt.month\n",
|
|
" df_dates['purchase_day'] = df_dates['purchase_date_clean'].dt.day\n",
|
|
" df_dates['purchase_day_of_week'] = df_dates['purchase_date_clean'].dt.day_name()\n",
|
|
" \n",
|
|
" return df_dates\n",
|
|
"\n",
|
|
"# Apply date standardization\n",
|
|
"df_date_clean = standardize_dates(df_phone_clean)\n",
|
|
"\n",
|
|
"print(\"Date standardization results:\")\n",
|
|
"print(df_date_clean[['purchase_date', 'purchase_date_clean', 'date_is_valid', \n",
|
|
" 'purchase_year', 'purchase_month', 'purchase_day_of_week']].head(15))\n",
|
|
"\n",
|
|
"print(\"\\nDate parsing summary:\")\n",
|
|
"print(df_date_clean['date_is_valid'].value_counts())\n",
|
|
"\n",
|
|
"invalid_dates = df_date_clean[~df_date_clean['date_is_valid']]['purchase_date'].unique()\n",
|
|
"if len(invalid_dates) > 0:\n",
|
|
" print(f\"\\nInvalid date formats found: {invalid_dates}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Numerical Data Cleaning\n",
|
|
"\n",
|
|
"Handle outliers, invalid values, and missing numerical data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def clean_numerical_data(df):\n",
|
|
" \"\"\"Clean and validate numerical data\"\"\"\n",
|
|
" df_numeric = df.copy()\n",
|
|
" \n",
|
|
" # Clean purchase amounts\n",
|
|
" print(\"Cleaning purchase amounts...\")\n",
|
|
" \n",
|
|
" # Flag invalid values\n",
|
|
" df_numeric['amount_is_valid'] = (\n",
|
|
" df_numeric['purchase_amount'].notna() & \n",
|
|
" (df_numeric['purchase_amount'] >= 0) & \n",
|
|
" (df_numeric['purchase_amount'] <= 10000) # Reasonable upper limit\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Replace invalid values with NaN\n",
|
|
" df_numeric['purchase_amount_clean'] = df_numeric['purchase_amount'].where(\n",
|
|
" df_numeric['amount_is_valid'], np.nan\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Detect outliers using IQR method\n",
|
|
" Q1 = df_numeric['purchase_amount_clean'].quantile(0.25)\n",
|
|
" Q3 = df_numeric['purchase_amount_clean'].quantile(0.75)\n",
|
|
" IQR = Q3 - Q1\n",
|
|
" lower_bound = Q1 - 1.5 * IQR\n",
|
|
" upper_bound = Q3 + 1.5 * IQR\n",
|
|
" \n",
|
|
" df_numeric['amount_is_outlier'] = (\n",
|
|
" (df_numeric['purchase_amount_clean'] < lower_bound) |\n",
|
|
" (df_numeric['purchase_amount_clean'] > upper_bound)\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Clean satisfaction scores\n",
|
|
" print(\"Cleaning satisfaction scores...\")\n",
|
|
" \n",
|
|
" # Valid satisfaction scores are 1-5\n",
|
|
" df_numeric['satisfaction_is_valid'] = (\n",
|
|
" df_numeric['satisfaction_score'].notna() &\n",
|
|
" (df_numeric['satisfaction_score'].between(1, 5))\n",
|
|
" )\n",
|
|
" \n",
|
|
" df_numeric['satisfaction_score_clean'] = df_numeric['satisfaction_score'].where(\n",
|
|
" df_numeric['satisfaction_is_valid'], np.nan\n",
|
|
" )\n",
|
|
" \n",
|
|
" return df_numeric\n",
|
|
"\n",
|
|
"# Apply numerical cleaning\n",
|
|
"df_numeric_clean = clean_numerical_data(df_date_clean)\n",
|
|
"\n",
|
|
"print(\"Numerical data cleaning results:\")\n",
|
|
"print(df_numeric_clean[['purchase_amount', 'purchase_amount_clean', 'amount_is_valid', \n",
|
|
" 'amount_is_outlier', 'satisfaction_score', 'satisfaction_score_clean', \n",
|
|
" 'satisfaction_is_valid']].head(15))\n",
|
|
"\n",
|
|
"print(\"\\nNumerical data quality summary:\")\n",
|
|
"print(f\"Valid purchase amounts: {df_numeric_clean['amount_is_valid'].sum()}/{len(df_numeric_clean)}\")\n",
|
|
"print(f\"Outlier amounts: {df_numeric_clean['amount_is_outlier'].sum()}\")\n",
|
|
"print(f\"Valid satisfaction scores: {df_numeric_clean['satisfaction_is_valid'].sum()}/{len(df_numeric_clean)}\")\n",
|
|
"\n",
|
|
"# Show statistics for cleaned data\n",
|
|
"print(\"\\nCleaned amount statistics:\")\n",
|
|
"print(df_numeric_clean['purchase_amount_clean'].describe())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 6. Duplicate Detection and Handling\n",
|
|
"\n",
|
|
"Identify and handle duplicate records intelligently."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def handle_duplicates(df):\n",
|
|
" \"\"\"Comprehensive duplicate detection and handling\"\"\"\n",
|
|
" df_dedup = df.copy()\n",
|
|
" \n",
|
|
" print(\"=== DUPLICATE ANALYSIS ===\")\n",
|
|
" \n",
|
|
" # 1. Exact duplicates\n",
|
|
" exact_duplicates = df_dedup.duplicated()\n",
|
|
" print(f\"Exact duplicate rows: {exact_duplicates.sum()}\")\n",
|
|
" \n",
|
|
" # 2. Duplicates based on key columns (likely same customer)\n",
|
|
" key_cols = ['customer_name_clean', 'email_clean']\n",
|
|
" key_duplicates = df_dedup.duplicated(subset=key_cols, keep=False)\n",
|
|
" print(f\"Duplicate customers (by name/email): {key_duplicates.sum()}\")\n",
|
|
" \n",
|
|
" # 3. Near duplicates (similar but not exact)\n",
|
|
" # For demonstration, we'll check phone numbers\n",
|
|
" phone_duplicates = df_dedup.duplicated(subset=['phone_clean'], keep=False)\n",
|
|
" print(f\"Duplicate phone numbers: {phone_duplicates.sum()}\")\n",
|
|
" \n",
|
|
" # Show duplicate examples\n",
|
|
" if key_duplicates.any():\n",
|
|
" print(\"\\nExample duplicate customers:\")\n",
|
|
" duplicate_customers = df_dedup[key_duplicates].sort_values(key_cols)\n",
|
|
" print(duplicate_customers[key_cols + ['customer_id_clean', 'purchase_amount_clean']].head(10))\n",
|
|
" \n",
|
|
" # Remove exact duplicates\n",
|
|
" print(f\"\\nRemoving {exact_duplicates.sum()} exact duplicates...\")\n",
|
|
" df_no_exact_dups = df_dedup[~exact_duplicates]\n",
|
|
" \n",
|
|
" # For customer duplicates, keep the one with the highest purchase amount\n",
|
|
" print(\"Handling customer duplicates (keeping highest purchase)...\")\n",
|
|
" df_final = df_no_exact_dups.sort_values('purchase_amount_clean', ascending=False).drop_duplicates(\n",
|
|
" subset=key_cols, keep='first'\n",
|
|
" )\n",
|
|
" \n",
|
|
" print(f\"Final dataset size after deduplication: {len(df_final)} (was {len(df)})\")\n",
|
|
" \n",
|
|
" return df_final\n",
|
|
"\n",
|
|
"# Apply duplicate handling\n",
|
|
"df_deduplicated = handle_duplicates(df_numeric_clean)\n",
|
|
"\n",
|
|
"print(f\"\\nRows removed: {len(df_numeric_clean) - len(df_deduplicated)}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 7. Data Validation and Quality Scores\n",
|
|
"\n",
|
|
"Create comprehensive data quality metrics."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def calculate_quality_scores(df):\n",
|
|
" \"\"\"Calculate comprehensive data quality scores\"\"\"\n",
|
|
" df_quality = df.copy()\n",
|
|
" \n",
|
|
" # Define quality checks\n",
|
|
" quality_checks = {\n",
|
|
" 'has_customer_name': df_quality['customer_name_clean'].notna(),\n",
|
|
" 'has_valid_email': df_quality['email_clean'].notna() & \n",
|
|
" df_quality['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', na=False),\n",
|
|
" 'has_valid_phone': df_quality['phone_is_valid'] == True,\n",
|
|
" 'has_valid_date': df_quality['date_is_valid'] == True,\n",
|
|
" 'has_valid_amount': df_quality['amount_is_valid'] == True,\n",
|
|
" 'has_valid_satisfaction': df_quality['satisfaction_is_valid'] == True,\n",
|
|
" 'amount_not_outlier': df_quality['amount_is_outlier'] == False,\n",
|
|
" 'has_complete_address': df_quality['city'].notna() & df_quality['state'].notna() & df_quality['zipcode'].notna()\n",
|
|
" }\n",
|
|
" \n",
|
|
" # Add individual quality flags\n",
|
|
" for check_name, check_result in quality_checks.items():\n",
|
|
" df_quality[f'quality_{check_name}'] = check_result.astype(int)\n",
|
|
" \n",
|
|
" # Calculate overall quality score (percentage of passed checks)\n",
|
|
" quality_cols = [col for col in df_quality.columns if col.startswith('quality_')]\n",
|
|
" df_quality['data_quality_score'] = df_quality[quality_cols].mean(axis=1) * 100\n",
|
|
" \n",
|
|
" # Categorize quality levels\n",
|
|
" def quality_category(score):\n",
|
|
" if score >= 90:\n",
|
|
" return 'Excellent'\n",
|
|
" elif score >= 75:\n",
|
|
" return 'Good'\n",
|
|
" elif score >= 50:\n",
|
|
" return 'Fair'\n",
|
|
" else:\n",
|
|
" return 'Poor'\n",
|
|
" \n",
|
|
" df_quality['quality_category'] = df_quality['data_quality_score'].apply(quality_category)\n",
|
|
" \n",
|
|
" return df_quality, quality_checks\n",
|
|
"\n",
|
|
"# Calculate quality scores\n",
|
|
"df_with_quality, quality_checks = calculate_quality_scores(df_deduplicated)\n",
|
|
"\n",
|
|
"print(\"Data quality analysis:\")\n",
|
|
"print(df_with_quality[['customer_name_clean', 'data_quality_score', 'quality_category']].head(10))\n",
|
|
"\n",
|
|
"print(\"\\nQuality category distribution:\")\n",
|
|
"print(df_with_quality['quality_category'].value_counts())\n",
|
|
"\n",
|
|
"print(\"\\nAverage quality scores by check:\")\n",
|
|
"quality_summary = {}\n",
|
|
"for check_name in quality_checks.keys():\n",
|
|
" col_name = f'quality_{check_name}'\n",
|
|
" quality_summary[check_name] = df_with_quality[col_name].mean() * 100\n",
|
|
"\n",
|
|
"quality_df = pd.DataFrame(list(quality_summary.items()), columns=['Quality_Check', 'Pass_Rate_%'])\n",
|
|
"quality_df = quality_df.sort_values('Pass_Rate_%', ascending=False)\n",
|
|
"print(quality_df.round(1))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Practice Exercises\n",
|
|
"\n",
|
|
"Apply advanced data cleaning techniques to challenging scenarios:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 1: Create a custom validation function\n",
|
|
"# Build a function that validates business rules:\n",
|
|
"# - Email domains should be from approved list\n",
|
|
"# - Purchase amounts should be within reasonable ranges by category\n",
|
|
"# - Dates should be within business operating period\n",
|
|
"# - Customer IDs should follow specific format patterns\n",
|
|
"\n",
|
|
"def validate_business_rules(df):\n",
|
|
" \"\"\"Validate business-specific rules\"\"\"\n",
|
|
" # Your implementation here\n",
|
|
" pass\n",
|
|
"\n",
|
|
"# validation_results = validate_business_rules(df_final_clean)\n",
|
|
"# print(validation_results)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 2: Advanced duplicate detection\n",
|
|
"# Implement fuzzy matching for near-duplicate detection:\n",
|
|
"# - Similar names (edit distance)\n",
|
|
"# - Similar addresses\n",
|
|
"# - Similar email patterns\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 3: Data cleaning metrics dashboard\n",
|
|
"# Create a comprehensive data quality dashboard that shows:\n",
|
|
"# - Data quality trends over time\n",
|
|
"# - Field-by-field quality scores\n",
|
|
"# - Impact of cleaning steps\n",
|
|
"# - Recommendations for further improvement\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Key Takeaways\n",
|
|
"\n",
|
|
"1. **Assessment First**: Always assess data quality before cleaning\n",
|
|
"2. **Systematic Approach**: Use a structured pipeline for consistent results\n",
|
|
"3. **Preserve Original Data**: Keep original values while creating cleaned versions\n",
|
|
"4. **Document Everything**: Log all cleaning steps and decisions\n",
|
|
"5. **Validation**: Implement business rule validation\n",
|
|
"6. **Quality Metrics**: Measure and track data quality improvements\n",
|
|
"7. **Reusable Pipeline**: Create automated, configurable cleaning processes\n",
|
|
"8. **Context Matters**: Consider domain-specific requirements\n",
|
|
"\n",
|
|
"## Common Data Issues and Solutions\n",
|
|
"\n",
|
|
"| Issue | Detection Method | Solution |\n",
|
|
"|-------|-----------------|----------|\n",
|
|
"| Inconsistent Format | Pattern analysis | Standardization rules |\n",
|
|
"| Missing Values | `.isnull()` | Imputation or flagging |\n",
|
|
"| Duplicates | `.duplicated()` | Deduplication logic |\n",
|
|
"| Outliers | Statistical methods | Capping or flagging |\n",
|
|
"| Invalid Values | Business rules | Validation and correction |\n",
|
|
"| Inconsistent Naming | String analysis | Normalization |\n",
|
|
"| Date Issues | Parsing attempts | Multiple format handling |\n",
|
|
"| Text Issues | Regex patterns | Cleaning and standardization |\n",
|
|
"\n",
|
|
"## Best Practices\n",
|
|
"\n",
|
|
"1. **Start with Exploration**: Understand your data before cleaning\n",
|
|
"2. **Preserve Traceability**: Keep original and cleaned versions\n",
|
|
"3. **Validate Assumptions**: Test cleaning rules on sample data\n",
|
|
"4. **Measure Impact**: Quantify improvements from cleaning\n",
|
|
"5. **Automate When Possible**: Build reusable cleaning pipelines\n",
|
|
"6. **Handle Edge Cases**: Plan for unusual but valid data\n",
|
|
"7. **Business Context**: Include domain experts in rule definition\n",
|
|
"8. **Iterative Process**: Refine cleaning rules based on results\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|