{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Session 1 - DataFrames - Lesson 13: Advanced Data Cleaning\n", "\n", "## Learning Objectives\n", "- Master advanced techniques for data cleaning and validation\n", "- Learn to detect and handle various types of data quality issues\n", "- Understand data standardization and normalization techniques\n", "- Practice with real-world messy data scenarios\n", "- Develop automated data cleaning pipelines\n", "\n", "## Prerequisites\n", "- Completed previous lessons on DataFrames\n", "- Understanding of basic data cleaning concepts\n", "- Familiarity with regular expressions (helpful but not required)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "import re\n", "from datetime import datetime, timedelta\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Display settings\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', 100)\n", "\n", "print(\"Libraries loaded successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Messy Sample Data\n", "\n", "Let's create a realistic messy dataset to practice advanced cleaning techniques." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Create intentionally messy data that mimics real-world issues\n", "np.random.seed(42)\n", "\n", "# Base data\n", "n_records = 200\n", "messy_data = {\n", " 'customer_id': [f'CUST{i:04d}' if i % 10 != 0 else f'cust{i:04d}' for i in range(1, n_records + 1)],\n", " 'customer_name': [\n", " 'John Smith', 'jane doe', 'MARY JOHNSON', 'bob wilson', 'Sarah Davis',\n", " 'Mike Brown', 'lisa garcia', 'DAVID MILLER', 'Amy Wilson', 'Tom Anderson'\n", " ] * 20,\n", " 'email': [\n", " 'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',\n", " 'bob..wilson@test.com', 'sarah@invalid-email', 'mike@email.com',\n", " 'lisa.garcia@email.com', 'david@company.org', 'amy@email.com', 'tom@test.com'\n", " ] * 20,\n", " 'phone': [\n", " '(555) 123-4567', '555.987.6543', '5551234567', '555-987-6543',\n", " '(555)123-4567', '+1-555-123-4567', '555 123 4567', '5559876543',\n", " '(555) 987 6543', '555-123-4567'\n", " ] * 20,\n", " 'address': [\n", " '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',\n", " '789 Pine Rd, Los Angeles, CA 90210', '321 Elm St, Chicago, IL 60601',\n", " '654 Maple Dr, Houston, TX 77001', '987 Cedar Ln, Phoenix, AZ 85001',\n", " '147 Birch Way, Philadelphia, PA 19101', '258 Ash Ct, San Antonio, TX 78201',\n", " '369 Walnut St, San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201'\n", " ] * 20,\n", " 'purchase_amount': np.random.normal(100, 30, n_records).round(2),\n", " 'purchase_date': [\n", " '2024-01-15', '01/16/2024', '2024-1-17', '16-01-2024', '2024/01/18',\n", " 'January 19, 2024', '2024-01-20', '01-21-24', '2024.01.22', '23/01/2024'\n", " ] * 20,\n", " 'category': [\n", " 'Electronics', 'electronics', 'ELECTRONICS', 'Books', 'books',\n", " 'Clothing', 'clothing', 'CLOTHING', 'Home & Garden', 'home&garden'\n", " ] * 20,\n", " 'satisfaction_score': np.random.choice([1, 2, 3, 4, 5, 99, -1, None], n_records, p=[0.05, 0.1, 0.15, 0.35, 0.3, 0.02, 0.02, 0.01])\n", "}\n", "\n", "# Convert to DataFrame first\n", "df_messy = pd.DataFrame(messy_data)\n", "\n", "# Introduce missing values and anomalies using proper indexing\n", "df_messy.loc[df_messy.index[::25], 'customer_name'] = None # Some missing names\n", "df_messy.loc[df_messy.index[::30], 'email'] = None # Some missing emails\n", "df_messy.loc[df_messy.index[::35], 'purchase_amount'] = np.nan # Some missing amounts\n", "df_messy.loc[df_messy.index[::40], 'purchase_amount'] = -999 # Invalid negative values\n", "\n", "# Add some duplicate records\n", "duplicate_indices = [0, 1, 2, 3, 4]\n", "duplicate_rows = df_messy.iloc[duplicate_indices].copy()\n", "df_messy = pd.concat([df_messy, duplicate_rows], ignore_index=True)\n", "\n", "print(\"Messy dataset created:\")\n", "print(f\"Shape: {df_messy.shape}\")\n", "print(\"\\nFirst few rows:\")\n", "print(df_messy.head(10))\n", "print(\"\\nData types:\")\n", "print(df_messy.dtypes)\n", "print(\"\\nSample of data quality issues:\")\n", "print(\"\\n1. Missing values:\")\n", "print(df_messy.isnull().sum())\n", "print(\"\\n2. Inconsistent formatting examples:\")\n", "print(\"Customer IDs:\", df_messy['customer_id'].head(15).tolist())\n", "print(\"Customer names:\", df_messy['customer_name'].dropna().head(5).tolist())\n", "print(\"Categories:\", df_messy['category'].unique()[:5])\n", "print(\"\\n3. Invalid satisfaction scores:\")\n", "print(\"Unique satisfaction scores:\", sorted(df_messy['satisfaction_score'].dropna().unique()))\n", "print(\"\\n4. Invalid purchase amounts:\")\n", "print(\"Negative amounts:\", df_messy[df_messy['purchase_amount'] < 0]['purchase_amount'].count())\n", "print(\"\\n5. Date format inconsistencies:\")\n", "print(\"Sample dates:\", df_messy['purchase_date'].head(10).tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Data Quality Assessment\n", "\n", "First, let's assess the quality of our messy data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def assess_data_quality(df):\n", " \"\"\"Comprehensive data quality assessment\"\"\"\n", " print(\"=== DATA QUALITY ASSESSMENT ===\")\n", " print(f\"Dataset shape: {df.shape}\")\n", " print(f\"Total cells: {df.size}\")\n", " \n", " # Missing values analysis\n", " print(\"\\n--- Missing Values ---\")\n", " missing_stats = pd.DataFrame({\n", " 'Missing_Count': df.isnull().sum(),\n", " 'Missing_Percentage': (df.isnull().sum() / len(df)) * 100\n", " })\n", " missing_stats = missing_stats[missing_stats['Missing_Count'] > 0]\n", " print(missing_stats.round(2))\n", " \n", " # Duplicate analysis\n", " print(\"\\n--- Duplicates ---\")\n", " total_duplicates = df.duplicated().sum()\n", " print(f\"Complete duplicate rows: {total_duplicates}\")\n", " \n", " # Column-specific analysis\n", " print(\"\\n--- Column Analysis ---\")\n", " for col in df.columns:\n", " unique_count = df[col].nunique()\n", " unique_percentage = (unique_count / len(df)) * 100\n", " print(f\"{col}: {unique_count} unique values ({unique_percentage:.1f}%)\")\n", " \n", " # Data type issues\n", " print(\"\\n--- Data Types ---\")\n", " print(df.dtypes)\n", " \n", " return missing_stats, total_duplicates\n", "\n", "# Assess the messy data\n", "missing_stats, duplicate_count = assess_data_quality(df_messy)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Identify specific data quality issues\n", "def identify_issues(df):\n", " \"\"\"Identify specific data quality issues\"\"\"\n", " issues = []\n", " \n", " # Check for inconsistent formatting\n", " print(\"=== SPECIFIC ISSUES IDENTIFIED ===\")\n", " \n", " # Customer ID formatting\n", " id_patterns = df['customer_id'].str.extract(r'(CUST|cust)(\\d+)').fillna('')\n", " inconsistent_ids = (id_patterns[0] == 'cust').sum()\n", " print(f\"Inconsistent customer ID format: {inconsistent_ids} records\")\n", " \n", " # Email validation\n", " email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n", " invalid_emails = ~df['email'].str.match(email_pattern, na=False)\n", " print(f\"Invalid email formats: {invalid_emails.sum()} records\")\n", " \n", " # Negative purchase amounts\n", " negative_amounts = (df['purchase_amount'] < 0).sum()\n", " print(f\"Negative purchase amounts: {negative_amounts} records\")\n", " \n", " # Invalid satisfaction scores\n", " invalid_scores = ((df['satisfaction_score'] < 1) | (df['satisfaction_score'] > 5)) & df['satisfaction_score'].notna()\n", " print(f\"Invalid satisfaction scores: {invalid_scores.sum()} records\")\n", " \n", " # Category inconsistencies\n", " category_variations = df['category'].value_counts()\n", " print(f\"\\nCategory variations: {len(category_variations)} different values\")\n", " print(category_variations)\n", " \n", " return issues\n", "\n", "issues = identify_issues(df_messy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Text Data Standardization\n", "\n", "Clean and standardize text fields." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Text cleaning functions\n", "def clean_text_data(df):\n", " \"\"\"Comprehensive text data cleaning\"\"\"\n", " df_clean = df.copy()\n", " \n", " # Standardize customer names\n", " print(\"Cleaning customer names...\")\n", " df_clean['customer_name_clean'] = df_clean['customer_name'].str.strip() # Remove whitespace\n", " df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.title() # Title case\n", " df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.replace(r'\\s+', ' ', regex=True) # Multiple spaces\n", " \n", " # Standardize customer IDs\n", " print(\"Standardizing customer IDs...\")\n", " df_clean['customer_id_clean'] = df_clean['customer_id'].str.upper() # All uppercase\n", " df_clean['customer_id_clean'] = df_clean['customer_id_clean'].str.replace('CUST', 'CUST') # Ensure consistent prefix\n", " \n", " # Clean email addresses\n", " print(\"Cleaning email addresses...\")\n", " df_clean['email_clean'] = df_clean['email'].str.lower() # Lowercase\n", " df_clean['email_clean'] = df_clean['email_clean'].str.strip() # Remove whitespace\n", " df_clean['email_clean'] = df_clean['email_clean'].str.replace(r'\\.{2,}', '.', regex=True) # Multiple dots\n", " \n", " # Standardize categories\n", " print(\"Standardizing categories...\")\n", " category_mapping = {\n", " 'electronics': 'Electronics',\n", " 'ELECTRONICS': 'Electronics',\n", " 'books': 'Books',\n", " 'clothing': 'Clothing',\n", " 'CLOTHING': 'Clothing',\n", " 'home&garden': 'Home & Garden',\n", " 'Home & Garden': 'Home & Garden'\n", " }\n", " df_clean['category_clean'] = df_clean['category'].map(category_mapping).fillna(df_clean['category'])\n", " \n", " return df_clean\n", "\n", "# Apply text cleaning\n", "df_text_clean = clean_text_data(df_messy)\n", "\n", "print(\"\\nText cleaning comparison:\")\n", "comparison_cols = ['customer_name', 'customer_name_clean', 'customer_id', 'customer_id_clean', \n", " 'email', 'email_clean', 'category', 'category_clean']\n", "print(df_text_clean[comparison_cols].head(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Advanced text cleaning with regex\n", "def advanced_text_cleaning(df):\n", " \"\"\"Advanced text cleaning using regular expressions\"\"\"\n", " df_advanced = df.copy()\n", " \n", " # Extract and standardize address components\n", " print(\"Processing addresses...\")\n", " # Basic address pattern: number street, city, state zipcode\n", " address_pattern = r'(\\d+)\\s+([^,]+),\\s*([^,]+),\\s*([A-Z]{2})\\s+(\\d{5})'\n", " address_parts = df_advanced['address'].str.extract(address_pattern)\n", " address_parts.columns = ['street_number', 'street_name', 'city', 'state', 'zipcode']\n", " \n", " # Clean street names\n", " address_parts['street_name'] = address_parts['street_name'].str.title()\n", " address_parts['city'] = address_parts['city'].str.title()\n", " \n", " # Combine cleaned parts\n", " df_advanced['address_clean'] = (\n", " address_parts['street_number'] + ' ' + address_parts['street_name'] + ', ' +\n", " address_parts['city'] + ', ' + address_parts['state'] + ' ' + address_parts['zipcode']\n", " )\n", " \n", " # Add individual address components\n", " for col in address_parts.columns:\n", " df_advanced[col] = address_parts[col]\n", " \n", " return df_advanced\n", "\n", "# Apply advanced cleaning\n", "df_advanced_clean = advanced_text_cleaning(df_text_clean)\n", "\n", "print(\"Address cleaning results:\")\n", "print(df_advanced_clean[['address', 'address_clean', 'city', 'state', 'zipcode']].head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Phone Number Standardization\n", "\n", "Clean and standardize phone numbers using regex patterns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def standardize_phone_numbers(df):\n", " \"\"\"Standardize phone numbers to consistent format\"\"\"\n", " df_phone = df.copy()\n", " \n", " def clean_phone(phone):\n", " \"\"\"Clean individual phone number\"\"\"\n", " if pd.isna(phone):\n", " return None\n", " \n", " # Remove all non-digit characters\n", " digits_only = re.sub(r'\\D', '', str(phone))\n", " \n", " # Handle different formats\n", " if len(digits_only) == 10:\n", " # Format as (XXX) XXX-XXXX\n", " return f\"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}\"\n", " elif len(digits_only) == 11 and digits_only.startswith('1'):\n", " # Remove country code and format\n", " phone_part = digits_only[1:]\n", " return f\"({phone_part[:3]}) {phone_part[3:6]}-{phone_part[6:]}\"\n", " else:\n", " # Invalid phone number\n", " return 'INVALID'\n", " \n", " # Apply phone cleaning\n", " df_phone['phone_clean'] = df_phone['phone'].apply(clean_phone)\n", " \n", " # Extract area code\n", " df_phone['area_code'] = df_phone['phone_clean'].str.extract(r'\\((\\d{3})\\)')\n", " \n", " # Flag invalid phone numbers\n", " df_phone['phone_is_valid'] = df_phone['phone_clean'] != 'INVALID'\n", " \n", " return df_phone\n", "\n", "# Apply phone standardization\n", "df_phone_clean = standardize_phone_numbers(df_advanced_clean)\n", "\n", "print(\"Phone number standardization:\")\n", "print(df_phone_clean[['phone', 'phone_clean', 'area_code', 'phone_is_valid']].head(15))\n", "\n", "print(\"\\nPhone validation summary:\")\n", "print(df_phone_clean['phone_is_valid'].value_counts())\n", "\n", "print(\"\\nArea code distribution:\")\n", "print(df_phone_clean['area_code'].value_counts().head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Date Standardization\n", "\n", "Parse and standardize dates from various formats." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def standardize_dates(df):\n", " \"\"\"Parse and standardize dates from multiple formats\"\"\"\n", " df_dates = df.copy()\n", " \n", " def parse_date(date_str):\n", " \"\"\"Try to parse date from various formats\"\"\"\n", " if pd.isna(date_str):\n", " return None\n", " \n", " date_str = str(date_str).strip()\n", " \n", " # Common date formats to try\n", " formats = [\n", " '%Y-%m-%d', # 2024-01-15\n", " '%m/%d/%Y', # 01/16/2024\n", " '%Y-%m-%d', # 2024-1-17 (handled by first format)\n", " '%d-%m-%Y', # 16-01-2024\n", " '%Y/%m/%d', # 2024/01/18\n", " '%B %d, %Y', # January 19, 2024\n", " '%m-%d-%y', # 01-21-24\n", " '%Y.%m.%d', # 2024.01.22\n", " '%d/%m/%Y' # 23/01/2024\n", " ]\n", " \n", " for fmt in formats:\n", " try:\n", " return pd.to_datetime(date_str, format=fmt)\n", " except ValueError:\n", " continue\n", " \n", " # If all else fails, try pandas' flexible parser\n", " try:\n", " return pd.to_datetime(date_str, infer_datetime_format=True)\n", " except:\n", " return None\n", " \n", " # Apply date parsing\n", " print(\"Parsing dates...\")\n", " df_dates['purchase_date_clean'] = df_dates['purchase_date'].apply(parse_date)\n", " \n", " # Flag unparseable dates\n", " df_dates['date_is_valid'] = df_dates['purchase_date_clean'].notna()\n", " \n", " # Extract date components for valid dates\n", " df_dates['purchase_year'] = df_dates['purchase_date_clean'].dt.year\n", " df_dates['purchase_month'] = df_dates['purchase_date_clean'].dt.month\n", " df_dates['purchase_day'] = df_dates['purchase_date_clean'].dt.day\n", " df_dates['purchase_day_of_week'] = df_dates['purchase_date_clean'].dt.day_name()\n", " \n", " return df_dates\n", "\n", "# Apply date standardization\n", "df_date_clean = standardize_dates(df_phone_clean)\n", "\n", "print(\"Date standardization results:\")\n", "print(df_date_clean[['purchase_date', 'purchase_date_clean', 'date_is_valid', \n", " 'purchase_year', 'purchase_month', 'purchase_day_of_week']].head(15))\n", "\n", "print(\"\\nDate parsing summary:\")\n", "print(df_date_clean['date_is_valid'].value_counts())\n", "\n", "invalid_dates = df_date_clean[~df_date_clean['date_is_valid']]['purchase_date'].unique()\n", "if len(invalid_dates) > 0:\n", " print(f\"\\nInvalid date formats found: {invalid_dates}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Numerical Data Cleaning\n", "\n", "Handle outliers, invalid values, and missing numerical data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def clean_numerical_data(df):\n", " \"\"\"Clean and validate numerical data\"\"\"\n", " df_numeric = df.copy()\n", " \n", " # Clean purchase amounts\n", " print(\"Cleaning purchase amounts...\")\n", " \n", " # Flag invalid values\n", " df_numeric['amount_is_valid'] = (\n", " df_numeric['purchase_amount'].notna() & \n", " (df_numeric['purchase_amount'] >= 0) & \n", " (df_numeric['purchase_amount'] <= 10000) # Reasonable upper limit\n", " )\n", " \n", " # Replace invalid values with NaN\n", " df_numeric['purchase_amount_clean'] = df_numeric['purchase_amount'].where(\n", " df_numeric['amount_is_valid'], np.nan\n", " )\n", " \n", " # Detect outliers using IQR method\n", " Q1 = df_numeric['purchase_amount_clean'].quantile(0.25)\n", " Q3 = df_numeric['purchase_amount_clean'].quantile(0.75)\n", " IQR = Q3 - Q1\n", " lower_bound = Q1 - 1.5 * IQR\n", " upper_bound = Q3 + 1.5 * IQR\n", " \n", " df_numeric['amount_is_outlier'] = (\n", " (df_numeric['purchase_amount_clean'] < lower_bound) |\n", " (df_numeric['purchase_amount_clean'] > upper_bound)\n", " )\n", " \n", " # Clean satisfaction scores\n", " print(\"Cleaning satisfaction scores...\")\n", " \n", " # Valid satisfaction scores are 1-5\n", " df_numeric['satisfaction_is_valid'] = (\n", " df_numeric['satisfaction_score'].notna() &\n", " (df_numeric['satisfaction_score'].between(1, 5))\n", " )\n", " \n", " df_numeric['satisfaction_score_clean'] = df_numeric['satisfaction_score'].where(\n", " df_numeric['satisfaction_is_valid'], np.nan\n", " )\n", " \n", " return df_numeric\n", "\n", "# Apply numerical cleaning\n", "df_numeric_clean = clean_numerical_data(df_date_clean)\n", "\n", "print(\"Numerical data cleaning results:\")\n", "print(df_numeric_clean[['purchase_amount', 'purchase_amount_clean', 'amount_is_valid', \n", " 'amount_is_outlier', 'satisfaction_score', 'satisfaction_score_clean', \n", " 'satisfaction_is_valid']].head(15))\n", "\n", "print(\"\\nNumerical data quality summary:\")\n", "print(f\"Valid purchase amounts: {df_numeric_clean['amount_is_valid'].sum()}/{len(df_numeric_clean)}\")\n", "print(f\"Outlier amounts: {df_numeric_clean['amount_is_outlier'].sum()}\")\n", "print(f\"Valid satisfaction scores: {df_numeric_clean['satisfaction_is_valid'].sum()}/{len(df_numeric_clean)}\")\n", "\n", "# Show statistics for cleaned data\n", "print(\"\\nCleaned amount statistics:\")\n", "print(df_numeric_clean['purchase_amount_clean'].describe())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Duplicate Detection and Handling\n", "\n", "Identify and handle duplicate records intelligently." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def handle_duplicates(df):\n", " \"\"\"Comprehensive duplicate detection and handling\"\"\"\n", " df_dedup = df.copy()\n", " \n", " print(\"=== DUPLICATE ANALYSIS ===\")\n", " \n", " # 1. Exact duplicates\n", " exact_duplicates = df_dedup.duplicated()\n", " print(f\"Exact duplicate rows: {exact_duplicates.sum()}\")\n", " \n", " # 2. Duplicates based on key columns (likely same customer)\n", " key_cols = ['customer_name_clean', 'email_clean']\n", " key_duplicates = df_dedup.duplicated(subset=key_cols, keep=False)\n", " print(f\"Duplicate customers (by name/email): {key_duplicates.sum()}\")\n", " \n", " # 3. Near duplicates (similar but not exact)\n", " # For demonstration, we'll check phone numbers\n", " phone_duplicates = df_dedup.duplicated(subset=['phone_clean'], keep=False)\n", " print(f\"Duplicate phone numbers: {phone_duplicates.sum()}\")\n", " \n", " # Show duplicate examples\n", " if key_duplicates.any():\n", " print(\"\\nExample duplicate customers:\")\n", " duplicate_customers = df_dedup[key_duplicates].sort_values(key_cols)\n", " print(duplicate_customers[key_cols + ['customer_id_clean', 'purchase_amount_clean']].head(10))\n", " \n", " # Remove exact duplicates\n", " print(f\"\\nRemoving {exact_duplicates.sum()} exact duplicates...\")\n", " df_no_exact_dups = df_dedup[~exact_duplicates]\n", " \n", " # For customer duplicates, keep the one with the highest purchase amount\n", " print(\"Handling customer duplicates (keeping highest purchase)...\")\n", " df_final = df_no_exact_dups.sort_values('purchase_amount_clean', ascending=False).drop_duplicates(\n", " subset=key_cols, keep='first'\n", " )\n", " \n", " print(f\"Final dataset size after deduplication: {len(df_final)} (was {len(df)})\")\n", " \n", " return df_final\n", "\n", "# Apply duplicate handling\n", "df_deduplicated = handle_duplicates(df_numeric_clean)\n", "\n", "print(f\"\\nRows removed: {len(df_numeric_clean) - len(df_deduplicated)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Data Validation and Quality Scores\n", "\n", "Create comprehensive data quality metrics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def calculate_quality_scores(df):\n", " \"\"\"Calculate comprehensive data quality scores\"\"\"\n", " df_quality = df.copy()\n", " \n", " # Define quality checks\n", " quality_checks = {\n", " 'has_customer_name': df_quality['customer_name_clean'].notna(),\n", " 'has_valid_email': df_quality['email_clean'].notna() & \n", " df_quality['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', na=False),\n", " 'has_valid_phone': df_quality['phone_is_valid'] == True,\n", " 'has_valid_date': df_quality['date_is_valid'] == True,\n", " 'has_valid_amount': df_quality['amount_is_valid'] == True,\n", " 'has_valid_satisfaction': df_quality['satisfaction_is_valid'] == True,\n", " 'amount_not_outlier': df_quality['amount_is_outlier'] == False,\n", " 'has_complete_address': df_quality['city'].notna() & df_quality['state'].notna() & df_quality['zipcode'].notna()\n", " }\n", " \n", " # Add individual quality flags\n", " for check_name, check_result in quality_checks.items():\n", " df_quality[f'quality_{check_name}'] = check_result.astype(int)\n", " \n", " # Calculate overall quality score (percentage of passed checks)\n", " quality_cols = [col for col in df_quality.columns if col.startswith('quality_')]\n", " df_quality['data_quality_score'] = df_quality[quality_cols].mean(axis=1) * 100\n", " \n", " # Categorize quality levels\n", " def quality_category(score):\n", " if score >= 90:\n", " return 'Excellent'\n", " elif score >= 75:\n", " return 'Good'\n", " elif score >= 50:\n", " return 'Fair'\n", " else:\n", " return 'Poor'\n", " \n", " df_quality['quality_category'] = df_quality['data_quality_score'].apply(quality_category)\n", " \n", " return df_quality, quality_checks\n", "\n", "# Calculate quality scores\n", "df_with_quality, quality_checks = calculate_quality_scores(df_deduplicated)\n", "\n", "print(\"Data quality analysis:\")\n", "print(df_with_quality[['customer_name_clean', 'data_quality_score', 'quality_category']].head(10))\n", "\n", "print(\"\\nQuality category distribution:\")\n", "print(df_with_quality['quality_category'].value_counts())\n", "\n", "print(\"\\nAverage quality scores by check:\")\n", "quality_summary = {}\n", "for check_name in quality_checks.keys():\n", " col_name = f'quality_{check_name}'\n", " quality_summary[check_name] = df_with_quality[col_name].mean() * 100\n", "\n", "quality_df = pd.DataFrame(list(quality_summary.items()), columns=['Quality_Check', 'Pass_Rate_%'])\n", "quality_df = quality_df.sort_values('Pass_Rate_%', ascending=False)\n", "print(quality_df.round(1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice Exercises\n", "\n", "Apply advanced data cleaning techniques to challenging scenarios:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Exercise 1: Create a custom validation function\n", "# Build a function that validates business rules:\n", "# - Email domains should be from approved list\n", "# - Purchase amounts should be within reasonable ranges by category\n", "# - Dates should be within business operating period\n", "# - Customer IDs should follow specific format patterns\n", "\n", "def validate_business_rules(df):\n", " \"\"\"Validate business-specific rules\"\"\"\n", " # Your implementation here\n", " pass\n", "\n", "# validation_results = validate_business_rules(df_final_clean)\n", "# print(validation_results)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# Exercise 2: Advanced duplicate detection\n", "# Implement fuzzy matching for near-duplicate detection:\n", "# - Similar names (edit distance)\n", "# - Similar addresses\n", "# - Similar email patterns\n", "\n", "# Your code here:\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "# Exercise 3: Data cleaning metrics dashboard\n", "# Create a comprehensive data quality dashboard that shows:\n", "# - Data quality trends over time\n", "# - Field-by-field quality scores\n", "# - Impact of cleaning steps\n", "# - Recommendations for further improvement\n", "\n", "# Your code here:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "1. **Assessment First**: Always assess data quality before cleaning\n", "2. **Systematic Approach**: Use a structured pipeline for consistent results\n", "3. **Preserve Original Data**: Keep original values while creating cleaned versions\n", "4. **Document Everything**: Log all cleaning steps and decisions\n", "5. **Validation**: Implement business rule validation\n", "6. **Quality Metrics**: Measure and track data quality improvements\n", "7. **Reusable Pipeline**: Create automated, configurable cleaning processes\n", "8. **Context Matters**: Consider domain-specific requirements\n", "\n", "## Common Data Issues and Solutions\n", "\n", "| Issue | Detection Method | Solution |\n", "|-------|-----------------|----------|\n", "| Inconsistent Format | Pattern analysis | Standardization rules |\n", "| Missing Values | `.isnull()` | Imputation or flagging |\n", "| Duplicates | `.duplicated()` | Deduplication logic |\n", "| Outliers | Statistical methods | Capping or flagging |\n", "| Invalid Values | Business rules | Validation and correction |\n", "| Inconsistent Naming | String analysis | Normalization |\n", "| Date Issues | Parsing attempts | Multiple format handling |\n", "| Text Issues | Regex patterns | Cleaning and standardization |\n", "\n", "## Best Practices\n", "\n", "1. **Start with Exploration**: Understand your data before cleaning\n", "2. **Preserve Traceability**: Keep original and cleaned versions\n", "3. **Validate Assumptions**: Test cleaning rules on sample data\n", "4. **Measure Impact**: Quantify improvements from cleaning\n", "5. **Automate When Possible**: Build reusable cleaning pipelines\n", "6. **Handle Edge Cases**: Plan for unusual but valid data\n", "7. **Business Context**: Include domain experts in rule definition\n", "8. **Iterative Process**: Refine cleaning rules based on results\n" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 4 }