# Session 1 - DataFrames - Lesson 13: Advanced Data Cleaning

## Learning Objectives
- Master advanced techniques for data cleaning and validation
- Learn to detect and handle various types of data quality issues
- Understand data standardization and normalization techniques
- Practice with real-world messy data scenarios
- Develop automated data cleaning pipelines

## Prerequisites
- Completed previous lessons on DataFrames
- Understanding of basic data cleaning concepts
- Familiarity with regular expressions (helpful but not required)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import re
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries loaded successfully!")

## Creating Messy Sample Data

Let's create a realistic messy dataset to practice advanced cleaning techniques.

In [None]:
import pandas as pd
import numpy as np

# Create intentionally messy data that mimics real-world issues
np.random.seed(42)

# Base data
n_records = 200
messy_data = {
    'customer_id': [f'CUST{i:04d}' if i % 10 != 0 else f'cust{i:04d}' for i in range(1, n_records + 1)],
    'customer_name': [
        'John Smith', 'jane doe', 'MARY JOHNSON', 'bob wilson', 'Sarah Davis',
        'Mike Brown', 'lisa garcia', 'DAVID MILLER', 'Amy Wilson', 'Tom Anderson'
    ] * 20,
    'email': [
        'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',
        'bob..wilson@test.com', 'sarah@invalid-email', 'mike@email.com',
        'lisa.garcia@email.com', 'david@company.org', 'amy@email.com', 'tom@test.com'
    ] * 20,
    'phone': [
        '(555) 123-4567', '555.987.6543', '5551234567', '555-987-6543',
        '(555)123-4567', '+1-555-123-4567', '555 123 4567', '5559876543',
        '(555) 987 6543', '555-123-4567'
    ] * 20,
    'address': [
        '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',
        '789 Pine Rd, Los Angeles, CA 90210', '321 Elm St, Chicago, IL 60601',
        '654 Maple Dr, Houston, TX 77001', '987 Cedar Ln, Phoenix, AZ 85001',
        '147 Birch Way, Philadelphia, PA 19101', '258 Ash Ct, San Antonio, TX 78201',
        '369 Walnut St, San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201'
    ] * 20,
    'purchase_amount': np.random.normal(100, 30, n_records).round(2),
    'purchase_date': [
        '2024-01-15', '01/16/2024', '2024-1-17', '16-01-2024', '2024/01/18',
        'January 19, 2024', '2024-01-20', '01-21-24', '2024.01.22', '23/01/2024'
    ] * 20,
    'category': [
        'Electronics', 'electronics', 'ELECTRONICS', 'Books', 'books',
        'Clothing', 'clothing', 'CLOTHING', 'Home & Garden', 'home&garden'
    ] * 20,
    'satisfaction_score': np.random.choice([1, 2, 3, 4, 5, 99, -1, None], n_records, p=[0.05, 0.1, 0.15, 0.35, 0.3, 0.02, 0.02, 0.01])
}

# Convert to DataFrame first
df_messy = pd.DataFrame(messy_data)

# Introduce missing values and anomalies using proper indexing
df_messy.loc[df_messy.index[::25], 'customer_name'] = None  # Some missing names
df_messy.loc[df_messy.index[::30], 'email'] = None  # Some missing emails
df_messy.loc[df_messy.index[::35], 'purchase_amount'] = np.nan  # Some missing amounts
df_messy.loc[df_messy.index[::40], 'purchase_amount'] = -999  # Invalid negative values

# Add some duplicate records
duplicate_indices = [0, 1, 2, 3, 4]
duplicate_rows = df_messy.iloc[duplicate_indices].copy()
df_messy = pd.concat([df_messy, duplicate_rows], ignore_index=True)

print("Messy dataset created:")
print(f"Shape: {df_messy.shape}")
print("\nFirst few rows:")
print(df_messy.head(10))
print("\nData types:")
print(df_messy.dtypes)
print("\nSample of data quality issues:")
print("\n1. Missing values:")
print(df_messy.isnull().sum())
print("\n2. Inconsistent formatting examples:")
print("Customer IDs:", df_messy['customer_id'].head(15).tolist())
print("Customer names:", df_messy['customer_name'].dropna().head(5).tolist())
print("Categories:", df_messy['category'].unique()[:5])
print("\n3. Invalid satisfaction scores:")
print("Unique satisfaction scores:", sorted(df_messy['satisfaction_score'].dropna().unique()))
print("\n4. Invalid purchase amounts:")
print("Negative amounts:", df_messy[df_messy['purchase_amount'] < 0]['purchase_amount'].count())
print("\n5. Date format inconsistencies:")
print("Sample dates:", df_messy['purchase_date'].head(10).tolist())

## 1. Data Quality Assessment

First, let's assess the quality of our messy data.

In [None]:
def assess_data_quality(df):
    """Comprehensive data quality assessment"""
    print("=== DATA QUALITY ASSESSMENT ===")
    print(f"Dataset shape: {df.shape}")
    print(f"Total cells: {df.size}")
    
    # Missing values analysis
    print("\n--- Missing Values ---")
    missing_stats = pd.DataFrame({
        'Missing_Count': df.isnull().sum(),
        'Missing_Percentage': (df.isnull().sum() / len(df)) * 100
    })
    missing_stats = missing_stats[missing_stats['Missing_Count'] > 0]
    print(missing_stats.round(2))
    
    # Duplicate analysis
    print("\n--- Duplicates ---")
    total_duplicates = df.duplicated().sum()
    print(f"Complete duplicate rows: {total_duplicates}")
    
    # Column-specific analysis
    print("\n--- Column Analysis ---")
    for col in df.columns:
        unique_count = df[col].nunique()
        unique_percentage = (unique_count / len(df)) * 100
        print(f"{col}: {unique_count} unique values ({unique_percentage:.1f}%)")
    
    # Data type issues
    print("\n--- Data Types ---")
    print(df.dtypes)
    
    return missing_stats, total_duplicates

# Assess the messy data
missing_stats, duplicate_count = assess_data_quality(df_messy)

In [None]:
# Identify specific data quality issues
def identify_issues(df):
    """Identify specific data quality issues"""
    issues = []
    
    # Check for inconsistent formatting
    print("=== SPECIFIC ISSUES IDENTIFIED ===")
    
    # Customer ID formatting
    id_patterns = df['customer_id'].str.extract(r'(CUST|cust)(\d+)').fillna('')
    inconsistent_ids = (id_patterns[0] == 'cust').sum()
    print(f"Inconsistent customer ID format: {inconsistent_ids} records")
    
    # Email validation
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    invalid_emails = ~df['email'].str.match(email_pattern, na=False)
    print(f"Invalid email formats: {invalid_emails.sum()} records")
    
    # Negative purchase amounts
    negative_amounts = (df['purchase_amount'] < 0).sum()
    print(f"Negative purchase amounts: {negative_amounts} records")
    
    # Invalid satisfaction scores
    invalid_scores = ((df['satisfaction_score'] < 1) | (df['satisfaction_score'] > 5)) & df['satisfaction_score'].notna()
    print(f"Invalid satisfaction scores: {invalid_scores.sum()} records")
    
    # Category inconsistencies
    category_variations = df['category'].value_counts()
    print(f"\nCategory variations: {len(category_variations)} different values")
    print(category_variations)
    
    return issues

issues = identify_issues(df_messy)

## 2. Text Data Standardization

Clean and standardize text fields.

In [None]:
# Text cleaning functions
def clean_text_data(df):
    """Comprehensive text data cleaning"""
    df_clean = df.copy()
    
    # Standardize customer names
    print("Cleaning customer names...")
    df_clean['customer_name_clean'] = df_clean['customer_name'].str.strip()  # Remove whitespace
    df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.title()  # Title case
    df_clean['customer_name_clean'] = df_clean['customer_name_clean'].str.replace(r'\s+', ' ', regex=True)  # Multiple spaces
    
    # Standardize customer IDs
    print("Standardizing customer IDs...")
    df_clean['customer_id_clean'] = df_clean['customer_id'].str.upper()  # All uppercase
    df_clean['customer_id_clean'] = df_clean['customer_id_clean'].str.replace('CUST', 'CUST')  # Ensure consistent prefix
    
    # Clean email addresses
    print("Cleaning email addresses...")
    df_clean['email_clean'] = df_clean['email'].str.lower()  # Lowercase
    df_clean['email_clean'] = df_clean['email_clean'].str.strip()  # Remove whitespace
    df_clean['email_clean'] = df_clean['email_clean'].str.replace(r'\.{2,}', '.', regex=True)  # Multiple dots
    
    # Standardize categories
    print("Standardizing categories...")
    category_mapping = {
        'electronics': 'Electronics',
        'ELECTRONICS': 'Electronics',
        'books': 'Books',
        'clothing': 'Clothing',
        'CLOTHING': 'Clothing',
        'home&garden': 'Home & Garden',
        'Home & Garden': 'Home & Garden'
    }
    df_clean['category_clean'] = df_clean['category'].map(category_mapping).fillna(df_clean['category'])
    
    return df_clean

# Apply text cleaning
df_text_clean = clean_text_data(df_messy)

print("\nText cleaning comparison:")
comparison_cols = ['customer_name', 'customer_name_clean', 'customer_id', 'customer_id_clean', 
                  'email', 'email_clean', 'category', 'category_clean']
print(df_text_clean[comparison_cols].head(10))

In [None]:
# Advanced text cleaning with regex
def advanced_text_cleaning(df):
    """Advanced text cleaning using regular expressions"""
    df_advanced = df.copy()
    
    # Extract and standardize address components
    print("Processing addresses...")
    # Basic address pattern: number street, city, state zipcode
    address_pattern = r'(\d+)\s+([^,]+),\s*([^,]+),\s*([A-Z]{2})\s+(\d{5})'
    address_parts = df_advanced['address'].str.extract(address_pattern)
    address_parts.columns = ['street_number', 'street_name', 'city', 'state', 'zipcode']
    
    # Clean street names
    address_parts['street_name'] = address_parts['street_name'].str.title()
    address_parts['city'] = address_parts['city'].str.title()
    
    # Combine cleaned parts
    df_advanced['address_clean'] = (
        address_parts['street_number'] + ' ' + address_parts['street_name'] + ', ' +
        address_parts['city'] + ', ' + address_parts['state'] + ' ' + address_parts['zipcode']
    )
    
    # Add individual address components
    for col in address_parts.columns:
        df_advanced[col] = address_parts[col]
    
    return df_advanced

# Apply advanced cleaning
df_advanced_clean = advanced_text_cleaning(df_text_clean)

print("Address cleaning results:")
print(df_advanced_clean[['address', 'address_clean', 'city', 'state', 'zipcode']].head())

## 3. Phone Number Standardization

Clean and standardize phone numbers using regex patterns.

In [None]:
def standardize_phone_numbers(df):
    """Standardize phone numbers to consistent format"""
    df_phone = df.copy()
    
    def clean_phone(phone):
        """Clean individual phone number"""
        if pd.isna(phone):
            return None
        
        # Remove all non-digit characters
        digits_only = re.sub(r'\D', '', str(phone))
        
        # Handle different formats
        if len(digits_only) == 10:
            # Format as (XXX) XXX-XXXX
            return f"({digits_only[:3]}) {digits_only[3:6]}-{digits_only[6:]}"
        elif len(digits_only) == 11 and digits_only.startswith('1'):
            # Remove country code and format
            phone_part = digits_only[1:]
            return f"({phone_part[:3]}) {phone_part[3:6]}-{phone_part[6:]}"
        else:
            # Invalid phone number
            return 'INVALID'
    
    # Apply phone cleaning
    df_phone['phone_clean'] = df_phone['phone'].apply(clean_phone)
    
    # Extract area code
    df_phone['area_code'] = df_phone['phone_clean'].str.extract(r'\((\d{3})\)')
    
    # Flag invalid phone numbers
    df_phone['phone_is_valid'] = df_phone['phone_clean'] != 'INVALID'
    
    return df_phone

# Apply phone standardization
df_phone_clean = standardize_phone_numbers(df_advanced_clean)

print("Phone number standardization:")
print(df_phone_clean[['phone', 'phone_clean', 'area_code', 'phone_is_valid']].head(15))

print("\nPhone validation summary:")
print(df_phone_clean['phone_is_valid'].value_counts())

print("\nArea code distribution:")
print(df_phone_clean['area_code'].value_counts().head())

## 4. Date Standardization

Parse and standardize dates from various formats.

In [None]:
def standardize_dates(df):
    """Parse and standardize dates from multiple formats"""
    df_dates = df.copy()
    
    def parse_date(date_str):
        """Try to parse date from various formats"""
        if pd.isna(date_str):
            return None
        
        date_str = str(date_str).strip()
        
        # Common date formats to try
        formats = [
            '%Y-%m-%d',      # 2024-01-15
            '%m/%d/%Y',      # 01/16/2024
            '%Y-%m-%d',      # 2024-1-17 (handled by first format)
            '%d-%m-%Y',      # 16-01-2024
            '%Y/%m/%d',      # 2024/01/18
            '%B %d, %Y',     # January 19, 2024
            '%m-%d-%y',      # 01-21-24
            '%Y.%m.%d',      # 2024.01.22
            '%d/%m/%Y'       # 23/01/2024
        ]
        
        for fmt in formats:
            try:
                return pd.to_datetime(date_str, format=fmt)
            except ValueError:
                continue
        
        # If all else fails, try pandas' flexible parser
        try:
            return pd.to_datetime(date_str, infer_datetime_format=True)
        except:
            return None
    
    # Apply date parsing
    print("Parsing dates...")
    df_dates['purchase_date_clean'] = df_dates['purchase_date'].apply(parse_date)
    
    # Flag unparseable dates
    df_dates['date_is_valid'] = df_dates['purchase_date_clean'].notna()
    
    # Extract date components for valid dates
    df_dates['purchase_year'] = df_dates['purchase_date_clean'].dt.year
    df_dates['purchase_month'] = df_dates['purchase_date_clean'].dt.month
    df_dates['purchase_day'] = df_dates['purchase_date_clean'].dt.day
    df_dates['purchase_day_of_week'] = df_dates['purchase_date_clean'].dt.day_name()
    
    return df_dates

# Apply date standardization
df_date_clean = standardize_dates(df_phone_clean)

print("Date standardization results:")
print(df_date_clean[['purchase_date', 'purchase_date_clean', 'date_is_valid', 
                    'purchase_year', 'purchase_month', 'purchase_day_of_week']].head(15))

print("\nDate parsing summary:")
print(df_date_clean['date_is_valid'].value_counts())

invalid_dates = df_date_clean[~df_date_clean['date_is_valid']]['purchase_date'].unique()
if len(invalid_dates) > 0:
    print(f"\nInvalid date formats found: {invalid_dates}")

## 5. Numerical Data Cleaning

Handle outliers, invalid values, and missing numerical data.

In [None]:
def clean_numerical_data(df):
    """Clean and validate numerical data"""
    df_numeric = df.copy()
    
    # Clean purchase amounts
    print("Cleaning purchase amounts...")
    
    # Flag invalid values
    df_numeric['amount_is_valid'] = (
        df_numeric['purchase_amount'].notna() & 
        (df_numeric['purchase_amount'] >= 0) & 
        (df_numeric['purchase_amount'] <= 10000)  # Reasonable upper limit
    )
    
    # Replace invalid values with NaN
    df_numeric['purchase_amount_clean'] = df_numeric['purchase_amount'].where(
        df_numeric['amount_is_valid'], np.nan
    )
    
    # Detect outliers using IQR method
    Q1 = df_numeric['purchase_amount_clean'].quantile(0.25)
    Q3 = df_numeric['purchase_amount_clean'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    df_numeric['amount_is_outlier'] = (
        (df_numeric['purchase_amount_clean'] < lower_bound) |
        (df_numeric['purchase_amount_clean'] > upper_bound)
    )
    
    # Clean satisfaction scores
    print("Cleaning satisfaction scores...")
    
    # Valid satisfaction scores are 1-5
    df_numeric['satisfaction_is_valid'] = (
        df_numeric['satisfaction_score'].notna() &
        (df_numeric['satisfaction_score'].between(1, 5))
    )
    
    df_numeric['satisfaction_score_clean'] = df_numeric['satisfaction_score'].where(
        df_numeric['satisfaction_is_valid'], np.nan
    )
    
    return df_numeric

# Apply numerical cleaning
df_numeric_clean = clean_numerical_data(df_date_clean)

print("Numerical data cleaning results:")
print(df_numeric_clean[['purchase_amount', 'purchase_amount_clean', 'amount_is_valid', 
                       'amount_is_outlier', 'satisfaction_score', 'satisfaction_score_clean', 
                       'satisfaction_is_valid']].head(15))

print("\nNumerical data quality summary:")
print(f"Valid purchase amounts: {df_numeric_clean['amount_is_valid'].sum()}/{len(df_numeric_clean)}")
print(f"Outlier amounts: {df_numeric_clean['amount_is_outlier'].sum()}")
print(f"Valid satisfaction scores: {df_numeric_clean['satisfaction_is_valid'].sum()}/{len(df_numeric_clean)}")

# Show statistics for cleaned data
print("\nCleaned amount statistics:")
print(df_numeric_clean['purchase_amount_clean'].describe())

## 6. Duplicate Detection and Handling

Identify and handle duplicate records intelligently.

In [None]:
def handle_duplicates(df):
    """Comprehensive duplicate detection and handling"""
    df_dedup = df.copy()
    
    print("=== DUPLICATE ANALYSIS ===")
    
    # 1. Exact duplicates
    exact_duplicates = df_dedup.duplicated()
    print(f"Exact duplicate rows: {exact_duplicates.sum()}")
    
    # 2. Duplicates based on key columns (likely same customer)
    key_cols = ['customer_name_clean', 'email_clean']
    key_duplicates = df_dedup.duplicated(subset=key_cols, keep=False)
    print(f"Duplicate customers (by name/email): {key_duplicates.sum()}")
    
    # 3. Near duplicates (similar but not exact)
    # For demonstration, we'll check phone numbers
    phone_duplicates = df_dedup.duplicated(subset=['phone_clean'], keep=False)
    print(f"Duplicate phone numbers: {phone_duplicates.sum()}")
    
    # Show duplicate examples
    if key_duplicates.any():
        print("\nExample duplicate customers:")
        duplicate_customers = df_dedup[key_duplicates].sort_values(key_cols)
        print(duplicate_customers[key_cols + ['customer_id_clean', 'purchase_amount_clean']].head(10))
    
    # Remove exact duplicates
    print(f"\nRemoving {exact_duplicates.sum()} exact duplicates...")
    df_no_exact_dups = df_dedup[~exact_duplicates]
    
    # For customer duplicates, keep the one with the highest purchase amount
    print("Handling customer duplicates (keeping highest purchase)...")
    df_final = df_no_exact_dups.sort_values('purchase_amount_clean', ascending=False).drop_duplicates(
        subset=key_cols, keep='first'
    )
    
    print(f"Final dataset size after deduplication: {len(df_final)} (was {len(df)})")
    
    return df_final

# Apply duplicate handling
df_deduplicated = handle_duplicates(df_numeric_clean)

print(f"\nRows removed: {len(df_numeric_clean) - len(df_deduplicated)}")

## 7. Data Validation and Quality Scores

Create comprehensive data quality metrics.

In [None]:
def calculate_quality_scores(df):
    """Calculate comprehensive data quality scores"""
    df_quality = df.copy()
    
    # Define quality checks
    quality_checks = {
        'has_customer_name': df_quality['customer_name_clean'].notna(),
        'has_valid_email': df_quality['email_clean'].notna() & 
                          df_quality['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', na=False),
        'has_valid_phone': df_quality['phone_is_valid'] == True,
        'has_valid_date': df_quality['date_is_valid'] == True,
        'has_valid_amount': df_quality['amount_is_valid'] == True,
        'has_valid_satisfaction': df_quality['satisfaction_is_valid'] == True,
        'amount_not_outlier': df_quality['amount_is_outlier'] == False,
        'has_complete_address': df_quality['city'].notna() & df_quality['state'].notna() & df_quality['zipcode'].notna()
    }
    
    # Add individual quality flags
    for check_name, check_result in quality_checks.items():
        df_quality[f'quality_{check_name}'] = check_result.astype(int)
    
    # Calculate overall quality score (percentage of passed checks)
    quality_cols = [col for col in df_quality.columns if col.startswith('quality_')]
    df_quality['data_quality_score'] = df_quality[quality_cols].mean(axis=1) * 100
    
    # Categorize quality levels
    def quality_category(score):
        if score >= 90:
            return 'Excellent'
        elif score >= 75:
            return 'Good'
        elif score >= 50:
            return 'Fair'
        else:
            return 'Poor'
    
    df_quality['quality_category'] = df_quality['data_quality_score'].apply(quality_category)
    
    return df_quality, quality_checks

# Calculate quality scores
df_with_quality, quality_checks = calculate_quality_scores(df_deduplicated)

print("Data quality analysis:")
print(df_with_quality[['customer_name_clean', 'data_quality_score', 'quality_category']].head(10))

print("\nQuality category distribution:")
print(df_with_quality['quality_category'].value_counts())

print("\nAverage quality scores by check:")
quality_summary = {}
for check_name in quality_checks.keys():
    col_name = f'quality_{check_name}'
    quality_summary[check_name] = df_with_quality[col_name].mean() * 100

quality_df = pd.DataFrame(list(quality_summary.items()), columns=['Quality_Check', 'Pass_Rate_%'])
quality_df = quality_df.sort_values('Pass_Rate_%', ascending=False)
print(quality_df.round(1))

## Practice Exercises

Apply advanced data cleaning techniques to challenging scenarios:

In [32]:
# Exercise 1: Create a custom validation function
# Build a function that validates business rules:
# - Email domains should be from approved list
# - Purchase amounts should be within reasonable ranges by category
# - Dates should be within business operating period
# - Customer IDs should follow specific format patterns

def validate_business_rules(df):
    """Validate business-specific rules"""
    # Your implementation here
    pass

# validation_results = validate_business_rules(df_final_clean)
# print(validation_results)

In [33]:
# Exercise 2: Advanced duplicate detection
# Implement fuzzy matching for near-duplicate detection:
# - Similar names (edit distance)
# - Similar addresses
# - Similar email patterns

# Your code here:


In [34]:
# Exercise 3: Data cleaning metrics dashboard
# Create a comprehensive data quality dashboard that shows:
# - Data quality trends over time
# - Field-by-field quality scores
# - Impact of cleaning steps
# - Recommendations for further improvement

# Your code here:


## Key Takeaways

1. **Assessment First**: Always assess data quality before cleaning
2. **Systematic Approach**: Use a structured pipeline for consistent results
3. **Preserve Original Data**: Keep original values while creating cleaned versions
4. **Document Everything**: Log all cleaning steps and decisions
5. **Validation**: Implement business rule validation
6. **Quality Metrics**: Measure and track data quality improvements
7. **Reusable Pipeline**: Create automated, configurable cleaning processes
8. **Context Matters**: Consider domain-specific requirements

## Common Data Issues and Solutions

| Issue | Detection Method | Solution |
|-------|-----------------|----------|
| Inconsistent Format | Pattern analysis | Standardization rules |
| Missing Values | `.isnull()` | Imputation or flagging |
| Duplicates | `.duplicated()` | Deduplication logic |
| Outliers | Statistical methods | Capping or flagging |
| Invalid Values | Business rules | Validation and correction |
| Inconsistent Naming | String analysis | Normalization |
| Date Issues | Parsing attempts | Multiple format handling |
| Text Issues | Regex patterns | Cleaning and standardization |

## Best Practices

1. **Start with Exploration**: Understand your data before cleaning
2. **Preserve Traceability**: Keep original and cleaned versions
3. **Validate Assumptions**: Test cleaning rules on sample data
4. **Measure Impact**: Quantify improvements from cleaning
5. **Automate When Possible**: Build reusable cleaning pipelines
6. **Handle Edge Cases**: Plan for unusual but valid data
7. **Business Context**: Include domain experts in rule definition
8. **Iterative Process**: Refine cleaning rules based on results
