# Session 1 - DataFrames - Lesson 11: String Operations and Text Processing

## Learning Objectives
- Master pandas string methods for text processing
- Learn regular expressions for pattern matching and extraction
- Understand text cleaning and standardization techniques
- Practice with real-world text data scenarios
- Apply string operations to business data analysis

## Prerequisites
- Completed Lessons 1-10
- Basic understanding of regular expressions (helpful but not required)
- Familiarity with text data challenges

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import re
import string
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_colwidth', 50)

print("Libraries loaded successfully!")

## Creating Text-Rich Dataset

Let's create a comprehensive dataset with various text processing challenges.

In [None]:
# Create realistic text-rich dataset
np.random.seed(42)

# Sample data with intentional text issues
text_data = {
    'customer_id': range(1, 201),
    'customer_name': [
        'John Smith', 'jane doe', 'MARY JOHNSON', 'Bob Wilson Jr.', 'Dr. Sarah Davis',
        'Mike O\'Connor', 'Lisa Garcia-Martinez', 'David Miller III', 'Amy Chen', 'Tom Anderson',
        'Kate Wilson', 'james brown', 'DIANA PRINCE', 'Frank Miller Sr.', 'Prof. Grace Lee',
        'Henry Davis', 'Ivy Chen-Wang', 'Jack Robinson', 'Olivia Taylor', 'Ryan Clark'
    ] * 10,
    'email': [
        'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',
        'bob.wilson@test.co.uk', 'sarah.davis@university.edu', 'mike@work.net',
        'lisa.garcia@startup.io', 'david@consulting.biz', 'amy.chen@tech.com', 'tom@sales.org',
        'kate.wilson@design.com', 'james@marketing.net', 'diana@fashion.com',
        'frank@legal.org', 'grace.lee@research.edu', 'henry@finance.com',
        'ivy@engineering.tech', 'jack@operations.biz', 'olivia@hr.org', 'ryan@analytics.io'
    ] * 10,
    'phone': [
        '(555) 123-4567', '555.987.6543', '5551234567', '+1-555-987-6543',
        '(555)123-4567', '555 123 4567', '1-555-987-6543', '555-123-4567',
        '(555) 987 6543', '+15559876543', '555.123.4567', '(555)987-6543',
        '555 987 6543', '1 555 123 4567', '+1 555 987 6543', '5559876543',
        '(555)-123-4567', '555_987_6543', '555/123/4567', '555-987-6543'
    ] * 10,
    'address': [
        '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',
        '789 pine road, los angeles, CA 90210', '321 ELM STREET, Chicago, IL 60601',
        '654 Maple Dr., Houston, TX 77001', '987 Cedar Lane, Phoenix, AZ 85001',
        '147 birch way, Philadelphia, PA 19101', '258 ASH CT, San Antonio, TX 78201',
        '369 Walnut St., San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201',
        '852 Spruce Blvd, Austin, TX 73301', '963 Fir Street, Seattle, WA 98101',
        '159 redwood dr, Portland, OR 97201', '357 WILLOW LN, Denver, CO 80201',
        '468 Poplar St., Miami, FL 33101', '579 Hickory Ave, Atlanta, GA 30301',
        '680 magnolia way, Nashville, TN 37201', '791 DOGWOOD CT, Charlotte, NC 28201',
        '802 Palm St., Orlando, FL 32801', '913 Cypress Ave, Tampa, FL 33601'
    ] * 10,
    'product_reviews': [
        'Great product! Highly recommend!!!', 'okay product, nothing special',
        'TERRIBLE! DO NOT BUY!', 'Amazing quality, fast shipping :)', 'Good value for money.',
        'Poor quality, broke after 1 week :(', 'Excellent customer service!',
        'average product... could be better', 'LOVE IT! 5 stars!!', 'Not worth the price.',
        'Perfect! Exactly what I needed.', 'disappointing quality',
        'OUTSTANDING PRODUCT!!!', 'mediocre at best', 'Fantastic! Will buy again.',
        'cheap quality, looks fake', 'Superb craftsmanship!',
        'waste of money', 'Incredible value! Recommended!', 'poor design'
    ] * 10,
    'job_title': [
        'Software Engineer', 'data scientist', 'MARKETING MANAGER', 'Sales Rep',
        'Product Manager', 'business analyst', 'UX DESIGNER', 'DevOps Engineer',
        'Content Writer', 'project manager', 'FINANCIAL ANALYST', 'HR Specialist',
        'Operations Manager', 'qa engineer', 'RESEARCH SCIENTIST', 'Account Executive',
        'Digital Marketer', 'software developer', 'DATA ENGINEER', 'Consultant'
    ] * 10,
    'company': [
        'TechCorp Inc.', 'data solutions llc', 'INNOVATIVE SYSTEMS', 'Global Enterprises',
        'StartupXYZ', 'consulting group ltd', 'FUTURE TECH CO', 'Analytics Pro',
        'Design Studio', 'enterprise solutions', 'MARKETING MASTERS', 'Software Solutions',
        'Digital Agency', 'research institute', 'FINANCE FIRM', 'Operations Co.',
        'Creative Agency', 'tech startup', 'DATA CORP', 'Professional Services'
    ] * 10
}

df_text = pd.DataFrame(text_data)

print("Text-rich dataset created:")
print(f"Shape: {df_text.shape}")
print("\nFirst few rows:")
print(df_text.head())
print("\nData types:")
print(df_text.dtypes)
print("\nSample of text issues to address:")
print("- Inconsistent capitalization")
print("- Various phone number formats")
print("- Mixed address formatting")
print("- Inconsistent email domains")
print("- Varied punctuation and emoticons in reviews")

## 1. Basic String Operations

Fundamental string methods and transformations.

In [None]:
# Basic string transformations
print("=== BASIC STRING TRANSFORMATIONS ===")

# Case transformations
df_basic = df_text.copy()

# Convert to different cases
df_basic['customer_name_upper'] = df_basic['customer_name'].str.upper()
df_basic['customer_name_lower'] = df_basic['customer_name'].str.lower()
df_basic['customer_name_title'] = df_basic['customer_name'].str.title()
df_basic['customer_name_capitalize'] = df_basic['customer_name'].str.capitalize()

print("Case transformations:")
case_cols = ['customer_name', 'customer_name_upper', 'customer_name_lower', 
            'customer_name_title', 'customer_name_capitalize']
print(df_basic[case_cols].head())

# String length and basic properties
df_basic['name_length'] = df_basic['customer_name'].str.len()
df_basic['email_length'] = df_basic['email'].str.len()
df_basic['review_length'] = df_basic['product_reviews'].str.len()

print("\nString lengths:")
print(df_basic[['customer_name', 'name_length', 'email', 'email_length']].head())

print("\nLength statistics:")
length_stats = df_basic[['name_length', 'email_length', 'review_length']].describe()
print(length_stats)

# Check for empty/null strings
print("\nEmpty string checks:")
for col in ['customer_name', 'email', 'phone']:
    empty_count = (df_basic[col].str.strip() == '').sum()
    null_count = df_basic[col].isnull().sum()
    print(f"{col}: {empty_count} empty strings, {null_count} null values")

In [None]:
# String slicing and indexing
print("=== STRING SLICING AND INDEXING ===")

# Extract parts of strings
df_basic['first_char'] = df_basic['customer_name'].str[0]
df_basic['last_char'] = df_basic['customer_name'].str[-1]
df_basic['first_three'] = df_basic['customer_name'].str[:3]
df_basic['last_three'] = df_basic['customer_name'].str[-3:]
df_basic['middle_chars'] = df_basic['customer_name'].str[2:5]

print("String slicing examples:")
slice_cols = ['customer_name', 'first_char', 'last_char', 'first_three', 'last_three', 'middle_chars']
print(df_basic[slice_cols].head(10))

# Extract email domains
df_basic['email_domain'] = df_basic['email'].str.split('@').str[1]
df_basic['email_username'] = df_basic['email'].str.split('@').str[0]

print("\nEmail parsing:")
print(df_basic[['email', 'email_username', 'email_domain']].head(10))

# Domain analysis
print("\nEmail domain distribution:")
domain_counts = df_basic['email_domain'].value_counts()
print(domain_counts.head(10))

In [None]:
# String concatenation and joining
print("=== STRING CONCATENATION AND JOINING ===")

# Simple concatenation
df_basic['name_email'] = df_basic['customer_name'] + ' - ' + df_basic['email']
df_basic['initials'] = df_basic['customer_name'].str[0] + '.' + df_basic['customer_name'].str.split().str[1].str[0] + '.'

print("String concatenation:")
print(df_basic[['customer_name', 'email', 'name_email', 'initials']].head())

# Using str.cat() for more complex joining
df_basic['full_contact'] = df_basic['customer_name'].str.cat(
    [df_basic['email'], df_basic['phone']], 
    sep=' | '
)

print("\nComplex concatenation:")
print(df_basic[['full_contact']].head())

# Conditional concatenation
df_basic['display_name'] = df_basic.apply(
    lambda row: f"{row['customer_name']} ({row['job_title']})" 
    if pd.notna(row['job_title']) else row['customer_name'], 
    axis=1
)

print("\nConditional concatenation:")
print(df_basic[['customer_name', 'job_title', 'display_name']].head())

## 2. Pattern Matching and String Contains

Finding patterns and filtering based on string content.

In [None]:
# Basic pattern matching
print("=== BASIC PATTERN MATCHING ===")

# Check if strings contain specific patterns
df_patterns = df_text.copy()

# Contains operations
df_patterns['has_dr_title'] = df_patterns['customer_name'].str.contains('Dr\.|Prof\.', case=False, na=False)
df_patterns['has_jr_sr'] = df_patterns['customer_name'].str.contains('Jr\.|Sr\.', case=False, na=False)
df_patterns['has_hyphen'] = df_patterns['customer_name'].str.contains('-', na=False)
df_patterns['has_apostrophe'] = df_patterns['customer_name'].str.contains("'", na=False)

print("Pattern matching results:")
pattern_summary = df_patterns[['has_dr_title', 'has_jr_sr', 'has_hyphen', 'has_apostrophe']].sum()
print(pattern_summary)

print("\nExamples of names with titles:")
title_names = df_patterns[df_patterns['has_dr_title']]['customer_name'].unique()
print(title_names)

# Email domain patterns
df_patterns['edu_email'] = df_patterns['email'].str.contains('\.edu', case=False, na=False)
df_patterns['com_email'] = df_patterns['email'].str.contains('\.com', case=False, na=False)
df_patterns['org_email'] = df_patterns['email'].str.contains('\.org', case=False, na=False)

print("\nEmail domain patterns:")
domain_pattern_summary = df_patterns[['edu_email', 'com_email', 'org_email']].sum()
print(domain_pattern_summary)

# Review sentiment patterns
df_patterns['positive_review'] = df_patterns['product_reviews'].str.contains(
    'great|excellent|amazing|fantastic|love|perfect|outstanding|superb|incredible', 
    case=False, na=False
)
df_patterns['negative_review'] = df_patterns['product_reviews'].str.contains(
    'terrible|poor|disappointing|waste|cheap|broke|fake|mediocre', 
    case=False, na=False
)

print("\nReview sentiment patterns:")
sentiment_summary = df_patterns[['positive_review', 'negative_review']].sum()
print(sentiment_summary)

print("\nSample positive reviews:")
positive_reviews = df_patterns[df_patterns['positive_review']]['product_reviews'].head(5)
for review in positive_reviews:
    print(f"- {review}")

In [None]:
# Advanced pattern matching with startswith/endswith
print("=== STARTSWITH/ENDSWITH PATTERNS ===")

# Check beginnings and endings
df_patterns['starts_with_vowel'] = df_patterns['customer_name'].str.lower().str.startswith(('a', 'e', 'i', 'o', 'u'))
df_patterns['ends_with_son'] = df_patterns['customer_name'].str.lower().str.endswith('son')
df_patterns['job_starts_data'] = df_patterns['job_title'].str.lower().str.startswith('data')
df_patterns['company_ends_inc'] = df_patterns['company'].str.lower().str.endswith(('inc', 'inc.', 'llc', 'ltd', 'co', 'co.'))

print("Start/End pattern results:")
start_end_summary = df_patterns[['starts_with_vowel', 'ends_with_son', 'job_starts_data', 'company_ends_inc']].sum()
print(start_end_summary)

print("\nNames starting with vowels:")
vowel_names = df_patterns[df_patterns['starts_with_vowel']]['customer_name'].unique()[:10]
print(vowel_names)

print("\nData-related job titles:")
data_jobs = df_patterns[df_patterns['job_starts_data']]['job_title'].unique()
print(data_jobs)

# Phone number format detection
df_patterns['phone_parentheses'] = df_patterns['phone'].str.contains(r'\(\d{3}\)', na=False)
df_patterns['phone_dashes'] = df_patterns['phone'].str.contains(r'\d{3}-\d{3}-\d{4}', na=False)
df_patterns['phone_dots'] = df_patterns['phone'].str.contains(r'\d{3}\.\d{3}\.\d{4}', na=False)
df_patterns['phone_spaces'] = df_patterns['phone'].str.contains(r'\d{3}\s\d{3}\s\d{4}', na=False)

print("\nPhone number format patterns:")
phone_format_summary = df_patterns[['phone_parentheses', 'phone_dashes', 'phone_dots', 'phone_spaces']].sum()
print(phone_format_summary)

print("\nSample phone formats:")
for format_type in ['phone_parentheses', 'phone_dashes', 'phone_dots', 'phone_spaces']:
    sample = df_patterns[df_patterns[format_type]]['phone'].iloc[0] if df_patterns[format_type].any() else 'None'
    print(f"{format_type}: {sample}")

## 3. Regular Expressions

Advanced pattern matching using regular expressions.

In [None]:
# Regular expression basics
print("=== REGULAR EXPRESSION BASICS ===")

df_regex = df_text.copy()

# Extract patterns using regex
# Extract phone numbers (various formats)
phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
df_regex['extracted_phone'] = df_regex['phone'].str.extract(f'({phone_pattern})')

print("Phone number extraction:")
print(df_regex[['phone', 'extracted_phone']].head(10))

# Extract ZIP codes from addresses
zip_pattern = r'\b\d{5}\b'
df_regex['zip_code'] = df_regex['address'].str.extract(f'({zip_pattern})')

print("\nZIP code extraction:")
print(df_regex[['address', 'zip_code']].head(10))

# Extract state abbreviations
state_pattern = r'\b[A-Z]{2}\b'
df_regex['state'] = df_regex['address'].str.extract(f'({state_pattern})')

print("\nState extraction:")
print(df_regex[['address', 'state']].head(10))

# Count digits in strings
df_regex['digit_count'] = df_regex['phone'].str.count(r'\d')
df_regex['letter_count'] = df_regex['customer_name'].str.count(r'[a-zA-Z]')

print("\nCharacter counting:")
print(df_regex[['phone', 'digit_count', 'customer_name', 'letter_count']].head())

In [None]:
# Advanced regex patterns
print("=== ADVANCED REGEX PATTERNS ===")

# Extract all email components
email_pattern = r'([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})'
email_parts = df_regex['email'].str.extract(email_pattern)
email_parts.columns = ['username', 'domain', 'tld']

df_regex = pd.concat([df_regex, email_parts], axis=1)

print("Email component extraction:")
print(df_regex[['email', 'username', 'domain', 'tld']].head(10))

# Extract multiple phone number parts
phone_parts_pattern = r'\(??(\d{3})\)?[-.,\s]?(\d{3})[-.,\s]?(\d{4})'
phone_parts = df_regex['phone'].str.extract(phone_parts_pattern)
phone_parts.columns = ['area_code', 'exchange', 'number']

df_regex = pd.concat([df_regex, phone_parts], axis=1)

print("\nPhone number component extraction:")
print(df_regex[['phone', 'area_code', 'exchange', 'number']].head(10))

# Extract address components
address_pattern = r'(\d+)\s+(.+?)\s*,\s*(.+?)\s*,\s*([A-Z]{2})\s+(\d{5})'
address_parts = df_regex['address'].str.extract(address_pattern)
address_parts.columns = ['street_number', 'street_name', 'city', 'state_extracted', 'zip_extracted']

print("\nAddress component extraction (first 5):")
print(address_parts.head())

# Find all matches (not just first)
# Find all capitalized words in names
df_regex['capitalized_words'] = df_regex['customer_name'].str.findall(r'\b[A-Z][a-z]+\b')
df_regex['num_capitalized'] = df_regex['capitalized_words'].str.len()

print("\nCapitalized words in names:")
print(df_regex[['customer_name', 'capitalized_words', 'num_capitalized']].head(10))

In [None]:
# Regex replacement and cleaning
print("=== REGEX REPLACEMENT AND CLEANING ===")

# Clean phone numbers to standard format
def clean_phone(phone_str):
    """Clean phone number to standard format"""
    if pd.isna(phone_str):
        return None
    # Remove all non-digits
    digits = re.sub(r'\D', '', phone_str)
    # Handle different lengths
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    elif len(digits) == 11 and digits.startswith('1'):
        return f"({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
    else:
        return 'Invalid'

df_regex['phone_cleaned'] = df_regex['phone'].apply(clean_phone)

print("Phone number cleaning:")
phone_cleaning_sample = df_regex[['phone', 'phone_cleaned']].head(15)
print(phone_cleaning_sample)

# Remove punctuation from reviews
df_regex['review_no_punct'] = df_regex['product_reviews'].str.replace(r'[^\w\s]', ' ', regex=True)
df_regex['review_clean'] = df_regex['review_no_punct'].str.replace(r'\s+', ' ', regex=True).str.strip()

print("\nReview cleaning:")
review_cleaning_sample = df_regex[['product_reviews', 'review_clean']].head(5)
for idx, row in review_cleaning_sample.iterrows():
    print(f"Original: {row['product_reviews']}")
    print(f"Cleaned:  {row['review_clean']}")
    print()

# Standardize company names
df_regex['company_clean'] = (
    df_regex['company']
    .str.replace(r'\binc\.?\b', 'Inc.', case=False, regex=True)
    .str.replace(r'\bllc\b', 'LLC', case=False, regex=True)
    .str.replace(r'\bltd\.?\b', 'Ltd.', case=False, regex=True)
    .str.replace(r'\bco\.?\b', 'Co.', case=False, regex=True)
    .str.title()
)

print("Company name standardization:")
company_sample = df_regex[['company', 'company_clean']].head(10)
print(company_sample)

## 4. Text Cleaning and Standardization

Comprehensive text cleaning workflows.

In [None]:
# Comprehensive text cleaning pipeline
print("=== COMPREHENSIVE TEXT CLEANING ===")

def clean_text_comprehensive(df):
    """Comprehensive text cleaning pipeline"""
    df_clean = df.copy()
    
    # 1. Clean customer names
    df_clean['customer_name_clean'] = (
        df_clean['customer_name']
        .str.strip()                           # Remove leading/trailing whitespace
        .str.replace(r'\s+', ' ', regex=True)  # Replace multiple spaces with single space
        .str.title()                           # Title case
        .str.replace(r'\bDr\.', 'Dr.', regex=True)      # Standardize titles
        .str.replace(r'\bProf\.', 'Prof.', regex=True)
        .str.replace(r'\bJr\.', 'Jr.', regex=True)
        .str.replace(r'\bSr\.', 'Sr.', regex=True)
    )
    
    # 2. Clean and standardize emails
    df_clean['email_clean'] = (
        df_clean['email']
        .str.strip()
        .str.lower()                           # Lowercase for emails
        .str.replace(r'\s+', '', regex=True)   # Remove any spaces
    )
    
    # 3. Standardize phone numbers
    df_clean['phone_clean'] = df_clean['phone'].apply(clean_phone)
    
    # 4. Clean addresses
    df_clean['address_clean'] = (
        df_clean['address']
        .str.strip()
        .str.title()                           # Title case
        .str.replace(r'\bSt\.?\b', 'St.', regex=True)    # Standardize street abbreviations
        .str.replace(r'\bAve\.?\b', 'Ave.', regex=True)
        .str.replace(r'\bRd\.?\b', 'Rd.', regex=True)
        .str.replace(r'\bDr\.?\b', 'Dr.', regex=True)
        .str.replace(r'\bLn\.?\b', 'Ln.', regex=True)
        .str.replace(r'\bCt\.?\b', 'Ct.', regex=True)
        .str.replace(r'\bBlvd\.?\b', 'Blvd.', regex=True)
        .str.replace(r'\s+', ' ', regex=True)  # Multiple spaces to single
    )
    
    # 5. Clean job titles
    df_clean['job_title_clean'] = (
        df_clean['job_title']
        .str.strip()
        .str.title()
        .str.replace(r'\bQa\b', 'QA', regex=True)        # Specific corrections
        .str.replace(r'\bUx\b', 'UX', regex=True)
        .str.replace(r'\bHr\b', 'HR', regex=True)
    )
    
    # 6. Clean company names
    df_clean['company_clean'] = (
        df_clean['company']
        .str.strip()
        .str.title()
        .str.replace(r'\binc\.?\b', 'Inc.', case=False, regex=True)
        .str.replace(r'\bllc\b', 'LLC', case=False, regex=True)
        .str.replace(r'\bltd\.?\b', 'Ltd.', case=False, regex=True)
        .str.replace(r'\bco\.?\b', 'Co.', case=False, regex=True)
    )
    
    return df_clean

# Apply comprehensive cleaning
df_comprehensive = clean_text_comprehensive(df_text)

print("Comprehensive cleaning results:")
# Show before/after comparison
comparison_cols = [
    ('customer_name', 'customer_name_clean'),
    ('email', 'email_clean'),
    ('phone', 'phone_clean'),
    ('job_title', 'job_title_clean'),
    ('company', 'company_clean')
]

for original, cleaned in comparison_cols:
    print(f"\n{original.upper()} CLEANING:")
    sample = df_comprehensive[[original, cleaned]].head(5)
    for idx, row in sample.iterrows():
        print(f"  Before: {row[original]}")
        print(f"  After:  {row[cleaned]}")
        print()

In [None]:
# Text standardization and validation
print("=== TEXT STANDARDIZATION AND VALIDATION ===")

def validate_cleaned_data(df):
    """Validate cleaned data quality"""
    validation_results = {}
    
    # Email validation
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    valid_emails = df['email_clean'].str.match(email_pattern, na=False)
    validation_results['valid_emails'] = {
        'total': len(df),
        'valid': valid_emails.sum(),
        'invalid': (~valid_emails).sum(),
        'percentage_valid': (valid_emails.sum() / len(df)) * 100
    }
    
    # Phone validation
    valid_phones = df['phone_clean'] != 'Invalid'
    validation_results['valid_phones'] = {
        'total': len(df),
        'valid': valid_phones.sum(),
        'invalid': (~valid_phones).sum(),
        'percentage_valid': (valid_phones.sum() / len(df)) * 100
    }
    
    # Name validation (no numbers, reasonable length)
    valid_names = (
        df['customer_name_clean'].str.len().between(2, 50) &
        ~df['customer_name_clean'].str.contains(r'\d', na=False)
    )
    validation_results['valid_names'] = {
        'total': len(df),
        'valid': valid_names.sum(),
        'invalid': (~valid_names).sum(),
        'percentage_valid': (valid_names.sum() / len(df)) * 100
    }
    
    return validation_results

# Validate cleaned data
validation_results = validate_cleaned_data(df_comprehensive)

print("Data validation results:")
for field, results in validation_results.items():
    print(f"\n{field.upper()}:")
    print(f"  Total records: {results['total']}")
    print(f"  Valid: {results['valid']} ({results['percentage_valid']:.1f}%)")
    print(f"  Invalid: {results['invalid']}")

# Show some invalid examples
print("\nExamples of invalid data:")
invalid_emails = df_comprehensive[~df_comprehensive['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', na=False)]
if len(invalid_emails) > 0:
    print(f"Invalid emails: {invalid_emails['email_clean'].head(3).tolist()}")

invalid_phones = df_comprehensive[df_comprehensive['phone_clean'] == 'Invalid']
if len(invalid_phones) > 0:
    print(f"Invalid phones: {invalid_phones['phone'].head(3).tolist()}")

# Generate data quality summary
overall_quality = np.mean([results['percentage_valid'] for results in validation_results.values()])
print(f"\nOverall data quality score: {overall_quality:.1f}%")

## 5. Text Analysis and Insights

Extracting business insights from text data.

In [None]:
# Text analysis for business insights
print("=== TEXT ANALYSIS FOR BUSINESS INSIGHTS ===")

def analyze_text_patterns(df):
    """Analyze text patterns for business insights"""
    analysis = {}
    
    # 1. Name analysis
    analysis['name_insights'] = {
        'avg_name_length': df['customer_name_clean'].str.len().mean(),
        'names_with_titles': df['customer_name_clean'].str.contains(r'Dr\.|Prof\.|Mr\.|Ms\.|Mrs\.').sum(),
        'names_with_suffixes': df['customer_name_clean'].str.contains(r'Jr\.|Sr\.|III|II').sum(),
        'hyphenated_names': df['customer_name_clean'].str.contains('-').sum(),
        'most_common_first_names': df['customer_name_clean'].str.split().str[0].value_counts().head(5)
    }
    
    # 2. Email domain analysis
    domains = df['email_clean'].str.split('@').str[1]
    analysis['email_insights'] = {
        'total_unique_domains': domains.nunique(),
        'top_domains': domains.value_counts().head(10),
        'edu_domains': domains.str.endswith('.edu').sum(),
        'com_domains': domains.str.endswith('.com').sum(),
        'org_domains': domains.str.endswith('.org').sum()
    }
    
    # 3. Geographic analysis from addresses
    states = df['address_clean'].str.extract(r'\b([A-Z]{2})\s+\d{5}')[0]
    analysis['geographic_insights'] = {
        'unique_states': states.nunique(),
        'top_states': states.value_counts().head(10),
        'coastal_states': states.isin(['CA', 'NY', 'FL', 'WA', 'OR']).sum()
    }
    
    # 4. Job title analysis
    analysis['job_insights'] = {
        'unique_job_titles': df['job_title_clean'].nunique(),
        'top_job_titles': df['job_title_clean'].value_counts().head(10),
        'tech_jobs': df['job_title_clean'].str.contains('Engineer|Developer|Data|Software', case=False).sum(),
        'management_jobs': df['job_title_clean'].str.contains('Manager|Director|VP|President', case=False).sum()
    }
    
    # 5. Company analysis
    analysis['company_insights'] = {
        'unique_companies': df['company_clean'].nunique(),
        'top_companies': df['company_clean'].value_counts().head(10),
        'inc_companies': df['company_clean'].str.contains('Inc\.').sum(),
        'llc_companies': df['company_clean'].str.contains('LLC').sum(),
        'startups': df['company_clean'].str.contains('Startup|startup', case=False).sum()
    }
    
    return analysis

# Perform text analysis
text_analysis = analyze_text_patterns(df_comprehensive)

print("TEXT ANALYSIS RESULTS:")

print("\n1. NAME INSIGHTS:")
name_insights = text_analysis['name_insights']
print(f"   Average name length: {name_insights['avg_name_length']:.1f} characters")
print(f"   Names with titles: {name_insights['names_with_titles']}")
print(f"   Names with suffixes: {name_insights['names_with_suffixes']}")
print(f"   Hyphenated names: {name_insights['hyphenated_names']}")
print("   Most common first names:")
for name, count in name_insights['most_common_first_names'].items():
    print(f"     {name}: {count}")

print("\n2. EMAIL INSIGHTS:")
email_insights = text_analysis['email_insights']
print(f"   Unique domains: {email_insights['total_unique_domains']}")
print(f"   .edu domains: {email_insights['edu_domains']}")
print(f"   .com domains: {email_insights['com_domains']}")
print(f"   .org domains: {email_insights['org_domains']}")
print("   Top domains:")
for domain, count in email_insights['top_domains'].head(5).items():
    print(f"     {domain}: {count}")

print("\n3. GEOGRAPHIC INSIGHTS:")
geo_insights = text_analysis['geographic_insights']
print(f"   Unique states: {geo_insights['unique_states']}")
print(f"   Coastal states: {geo_insights['coastal_states']}")
print("   Top states:")
for state, count in geo_insights['top_states'].head(5).items():
    print(f"     {state}: {count}")

print("\n4. JOB INSIGHTS:")
job_insights = text_analysis['job_insights']
print(f"   Unique job titles: {job_insights['unique_job_titles']}")
print(f"   Tech jobs: {job_insights['tech_jobs']}")
print(f"   Management jobs: {job_insights['management_jobs']}")

print("\n5. COMPANY INSIGHTS:")
company_insights = text_analysis['company_insights']
print(f"   Unique companies: {company_insights['unique_companies']}")
print(f"   Inc. companies: {company_insights['inc_companies']}")
print(f"   LLC companies: {company_insights['llc_companies']}")
print(f"   Startups: {company_insights['startups']}")

In [None]:
# Sentiment analysis of product reviews
print("=== SENTIMENT ANALYSIS OF REVIEWS ===")

def analyze_review_sentiment(df):
    """Analyze sentiment in product reviews"""
    # Define sentiment word lists
    positive_words = [
        'great', 'excellent', 'amazing', 'fantastic', 'love', 'perfect', 
        'outstanding', 'superb', 'incredible', 'wonderful', 'awesome', 
        'brilliant', 'impressive', 'remarkable', 'exceptional'
    ]
    
    negative_words = [
        'terrible', 'poor', 'disappointing', 'waste', 'cheap', 'broke', 
        'fake', 'mediocre', 'awful', 'horrible', 'useless', 'worst', 
        'defective', 'junk', 'garbage'
    ]
    
    # Create patterns
    positive_pattern = '|'.join(positive_words)
    negative_pattern = '|'.join(negative_words)
    
    # Count sentiment words
    df['positive_word_count'] = df['product_reviews'].str.lower().str.count(positive_pattern)
    df['negative_word_count'] = df['product_reviews'].str.lower().str.count(negative_pattern)
    
    # Calculate sentiment score
    df['sentiment_score'] = df['positive_word_count'] - df['negative_word_count']
    
    # Categorize sentiment
    def categorize_sentiment(score):
        if score > 0:
            return 'Positive'
        elif score < 0:
            return 'Negative'
        else:
            return 'Neutral'
    
    df['sentiment_category'] = df['sentiment_score'].apply(categorize_sentiment)
    
    # Additional features
    df['has_exclamation'] = df['product_reviews'].str.contains('!').astype(int)
    df['has_caps'] = df['product_reviews'].str.contains(r'[A-Z]{3,}').astype(int)
    df['review_word_count'] = df['product_reviews'].str.split().str.len()
    
    return df

# Analyze sentiment
df_sentiment = analyze_review_sentiment(df_comprehensive.copy())

print("Sentiment analysis results:")
sentiment_summary = df_sentiment['sentiment_category'].value_counts()
print(sentiment_summary)
print(f"\nSentiment distribution:")
for category, count in sentiment_summary.items():
    percentage = (count / len(df_sentiment)) * 100
    print(f"  {category}: {count} ({percentage:.1f}%)")

print("\nSentiment score statistics:")
print(df_sentiment['sentiment_score'].describe())

print("\nSample reviews by sentiment:")
for sentiment in ['Positive', 'Negative', 'Neutral']:
    sample_reviews = df_sentiment[df_sentiment['sentiment_category'] == sentiment]['product_reviews'].head(2)
    print(f"\n{sentiment} reviews:")
    for review in sample_reviews:
        print(f"  - {review}")

# Correlation analysis
print("\nCorrelation between text features:")
text_features = ['positive_word_count', 'negative_word_count', 'sentiment_score', 
                'has_exclamation', 'has_caps', 'review_word_count']
correlation_matrix = df_sentiment[text_features].corr()
print(correlation_matrix.round(3))

## Practice Exercises

Apply string operations to complex text processing scenarios:

In [15]:
# Exercise 1: Advanced Text Cleaning Pipeline
# Create a comprehensive text cleaning and validation system:
# - Handle international characters and encoding issues
# - Implement fuzzy matching for duplicate detection
# - Create data quality scoring system
# - Generate cleaning reports with statistics

def advanced_text_cleaning_pipeline(df):
    """Advanced text cleaning with international support and validation"""
    # Your implementation here
    pass

# cleaned_df = advanced_text_cleaning_pipeline(df_text)
# print("Advanced text cleaning completed")

In [16]:
# Exercise 2: Text Mining and Information Extraction
# Extract structured information from unstructured text:
# - Extract entities (names, organizations, locations)
# - Parse complex address formats
# - Identify and extract contact information
# - Create knowledge graphs from text relationships

# Your code here:


In [17]:
# Exercise 3: Business Intelligence from Text
# Create business insights from text analysis:
# - Customer segmentation based on text patterns
# - Market analysis from company and job data
# - Geographic market penetration analysis
# - Competitive intelligence from text data

# Your code here:


## Key Takeaways

1. **String Accessor (`.str`)**:
   - Essential for all pandas string operations
   - Works with Series containing strings
   - Handles NaN values gracefully

2. **Basic Operations**:
   - **Case**: `.upper()`, `.lower()`, `.title()`, `.capitalize()`
   - **Length**: `.len()`
   - **Slicing**: `.str[start:end]`
   - **Splitting**: `.str.split()`

3. **Pattern Matching**:
   - **Contains**: `.str.contains()` for pattern detection
   - **Startswith/Endswith**: `.str.startswith()`, `.str.endswith()`
   - **Regular Expressions**: Use `regex=True` parameter

4. **Text Cleaning**:
   - **Replace**: `.str.replace()` for substitution
   - **Strip**: `.str.strip()` for whitespace removal
   - **Extract**: `.str.extract()` for regex pattern extraction

## String Operations Quick Reference

```python
# Basic transformations
df['col'].str.upper()                    # Uppercase
df['col'].str.lower()                    # Lowercase
df['col'].str.title()                    # Title Case
df['col'].str.len()                      # String length
df['col'].str.strip()                    # Remove whitespace

# Pattern matching
df['col'].str.contains('pattern')        # Check if contains
df['col'].str.startswith('prefix')       # Check if starts with
df['col'].str.endswith('suffix')         # Check if ends with

# Extraction and replacement
df['col'].str.extract(r'(\d+)')         # Extract pattern
df['col'].str.replace('old', 'new')      # Replace text
df['col'].str.split('delimiter')         # Split string

# Advanced regex
df['col'].str.findall(r'\b\w+\b')       # Find all matches
df['col'].str.count(r'\d')              # Count pattern occurrences
```

## Common Text Cleaning Patterns

| Task | Pattern | Example |
|------|---------|----------|
| Remove punctuation | `r'[^\w\s]'` | `str.replace(r'[^\w\s]', '', regex=True)` |
| Extract digits | `r'\d+'` | `str.extract(r'(\d+)')` |
| Clean phone numbers | `r'\D'` | `str.replace(r'\D', '', regex=True)` |
| Extract email parts | `r'([^@]+)@(.+)'` | `str.extract(r'([^@]+)@(.+)')` |
| Standardize whitespace | `r'\s+'` | `str.replace(r'\s+', ' ', regex=True)` |

## Best Practices

1. **Data Validation**: Always validate cleaned data
2. **Preserve Originals**: Keep original columns during cleaning
3. **Handle Edge Cases**: Plan for missing values and unusual formats
4. **Performance**: Use vectorized operations instead of apply() when possible
5. **Documentation**: Document cleaning rules and business logic
6. **Testing**: Test regex patterns thoroughly with edge cases

## Business Applications

- **Customer Data Cleaning**: Standardize names, addresses, contacts
- **Market Research**: Analyze company names and domains
- **Sentiment Analysis**: Process customer reviews and feedback
- **Data Integration**: Clean and match data from multiple sources
- **Compliance**: Standardize data for regulatory requirements