{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Session 1 - DataFrames - Lesson 11: String Operations and Text Processing\n", "\n", "## Learning Objectives\n", "- Master pandas string methods for text processing\n", "- Learn regular expressions for pattern matching and extraction\n", "- Understand text cleaning and standardization techniques\n", "- Practice with real-world text data scenarios\n", "- Apply string operations to business data analysis\n", "\n", "## Prerequisites\n", "- Completed Lessons 1-10\n", "- Basic understanding of regular expressions (helpful but not required)\n", "- Familiarity with text data challenges" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "import re\n", "import string\n", "from datetime import datetime\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Set display options\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', 20)\n", "pd.set_option('display.max_colwidth', 50)\n", "\n", "print(\"Libraries loaded successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Text-Rich Dataset\n", "\n", "Let's create a comprehensive dataset with various text processing challenges." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create realistic text-rich dataset\n", "np.random.seed(42)\n", "\n", "# Sample data with intentional text issues\n", "text_data = {\n", " 'customer_id': range(1, 201),\n", " 'customer_name': [\n", " 'John Smith', 'jane doe', 'MARY JOHNSON', 'Bob Wilson Jr.', 'Dr. Sarah Davis',\n", " 'Mike O\\'Connor', 'Lisa Garcia-Martinez', 'David Miller III', 'Amy Chen', 'Tom Anderson',\n", " 'Kate Wilson', 'james brown', 'DIANA PRINCE', 'Frank Miller Sr.', 'Prof. Grace Lee',\n", " 'Henry Davis', 'Ivy Chen-Wang', 'Jack Robinson', 'Olivia Taylor', 'Ryan Clark'\n", " ] * 10,\n", " 'email': [\n", " 'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',\n", " 'bob.wilson@test.co.uk', 'sarah.davis@university.edu', 'mike@work.net',\n", " 'lisa.garcia@startup.io', 'david@consulting.biz', 'amy.chen@tech.com', 'tom@sales.org',\n", " 'kate.wilson@design.com', 'james@marketing.net', 'diana@fashion.com',\n", " 'frank@legal.org', 'grace.lee@research.edu', 'henry@finance.com',\n", " 'ivy@engineering.tech', 'jack@operations.biz', 'olivia@hr.org', 'ryan@analytics.io'\n", " ] * 10,\n", " 'phone': [\n", " '(555) 123-4567', '555.987.6543', '5551234567', '+1-555-987-6543',\n", " '(555)123-4567', '555 123 4567', '1-555-987-6543', '555-123-4567',\n", " '(555) 987 6543', '+15559876543', '555.123.4567', '(555)987-6543',\n", " '555 987 6543', '1 555 123 4567', '+1 555 987 6543', '5559876543',\n", " '(555)-123-4567', '555_987_6543', '555/123/4567', '555-987-6543'\n", " ] * 10,\n", " 'address': [\n", " '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',\n", " '789 pine road, los angeles, CA 90210', '321 ELM STREET, Chicago, IL 60601',\n", " '654 Maple Dr., Houston, TX 77001', '987 Cedar Lane, Phoenix, AZ 85001',\n", " '147 birch way, Philadelphia, PA 19101', '258 ASH CT, San Antonio, TX 78201',\n", " '369 Walnut St., San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201',\n", " '852 Spruce Blvd, Austin, TX 73301', '963 Fir Street, Seattle, WA 98101',\n", " '159 redwood dr, Portland, OR 97201', '357 WILLOW LN, Denver, CO 80201',\n", " '468 Poplar St., Miami, FL 33101', '579 Hickory Ave, Atlanta, GA 30301',\n", " '680 magnolia way, Nashville, TN 37201', '791 DOGWOOD CT, Charlotte, NC 28201',\n", " '802 Palm St., Orlando, FL 32801', '913 Cypress Ave, Tampa, FL 33601'\n", " ] * 10,\n", " 'product_reviews': [\n", " 'Great product! Highly recommend!!!', 'okay product, nothing special',\n", " 'TERRIBLE! DO NOT BUY!', 'Amazing quality, fast shipping :)', 'Good value for money.',\n", " 'Poor quality, broke after 1 week :(', 'Excellent customer service!',\n", " 'average product... could be better', 'LOVE IT! 5 stars!!', 'Not worth the price.',\n", " 'Perfect! Exactly what I needed.', 'disappointing quality',\n", " 'OUTSTANDING PRODUCT!!!', 'mediocre at best', 'Fantastic! Will buy again.',\n", " 'cheap quality, looks fake', 'Superb craftsmanship!',\n", " 'waste of money', 'Incredible value! Recommended!', 'poor design'\n", " ] * 10,\n", " 'job_title': [\n", " 'Software Engineer', 'data scientist', 'MARKETING MANAGER', 'Sales Rep',\n", " 'Product Manager', 'business analyst', 'UX DESIGNER', 'DevOps Engineer',\n", " 'Content Writer', 'project manager', 'FINANCIAL ANALYST', 'HR Specialist',\n", " 'Operations Manager', 'qa engineer', 'RESEARCH SCIENTIST', 'Account Executive',\n", " 'Digital Marketer', 'software developer', 'DATA ENGINEER', 'Consultant'\n", " ] * 10,\n", " 'company': [\n", " 'TechCorp Inc.', 'data solutions llc', 'INNOVATIVE SYSTEMS', 'Global Enterprises',\n", " 'StartupXYZ', 'consulting group ltd', 'FUTURE TECH CO', 'Analytics Pro',\n", " 'Design Studio', 'enterprise solutions', 'MARKETING MASTERS', 'Software Solutions',\n", " 'Digital Agency', 'research institute', 'FINANCE FIRM', 'Operations Co.',\n", " 'Creative Agency', 'tech startup', 'DATA CORP', 'Professional Services'\n", " ] * 10\n", "}\n", "\n", "df_text = pd.DataFrame(text_data)\n", "\n", "print(\"Text-rich dataset created:\")\n", "print(f\"Shape: {df_text.shape}\")\n", "print(\"\\nFirst few rows:\")\n", "print(df_text.head())\n", "print(\"\\nData types:\")\n", "print(df_text.dtypes)\n", "print(\"\\nSample of text issues to address:\")\n", "print(\"- Inconsistent capitalization\")\n", "print(\"- Various phone number formats\")\n", "print(\"- Mixed address formatting\")\n", "print(\"- Inconsistent email domains\")\n", "print(\"- Varied punctuation and emoticons in reviews\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Basic String Operations\n", "\n", "Fundamental string methods and transformations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Basic string transformations\n", "print(\"=== BASIC STRING TRANSFORMATIONS ===\")\n", "\n", "# Case transformations\n", "df_basic = df_text.copy()\n", "\n", "# Convert to different cases\n", "df_basic['customer_name_upper'] = df_basic['customer_name'].str.upper()\n", "df_basic['customer_name_lower'] = df_basic['customer_name'].str.lower()\n", "df_basic['customer_name_title'] = df_basic['customer_name'].str.title()\n", "df_basic['customer_name_capitalize'] = df_basic['customer_name'].str.capitalize()\n", "\n", "print(\"Case transformations:\")\n", "case_cols = ['customer_name', 'customer_name_upper', 'customer_name_lower', \n", " 'customer_name_title', 'customer_name_capitalize']\n", "print(df_basic[case_cols].head())\n", "\n", "# String length and basic properties\n", "df_basic['name_length'] = df_basic['customer_name'].str.len()\n", "df_basic['email_length'] = df_basic['email'].str.len()\n", "df_basic['review_length'] = df_basic['product_reviews'].str.len()\n", "\n", "print(\"\\nString lengths:\")\n", "print(df_basic[['customer_name', 'name_length', 'email', 'email_length']].head())\n", "\n", "print(\"\\nLength statistics:\")\n", "length_stats = df_basic[['name_length', 'email_length', 'review_length']].describe()\n", "print(length_stats)\n", "\n", "# Check for empty/null strings\n", "print(\"\\nEmpty string checks:\")\n", "for col in ['customer_name', 'email', 'phone']:\n", " empty_count = (df_basic[col].str.strip() == '').sum()\n", " null_count = df_basic[col].isnull().sum()\n", " print(f\"{col}: {empty_count} empty strings, {null_count} null values\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# String slicing and indexing\n", "print(\"=== STRING SLICING AND INDEXING ===\")\n", "\n", "# Extract parts of strings\n", "df_basic['first_char'] = df_basic['customer_name'].str[0]\n", "df_basic['last_char'] = df_basic['customer_name'].str[-1]\n", "df_basic['first_three'] = df_basic['customer_name'].str[:3]\n", "df_basic['last_three'] = df_basic['customer_name'].str[-3:]\n", "df_basic['middle_chars'] = df_basic['customer_name'].str[2:5]\n", "\n", "print(\"String slicing examples:\")\n", "slice_cols = ['customer_name', 'first_char', 'last_char', 'first_three', 'last_three', 'middle_chars']\n", "print(df_basic[slice_cols].head(10))\n", "\n", "# Extract email domains\n", "df_basic['email_domain'] = df_basic['email'].str.split('@').str[1]\n", "df_basic['email_username'] = df_basic['email'].str.split('@').str[0]\n", "\n", "print(\"\\nEmail parsing:\")\n", "print(df_basic[['email', 'email_username', 'email_domain']].head(10))\n", "\n", "# Domain analysis\n", "print(\"\\nEmail domain distribution:\")\n", "domain_counts = df_basic['email_domain'].value_counts()\n", "print(domain_counts.head(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# String concatenation and joining\n", "print(\"=== STRING CONCATENATION AND JOINING ===\")\n", "\n", "# Simple concatenation\n", "df_basic['name_email'] = df_basic['customer_name'] + ' - ' + df_basic['email']\n", "df_basic['initials'] = df_basic['customer_name'].str[0] + '.' + df_basic['customer_name'].str.split().str[1].str[0] + '.'\n", "\n", "print(\"String concatenation:\")\n", "print(df_basic[['customer_name', 'email', 'name_email', 'initials']].head())\n", "\n", "# Using str.cat() for more complex joining\n", "df_basic['full_contact'] = df_basic['customer_name'].str.cat(\n", " [df_basic['email'], df_basic['phone']], \n", " sep=' | '\n", ")\n", "\n", "print(\"\\nComplex concatenation:\")\n", "print(df_basic[['full_contact']].head())\n", "\n", "# Conditional concatenation\n", "df_basic['display_name'] = df_basic.apply(\n", " lambda row: f\"{row['customer_name']} ({row['job_title']})\" \n", " if pd.notna(row['job_title']) else row['customer_name'], \n", " axis=1\n", ")\n", "\n", "print(\"\\nConditional concatenation:\")\n", "print(df_basic[['customer_name', 'job_title', 'display_name']].head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Pattern Matching and String Contains\n", "\n", "Finding patterns and filtering based on string content." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Basic pattern matching\n", "print(\"=== BASIC PATTERN MATCHING ===\")\n", "\n", "# Check if strings contain specific patterns\n", "df_patterns = df_text.copy()\n", "\n", "# Contains operations\n", "df_patterns['has_dr_title'] = df_patterns['customer_name'].str.contains('Dr\\.|Prof\\.', case=False, na=False)\n", "df_patterns['has_jr_sr'] = df_patterns['customer_name'].str.contains('Jr\\.|Sr\\.', case=False, na=False)\n", "df_patterns['has_hyphen'] = df_patterns['customer_name'].str.contains('-', na=False)\n", "df_patterns['has_apostrophe'] = df_patterns['customer_name'].str.contains(\"'\", na=False)\n", "\n", "print(\"Pattern matching results:\")\n", "pattern_summary = df_patterns[['has_dr_title', 'has_jr_sr', 'has_hyphen', 'has_apostrophe']].sum()\n", "print(pattern_summary)\n", "\n", "print(\"\\nExamples of names with titles:\")\n", "title_names = df_patterns[df_patterns['has_dr_title']]['customer_name'].unique()\n", "print(title_names)\n", "\n", "# Email domain patterns\n", "df_patterns['edu_email'] = df_patterns['email'].str.contains('\\.edu', case=False, na=False)\n", "df_patterns['com_email'] = df_patterns['email'].str.contains('\\.com', case=False, na=False)\n", "df_patterns['org_email'] = df_patterns['email'].str.contains('\\.org', case=False, na=False)\n", "\n", "print(\"\\nEmail domain patterns:\")\n", "domain_pattern_summary = df_patterns[['edu_email', 'com_email', 'org_email']].sum()\n", "print(domain_pattern_summary)\n", "\n", "# Review sentiment patterns\n", "df_patterns['positive_review'] = df_patterns['product_reviews'].str.contains(\n", " 'great|excellent|amazing|fantastic|love|perfect|outstanding|superb|incredible', \n", " case=False, na=False\n", ")\n", "df_patterns['negative_review'] = df_patterns['product_reviews'].str.contains(\n", " 'terrible|poor|disappointing|waste|cheap|broke|fake|mediocre', \n", " case=False, na=False\n", ")\n", "\n", "print(\"\\nReview sentiment patterns:\")\n", "sentiment_summary = df_patterns[['positive_review', 'negative_review']].sum()\n", "print(sentiment_summary)\n", "\n", "print(\"\\nSample positive reviews:\")\n", "positive_reviews = df_patterns[df_patterns['positive_review']]['product_reviews'].head(5)\n", "for review in positive_reviews:\n", " print(f\"- {review}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Advanced pattern matching with startswith/endswith\n", "print(\"=== STARTSWITH/ENDSWITH PATTERNS ===\")\n", "\n", "# Check beginnings and endings\n", "df_patterns['starts_with_vowel'] = df_patterns['customer_name'].str.lower().str.startswith(('a', 'e', 'i', 'o', 'u'))\n", "df_patterns['ends_with_son'] = df_patterns['customer_name'].str.lower().str.endswith('son')\n", "df_patterns['job_starts_data'] = df_patterns['job_title'].str.lower().str.startswith('data')\n", "df_patterns['company_ends_inc'] = df_patterns['company'].str.lower().str.endswith(('inc', 'inc.', 'llc', 'ltd', 'co', 'co.'))\n", "\n", "print(\"Start/End pattern results:\")\n", "start_end_summary = df_patterns[['starts_with_vowel', 'ends_with_son', 'job_starts_data', 'company_ends_inc']].sum()\n", "print(start_end_summary)\n", "\n", "print(\"\\nNames starting with vowels:\")\n", "vowel_names = df_patterns[df_patterns['starts_with_vowel']]['customer_name'].unique()[:10]\n", "print(vowel_names)\n", "\n", "print(\"\\nData-related job titles:\")\n", "data_jobs = df_patterns[df_patterns['job_starts_data']]['job_title'].unique()\n", "print(data_jobs)\n", "\n", "# Phone number format detection\n", "df_patterns['phone_parentheses'] = df_patterns['phone'].str.contains(r'\\(\\d{3}\\)', na=False)\n", "df_patterns['phone_dashes'] = df_patterns['phone'].str.contains(r'\\d{3}-\\d{3}-\\d{4}', na=False)\n", "df_patterns['phone_dots'] = df_patterns['phone'].str.contains(r'\\d{3}\\.\\d{3}\\.\\d{4}', na=False)\n", "df_patterns['phone_spaces'] = df_patterns['phone'].str.contains(r'\\d{3}\\s\\d{3}\\s\\d{4}', na=False)\n", "\n", "print(\"\\nPhone number format patterns:\")\n", "phone_format_summary = df_patterns[['phone_parentheses', 'phone_dashes', 'phone_dots', 'phone_spaces']].sum()\n", "print(phone_format_summary)\n", "\n", "print(\"\\nSample phone formats:\")\n", "for format_type in ['phone_parentheses', 'phone_dashes', 'phone_dots', 'phone_spaces']:\n", " sample = df_patterns[df_patterns[format_type]]['phone'].iloc[0] if df_patterns[format_type].any() else 'None'\n", " print(f\"{format_type}: {sample}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Regular Expressions\n", "\n", "Advanced pattern matching using regular expressions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Regular expression basics\n", "print(\"=== REGULAR EXPRESSION BASICS ===\")\n", "\n", "df_regex = df_text.copy()\n", "\n", "# Extract patterns using regex\n", "# Extract phone numbers (various formats)\n", "phone_pattern = r'\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}'\n", "df_regex['extracted_phone'] = df_regex['phone'].str.extract(f'({phone_pattern})')\n", "\n", "print(\"Phone number extraction:\")\n", "print(df_regex[['phone', 'extracted_phone']].head(10))\n", "\n", "# Extract ZIP codes from addresses\n", "zip_pattern = r'\\b\\d{5}\\b'\n", "df_regex['zip_code'] = df_regex['address'].str.extract(f'({zip_pattern})')\n", "\n", "print(\"\\nZIP code extraction:\")\n", "print(df_regex[['address', 'zip_code']].head(10))\n", "\n", "# Extract state abbreviations\n", "state_pattern = r'\\b[A-Z]{2}\\b'\n", "df_regex['state'] = df_regex['address'].str.extract(f'({state_pattern})')\n", "\n", "print(\"\\nState extraction:\")\n", "print(df_regex[['address', 'state']].head(10))\n", "\n", "# Count digits in strings\n", "df_regex['digit_count'] = df_regex['phone'].str.count(r'\\d')\n", "df_regex['letter_count'] = df_regex['customer_name'].str.count(r'[a-zA-Z]')\n", "\n", "print(\"\\nCharacter counting:\")\n", "print(df_regex[['phone', 'digit_count', 'customer_name', 'letter_count']].head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Advanced regex patterns\n", "print(\"=== ADVANCED REGEX PATTERNS ===\")\n", "\n", "# Extract all email components\n", "email_pattern = r'([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\\.([a-zA-Z]{2,})'\n", "email_parts = df_regex['email'].str.extract(email_pattern)\n", "email_parts.columns = ['username', 'domain', 'tld']\n", "\n", "df_regex = pd.concat([df_regex, email_parts], axis=1)\n", "\n", "print(\"Email component extraction:\")\n", "print(df_regex[['email', 'username', 'domain', 'tld']].head(10))\n", "\n", "# Extract multiple phone number parts\n", "phone_parts_pattern = r'\\(??(\\d{3})\\)?[-.,\\s]?(\\d{3})[-.,\\s]?(\\d{4})'\n", "phone_parts = df_regex['phone'].str.extract(phone_parts_pattern)\n", "phone_parts.columns = ['area_code', 'exchange', 'number']\n", "\n", "df_regex = pd.concat([df_regex, phone_parts], axis=1)\n", "\n", "print(\"\\nPhone number component extraction:\")\n", "print(df_regex[['phone', 'area_code', 'exchange', 'number']].head(10))\n", "\n", "# Extract address components\n", "address_pattern = r'(\\d+)\\s+(.+?)\\s*,\\s*(.+?)\\s*,\\s*([A-Z]{2})\\s+(\\d{5})'\n", "address_parts = df_regex['address'].str.extract(address_pattern)\n", "address_parts.columns = ['street_number', 'street_name', 'city', 'state_extracted', 'zip_extracted']\n", "\n", "print(\"\\nAddress component extraction (first 5):\")\n", "print(address_parts.head())\n", "\n", "# Find all matches (not just first)\n", "# Find all capitalized words in names\n", "df_regex['capitalized_words'] = df_regex['customer_name'].str.findall(r'\\b[A-Z][a-z]+\\b')\n", "df_regex['num_capitalized'] = df_regex['capitalized_words'].str.len()\n", "\n", "print(\"\\nCapitalized words in names:\")\n", "print(df_regex[['customer_name', 'capitalized_words', 'num_capitalized']].head(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Regex replacement and cleaning\n", "print(\"=== REGEX REPLACEMENT AND CLEANING ===\")\n", "\n", "# Clean phone numbers to standard format\n", "def clean_phone(phone_str):\n", " \"\"\"Clean phone number to standard format\"\"\"\n", " if pd.isna(phone_str):\n", " return None\n", " # Remove all non-digits\n", " digits = re.sub(r'\\D', '', phone_str)\n", " # Handle different lengths\n", " if len(digits) == 10:\n", " return f\"({digits[:3]}) {digits[3:6]}-{digits[6:]}\"\n", " elif len(digits) == 11 and digits.startswith('1'):\n", " return f\"({digits[1:4]}) {digits[4:7]}-{digits[7:]}\"\n", " else:\n", " return 'Invalid'\n", "\n", "df_regex['phone_cleaned'] = df_regex['phone'].apply(clean_phone)\n", "\n", "print(\"Phone number cleaning:\")\n", "phone_cleaning_sample = df_regex[['phone', 'phone_cleaned']].head(15)\n", "print(phone_cleaning_sample)\n", "\n", "# Remove punctuation from reviews\n", "df_regex['review_no_punct'] = df_regex['product_reviews'].str.replace(r'[^\\w\\s]', ' ', regex=True)\n", "df_regex['review_clean'] = df_regex['review_no_punct'].str.replace(r'\\s+', ' ', regex=True).str.strip()\n", "\n", "print(\"\\nReview cleaning:\")\n", "review_cleaning_sample = df_regex[['product_reviews', 'review_clean']].head(5)\n", "for idx, row in review_cleaning_sample.iterrows():\n", " print(f\"Original: {row['product_reviews']}\")\n", " print(f\"Cleaned: {row['review_clean']}\")\n", " print()\n", "\n", "# Standardize company names\n", "df_regex['company_clean'] = (\n", " df_regex['company']\n", " .str.replace(r'\\binc\\.?\\b', 'Inc.', case=False, regex=True)\n", " .str.replace(r'\\bllc\\b', 'LLC', case=False, regex=True)\n", " .str.replace(r'\\bltd\\.?\\b', 'Ltd.', case=False, regex=True)\n", " .str.replace(r'\\bco\\.?\\b', 'Co.', case=False, regex=True)\n", " .str.title()\n", ")\n", "\n", "print(\"Company name standardization:\")\n", "company_sample = df_regex[['company', 'company_clean']].head(10)\n", "print(company_sample)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Text Cleaning and Standardization\n", "\n", "Comprehensive text cleaning workflows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Comprehensive text cleaning pipeline\n", "print(\"=== COMPREHENSIVE TEXT CLEANING ===\")\n", "\n", "def clean_text_comprehensive(df):\n", " \"\"\"Comprehensive text cleaning pipeline\"\"\"\n", " df_clean = df.copy()\n", " \n", " # 1. Clean customer names\n", " df_clean['customer_name_clean'] = (\n", " df_clean['customer_name']\n", " .str.strip() # Remove leading/trailing whitespace\n", " .str.replace(r'\\s+', ' ', regex=True) # Replace multiple spaces with single space\n", " .str.title() # Title case\n", " .str.replace(r'\\bDr\\.', 'Dr.', regex=True) # Standardize titles\n", " .str.replace(r'\\bProf\\.', 'Prof.', regex=True)\n", " .str.replace(r'\\bJr\\.', 'Jr.', regex=True)\n", " .str.replace(r'\\bSr\\.', 'Sr.', regex=True)\n", " )\n", " \n", " # 2. Clean and standardize emails\n", " df_clean['email_clean'] = (\n", " df_clean['email']\n", " .str.strip()\n", " .str.lower() # Lowercase for emails\n", " .str.replace(r'\\s+', '', regex=True) # Remove any spaces\n", " )\n", " \n", " # 3. Standardize phone numbers\n", " df_clean['phone_clean'] = df_clean['phone'].apply(clean_phone)\n", " \n", " # 4. Clean addresses\n", " df_clean['address_clean'] = (\n", " df_clean['address']\n", " .str.strip()\n", " .str.title() # Title case\n", " .str.replace(r'\\bSt\\.?\\b', 'St.', regex=True) # Standardize street abbreviations\n", " .str.replace(r'\\bAve\\.?\\b', 'Ave.', regex=True)\n", " .str.replace(r'\\bRd\\.?\\b', 'Rd.', regex=True)\n", " .str.replace(r'\\bDr\\.?\\b', 'Dr.', regex=True)\n", " .str.replace(r'\\bLn\\.?\\b', 'Ln.', regex=True)\n", " .str.replace(r'\\bCt\\.?\\b', 'Ct.', regex=True)\n", " .str.replace(r'\\bBlvd\\.?\\b', 'Blvd.', regex=True)\n", " .str.replace(r'\\s+', ' ', regex=True) # Multiple spaces to single\n", " )\n", " \n", " # 5. Clean job titles\n", " df_clean['job_title_clean'] = (\n", " df_clean['job_title']\n", " .str.strip()\n", " .str.title()\n", " .str.replace(r'\\bQa\\b', 'QA', regex=True) # Specific corrections\n", " .str.replace(r'\\bUx\\b', 'UX', regex=True)\n", " .str.replace(r'\\bHr\\b', 'HR', regex=True)\n", " )\n", " \n", " # 6. Clean company names\n", " df_clean['company_clean'] = (\n", " df_clean['company']\n", " .str.strip()\n", " .str.title()\n", " .str.replace(r'\\binc\\.?\\b', 'Inc.', case=False, regex=True)\n", " .str.replace(r'\\bllc\\b', 'LLC', case=False, regex=True)\n", " .str.replace(r'\\bltd\\.?\\b', 'Ltd.', case=False, regex=True)\n", " .str.replace(r'\\bco\\.?\\b', 'Co.', case=False, regex=True)\n", " )\n", " \n", " return df_clean\n", "\n", "# Apply comprehensive cleaning\n", "df_comprehensive = clean_text_comprehensive(df_text)\n", "\n", "print(\"Comprehensive cleaning results:\")\n", "# Show before/after comparison\n", "comparison_cols = [\n", " ('customer_name', 'customer_name_clean'),\n", " ('email', 'email_clean'),\n", " ('phone', 'phone_clean'),\n", " ('job_title', 'job_title_clean'),\n", " ('company', 'company_clean')\n", "]\n", "\n", "for original, cleaned in comparison_cols:\n", " print(f\"\\n{original.upper()} CLEANING:\")\n", " sample = df_comprehensive[[original, cleaned]].head(5)\n", " for idx, row in sample.iterrows():\n", " print(f\" Before: {row[original]}\")\n", " print(f\" After: {row[cleaned]}\")\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Text standardization and validation\n", "print(\"=== TEXT STANDARDIZATION AND VALIDATION ===\")\n", "\n", "def validate_cleaned_data(df):\n", " \"\"\"Validate cleaned data quality\"\"\"\n", " validation_results = {}\n", " \n", " # Email validation\n", " email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n", " valid_emails = df['email_clean'].str.match(email_pattern, na=False)\n", " validation_results['valid_emails'] = {\n", " 'total': len(df),\n", " 'valid': valid_emails.sum(),\n", " 'invalid': (~valid_emails).sum(),\n", " 'percentage_valid': (valid_emails.sum() / len(df)) * 100\n", " }\n", " \n", " # Phone validation\n", " valid_phones = df['phone_clean'] != 'Invalid'\n", " validation_results['valid_phones'] = {\n", " 'total': len(df),\n", " 'valid': valid_phones.sum(),\n", " 'invalid': (~valid_phones).sum(),\n", " 'percentage_valid': (valid_phones.sum() / len(df)) * 100\n", " }\n", " \n", " # Name validation (no numbers, reasonable length)\n", " valid_names = (\n", " df['customer_name_clean'].str.len().between(2, 50) &\n", " ~df['customer_name_clean'].str.contains(r'\\d', na=False)\n", " )\n", " validation_results['valid_names'] = {\n", " 'total': len(df),\n", " 'valid': valid_names.sum(),\n", " 'invalid': (~valid_names).sum(),\n", " 'percentage_valid': (valid_names.sum() / len(df)) * 100\n", " }\n", " \n", " return validation_results\n", "\n", "# Validate cleaned data\n", "validation_results = validate_cleaned_data(df_comprehensive)\n", "\n", "print(\"Data validation results:\")\n", "for field, results in validation_results.items():\n", " print(f\"\\n{field.upper()}:\")\n", " print(f\" Total records: {results['total']}\")\n", " print(f\" Valid: {results['valid']} ({results['percentage_valid']:.1f}%)\")\n", " print(f\" Invalid: {results['invalid']}\")\n", "\n", "# Show some invalid examples\n", "print(\"\\nExamples of invalid data:\")\n", "invalid_emails = df_comprehensive[~df_comprehensive['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', na=False)]\n", "if len(invalid_emails) > 0:\n", " print(f\"Invalid emails: {invalid_emails['email_clean'].head(3).tolist()}\")\n", "\n", "invalid_phones = df_comprehensive[df_comprehensive['phone_clean'] == 'Invalid']\n", "if len(invalid_phones) > 0:\n", " print(f\"Invalid phones: {invalid_phones['phone'].head(3).tolist()}\")\n", "\n", "# Generate data quality summary\n", "overall_quality = np.mean([results['percentage_valid'] for results in validation_results.values()])\n", "print(f\"\\nOverall data quality score: {overall_quality:.1f}%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Text Analysis and Insights\n", "\n", "Extracting business insights from text data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Text analysis for business insights\n", "print(\"=== TEXT ANALYSIS FOR BUSINESS INSIGHTS ===\")\n", "\n", "def analyze_text_patterns(df):\n", " \"\"\"Analyze text patterns for business insights\"\"\"\n", " analysis = {}\n", " \n", " # 1. Name analysis\n", " analysis['name_insights'] = {\n", " 'avg_name_length': df['customer_name_clean'].str.len().mean(),\n", " 'names_with_titles': df['customer_name_clean'].str.contains(r'Dr\\.|Prof\\.|Mr\\.|Ms\\.|Mrs\\.').sum(),\n", " 'names_with_suffixes': df['customer_name_clean'].str.contains(r'Jr\\.|Sr\\.|III|II').sum(),\n", " 'hyphenated_names': df['customer_name_clean'].str.contains('-').sum(),\n", " 'most_common_first_names': df['customer_name_clean'].str.split().str[0].value_counts().head(5)\n", " }\n", " \n", " # 2. Email domain analysis\n", " domains = df['email_clean'].str.split('@').str[1]\n", " analysis['email_insights'] = {\n", " 'total_unique_domains': domains.nunique(),\n", " 'top_domains': domains.value_counts().head(10),\n", " 'edu_domains': domains.str.endswith('.edu').sum(),\n", " 'com_domains': domains.str.endswith('.com').sum(),\n", " 'org_domains': domains.str.endswith('.org').sum()\n", " }\n", " \n", " # 3. Geographic analysis from addresses\n", " states = df['address_clean'].str.extract(r'\\b([A-Z]{2})\\s+\\d{5}')[0]\n", " analysis['geographic_insights'] = {\n", " 'unique_states': states.nunique(),\n", " 'top_states': states.value_counts().head(10),\n", " 'coastal_states': states.isin(['CA', 'NY', 'FL', 'WA', 'OR']).sum()\n", " }\n", " \n", " # 4. Job title analysis\n", " analysis['job_insights'] = {\n", " 'unique_job_titles': df['job_title_clean'].nunique(),\n", " 'top_job_titles': df['job_title_clean'].value_counts().head(10),\n", " 'tech_jobs': df['job_title_clean'].str.contains('Engineer|Developer|Data|Software', case=False).sum(),\n", " 'management_jobs': df['job_title_clean'].str.contains('Manager|Director|VP|President', case=False).sum()\n", " }\n", " \n", " # 5. Company analysis\n", " analysis['company_insights'] = {\n", " 'unique_companies': df['company_clean'].nunique(),\n", " 'top_companies': df['company_clean'].value_counts().head(10),\n", " 'inc_companies': df['company_clean'].str.contains('Inc\\.').sum(),\n", " 'llc_companies': df['company_clean'].str.contains('LLC').sum(),\n", " 'startups': df['company_clean'].str.contains('Startup|startup', case=False).sum()\n", " }\n", " \n", " return analysis\n", "\n", "# Perform text analysis\n", "text_analysis = analyze_text_patterns(df_comprehensive)\n", "\n", "print(\"TEXT ANALYSIS RESULTS:\")\n", "\n", "print(\"\\n1. NAME INSIGHTS:\")\n", "name_insights = text_analysis['name_insights']\n", "print(f\" Average name length: {name_insights['avg_name_length']:.1f} characters\")\n", "print(f\" Names with titles: {name_insights['names_with_titles']}\")\n", "print(f\" Names with suffixes: {name_insights['names_with_suffixes']}\")\n", "print(f\" Hyphenated names: {name_insights['hyphenated_names']}\")\n", "print(\" Most common first names:\")\n", "for name, count in name_insights['most_common_first_names'].items():\n", " print(f\" {name}: {count}\")\n", "\n", "print(\"\\n2. EMAIL INSIGHTS:\")\n", "email_insights = text_analysis['email_insights']\n", "print(f\" Unique domains: {email_insights['total_unique_domains']}\")\n", "print(f\" .edu domains: {email_insights['edu_domains']}\")\n", "print(f\" .com domains: {email_insights['com_domains']}\")\n", "print(f\" .org domains: {email_insights['org_domains']}\")\n", "print(\" Top domains:\")\n", "for domain, count in email_insights['top_domains'].head(5).items():\n", " print(f\" {domain}: {count}\")\n", "\n", "print(\"\\n3. GEOGRAPHIC INSIGHTS:\")\n", "geo_insights = text_analysis['geographic_insights']\n", "print(f\" Unique states: {geo_insights['unique_states']}\")\n", "print(f\" Coastal states: {geo_insights['coastal_states']}\")\n", "print(\" Top states:\")\n", "for state, count in geo_insights['top_states'].head(5).items():\n", " print(f\" {state}: {count}\")\n", "\n", "print(\"\\n4. JOB INSIGHTS:\")\n", "job_insights = text_analysis['job_insights']\n", "print(f\" Unique job titles: {job_insights['unique_job_titles']}\")\n", "print(f\" Tech jobs: {job_insights['tech_jobs']}\")\n", "print(f\" Management jobs: {job_insights['management_jobs']}\")\n", "\n", "print(\"\\n5. COMPANY INSIGHTS:\")\n", "company_insights = text_analysis['company_insights']\n", "print(f\" Unique companies: {company_insights['unique_companies']}\")\n", "print(f\" Inc. companies: {company_insights['inc_companies']}\")\n", "print(f\" LLC companies: {company_insights['llc_companies']}\")\n", "print(f\" Startups: {company_insights['startups']}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sentiment analysis of product reviews\n", "print(\"=== SENTIMENT ANALYSIS OF REVIEWS ===\")\n", "\n", "def analyze_review_sentiment(df):\n", " \"\"\"Analyze sentiment in product reviews\"\"\"\n", " # Define sentiment word lists\n", " positive_words = [\n", " 'great', 'excellent', 'amazing', 'fantastic', 'love', 'perfect', \n", " 'outstanding', 'superb', 'incredible', 'wonderful', 'awesome', \n", " 'brilliant', 'impressive', 'remarkable', 'exceptional'\n", " ]\n", " \n", " negative_words = [\n", " 'terrible', 'poor', 'disappointing', 'waste', 'cheap', 'broke', \n", " 'fake', 'mediocre', 'awful', 'horrible', 'useless', 'worst', \n", " 'defective', 'junk', 'garbage'\n", " ]\n", " \n", " # Create patterns\n", " positive_pattern = '|'.join(positive_words)\n", " negative_pattern = '|'.join(negative_words)\n", " \n", " # Count sentiment words\n", " df['positive_word_count'] = df['product_reviews'].str.lower().str.count(positive_pattern)\n", " df['negative_word_count'] = df['product_reviews'].str.lower().str.count(negative_pattern)\n", " \n", " # Calculate sentiment score\n", " df['sentiment_score'] = df['positive_word_count'] - df['negative_word_count']\n", " \n", " # Categorize sentiment\n", " def categorize_sentiment(score):\n", " if score > 0:\n", " return 'Positive'\n", " elif score < 0:\n", " return 'Negative'\n", " else:\n", " return 'Neutral'\n", " \n", " df['sentiment_category'] = df['sentiment_score'].apply(categorize_sentiment)\n", " \n", " # Additional features\n", " df['has_exclamation'] = df['product_reviews'].str.contains('!').astype(int)\n", " df['has_caps'] = df['product_reviews'].str.contains(r'[A-Z]{3,}').astype(int)\n", " df['review_word_count'] = df['product_reviews'].str.split().str.len()\n", " \n", " return df\n", "\n", "# Analyze sentiment\n", "df_sentiment = analyze_review_sentiment(df_comprehensive.copy())\n", "\n", "print(\"Sentiment analysis results:\")\n", "sentiment_summary = df_sentiment['sentiment_category'].value_counts()\n", "print(sentiment_summary)\n", "print(f\"\\nSentiment distribution:\")\n", "for category, count in sentiment_summary.items():\n", " percentage = (count / len(df_sentiment)) * 100\n", " print(f\" {category}: {count} ({percentage:.1f}%)\")\n", "\n", "print(\"\\nSentiment score statistics:\")\n", "print(df_sentiment['sentiment_score'].describe())\n", "\n", "print(\"\\nSample reviews by sentiment:\")\n", "for sentiment in ['Positive', 'Negative', 'Neutral']:\n", " sample_reviews = df_sentiment[df_sentiment['sentiment_category'] == sentiment]['product_reviews'].head(2)\n", " print(f\"\\n{sentiment} reviews:\")\n", " for review in sample_reviews:\n", " print(f\" - {review}\")\n", "\n", "# Correlation analysis\n", "print(\"\\nCorrelation between text features:\")\n", "text_features = ['positive_word_count', 'negative_word_count', 'sentiment_score', \n", " 'has_exclamation', 'has_caps', 'review_word_count']\n", "correlation_matrix = df_sentiment[text_features].corr()\n", "print(correlation_matrix.round(3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice Exercises\n", "\n", "Apply string operations to complex text processing scenarios:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Exercise 1: Advanced Text Cleaning Pipeline\n", "# Create a comprehensive text cleaning and validation system:\n", "# - Handle international characters and encoding issues\n", "# - Implement fuzzy matching for duplicate detection\n", "# - Create data quality scoring system\n", "# - Generate cleaning reports with statistics\n", "\n", "def advanced_text_cleaning_pipeline(df):\n", " \"\"\"Advanced text cleaning with international support and validation\"\"\"\n", " # Your implementation here\n", " pass\n", "\n", "# cleaned_df = advanced_text_cleaning_pipeline(df_text)\n", "# print(\"Advanced text cleaning completed\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Exercise 2: Text Mining and Information Extraction\n", "# Extract structured information from unstructured text:\n", "# - Extract entities (names, organizations, locations)\n", "# - Parse complex address formats\n", "# - Identify and extract contact information\n", "# - Create knowledge graphs from text relationships\n", "\n", "# Your code here:\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Exercise 3: Business Intelligence from Text\n", "# Create business insights from text analysis:\n", "# - Customer segmentation based on text patterns\n", "# - Market analysis from company and job data\n", "# - Geographic market penetration analysis\n", "# - Competitive intelligence from text data\n", "\n", "# Your code here:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "1. **String Accessor (`.str`)**:\n", " - Essential for all pandas string operations\n", " - Works with Series containing strings\n", " - Handles NaN values gracefully\n", "\n", "2. **Basic Operations**:\n", " - **Case**: `.upper()`, `.lower()`, `.title()`, `.capitalize()`\n", " - **Length**: `.len()`\n", " - **Slicing**: `.str[start:end]`\n", " - **Splitting**: `.str.split()`\n", "\n", "3. **Pattern Matching**:\n", " - **Contains**: `.str.contains()` for pattern detection\n", " - **Startswith/Endswith**: `.str.startswith()`, `.str.endswith()`\n", " - **Regular Expressions**: Use `regex=True` parameter\n", "\n", "4. **Text Cleaning**:\n", " - **Replace**: `.str.replace()` for substitution\n", " - **Strip**: `.str.strip()` for whitespace removal\n", " - **Extract**: `.str.extract()` for regex pattern extraction\n", "\n", "## String Operations Quick Reference\n", "\n", "```python\n", "# Basic transformations\n", "df['col'].str.upper() # Uppercase\n", "df['col'].str.lower() # Lowercase\n", "df['col'].str.title() # Title Case\n", "df['col'].str.len() # String length\n", "df['col'].str.strip() # Remove whitespace\n", "\n", "# Pattern matching\n", "df['col'].str.contains('pattern') # Check if contains\n", "df['col'].str.startswith('prefix') # Check if starts with\n", "df['col'].str.endswith('suffix') # Check if ends with\n", "\n", "# Extraction and replacement\n", "df['col'].str.extract(r'(\\d+)') # Extract pattern\n", "df['col'].str.replace('old', 'new') # Replace text\n", "df['col'].str.split('delimiter') # Split string\n", "\n", "# Advanced regex\n", "df['col'].str.findall(r'\\b\\w+\\b') # Find all matches\n", "df['col'].str.count(r'\\d') # Count pattern occurrences\n", "```\n", "\n", "## Common Text Cleaning Patterns\n", "\n", "| Task | Pattern | Example |\n", "|------|---------|----------|\n", "| Remove punctuation | `r'[^\\w\\s]'` | `str.replace(r'[^\\w\\s]', '', regex=True)` |\n", "| Extract digits | `r'\\d+'` | `str.extract(r'(\\d+)')` |\n", "| Clean phone numbers | `r'\\D'` | `str.replace(r'\\D', '', regex=True)` |\n", "| Extract email parts | `r'([^@]+)@(.+)'` | `str.extract(r'([^@]+)@(.+)')` |\n", "| Standardize whitespace | `r'\\s+'` | `str.replace(r'\\s+', ' ', regex=True)` |\n", "\n", "## Best Practices\n", "\n", "1. **Data Validation**: Always validate cleaned data\n", "2. **Preserve Originals**: Keep original columns during cleaning\n", "3. **Handle Edge Cases**: Plan for missing values and unusual formats\n", "4. **Performance**: Use vectorized operations instead of apply() when possible\n", "5. **Documentation**: Document cleaning rules and business logic\n", "6. **Testing**: Test regex patterns thoroughly with edge cases\n", "\n", "## Business Applications\n", "\n", "- **Customer Data Cleaning**: Standardize names, addresses, contacts\n", "- **Market Research**: Analyze company names and domains\n", "- **Sentiment Analysis**: Process customer reviews and feedback\n", "- **Data Integration**: Clean and match data from multiple sources\n", "- **Compliance**: Standardize data for regulatory requirements" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 4 }