1059 lines
44 KiB
Text
Executable file
1059 lines
44 KiB
Text
Executable file
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Session 1 - DataFrames - Lesson 11: String Operations and Text Processing\n",
|
|
"\n",
|
|
"## Learning Objectives\n",
|
|
"- Master pandas string methods for text processing\n",
|
|
"- Learn regular expressions for pattern matching and extraction\n",
|
|
"- Understand text cleaning and standardization techniques\n",
|
|
"- Practice with real-world text data scenarios\n",
|
|
"- Apply string operations to business data analysis\n",
|
|
"\n",
|
|
"## Prerequisites\n",
|
|
"- Completed Lessons 1-10\n",
|
|
"- Basic understanding of regular expressions (helpful but not required)\n",
|
|
"- Familiarity with text data challenges"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import required libraries\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import re\n",
|
|
"import string\n",
|
|
"from datetime import datetime\n",
|
|
"import warnings\n",
|
|
"warnings.filterwarnings('ignore')\n",
|
|
"\n",
|
|
"# Set display options\n",
|
|
"pd.set_option('display.max_columns', None)\n",
|
|
"pd.set_option('display.max_rows', 20)\n",
|
|
"pd.set_option('display.max_colwidth', 50)\n",
|
|
"\n",
|
|
"print(\"Libraries loaded successfully!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating Text-Rich Dataset\n",
|
|
"\n",
|
|
"Let's create a comprehensive dataset with various text processing challenges."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create realistic text-rich dataset\n",
|
|
"np.random.seed(42)\n",
|
|
"\n",
|
|
"# Sample data with intentional text issues\n",
|
|
"text_data = {\n",
|
|
" 'customer_id': range(1, 201),\n",
|
|
" 'customer_name': [\n",
|
|
" 'John Smith', 'jane doe', 'MARY JOHNSON', 'Bob Wilson Jr.', 'Dr. Sarah Davis',\n",
|
|
" 'Mike O\\'Connor', 'Lisa Garcia-Martinez', 'David Miller III', 'Amy Chen', 'Tom Anderson',\n",
|
|
" 'Kate Wilson', 'james brown', 'DIANA PRINCE', 'Frank Miller Sr.', 'Prof. Grace Lee',\n",
|
|
" 'Henry Davis', 'Ivy Chen-Wang', 'Jack Robinson', 'Olivia Taylor', 'Ryan Clark'\n",
|
|
" ] * 10,\n",
|
|
" 'email': [\n",
|
|
" 'john.smith@email.com', 'JANE.DOE@EMAIL.COM', 'mary@company.org',\n",
|
|
" 'bob.wilson@test.co.uk', 'sarah.davis@university.edu', 'mike@work.net',\n",
|
|
" 'lisa.garcia@startup.io', 'david@consulting.biz', 'amy.chen@tech.com', 'tom@sales.org',\n",
|
|
" 'kate.wilson@design.com', 'james@marketing.net', 'diana@fashion.com',\n",
|
|
" 'frank@legal.org', 'grace.lee@research.edu', 'henry@finance.com',\n",
|
|
" 'ivy@engineering.tech', 'jack@operations.biz', 'olivia@hr.org', 'ryan@analytics.io'\n",
|
|
" ] * 10,\n",
|
|
" 'phone': [\n",
|
|
" '(555) 123-4567', '555.987.6543', '5551234567', '+1-555-987-6543',\n",
|
|
" '(555)123-4567', '555 123 4567', '1-555-987-6543', '555-123-4567',\n",
|
|
" '(555) 987 6543', '+15559876543', '555.123.4567', '(555)987-6543',\n",
|
|
" '555 987 6543', '1 555 123 4567', '+1 555 987 6543', '5559876543',\n",
|
|
" '(555)-123-4567', '555_987_6543', '555/123/4567', '555-987-6543'\n",
|
|
" ] * 10,\n",
|
|
" 'address': [\n",
|
|
" '123 Main St, Anytown, NY 12345', '456 Oak Ave, Boston, MA 02101',\n",
|
|
" '789 pine road, los angeles, CA 90210', '321 ELM STREET, Chicago, IL 60601',\n",
|
|
" '654 Maple Dr., Houston, TX 77001', '987 Cedar Lane, Phoenix, AZ 85001',\n",
|
|
" '147 birch way, Philadelphia, PA 19101', '258 ASH CT, San Antonio, TX 78201',\n",
|
|
" '369 Walnut St., San Diego, CA 92101', '741 Cherry Ave, Dallas, TX 75201',\n",
|
|
" '852 Spruce Blvd, Austin, TX 73301', '963 Fir Street, Seattle, WA 98101',\n",
|
|
" '159 redwood dr, Portland, OR 97201', '357 WILLOW LN, Denver, CO 80201',\n",
|
|
" '468 Poplar St., Miami, FL 33101', '579 Hickory Ave, Atlanta, GA 30301',\n",
|
|
" '680 magnolia way, Nashville, TN 37201', '791 DOGWOOD CT, Charlotte, NC 28201',\n",
|
|
" '802 Palm St., Orlando, FL 32801', '913 Cypress Ave, Tampa, FL 33601'\n",
|
|
" ] * 10,\n",
|
|
" 'product_reviews': [\n",
|
|
" 'Great product! Highly recommend!!!', 'okay product, nothing special',\n",
|
|
" 'TERRIBLE! DO NOT BUY!', 'Amazing quality, fast shipping :)', 'Good value for money.',\n",
|
|
" 'Poor quality, broke after 1 week :(', 'Excellent customer service!',\n",
|
|
" 'average product... could be better', 'LOVE IT! 5 stars!!', 'Not worth the price.',\n",
|
|
" 'Perfect! Exactly what I needed.', 'disappointing quality',\n",
|
|
" 'OUTSTANDING PRODUCT!!!', 'mediocre at best', 'Fantastic! Will buy again.',\n",
|
|
" 'cheap quality, looks fake', 'Superb craftsmanship!',\n",
|
|
" 'waste of money', 'Incredible value! Recommended!', 'poor design'\n",
|
|
" ] * 10,\n",
|
|
" 'job_title': [\n",
|
|
" 'Software Engineer', 'data scientist', 'MARKETING MANAGER', 'Sales Rep',\n",
|
|
" 'Product Manager', 'business analyst', 'UX DESIGNER', 'DevOps Engineer',\n",
|
|
" 'Content Writer', 'project manager', 'FINANCIAL ANALYST', 'HR Specialist',\n",
|
|
" 'Operations Manager', 'qa engineer', 'RESEARCH SCIENTIST', 'Account Executive',\n",
|
|
" 'Digital Marketer', 'software developer', 'DATA ENGINEER', 'Consultant'\n",
|
|
" ] * 10,\n",
|
|
" 'company': [\n",
|
|
" 'TechCorp Inc.', 'data solutions llc', 'INNOVATIVE SYSTEMS', 'Global Enterprises',\n",
|
|
" 'StartupXYZ', 'consulting group ltd', 'FUTURE TECH CO', 'Analytics Pro',\n",
|
|
" 'Design Studio', 'enterprise solutions', 'MARKETING MASTERS', 'Software Solutions',\n",
|
|
" 'Digital Agency', 'research institute', 'FINANCE FIRM', 'Operations Co.',\n",
|
|
" 'Creative Agency', 'tech startup', 'DATA CORP', 'Professional Services'\n",
|
|
" ] * 10\n",
|
|
"}\n",
|
|
"\n",
|
|
"df_text = pd.DataFrame(text_data)\n",
|
|
"\n",
|
|
"print(\"Text-rich dataset created:\")\n",
|
|
"print(f\"Shape: {df_text.shape}\")\n",
|
|
"print(\"\\nFirst few rows:\")\n",
|
|
"print(df_text.head())\n",
|
|
"print(\"\\nData types:\")\n",
|
|
"print(df_text.dtypes)\n",
|
|
"print(\"\\nSample of text issues to address:\")\n",
|
|
"print(\"- Inconsistent capitalization\")\n",
|
|
"print(\"- Various phone number formats\")\n",
|
|
"print(\"- Mixed address formatting\")\n",
|
|
"print(\"- Inconsistent email domains\")\n",
|
|
"print(\"- Varied punctuation and emoticons in reviews\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Basic String Operations\n",
|
|
"\n",
|
|
"Fundamental string methods and transformations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Basic string transformations\n",
|
|
"print(\"=== BASIC STRING TRANSFORMATIONS ===\")\n",
|
|
"\n",
|
|
"# Case transformations\n",
|
|
"df_basic = df_text.copy()\n",
|
|
"\n",
|
|
"# Convert to different cases\n",
|
|
"df_basic['customer_name_upper'] = df_basic['customer_name'].str.upper()\n",
|
|
"df_basic['customer_name_lower'] = df_basic['customer_name'].str.lower()\n",
|
|
"df_basic['customer_name_title'] = df_basic['customer_name'].str.title()\n",
|
|
"df_basic['customer_name_capitalize'] = df_basic['customer_name'].str.capitalize()\n",
|
|
"\n",
|
|
"print(\"Case transformations:\")\n",
|
|
"case_cols = ['customer_name', 'customer_name_upper', 'customer_name_lower', \n",
|
|
" 'customer_name_title', 'customer_name_capitalize']\n",
|
|
"print(df_basic[case_cols].head())\n",
|
|
"\n",
|
|
"# String length and basic properties\n",
|
|
"df_basic['name_length'] = df_basic['customer_name'].str.len()\n",
|
|
"df_basic['email_length'] = df_basic['email'].str.len()\n",
|
|
"df_basic['review_length'] = df_basic['product_reviews'].str.len()\n",
|
|
"\n",
|
|
"print(\"\\nString lengths:\")\n",
|
|
"print(df_basic[['customer_name', 'name_length', 'email', 'email_length']].head())\n",
|
|
"\n",
|
|
"print(\"\\nLength statistics:\")\n",
|
|
"length_stats = df_basic[['name_length', 'email_length', 'review_length']].describe()\n",
|
|
"print(length_stats)\n",
|
|
"\n",
|
|
"# Check for empty/null strings\n",
|
|
"print(\"\\nEmpty string checks:\")\n",
|
|
"for col in ['customer_name', 'email', 'phone']:\n",
|
|
" empty_count = (df_basic[col].str.strip() == '').sum()\n",
|
|
" null_count = df_basic[col].isnull().sum()\n",
|
|
" print(f\"{col}: {empty_count} empty strings, {null_count} null values\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# String slicing and indexing\n",
|
|
"print(\"=== STRING SLICING AND INDEXING ===\")\n",
|
|
"\n",
|
|
"# Extract parts of strings\n",
|
|
"df_basic['first_char'] = df_basic['customer_name'].str[0]\n",
|
|
"df_basic['last_char'] = df_basic['customer_name'].str[-1]\n",
|
|
"df_basic['first_three'] = df_basic['customer_name'].str[:3]\n",
|
|
"df_basic['last_three'] = df_basic['customer_name'].str[-3:]\n",
|
|
"df_basic['middle_chars'] = df_basic['customer_name'].str[2:5]\n",
|
|
"\n",
|
|
"print(\"String slicing examples:\")\n",
|
|
"slice_cols = ['customer_name', 'first_char', 'last_char', 'first_three', 'last_three', 'middle_chars']\n",
|
|
"print(df_basic[slice_cols].head(10))\n",
|
|
"\n",
|
|
"# Extract email domains\n",
|
|
"df_basic['email_domain'] = df_basic['email'].str.split('@').str[1]\n",
|
|
"df_basic['email_username'] = df_basic['email'].str.split('@').str[0]\n",
|
|
"\n",
|
|
"print(\"\\nEmail parsing:\")\n",
|
|
"print(df_basic[['email', 'email_username', 'email_domain']].head(10))\n",
|
|
"\n",
|
|
"# Domain analysis\n",
|
|
"print(\"\\nEmail domain distribution:\")\n",
|
|
"domain_counts = df_basic['email_domain'].value_counts()\n",
|
|
"print(domain_counts.head(10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# String concatenation and joining\n",
|
|
"print(\"=== STRING CONCATENATION AND JOINING ===\")\n",
|
|
"\n",
|
|
"# Simple concatenation\n",
|
|
"df_basic['name_email'] = df_basic['customer_name'] + ' - ' + df_basic['email']\n",
|
|
"df_basic['initials'] = df_basic['customer_name'].str[0] + '.' + df_basic['customer_name'].str.split().str[1].str[0] + '.'\n",
|
|
"\n",
|
|
"print(\"String concatenation:\")\n",
|
|
"print(df_basic[['customer_name', 'email', 'name_email', 'initials']].head())\n",
|
|
"\n",
|
|
"# Using str.cat() for more complex joining\n",
|
|
"df_basic['full_contact'] = df_basic['customer_name'].str.cat(\n",
|
|
" [df_basic['email'], df_basic['phone']], \n",
|
|
" sep=' | '\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"\\nComplex concatenation:\")\n",
|
|
"print(df_basic[['full_contact']].head())\n",
|
|
"\n",
|
|
"# Conditional concatenation\n",
|
|
"df_basic['display_name'] = df_basic.apply(\n",
|
|
" lambda row: f\"{row['customer_name']} ({row['job_title']})\" \n",
|
|
" if pd.notna(row['job_title']) else row['customer_name'], \n",
|
|
" axis=1\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"\\nConditional concatenation:\")\n",
|
|
"print(df_basic[['customer_name', 'job_title', 'display_name']].head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Pattern Matching and String Contains\n",
|
|
"\n",
|
|
"Finding patterns and filtering based on string content."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Basic pattern matching\n",
|
|
"print(\"=== BASIC PATTERN MATCHING ===\")\n",
|
|
"\n",
|
|
"# Check if strings contain specific patterns\n",
|
|
"df_patterns = df_text.copy()\n",
|
|
"\n",
|
|
"# Contains operations\n",
|
|
"df_patterns['has_dr_title'] = df_patterns['customer_name'].str.contains('Dr\\.|Prof\\.', case=False, na=False)\n",
|
|
"df_patterns['has_jr_sr'] = df_patterns['customer_name'].str.contains('Jr\\.|Sr\\.', case=False, na=False)\n",
|
|
"df_patterns['has_hyphen'] = df_patterns['customer_name'].str.contains('-', na=False)\n",
|
|
"df_patterns['has_apostrophe'] = df_patterns['customer_name'].str.contains(\"'\", na=False)\n",
|
|
"\n",
|
|
"print(\"Pattern matching results:\")\n",
|
|
"pattern_summary = df_patterns[['has_dr_title', 'has_jr_sr', 'has_hyphen', 'has_apostrophe']].sum()\n",
|
|
"print(pattern_summary)\n",
|
|
"\n",
|
|
"print(\"\\nExamples of names with titles:\")\n",
|
|
"title_names = df_patterns[df_patterns['has_dr_title']]['customer_name'].unique()\n",
|
|
"print(title_names)\n",
|
|
"\n",
|
|
"# Email domain patterns\n",
|
|
"df_patterns['edu_email'] = df_patterns['email'].str.contains('\\.edu', case=False, na=False)\n",
|
|
"df_patterns['com_email'] = df_patterns['email'].str.contains('\\.com', case=False, na=False)\n",
|
|
"df_patterns['org_email'] = df_patterns['email'].str.contains('\\.org', case=False, na=False)\n",
|
|
"\n",
|
|
"print(\"\\nEmail domain patterns:\")\n",
|
|
"domain_pattern_summary = df_patterns[['edu_email', 'com_email', 'org_email']].sum()\n",
|
|
"print(domain_pattern_summary)\n",
|
|
"\n",
|
|
"# Review sentiment patterns\n",
|
|
"df_patterns['positive_review'] = df_patterns['product_reviews'].str.contains(\n",
|
|
" 'great|excellent|amazing|fantastic|love|perfect|outstanding|superb|incredible', \n",
|
|
" case=False, na=False\n",
|
|
")\n",
|
|
"df_patterns['negative_review'] = df_patterns['product_reviews'].str.contains(\n",
|
|
" 'terrible|poor|disappointing|waste|cheap|broke|fake|mediocre', \n",
|
|
" case=False, na=False\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"\\nReview sentiment patterns:\")\n",
|
|
"sentiment_summary = df_patterns[['positive_review', 'negative_review']].sum()\n",
|
|
"print(sentiment_summary)\n",
|
|
"\n",
|
|
"print(\"\\nSample positive reviews:\")\n",
|
|
"positive_reviews = df_patterns[df_patterns['positive_review']]['product_reviews'].head(5)\n",
|
|
"for review in positive_reviews:\n",
|
|
" print(f\"- {review}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Advanced pattern matching with startswith/endswith\n",
|
|
"print(\"=== STARTSWITH/ENDSWITH PATTERNS ===\")\n",
|
|
"\n",
|
|
"# Check beginnings and endings\n",
|
|
"df_patterns['starts_with_vowel'] = df_patterns['customer_name'].str.lower().str.startswith(('a', 'e', 'i', 'o', 'u'))\n",
|
|
"df_patterns['ends_with_son'] = df_patterns['customer_name'].str.lower().str.endswith('son')\n",
|
|
"df_patterns['job_starts_data'] = df_patterns['job_title'].str.lower().str.startswith('data')\n",
|
|
"df_patterns['company_ends_inc'] = df_patterns['company'].str.lower().str.endswith(('inc', 'inc.', 'llc', 'ltd', 'co', 'co.'))\n",
|
|
"\n",
|
|
"print(\"Start/End pattern results:\")\n",
|
|
"start_end_summary = df_patterns[['starts_with_vowel', 'ends_with_son', 'job_starts_data', 'company_ends_inc']].sum()\n",
|
|
"print(start_end_summary)\n",
|
|
"\n",
|
|
"print(\"\\nNames starting with vowels:\")\n",
|
|
"vowel_names = df_patterns[df_patterns['starts_with_vowel']]['customer_name'].unique()[:10]\n",
|
|
"print(vowel_names)\n",
|
|
"\n",
|
|
"print(\"\\nData-related job titles:\")\n",
|
|
"data_jobs = df_patterns[df_patterns['job_starts_data']]['job_title'].unique()\n",
|
|
"print(data_jobs)\n",
|
|
"\n",
|
|
"# Phone number format detection\n",
|
|
"df_patterns['phone_parentheses'] = df_patterns['phone'].str.contains(r'\\(\\d{3}\\)', na=False)\n",
|
|
"df_patterns['phone_dashes'] = df_patterns['phone'].str.contains(r'\\d{3}-\\d{3}-\\d{4}', na=False)\n",
|
|
"df_patterns['phone_dots'] = df_patterns['phone'].str.contains(r'\\d{3}\\.\\d{3}\\.\\d{4}', na=False)\n",
|
|
"df_patterns['phone_spaces'] = df_patterns['phone'].str.contains(r'\\d{3}\\s\\d{3}\\s\\d{4}', na=False)\n",
|
|
"\n",
|
|
"print(\"\\nPhone number format patterns:\")\n",
|
|
"phone_format_summary = df_patterns[['phone_parentheses', 'phone_dashes', 'phone_dots', 'phone_spaces']].sum()\n",
|
|
"print(phone_format_summary)\n",
|
|
"\n",
|
|
"print(\"\\nSample phone formats:\")\n",
|
|
"for format_type in ['phone_parentheses', 'phone_dashes', 'phone_dots', 'phone_spaces']:\n",
|
|
" sample = df_patterns[df_patterns[format_type]]['phone'].iloc[0] if df_patterns[format_type].any() else 'None'\n",
|
|
" print(f\"{format_type}: {sample}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Regular Expressions\n",
|
|
"\n",
|
|
"Advanced pattern matching using regular expressions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Regular expression basics\n",
|
|
"print(\"=== REGULAR EXPRESSION BASICS ===\")\n",
|
|
"\n",
|
|
"df_regex = df_text.copy()\n",
|
|
"\n",
|
|
"# Extract patterns using regex\n",
|
|
"# Extract phone numbers (various formats)\n",
|
|
"phone_pattern = r'\\(?\\d{3}\\)?[-.\\s]?\\d{3}[-.\\s]?\\d{4}'\n",
|
|
"df_regex['extracted_phone'] = df_regex['phone'].str.extract(f'({phone_pattern})')\n",
|
|
"\n",
|
|
"print(\"Phone number extraction:\")\n",
|
|
"print(df_regex[['phone', 'extracted_phone']].head(10))\n",
|
|
"\n",
|
|
"# Extract ZIP codes from addresses\n",
|
|
"zip_pattern = r'\\b\\d{5}\\b'\n",
|
|
"df_regex['zip_code'] = df_regex['address'].str.extract(f'({zip_pattern})')\n",
|
|
"\n",
|
|
"print(\"\\nZIP code extraction:\")\n",
|
|
"print(df_regex[['address', 'zip_code']].head(10))\n",
|
|
"\n",
|
|
"# Extract state abbreviations\n",
|
|
"state_pattern = r'\\b[A-Z]{2}\\b'\n",
|
|
"df_regex['state'] = df_regex['address'].str.extract(f'({state_pattern})')\n",
|
|
"\n",
|
|
"print(\"\\nState extraction:\")\n",
|
|
"print(df_regex[['address', 'state']].head(10))\n",
|
|
"\n",
|
|
"# Count digits in strings\n",
|
|
"df_regex['digit_count'] = df_regex['phone'].str.count(r'\\d')\n",
|
|
"df_regex['letter_count'] = df_regex['customer_name'].str.count(r'[a-zA-Z]')\n",
|
|
"\n",
|
|
"print(\"\\nCharacter counting:\")\n",
|
|
"print(df_regex[['phone', 'digit_count', 'customer_name', 'letter_count']].head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Advanced regex patterns\n",
|
|
"print(\"=== ADVANCED REGEX PATTERNS ===\")\n",
|
|
"\n",
|
|
"# Extract all email components\n",
|
|
"email_pattern = r'([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\\.([a-zA-Z]{2,})'\n",
|
|
"email_parts = df_regex['email'].str.extract(email_pattern)\n",
|
|
"email_parts.columns = ['username', 'domain', 'tld']\n",
|
|
"\n",
|
|
"df_regex = pd.concat([df_regex, email_parts], axis=1)\n",
|
|
"\n",
|
|
"print(\"Email component extraction:\")\n",
|
|
"print(df_regex[['email', 'username', 'domain', 'tld']].head(10))\n",
|
|
"\n",
|
|
"# Extract multiple phone number parts\n",
|
|
"phone_parts_pattern = r'\\(??(\\d{3})\\)?[-.,\\s]?(\\d{3})[-.,\\s]?(\\d{4})'\n",
|
|
"phone_parts = df_regex['phone'].str.extract(phone_parts_pattern)\n",
|
|
"phone_parts.columns = ['area_code', 'exchange', 'number']\n",
|
|
"\n",
|
|
"df_regex = pd.concat([df_regex, phone_parts], axis=1)\n",
|
|
"\n",
|
|
"print(\"\\nPhone number component extraction:\")\n",
|
|
"print(df_regex[['phone', 'area_code', 'exchange', 'number']].head(10))\n",
|
|
"\n",
|
|
"# Extract address components\n",
|
|
"address_pattern = r'(\\d+)\\s+(.+?)\\s*,\\s*(.+?)\\s*,\\s*([A-Z]{2})\\s+(\\d{5})'\n",
|
|
"address_parts = df_regex['address'].str.extract(address_pattern)\n",
|
|
"address_parts.columns = ['street_number', 'street_name', 'city', 'state_extracted', 'zip_extracted']\n",
|
|
"\n",
|
|
"print(\"\\nAddress component extraction (first 5):\")\n",
|
|
"print(address_parts.head())\n",
|
|
"\n",
|
|
"# Find all matches (not just first)\n",
|
|
"# Find all capitalized words in names\n",
|
|
"df_regex['capitalized_words'] = df_regex['customer_name'].str.findall(r'\\b[A-Z][a-z]+\\b')\n",
|
|
"df_regex['num_capitalized'] = df_regex['capitalized_words'].str.len()\n",
|
|
"\n",
|
|
"print(\"\\nCapitalized words in names:\")\n",
|
|
"print(df_regex[['customer_name', 'capitalized_words', 'num_capitalized']].head(10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Regex replacement and cleaning\n",
|
|
"print(\"=== REGEX REPLACEMENT AND CLEANING ===\")\n",
|
|
"\n",
|
|
"# Clean phone numbers to standard format\n",
|
|
"def clean_phone(phone_str):\n",
|
|
" \"\"\"Clean phone number to standard format\"\"\"\n",
|
|
" if pd.isna(phone_str):\n",
|
|
" return None\n",
|
|
" # Remove all non-digits\n",
|
|
" digits = re.sub(r'\\D', '', phone_str)\n",
|
|
" # Handle different lengths\n",
|
|
" if len(digits) == 10:\n",
|
|
" return f\"({digits[:3]}) {digits[3:6]}-{digits[6:]}\"\n",
|
|
" elif len(digits) == 11 and digits.startswith('1'):\n",
|
|
" return f\"({digits[1:4]}) {digits[4:7]}-{digits[7:]}\"\n",
|
|
" else:\n",
|
|
" return 'Invalid'\n",
|
|
"\n",
|
|
"df_regex['phone_cleaned'] = df_regex['phone'].apply(clean_phone)\n",
|
|
"\n",
|
|
"print(\"Phone number cleaning:\")\n",
|
|
"phone_cleaning_sample = df_regex[['phone', 'phone_cleaned']].head(15)\n",
|
|
"print(phone_cleaning_sample)\n",
|
|
"\n",
|
|
"# Remove punctuation from reviews\n",
|
|
"df_regex['review_no_punct'] = df_regex['product_reviews'].str.replace(r'[^\\w\\s]', ' ', regex=True)\n",
|
|
"df_regex['review_clean'] = df_regex['review_no_punct'].str.replace(r'\\s+', ' ', regex=True).str.strip()\n",
|
|
"\n",
|
|
"print(\"\\nReview cleaning:\")\n",
|
|
"review_cleaning_sample = df_regex[['product_reviews', 'review_clean']].head(5)\n",
|
|
"for idx, row in review_cleaning_sample.iterrows():\n",
|
|
" print(f\"Original: {row['product_reviews']}\")\n",
|
|
" print(f\"Cleaned: {row['review_clean']}\")\n",
|
|
" print()\n",
|
|
"\n",
|
|
"# Standardize company names\n",
|
|
"df_regex['company_clean'] = (\n",
|
|
" df_regex['company']\n",
|
|
" .str.replace(r'\\binc\\.?\\b', 'Inc.', case=False, regex=True)\n",
|
|
" .str.replace(r'\\bllc\\b', 'LLC', case=False, regex=True)\n",
|
|
" .str.replace(r'\\bltd\\.?\\b', 'Ltd.', case=False, regex=True)\n",
|
|
" .str.replace(r'\\bco\\.?\\b', 'Co.', case=False, regex=True)\n",
|
|
" .str.title()\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"Company name standardization:\")\n",
|
|
"company_sample = df_regex[['company', 'company_clean']].head(10)\n",
|
|
"print(company_sample)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Text Cleaning and Standardization\n",
|
|
"\n",
|
|
"Comprehensive text cleaning workflows."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Comprehensive text cleaning pipeline\n",
|
|
"print(\"=== COMPREHENSIVE TEXT CLEANING ===\")\n",
|
|
"\n",
|
|
"def clean_text_comprehensive(df):\n",
|
|
" \"\"\"Comprehensive text cleaning pipeline\"\"\"\n",
|
|
" df_clean = df.copy()\n",
|
|
" \n",
|
|
" # 1. Clean customer names\n",
|
|
" df_clean['customer_name_clean'] = (\n",
|
|
" df_clean['customer_name']\n",
|
|
" .str.strip() # Remove leading/trailing whitespace\n",
|
|
" .str.replace(r'\\s+', ' ', regex=True) # Replace multiple spaces with single space\n",
|
|
" .str.title() # Title case\n",
|
|
" .str.replace(r'\\bDr\\.', 'Dr.', regex=True) # Standardize titles\n",
|
|
" .str.replace(r'\\bProf\\.', 'Prof.', regex=True)\n",
|
|
" .str.replace(r'\\bJr\\.', 'Jr.', regex=True)\n",
|
|
" .str.replace(r'\\bSr\\.', 'Sr.', regex=True)\n",
|
|
" )\n",
|
|
" \n",
|
|
" # 2. Clean and standardize emails\n",
|
|
" df_clean['email_clean'] = (\n",
|
|
" df_clean['email']\n",
|
|
" .str.strip()\n",
|
|
" .str.lower() # Lowercase for emails\n",
|
|
" .str.replace(r'\\s+', '', regex=True) # Remove any spaces\n",
|
|
" )\n",
|
|
" \n",
|
|
" # 3. Standardize phone numbers\n",
|
|
" df_clean['phone_clean'] = df_clean['phone'].apply(clean_phone)\n",
|
|
" \n",
|
|
" # 4. Clean addresses\n",
|
|
" df_clean['address_clean'] = (\n",
|
|
" df_clean['address']\n",
|
|
" .str.strip()\n",
|
|
" .str.title() # Title case\n",
|
|
" .str.replace(r'\\bSt\\.?\\b', 'St.', regex=True) # Standardize street abbreviations\n",
|
|
" .str.replace(r'\\bAve\\.?\\b', 'Ave.', regex=True)\n",
|
|
" .str.replace(r'\\bRd\\.?\\b', 'Rd.', regex=True)\n",
|
|
" .str.replace(r'\\bDr\\.?\\b', 'Dr.', regex=True)\n",
|
|
" .str.replace(r'\\bLn\\.?\\b', 'Ln.', regex=True)\n",
|
|
" .str.replace(r'\\bCt\\.?\\b', 'Ct.', regex=True)\n",
|
|
" .str.replace(r'\\bBlvd\\.?\\b', 'Blvd.', regex=True)\n",
|
|
" .str.replace(r'\\s+', ' ', regex=True) # Multiple spaces to single\n",
|
|
" )\n",
|
|
" \n",
|
|
" # 5. Clean job titles\n",
|
|
" df_clean['job_title_clean'] = (\n",
|
|
" df_clean['job_title']\n",
|
|
" .str.strip()\n",
|
|
" .str.title()\n",
|
|
" .str.replace(r'\\bQa\\b', 'QA', regex=True) # Specific corrections\n",
|
|
" .str.replace(r'\\bUx\\b', 'UX', regex=True)\n",
|
|
" .str.replace(r'\\bHr\\b', 'HR', regex=True)\n",
|
|
" )\n",
|
|
" \n",
|
|
" # 6. Clean company names\n",
|
|
" df_clean['company_clean'] = (\n",
|
|
" df_clean['company']\n",
|
|
" .str.strip()\n",
|
|
" .str.title()\n",
|
|
" .str.replace(r'\\binc\\.?\\b', 'Inc.', case=False, regex=True)\n",
|
|
" .str.replace(r'\\bllc\\b', 'LLC', case=False, regex=True)\n",
|
|
" .str.replace(r'\\bltd\\.?\\b', 'Ltd.', case=False, regex=True)\n",
|
|
" .str.replace(r'\\bco\\.?\\b', 'Co.', case=False, regex=True)\n",
|
|
" )\n",
|
|
" \n",
|
|
" return df_clean\n",
|
|
"\n",
|
|
"# Apply comprehensive cleaning\n",
|
|
"df_comprehensive = clean_text_comprehensive(df_text)\n",
|
|
"\n",
|
|
"print(\"Comprehensive cleaning results:\")\n",
|
|
"# Show before/after comparison\n",
|
|
"comparison_cols = [\n",
|
|
" ('customer_name', 'customer_name_clean'),\n",
|
|
" ('email', 'email_clean'),\n",
|
|
" ('phone', 'phone_clean'),\n",
|
|
" ('job_title', 'job_title_clean'),\n",
|
|
" ('company', 'company_clean')\n",
|
|
"]\n",
|
|
"\n",
|
|
"for original, cleaned in comparison_cols:\n",
|
|
" print(f\"\\n{original.upper()} CLEANING:\")\n",
|
|
" sample = df_comprehensive[[original, cleaned]].head(5)\n",
|
|
" for idx, row in sample.iterrows():\n",
|
|
" print(f\" Before: {row[original]}\")\n",
|
|
" print(f\" After: {row[cleaned]}\")\n",
|
|
" print()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Text standardization and validation\n",
|
|
"print(\"=== TEXT STANDARDIZATION AND VALIDATION ===\")\n",
|
|
"\n",
|
|
"def validate_cleaned_data(df):\n",
|
|
" \"\"\"Validate cleaned data quality\"\"\"\n",
|
|
" validation_results = {}\n",
|
|
" \n",
|
|
" # Email validation\n",
|
|
" email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n",
|
|
" valid_emails = df['email_clean'].str.match(email_pattern, na=False)\n",
|
|
" validation_results['valid_emails'] = {\n",
|
|
" 'total': len(df),\n",
|
|
" 'valid': valid_emails.sum(),\n",
|
|
" 'invalid': (~valid_emails).sum(),\n",
|
|
" 'percentage_valid': (valid_emails.sum() / len(df)) * 100\n",
|
|
" }\n",
|
|
" \n",
|
|
" # Phone validation\n",
|
|
" valid_phones = df['phone_clean'] != 'Invalid'\n",
|
|
" validation_results['valid_phones'] = {\n",
|
|
" 'total': len(df),\n",
|
|
" 'valid': valid_phones.sum(),\n",
|
|
" 'invalid': (~valid_phones).sum(),\n",
|
|
" 'percentage_valid': (valid_phones.sum() / len(df)) * 100\n",
|
|
" }\n",
|
|
" \n",
|
|
" # Name validation (no numbers, reasonable length)\n",
|
|
" valid_names = (\n",
|
|
" df['customer_name_clean'].str.len().between(2, 50) &\n",
|
|
" ~df['customer_name_clean'].str.contains(r'\\d', na=False)\n",
|
|
" )\n",
|
|
" validation_results['valid_names'] = {\n",
|
|
" 'total': len(df),\n",
|
|
" 'valid': valid_names.sum(),\n",
|
|
" 'invalid': (~valid_names).sum(),\n",
|
|
" 'percentage_valid': (valid_names.sum() / len(df)) * 100\n",
|
|
" }\n",
|
|
" \n",
|
|
" return validation_results\n",
|
|
"\n",
|
|
"# Validate cleaned data\n",
|
|
"validation_results = validate_cleaned_data(df_comprehensive)\n",
|
|
"\n",
|
|
"print(\"Data validation results:\")\n",
|
|
"for field, results in validation_results.items():\n",
|
|
" print(f\"\\n{field.upper()}:\")\n",
|
|
" print(f\" Total records: {results['total']}\")\n",
|
|
" print(f\" Valid: {results['valid']} ({results['percentage_valid']:.1f}%)\")\n",
|
|
" print(f\" Invalid: {results['invalid']}\")\n",
|
|
"\n",
|
|
"# Show some invalid examples\n",
|
|
"print(\"\\nExamples of invalid data:\")\n",
|
|
"invalid_emails = df_comprehensive[~df_comprehensive['email_clean'].str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$', na=False)]\n",
|
|
"if len(invalid_emails) > 0:\n",
|
|
" print(f\"Invalid emails: {invalid_emails['email_clean'].head(3).tolist()}\")\n",
|
|
"\n",
|
|
"invalid_phones = df_comprehensive[df_comprehensive['phone_clean'] == 'Invalid']\n",
|
|
"if len(invalid_phones) > 0:\n",
|
|
" print(f\"Invalid phones: {invalid_phones['phone'].head(3).tolist()}\")\n",
|
|
"\n",
|
|
"# Generate data quality summary\n",
|
|
"overall_quality = np.mean([results['percentage_valid'] for results in validation_results.values()])\n",
|
|
"print(f\"\\nOverall data quality score: {overall_quality:.1f}%\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Text Analysis and Insights\n",
|
|
"\n",
|
|
"Extracting business insights from text data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Text analysis for business insights\n",
|
|
"print(\"=== TEXT ANALYSIS FOR BUSINESS INSIGHTS ===\")\n",
|
|
"\n",
|
|
"def analyze_text_patterns(df):\n",
|
|
" \"\"\"Analyze text patterns for business insights\"\"\"\n",
|
|
" analysis = {}\n",
|
|
" \n",
|
|
" # 1. Name analysis\n",
|
|
" analysis['name_insights'] = {\n",
|
|
" 'avg_name_length': df['customer_name_clean'].str.len().mean(),\n",
|
|
" 'names_with_titles': df['customer_name_clean'].str.contains(r'Dr\\.|Prof\\.|Mr\\.|Ms\\.|Mrs\\.').sum(),\n",
|
|
" 'names_with_suffixes': df['customer_name_clean'].str.contains(r'Jr\\.|Sr\\.|III|II').sum(),\n",
|
|
" 'hyphenated_names': df['customer_name_clean'].str.contains('-').sum(),\n",
|
|
" 'most_common_first_names': df['customer_name_clean'].str.split().str[0].value_counts().head(5)\n",
|
|
" }\n",
|
|
" \n",
|
|
" # 2. Email domain analysis\n",
|
|
" domains = df['email_clean'].str.split('@').str[1]\n",
|
|
" analysis['email_insights'] = {\n",
|
|
" 'total_unique_domains': domains.nunique(),\n",
|
|
" 'top_domains': domains.value_counts().head(10),\n",
|
|
" 'edu_domains': domains.str.endswith('.edu').sum(),\n",
|
|
" 'com_domains': domains.str.endswith('.com').sum(),\n",
|
|
" 'org_domains': domains.str.endswith('.org').sum()\n",
|
|
" }\n",
|
|
" \n",
|
|
" # 3. Geographic analysis from addresses\n",
|
|
" states = df['address_clean'].str.extract(r'\\b([A-Z]{2})\\s+\\d{5}')[0]\n",
|
|
" analysis['geographic_insights'] = {\n",
|
|
" 'unique_states': states.nunique(),\n",
|
|
" 'top_states': states.value_counts().head(10),\n",
|
|
" 'coastal_states': states.isin(['CA', 'NY', 'FL', 'WA', 'OR']).sum()\n",
|
|
" }\n",
|
|
" \n",
|
|
" # 4. Job title analysis\n",
|
|
" analysis['job_insights'] = {\n",
|
|
" 'unique_job_titles': df['job_title_clean'].nunique(),\n",
|
|
" 'top_job_titles': df['job_title_clean'].value_counts().head(10),\n",
|
|
" 'tech_jobs': df['job_title_clean'].str.contains('Engineer|Developer|Data|Software', case=False).sum(),\n",
|
|
" 'management_jobs': df['job_title_clean'].str.contains('Manager|Director|VP|President', case=False).sum()\n",
|
|
" }\n",
|
|
" \n",
|
|
" # 5. Company analysis\n",
|
|
" analysis['company_insights'] = {\n",
|
|
" 'unique_companies': df['company_clean'].nunique(),\n",
|
|
" 'top_companies': df['company_clean'].value_counts().head(10),\n",
|
|
" 'inc_companies': df['company_clean'].str.contains('Inc\\.').sum(),\n",
|
|
" 'llc_companies': df['company_clean'].str.contains('LLC').sum(),\n",
|
|
" 'startups': df['company_clean'].str.contains('Startup|startup', case=False).sum()\n",
|
|
" }\n",
|
|
" \n",
|
|
" return analysis\n",
|
|
"\n",
|
|
"# Perform text analysis\n",
|
|
"text_analysis = analyze_text_patterns(df_comprehensive)\n",
|
|
"\n",
|
|
"print(\"TEXT ANALYSIS RESULTS:\")\n",
|
|
"\n",
|
|
"print(\"\\n1. NAME INSIGHTS:\")\n",
|
|
"name_insights = text_analysis['name_insights']\n",
|
|
"print(f\" Average name length: {name_insights['avg_name_length']:.1f} characters\")\n",
|
|
"print(f\" Names with titles: {name_insights['names_with_titles']}\")\n",
|
|
"print(f\" Names with suffixes: {name_insights['names_with_suffixes']}\")\n",
|
|
"print(f\" Hyphenated names: {name_insights['hyphenated_names']}\")\n",
|
|
"print(\" Most common first names:\")\n",
|
|
"for name, count in name_insights['most_common_first_names'].items():\n",
|
|
" print(f\" {name}: {count}\")\n",
|
|
"\n",
|
|
"print(\"\\n2. EMAIL INSIGHTS:\")\n",
|
|
"email_insights = text_analysis['email_insights']\n",
|
|
"print(f\" Unique domains: {email_insights['total_unique_domains']}\")\n",
|
|
"print(f\" .edu domains: {email_insights['edu_domains']}\")\n",
|
|
"print(f\" .com domains: {email_insights['com_domains']}\")\n",
|
|
"print(f\" .org domains: {email_insights['org_domains']}\")\n",
|
|
"print(\" Top domains:\")\n",
|
|
"for domain, count in email_insights['top_domains'].head(5).items():\n",
|
|
" print(f\" {domain}: {count}\")\n",
|
|
"\n",
|
|
"print(\"\\n3. GEOGRAPHIC INSIGHTS:\")\n",
|
|
"geo_insights = text_analysis['geographic_insights']\n",
|
|
"print(f\" Unique states: {geo_insights['unique_states']}\")\n",
|
|
"print(f\" Coastal states: {geo_insights['coastal_states']}\")\n",
|
|
"print(\" Top states:\")\n",
|
|
"for state, count in geo_insights['top_states'].head(5).items():\n",
|
|
" print(f\" {state}: {count}\")\n",
|
|
"\n",
|
|
"print(\"\\n4. JOB INSIGHTS:\")\n",
|
|
"job_insights = text_analysis['job_insights']\n",
|
|
"print(f\" Unique job titles: {job_insights['unique_job_titles']}\")\n",
|
|
"print(f\" Tech jobs: {job_insights['tech_jobs']}\")\n",
|
|
"print(f\" Management jobs: {job_insights['management_jobs']}\")\n",
|
|
"\n",
|
|
"print(\"\\n5. COMPANY INSIGHTS:\")\n",
|
|
"company_insights = text_analysis['company_insights']\n",
|
|
"print(f\" Unique companies: {company_insights['unique_companies']}\")\n",
|
|
"print(f\" Inc. companies: {company_insights['inc_companies']}\")\n",
|
|
"print(f\" LLC companies: {company_insights['llc_companies']}\")\n",
|
|
"print(f\" Startups: {company_insights['startups']}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Sentiment analysis of product reviews\n",
|
|
"print(\"=== SENTIMENT ANALYSIS OF REVIEWS ===\")\n",
|
|
"\n",
|
|
"def analyze_review_sentiment(df):\n",
|
|
" \"\"\"Analyze sentiment in product reviews\"\"\"\n",
|
|
" # Define sentiment word lists\n",
|
|
" positive_words = [\n",
|
|
" 'great', 'excellent', 'amazing', 'fantastic', 'love', 'perfect', \n",
|
|
" 'outstanding', 'superb', 'incredible', 'wonderful', 'awesome', \n",
|
|
" 'brilliant', 'impressive', 'remarkable', 'exceptional'\n",
|
|
" ]\n",
|
|
" \n",
|
|
" negative_words = [\n",
|
|
" 'terrible', 'poor', 'disappointing', 'waste', 'cheap', 'broke', \n",
|
|
" 'fake', 'mediocre', 'awful', 'horrible', 'useless', 'worst', \n",
|
|
" 'defective', 'junk', 'garbage'\n",
|
|
" ]\n",
|
|
" \n",
|
|
" # Create patterns\n",
|
|
" positive_pattern = '|'.join(positive_words)\n",
|
|
" negative_pattern = '|'.join(negative_words)\n",
|
|
" \n",
|
|
" # Count sentiment words\n",
|
|
" df['positive_word_count'] = df['product_reviews'].str.lower().str.count(positive_pattern)\n",
|
|
" df['negative_word_count'] = df['product_reviews'].str.lower().str.count(negative_pattern)\n",
|
|
" \n",
|
|
" # Calculate sentiment score\n",
|
|
" df['sentiment_score'] = df['positive_word_count'] - df['negative_word_count']\n",
|
|
" \n",
|
|
" # Categorize sentiment\n",
|
|
" def categorize_sentiment(score):\n",
|
|
" if score > 0:\n",
|
|
" return 'Positive'\n",
|
|
" elif score < 0:\n",
|
|
" return 'Negative'\n",
|
|
" else:\n",
|
|
" return 'Neutral'\n",
|
|
" \n",
|
|
" df['sentiment_category'] = df['sentiment_score'].apply(categorize_sentiment)\n",
|
|
" \n",
|
|
" # Additional features\n",
|
|
" df['has_exclamation'] = df['product_reviews'].str.contains('!').astype(int)\n",
|
|
" df['has_caps'] = df['product_reviews'].str.contains(r'[A-Z]{3,}').astype(int)\n",
|
|
" df['review_word_count'] = df['product_reviews'].str.split().str.len()\n",
|
|
" \n",
|
|
" return df\n",
|
|
"\n",
|
|
"# Analyze sentiment\n",
|
|
"df_sentiment = analyze_review_sentiment(df_comprehensive.copy())\n",
|
|
"\n",
|
|
"print(\"Sentiment analysis results:\")\n",
|
|
"sentiment_summary = df_sentiment['sentiment_category'].value_counts()\n",
|
|
"print(sentiment_summary)\n",
|
|
"print(f\"\\nSentiment distribution:\")\n",
|
|
"for category, count in sentiment_summary.items():\n",
|
|
" percentage = (count / len(df_sentiment)) * 100\n",
|
|
" print(f\" {category}: {count} ({percentage:.1f}%)\")\n",
|
|
"\n",
|
|
"print(\"\\nSentiment score statistics:\")\n",
|
|
"print(df_sentiment['sentiment_score'].describe())\n",
|
|
"\n",
|
|
"print(\"\\nSample reviews by sentiment:\")\n",
|
|
"for sentiment in ['Positive', 'Negative', 'Neutral']:\n",
|
|
" sample_reviews = df_sentiment[df_sentiment['sentiment_category'] == sentiment]['product_reviews'].head(2)\n",
|
|
" print(f\"\\n{sentiment} reviews:\")\n",
|
|
" for review in sample_reviews:\n",
|
|
" print(f\" - {review}\")\n",
|
|
"\n",
|
|
"# Correlation analysis\n",
|
|
"print(\"\\nCorrelation between text features:\")\n",
|
|
"text_features = ['positive_word_count', 'negative_word_count', 'sentiment_score', \n",
|
|
" 'has_exclamation', 'has_caps', 'review_word_count']\n",
|
|
"correlation_matrix = df_sentiment[text_features].corr()\n",
|
|
"print(correlation_matrix.round(3))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Practice Exercises\n",
|
|
"\n",
|
|
"Apply string operations to complex text processing scenarios:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 1: Advanced Text Cleaning Pipeline\n",
|
|
"# Create a comprehensive text cleaning and validation system:\n",
|
|
"# - Handle international characters and encoding issues\n",
|
|
"# - Implement fuzzy matching for duplicate detection\n",
|
|
"# - Create data quality scoring system\n",
|
|
"# - Generate cleaning reports with statistics\n",
|
|
"\n",
|
|
"def advanced_text_cleaning_pipeline(df):\n",
|
|
" \"\"\"Advanced text cleaning with international support and validation\"\"\"\n",
|
|
" # Your implementation here\n",
|
|
" pass\n",
|
|
"\n",
|
|
"# cleaned_df = advanced_text_cleaning_pipeline(df_text)\n",
|
|
"# print(\"Advanced text cleaning completed\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 2: Text Mining and Information Extraction\n",
|
|
"# Extract structured information from unstructured text:\n",
|
|
"# - Extract entities (names, organizations, locations)\n",
|
|
"# - Parse complex address formats\n",
|
|
"# - Identify and extract contact information\n",
|
|
"# - Create knowledge graphs from text relationships\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 3: Business Intelligence from Text\n",
|
|
"# Create business insights from text analysis:\n",
|
|
"# - Customer segmentation based on text patterns\n",
|
|
"# - Market analysis from company and job data\n",
|
|
"# - Geographic market penetration analysis\n",
|
|
"# - Competitive intelligence from text data\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Key Takeaways\n",
|
|
"\n",
|
|
"1. **String Accessor (`.str`)**:\n",
|
|
" - Essential for all pandas string operations\n",
|
|
" - Works with Series containing strings\n",
|
|
" - Handles NaN values gracefully\n",
|
|
"\n",
|
|
"2. **Basic Operations**:\n",
|
|
" - **Case**: `.upper()`, `.lower()`, `.title()`, `.capitalize()`\n",
|
|
" - **Length**: `.len()`\n",
|
|
" - **Slicing**: `.str[start:end]`\n",
|
|
" - **Splitting**: `.str.split()`\n",
|
|
"\n",
|
|
"3. **Pattern Matching**:\n",
|
|
" - **Contains**: `.str.contains()` for pattern detection\n",
|
|
" - **Startswith/Endswith**: `.str.startswith()`, `.str.endswith()`\n",
|
|
" - **Regular Expressions**: Use `regex=True` parameter\n",
|
|
"\n",
|
|
"4. **Text Cleaning**:\n",
|
|
" - **Replace**: `.str.replace()` for substitution\n",
|
|
" - **Strip**: `.str.strip()` for whitespace removal\n",
|
|
" - **Extract**: `.str.extract()` for regex pattern extraction\n",
|
|
"\n",
|
|
"## String Operations Quick Reference\n",
|
|
"\n",
|
|
"```python\n",
|
|
"# Basic transformations\n",
|
|
"df['col'].str.upper() # Uppercase\n",
|
|
"df['col'].str.lower() # Lowercase\n",
|
|
"df['col'].str.title() # Title Case\n",
|
|
"df['col'].str.len() # String length\n",
|
|
"df['col'].str.strip() # Remove whitespace\n",
|
|
"\n",
|
|
"# Pattern matching\n",
|
|
"df['col'].str.contains('pattern') # Check if contains\n",
|
|
"df['col'].str.startswith('prefix') # Check if starts with\n",
|
|
"df['col'].str.endswith('suffix') # Check if ends with\n",
|
|
"\n",
|
|
"# Extraction and replacement\n",
|
|
"df['col'].str.extract(r'(\\d+)') # Extract pattern\n",
|
|
"df['col'].str.replace('old', 'new') # Replace text\n",
|
|
"df['col'].str.split('delimiter') # Split string\n",
|
|
"\n",
|
|
"# Advanced regex\n",
|
|
"df['col'].str.findall(r'\\b\\w+\\b') # Find all matches\n",
|
|
"df['col'].str.count(r'\\d') # Count pattern occurrences\n",
|
|
"```\n",
|
|
"\n",
|
|
"## Common Text Cleaning Patterns\n",
|
|
"\n",
|
|
"| Task | Pattern | Example |\n",
|
|
"|------|---------|----------|\n",
|
|
"| Remove punctuation | `r'[^\\w\\s]'` | `str.replace(r'[^\\w\\s]', '', regex=True)` |\n",
|
|
"| Extract digits | `r'\\d+'` | `str.extract(r'(\\d+)')` |\n",
|
|
"| Clean phone numbers | `r'\\D'` | `str.replace(r'\\D', '', regex=True)` |\n",
|
|
"| Extract email parts | `r'([^@]+)@(.+)'` | `str.extract(r'([^@]+)@(.+)')` |\n",
|
|
"| Standardize whitespace | `r'\\s+'` | `str.replace(r'\\s+', ' ', regex=True)` |\n",
|
|
"\n",
|
|
"## Best Practices\n",
|
|
"\n",
|
|
"1. **Data Validation**: Always validate cleaned data\n",
|
|
"2. **Preserve Originals**: Keep original columns during cleaning\n",
|
|
"3. **Handle Edge Cases**: Plan for missing values and unusual formats\n",
|
|
"4. **Performance**: Use vectorized operations instead of apply() when possible\n",
|
|
"5. **Documentation**: Document cleaning rules and business logic\n",
|
|
"6. **Testing**: Test regex patterns thoroughly with edge cases\n",
|
|
"\n",
|
|
"## Business Applications\n",
|
|
"\n",
|
|
"- **Customer Data Cleaning**: Standardize names, addresses, contacts\n",
|
|
"- **Market Research**: Analyze company names and domains\n",
|
|
"- **Sentiment Analysis**: Process customer reviews and feedback\n",
|
|
"- **Data Integration**: Clean and match data from multiple sources\n",
|
|
"- **Compliance**: Standardize data for regulatory requirements"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|