937 lines
36 KiB
Text
Executable file
937 lines
36 KiB
Text
Executable file
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Session 1 - DataFrames - Lesson 7: Merging and Joining DataFrames\n",
|
|
"\n",
|
|
"## Learning Objectives\n",
|
|
"- Master different types of joins (inner, outer, left, right)\n",
|
|
"- Understand when to use merge vs join vs concat\n",
|
|
"- Handle duplicate keys and join conflicts\n",
|
|
"- Learn advanced merging techniques and best practices\n",
|
|
"- Practice with real-world data integration scenarios\n",
|
|
"\n",
|
|
"## Prerequisites\n",
|
|
"- Completed Lessons 1-6\n",
|
|
"- Understanding of relational database concepts (helpful)\n",
|
|
"- Basic knowledge of SQL joins (helpful but not required)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import required libraries\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"from datetime import datetime, timedelta\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import warnings\n",
|
|
"warnings.filterwarnings('ignore')\n",
|
|
"\n",
|
|
"# Set display options\n",
|
|
"pd.set_option('display.max_columns', None)\n",
|
|
"pd.set_option('display.max_rows', 50)\n",
|
|
"\n",
|
|
"print(\"Libraries loaded successfully!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating Sample Datasets\n",
|
|
"\n",
|
|
"Let's create realistic datasets that represent common business scenarios."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create sample datasets for merging examples\n",
|
|
"np.random.seed(42)\n",
|
|
"\n",
|
|
"# Customer dataset\n",
|
|
"customers_data = {\n",
|
|
" 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n",
|
|
" 'customer_name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson',\n",
|
|
" 'Frank Miller', 'Grace Lee', 'Henry Davis', 'Ivy Chen', 'Jack Robinson'],\n",
|
|
" 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com',\n",
|
|
" 'frank@email.com', 'grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],\n",
|
|
" 'age': [28, 35, 42, 31, 29, 45, 38, 33, 27, 41],\n",
|
|
" 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',\n",
|
|
" 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],\n",
|
|
" 'signup_date': pd.date_range('2023-01-01', periods=10, freq='M')\n",
|
|
"}\n",
|
|
"\n",
|
|
"df_customers = pd.DataFrame(customers_data)\n",
|
|
"\n",
|
|
"# Orders dataset\n",
|
|
"orders_data = {\n",
|
|
" 'order_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],\n",
|
|
" 'customer_id': [1, 2, 1, 3, 4, 2, 5, 1, 6, 11, 3, 2], # Note: customer_id 11 doesn't exist in customers\n",
|
|
" 'order_date': pd.date_range('2023-06-01', periods=12, freq='W'),\n",
|
|
" 'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Monitor', 'Phone', \n",
|
|
" 'Headphones', 'Mouse', 'Keyboard', 'Laptop', 'Tablet', 'Monitor'],\n",
|
|
" 'quantity': [1, 2, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1],\n",
|
|
" 'amount': [1200, 800, 400, 1200, 300, 800, 150, 50, 75, 1200, 800, 300]\n",
|
|
"}\n",
|
|
"\n",
|
|
"df_orders = pd.DataFrame(orders_data)\n",
|
|
"\n",
|
|
"# Product information dataset\n",
|
|
"products_data = {\n",
|
|
" 'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones', 'Mouse', 'Keyboard', 'Webcam'],\n",
|
|
" 'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', \n",
|
|
" 'Audio', 'Accessories', 'Accessories', 'Electronics'],\n",
|
|
" 'price': [1200, 800, 400, 300, 150, 50, 75, 100],\n",
|
|
" 'supplier': ['TechCorp', 'MobileCorp', 'TechCorp', 'DisplayCorp', \n",
|
|
" 'AudioCorp', 'AccessoryCorp', 'AccessoryCorp', 'TechCorp']\n",
|
|
"}\n",
|
|
"\n",
|
|
"df_products = pd.DataFrame(products_data)\n",
|
|
"\n",
|
|
"# Customer segments dataset\n",
|
|
"segments_data = {\n",
|
|
" 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 12, 13], # Some customers not in main customer table\n",
|
|
" 'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Basic', \n",
|
|
" 'Premium', 'Standard', 'Basic', 'Premium', 'Standard'],\n",
|
|
" 'loyalty_points': [1500, 800, 1200, 600, 200, 1800, 750, 300, 2000, 900]\n",
|
|
"}\n",
|
|
"\n",
|
|
"df_segments = pd.DataFrame(segments_data)\n",
|
|
"\n",
|
|
"print(\"Sample datasets created:\")\n",
|
|
"print(f\"Customers: {df_customers.shape}\")\n",
|
|
"print(f\"Orders: {df_orders.shape}\")\n",
|
|
"print(f\"Products: {df_products.shape}\")\n",
|
|
"print(f\"Segments: {df_segments.shape}\")\n",
|
|
"\n",
|
|
"print(\"\\nCustomers dataset:\")\n",
|
|
"print(df_customers.head())\n",
|
|
"\n",
|
|
"print(\"\\nOrders dataset:\")\n",
|
|
"print(df_orders.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Basic Merge Operations\n",
|
|
"\n",
|
|
"Understanding the fundamental merge operations and join types."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Inner Join - only matching records\n",
|
|
"print(\"=== INNER JOIN ===\")\n",
|
|
"inner_join = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
|
|
"print(f\"Result shape: {inner_join.shape}\")\n",
|
|
"print(\"Sample results:\")\n",
|
|
"print(inner_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
|
|
"\n",
|
|
"print(f\"\\nUnique customers in result: {inner_join['customer_id'].nunique()}\")\n",
|
|
"print(f\"Total orders: {len(inner_join)}\")\n",
|
|
"\n",
|
|
"# Check which customers have orders\n",
|
|
"customers_with_orders = inner_join['customer_id'].unique()\n",
|
|
"print(f\"Customers with orders: {sorted(customers_with_orders)}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Left Join - all records from left table\n",
|
|
"print(\"=== LEFT JOIN ===\")\n",
|
|
"left_join = pd.merge(df_customers, df_orders, on='customer_id', how='left')\n",
|
|
"print(f\"Result shape: {left_join.shape}\")\n",
|
|
"print(\"Sample results:\")\n",
|
|
"print(left_join[['customer_name', 'order_id', 'product', 'amount']].head(10))\n",
|
|
"\n",
|
|
"# Check customers without orders\n",
|
|
"customers_without_orders = left_join[left_join['order_id'].isnull()]['customer_name'].tolist()\n",
|
|
"print(f\"\\nCustomers without orders: {customers_without_orders}\")\n",
|
|
"\n",
|
|
"# Summary statistics\n",
|
|
"print(f\"\\nTotal records: {len(left_join)}\")\n",
|
|
"print(f\"Records with orders: {left_join['order_id'].notna().sum()}\")\n",
|
|
"print(f\"Records without orders: {left_join['order_id'].isnull().sum()}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Right Join - all records from right table\n",
|
|
"print(\"=== RIGHT JOIN ===\")\n",
|
|
"right_join = pd.merge(df_customers, df_orders, on='customer_id', how='right')\n",
|
|
"print(f\"Result shape: {right_join.shape}\")\n",
|
|
"print(\"Sample results:\")\n",
|
|
"print(right_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
|
|
"\n",
|
|
"# Check orders without customer information\n",
|
|
"orders_without_customers = right_join[right_join['customer_name'].isnull()]\n",
|
|
"print(f\"\\nOrders without customer info: {len(orders_without_customers)}\")\n",
|
|
"if len(orders_without_customers) > 0:\n",
|
|
" print(orders_without_customers[['customer_id', 'order_id', 'product', 'amount']])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Outer Join - all records from both tables\n",
|
|
"print(\"=== OUTER JOIN ===\")\n",
|
|
"outer_join = pd.merge(df_customers, df_orders, on='customer_id', how='outer')\n",
|
|
"print(f\"Result shape: {outer_join.shape}\")\n",
|
|
"\n",
|
|
"# Analyze the result\n",
|
|
"print(\"\\nData quality analysis:\")\n",
|
|
"print(f\"Records with complete customer info: {outer_join['customer_name'].notna().sum()}\")\n",
|
|
"print(f\"Records with complete order info: {outer_join['order_id'].notna().sum()}\")\n",
|
|
"print(f\"Records with both customer and order info: {(outer_join['customer_name'].notna() & outer_join['order_id'].notna()).sum()}\")\n",
|
|
"\n",
|
|
"# Show different categories of records\n",
|
|
"print(\"\\nCustomers without orders:\")\n",
|
|
"customers_only = outer_join[(outer_join['customer_name'].notna()) & (outer_join['order_id'].isnull())]\n",
|
|
"print(customers_only[['customer_name', 'city']].drop_duplicates())\n",
|
|
"\n",
|
|
"print(\"\\nOrders without customer data:\")\n",
|
|
"orders_only = outer_join[(outer_join['customer_name'].isnull()) & (outer_join['order_id'].notna())]\n",
|
|
"print(orders_only[['customer_id', 'order_id', 'product', 'amount']])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Multiple Table Joins\n",
|
|
"\n",
|
|
"Combining data from multiple sources in sequence."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Three-way join: Customers + Orders + Products\n",
|
|
"print(\"=== THREE-WAY JOIN ===\")\n",
|
|
"\n",
|
|
"# Step 1: Join customers and orders\n",
|
|
"customer_orders = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
|
|
"print(f\"After joining customers and orders: {customer_orders.shape}\")\n",
|
|
"\n",
|
|
"# Step 2: Join with products\n",
|
|
"complete_data = pd.merge(customer_orders, df_products, on='product', how='left')\n",
|
|
"print(f\"After joining with products: {complete_data.shape}\")\n",
|
|
"\n",
|
|
"# Display comprehensive view\n",
|
|
"print(\"\\nComplete order information:\")\n",
|
|
"display_cols = ['customer_name', 'order_id', 'product', 'category', 'quantity', 'amount', 'price', 'supplier']\n",
|
|
"print(complete_data[display_cols].head())\n",
|
|
"\n",
|
|
"# Verify data consistency\n",
|
|
"print(\"\\nData consistency check:\")\n",
|
|
"# Check if order amount matches product price * quantity\n",
|
|
"complete_data['calculated_amount'] = complete_data['price'] * complete_data['quantity']\n",
|
|
"amount_matches = (complete_data['amount'] == complete_data['calculated_amount']).all()\n",
|
|
"print(f\"Order amounts match calculated amounts: {amount_matches}\")\n",
|
|
"\n",
|
|
"if not amount_matches:\n",
|
|
" mismatched = complete_data[complete_data['amount'] != complete_data['calculated_amount']]\n",
|
|
" print(f\"\\nMismatched records: {len(mismatched)}\")\n",
|
|
" print(mismatched[['order_id', 'product', 'amount', 'calculated_amount']])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Add customer segment information\n",
|
|
"print(\"=== ADDING CUSTOMER SEGMENTS ===\")\n",
|
|
"\n",
|
|
"# Join with segments (left join to keep all customers)\n",
|
|
"customers_with_segments = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
|
|
"print(f\"Customers with segments shape: {customers_with_segments.shape}\")\n",
|
|
"\n",
|
|
"# Check which customers don't have segment information\n",
|
|
"missing_segments = customers_with_segments[customers_with_segments['segment'].isnull()]\n",
|
|
"print(f\"\\nCustomers without segment info: {len(missing_segments)}\")\n",
|
|
"if len(missing_segments) > 0:\n",
|
|
" print(missing_segments[['customer_name', 'city']])\n",
|
|
"\n",
|
|
"# Create comprehensive customer profile\n",
|
|
"full_customer_profile = pd.merge(complete_data, df_segments, on='customer_id', how='left')\n",
|
|
"print(f\"\\nFull customer profile shape: {full_customer_profile.shape}\")\n",
|
|
"\n",
|
|
"# Analyze by segment\n",
|
|
"segment_analysis = full_customer_profile.groupby('segment').agg({\n",
|
|
" 'amount': ['sum', 'mean', 'count'],\n",
|
|
" 'customer_id': 'nunique'\n",
|
|
"}).round(2)\n",
|
|
"segment_analysis.columns = ['Total_Revenue', 'Avg_Order_Value', 'Total_Orders', 'Unique_Customers']\n",
|
|
"print(\"\\nRevenue by customer segment:\")\n",
|
|
"print(segment_analysis)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Advanced Merge Techniques\n",
|
|
"\n",
|
|
"Handling complex merging scenarios and edge cases."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Merge with different column names\n",
|
|
"print(\"=== MERGE WITH DIFFERENT COLUMN NAMES ===\")\n",
|
|
"\n",
|
|
"# Create a dataset with different column name\n",
|
|
"customer_demographics = pd.DataFrame({\n",
|
|
" 'cust_id': [1, 2, 3, 4, 5],\n",
|
|
" 'income_range': ['50-75k', '75-100k', '50-75k', '100k+', '25-50k'],\n",
|
|
" 'education': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor'],\n",
|
|
" 'occupation': ['Engineer', 'Manager', 'Professor', 'Director', 'Analyst']\n",
|
|
"})\n",
|
|
"\n",
|
|
"# Merge using left_on and right_on parameters\n",
|
|
"customers_with_demographics = pd.merge(\n",
|
|
" df_customers, \n",
|
|
" customer_demographics, \n",
|
|
" left_on='customer_id', \n",
|
|
" right_on='cust_id', \n",
|
|
" how='left'\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"Merge with different column names:\")\n",
|
|
"print(customers_with_demographics[['customer_name', 'customer_id', 'cust_id', 'income_range', 'education']].head())\n",
|
|
"\n",
|
|
"# Clean up duplicate columns\n",
|
|
"customers_with_demographics = customers_with_demographics.drop('cust_id', axis=1)\n",
|
|
"print(f\"\\nAfter cleanup: {customers_with_demographics.shape}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Merge on multiple columns\n",
|
|
"print(\"=== MERGE ON MULTIPLE COLUMNS ===\")\n",
|
|
"\n",
|
|
"# Create time-based pricing data\n",
|
|
"pricing_data = pd.DataFrame({\n",
|
|
" 'product': ['Laptop', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Tablet'],\n",
|
|
" 'date': pd.to_datetime(['2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01']),\n",
|
|
" 'price': [1200, 1100, 800, 750, 400, 380],\n",
|
|
" 'promotion': [False, True, False, True, False, True]\n",
|
|
"})\n",
|
|
"\n",
|
|
"# Add year-month to orders for matching\n",
|
|
"df_orders_with_period = df_orders.copy()\n",
|
|
"df_orders_with_period['order_month'] = df_orders_with_period['order_date'].dt.to_period('M').dt.start_time\n",
|
|
"\n",
|
|
"# Create matching periods in pricing data\n",
|
|
"pricing_data['period'] = pricing_data['date'].dt.to_period('M').dt.start_time\n",
|
|
"\n",
|
|
"# Merge on product and time period\n",
|
|
"orders_with_pricing = pd.merge(\n",
|
|
" df_orders_with_period,\n",
|
|
" pricing_data,\n",
|
|
" left_on=['product', 'order_month'],\n",
|
|
" right_on=['product', 'period'],\n",
|
|
" how='left'\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"Orders with time-based pricing:\")\n",
|
|
"print(orders_with_pricing[['order_id', 'product', 'order_date', 'amount', 'price', 'promotion']].head())\n",
|
|
"\n",
|
|
"# Check for pricing discrepancies\n",
|
|
"pricing_discrepancies = orders_with_pricing[\n",
|
|
" (orders_with_pricing['amount'] != orders_with_pricing['price'] * orders_with_pricing['quantity']) &\n",
|
|
" orders_with_pricing['price'].notna()\n",
|
|
"]\n",
|
|
"print(f\"\\nOrders with pricing discrepancies: {len(pricing_discrepancies)}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Handling duplicate keys in merge\n",
|
|
"print(\"=== HANDLING DUPLICATE KEYS ===\")\n",
|
|
"\n",
|
|
"# Create data with duplicate keys\n",
|
|
"customer_contacts = pd.DataFrame({\n",
|
|
" 'customer_id': [1, 1, 2, 2, 3],\n",
|
|
" 'contact_type': ['email', 'phone', 'email', 'phone', 'email'],\n",
|
|
" 'contact_value': ['alice@email.com', '555-0101', 'bob@email.com', '555-0102', 'charlie@email.com'],\n",
|
|
" 'is_primary': [True, False, True, True, True]\n",
|
|
"})\n",
|
|
"\n",
|
|
"print(\"Customer contacts with duplicates:\")\n",
|
|
"print(customer_contacts)\n",
|
|
"\n",
|
|
"# Merge will create cartesian product for duplicate keys\n",
|
|
"customers_with_contacts = pd.merge(df_customers, customer_contacts, on='customer_id', how='inner')\n",
|
|
"print(f\"\\nResult of merge with duplicates: {customers_with_contacts.shape}\")\n",
|
|
"print(customers_with_contacts[['customer_name', 'contact_type', 'contact_value', 'is_primary']].head())\n",
|
|
"\n",
|
|
"# Strategy 1: Filter before merge\n",
|
|
"primary_contacts = customer_contacts[customer_contacts['is_primary'] == True]\n",
|
|
"customers_primary_contacts = pd.merge(df_customers, primary_contacts, on='customer_id', how='left')\n",
|
|
"print(f\"\\nAfter filtering to primary contacts: {customers_primary_contacts.shape}\")\n",
|
|
"\n",
|
|
"# Strategy 2: Pivot contacts to columns\n",
|
|
"contacts_pivoted = customer_contacts.pivot_table(\n",
|
|
" index='customer_id',\n",
|
|
" columns='contact_type',\n",
|
|
" values='contact_value',\n",
|
|
" aggfunc='first'\n",
|
|
").reset_index()\n",
|
|
"print(\"\\nPivoted contacts:\")\n",
|
|
"print(contacts_pivoted)\n",
|
|
"\n",
|
|
"customers_with_pivoted_contacts = pd.merge(df_customers, contacts_pivoted, on='customer_id', how='left')\n",
|
|
"print(f\"\\nAfter merging pivoted contacts: {customers_with_pivoted_contacts.shape}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Index-based Joins\n",
|
|
"\n",
|
|
"Using DataFrame indices for joining operations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Set up DataFrames with indices\n",
|
|
"print(\"=== INDEX-BASED JOINS ===\")\n",
|
|
"\n",
|
|
"# Set customer_id as index\n",
|
|
"customers_indexed = df_customers.set_index('customer_id')\n",
|
|
"segments_indexed = df_segments.set_index('customer_id')\n",
|
|
"\n",
|
|
"print(\"Customers with index:\")\n",
|
|
"print(customers_indexed.head())\n",
|
|
"\n",
|
|
"# Join using indices\n",
|
|
"joined_by_index = customers_indexed.join(segments_indexed, how='left')\n",
|
|
"print(f\"\\nJoined by index shape: {joined_by_index.shape}\")\n",
|
|
"print(joined_by_index[['customer_name', 'city', 'segment', 'loyalty_points']].head())\n",
|
|
"\n",
|
|
"# Compare with merge\n",
|
|
"merged_equivalent = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
|
|
"print(f\"\\nEquivalent merge shape: {merged_equivalent.shape}\")\n",
|
|
"\n",
|
|
"# Verify they're the same (after sorting)\n",
|
|
"joined_sorted = joined_by_index.reset_index().sort_values('customer_id')\n",
|
|
"merged_sorted = merged_equivalent.sort_values('customer_id')\n",
|
|
"are_equal = joined_sorted.equals(merged_sorted)\n",
|
|
"print(f\"Results are identical: {are_equal}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Multi-index joins\n",
|
|
"print(\"=== MULTI-INDEX JOINS ===\")\n",
|
|
"\n",
|
|
"# Create a dataset with multiple index levels\n",
|
|
"sales_by_region_product = pd.DataFrame({\n",
|
|
" 'region': ['North', 'North', 'South', 'South', 'East', 'East'],\n",
|
|
" 'product': ['Laptop', 'Phone', 'Laptop', 'Phone', 'Laptop', 'Phone'],\n",
|
|
" 'sales_target': [10, 15, 8, 12, 12, 18],\n",
|
|
" 'commission_rate': [0.05, 0.04, 0.06, 0.05, 0.05, 0.04]\n",
|
|
"})\n",
|
|
"\n",
|
|
"# Set multi-index\n",
|
|
"sales_targets = sales_by_region_product.set_index(['region', 'product'])\n",
|
|
"print(\"Sales targets with multi-index:\")\n",
|
|
"print(sales_targets)\n",
|
|
"\n",
|
|
"# Create customer orders with region mapping\n",
|
|
"customer_regions = {\n",
|
|
" 1: 'North', 2: 'South', 3: 'East', 4: 'North', 5: 'South', 6: 'East'\n",
|
|
"}\n",
|
|
"\n",
|
|
"orders_with_region = df_orders.copy()\n",
|
|
"orders_with_region['region'] = orders_with_region['customer_id'].map(customer_regions)\n",
|
|
"orders_with_region = orders_with_region.dropna(subset=['region'])\n",
|
|
"\n",
|
|
"# Merge on multiple columns to match multi-index\n",
|
|
"orders_with_targets = pd.merge(\n",
|
|
" orders_with_region,\n",
|
|
" sales_targets.reset_index(),\n",
|
|
" on=['region', 'product'],\n",
|
|
" how='left'\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"\\nOrders with sales targets:\")\n",
|
|
"print(orders_with_targets[['order_id', 'region', 'product', 'amount', 'sales_target', 'commission_rate']].head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Concatenation Operations\n",
|
|
"\n",
|
|
"Combining DataFrames vertically and horizontally."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Vertical concatenation (stacking DataFrames)\n",
|
|
"print(\"=== VERTICAL CONCATENATION ===\")\n",
|
|
"\n",
|
|
"# Create additional customer data (new batch)\n",
|
|
"new_customers = pd.DataFrame({\n",
|
|
" 'customer_id': [11, 12, 13, 14, 15],\n",
|
|
" 'customer_name': ['Kate Wilson', 'Liam Brown', 'Mia Garcia', 'Noah Jones', 'Olivia Miller'],\n",
|
|
" 'email': ['kate@email.com', 'liam@email.com', 'mia@email.com', 'noah@email.com', 'olivia@email.com'],\n",
|
|
" 'age': [26, 39, 31, 44, 28],\n",
|
|
" 'city': ['Austin', 'Seattle', 'Denver', 'Boston', 'Miami'],\n",
|
|
" 'signup_date': pd.date_range('2024-01-01', periods=5, freq='M')\n",
|
|
"})\n",
|
|
"\n",
|
|
"# Concatenate vertically\n",
|
|
"all_customers = pd.concat([df_customers, new_customers], ignore_index=True)\n",
|
|
"print(f\"Original customers: {len(df_customers)}\")\n",
|
|
"print(f\"New customers: {len(new_customers)}\")\n",
|
|
"print(f\"Combined customers: {len(all_customers)}\")\n",
|
|
"\n",
|
|
"print(\"\\nCombined customer data:\")\n",
|
|
"print(all_customers.tail())\n",
|
|
"\n",
|
|
"# Concatenation with different columns\n",
|
|
"customers_with_extra_info = pd.DataFrame({\n",
|
|
" 'customer_id': [16, 17],\n",
|
|
" 'customer_name': ['Paul Davis', 'Quinn Taylor'],\n",
|
|
" 'email': ['paul@email.com', 'quinn@email.com'],\n",
|
|
" 'age': [35, 29],\n",
|
|
" 'city': ['Portland', 'Nashville'],\n",
|
|
" 'signup_date': pd.date_range('2024-06-01', periods=2, freq='M'),\n",
|
|
" 'referral_source': ['Google', 'Facebook'] # Extra column\n",
|
|
"})\n",
|
|
"\n",
|
|
"# Concat with different columns (creates NaN for missing columns)\n",
|
|
"all_customers_extended = pd.concat([all_customers, customers_with_extra_info], ignore_index=True, sort=False)\n",
|
|
"print(f\"\\nAfter adding customers with extra info: {all_customers_extended.shape}\")\n",
|
|
"print(\"Missing values in referral_source:\")\n",
|
|
"print(all_customers_extended['referral_source'].isnull().sum())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Horizontal concatenation\n",
|
|
"print(\"=== HORIZONTAL CONCATENATION ===\")\n",
|
|
"\n",
|
|
"# Split customer data into parts\n",
|
|
"customer_basic_info = df_customers[['customer_id', 'customer_name', 'email']]\n",
|
|
"customer_demographics = df_customers[['customer_id', 'age', 'city', 'signup_date']]\n",
|
|
"\n",
|
|
"print(\"Customer basic info:\")\n",
|
|
"print(customer_basic_info.head())\n",
|
|
"\n",
|
|
"print(\"\\nCustomer demographics:\")\n",
|
|
"print(customer_demographics.head())\n",
|
|
"\n",
|
|
"# Concatenate horizontally (by index)\n",
|
|
"customers_recombined = pd.concat([customer_basic_info, customer_demographics.drop('customer_id', axis=1)], axis=1)\n",
|
|
"print(f\"\\nRecombined shape: {customers_recombined.shape}\")\n",
|
|
"print(customers_recombined.head())\n",
|
|
"\n",
|
|
"# Verify it matches original\n",
|
|
"columns_match = set(customers_recombined.columns) == set(df_customers.columns)\n",
|
|
"print(f\"\\nColumns match original: {columns_match}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Concat with keys (creating hierarchical columns)\n",
|
|
"print(\"=== CONCAT WITH KEYS ===\")\n",
|
|
"\n",
|
|
"# Create quarterly sales data\n",
|
|
"q1_sales = pd.DataFrame({\n",
|
|
" 'product': ['Laptop', 'Phone', 'Tablet'],\n",
|
|
" 'units_sold': [50, 75, 30],\n",
|
|
" 'revenue': [60000, 60000, 12000]\n",
|
|
"})\n",
|
|
"\n",
|
|
"q2_sales = pd.DataFrame({\n",
|
|
" 'product': ['Laptop', 'Phone', 'Tablet'],\n",
|
|
" 'units_sold': [45, 80, 35],\n",
|
|
" 'revenue': [54000, 64000, 14000]\n",
|
|
"})\n",
|
|
"\n",
|
|
"# Concatenate with keys\n",
|
|
"quarterly_sales = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])\n",
|
|
"print(\"Quarterly sales with hierarchical index:\")\n",
|
|
"print(quarterly_sales)\n",
|
|
"\n",
|
|
"# Access specific quarter\n",
|
|
"print(\"\\nQ1 sales only:\")\n",
|
|
"print(quarterly_sales.loc['Q1'])\n",
|
|
"\n",
|
|
"# Create summary comparison\n",
|
|
"quarterly_comparison = pd.concat([q1_sales.set_index('product'), q2_sales.set_index('product')], \n",
|
|
" keys=['Q1', 'Q2'], axis=1)\n",
|
|
"print(\"\\nQuarterly comparison (side by side):\")\n",
|
|
"print(quarterly_comparison)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 6. Performance and Best Practices\n",
|
|
"\n",
|
|
"Optimizing merge operations and avoiding common pitfalls."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Performance comparison: merge vs join\n",
|
|
"import time\n",
|
|
"\n",
|
|
"print(\"=== PERFORMANCE COMPARISON ===\")\n",
|
|
"\n",
|
|
"# Create larger datasets for performance testing\n",
|
|
"np.random.seed(42)\n",
|
|
"large_customers = pd.DataFrame({\n",
|
|
" 'customer_id': range(1, 10001),\n",
|
|
" 'customer_name': [f'Customer_{i}' for i in range(1, 10001)],\n",
|
|
" 'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], 10000)\n",
|
|
"})\n",
|
|
"\n",
|
|
"large_orders = pd.DataFrame({\n",
|
|
" 'order_id': range(1, 50001),\n",
|
|
" 'customer_id': np.random.randint(1, 10001, 50000),\n",
|
|
" 'amount': np.random.normal(100, 30, 50000)\n",
|
|
"})\n",
|
|
"\n",
|
|
"print(f\"Large customers: {large_customers.shape}\")\n",
|
|
"print(f\"Large orders: {large_orders.shape}\")\n",
|
|
"\n",
|
|
"# Test merge performance\n",
|
|
"start_time = time.time()\n",
|
|
"merged_result = pd.merge(large_customers, large_orders, on='customer_id', how='inner')\n",
|
|
"merge_time = time.time() - start_time\n",
|
|
"\n",
|
|
"# Test join performance\n",
|
|
"customers_indexed = large_customers.set_index('customer_id')\n",
|
|
"orders_indexed = large_orders.set_index('customer_id')\n",
|
|
"\n",
|
|
"start_time = time.time()\n",
|
|
"joined_result = customers_indexed.join(orders_indexed, how='inner')\n",
|
|
"join_time = time.time() - start_time\n",
|
|
"\n",
|
|
"print(f\"\\nMerge time: {merge_time:.4f} seconds\")\n",
|
|
"print(f\"Join time: {join_time:.4f} seconds\")\n",
|
|
"print(f\"Join is {merge_time/join_time:.2f}x faster\")\n",
|
|
"\n",
|
|
"print(f\"\\nResults shape - Merge: {merged_result.shape}, Join: {joined_result.shape}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Best practices and common pitfalls\n",
|
|
"print(\"=== BEST PRACTICES ===\")\n",
|
|
"\n",
|
|
"def analyze_merge_keys(df1, df2, key_col):\n",
|
|
" \"\"\"Analyze merge keys before joining\"\"\"\n",
|
|
" print(f\"\\n--- Analyzing merge on '{key_col}' ---\")\n",
|
|
" \n",
|
|
" # Check for duplicates\n",
|
|
" df1_dups = df1[key_col].duplicated().sum()\n",
|
|
" df2_dups = df2[key_col].duplicated().sum()\n",
|
|
" \n",
|
|
" print(f\"Duplicates in left table: {df1_dups}\")\n",
|
|
" print(f\"Duplicates in right table: {df2_dups}\")\n",
|
|
" \n",
|
|
" # Check for missing values\n",
|
|
" df1_missing = df1[key_col].isnull().sum()\n",
|
|
" df2_missing = df2[key_col].isnull().sum()\n",
|
|
" \n",
|
|
" print(f\"Missing values in left table: {df1_missing}\")\n",
|
|
" print(f\"Missing values in right table: {df2_missing}\")\n",
|
|
" \n",
|
|
" # Check overlap\n",
|
|
" left_keys = set(df1[key_col].dropna())\n",
|
|
" right_keys = set(df2[key_col].dropna())\n",
|
|
" \n",
|
|
" overlap = left_keys & right_keys\n",
|
|
" left_only = left_keys - right_keys\n",
|
|
" right_only = right_keys - left_keys\n",
|
|
" \n",
|
|
" print(f\"Keys in both tables: {len(overlap)}\")\n",
|
|
" print(f\"Keys only in left: {len(left_only)}\")\n",
|
|
" print(f\"Keys only in right: {len(right_only)}\")\n",
|
|
" \n",
|
|
" # Predict result sizes\n",
|
|
" if df1_dups == 0 and df2_dups == 0:\n",
|
|
" inner_size = len(overlap)\n",
|
|
" left_size = len(df1)\n",
|
|
" right_size = len(df2)\n",
|
|
" outer_size = len(left_keys | right_keys)\n",
|
|
" else:\n",
|
|
" print(\"Warning: Duplicates present, result size may be larger than expected\")\n",
|
|
" inner_size = \"Cannot predict (duplicates present)\"\n",
|
|
" left_size = \"Cannot predict (duplicates present)\"\n",
|
|
" right_size = \"Cannot predict (duplicates present)\"\n",
|
|
" outer_size = \"Cannot predict (duplicates present)\"\n",
|
|
" \n",
|
|
" print(f\"\\nPredicted result sizes:\")\n",
|
|
" print(f\"Inner join: {inner_size}\")\n",
|
|
" print(f\"Left join: {left_size}\")\n",
|
|
" print(f\"Right join: {right_size}\")\n",
|
|
" print(f\"Outer join: {outer_size}\")\n",
|
|
"\n",
|
|
"# Analyze our sample data\n",
|
|
"analyze_merge_keys(df_customers, df_orders, 'customer_id')\n",
|
|
"analyze_merge_keys(df_customers, df_segments, 'customer_id')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Data validation after merge\n",
|
|
"def validate_merge_result(df, expected_rows=None, key_col=None):\n",
|
|
" \"\"\"Validate merge results\"\"\"\n",
|
|
" print(\"\\n=== MERGE VALIDATION ===\")\n",
|
|
" \n",
|
|
" print(f\"Result shape: {df.shape}\")\n",
|
|
" \n",
|
|
" if expected_rows:\n",
|
|
" print(f\"Expected rows: {expected_rows}\")\n",
|
|
" if len(df) != expected_rows:\n",
|
|
" print(\"⚠️ Row count doesn't match expectation!\")\n",
|
|
" \n",
|
|
" # Check for unexpected duplicates\n",
|
|
" if key_col and key_col in df.columns:\n",
|
|
" duplicates = df[key_col].duplicated().sum()\n",
|
|
" if duplicates > 0:\n",
|
|
" print(f\"⚠️ Found {duplicates} duplicate keys after merge\")\n",
|
|
" \n",
|
|
" # Check for missing values in key columns\n",
|
|
" missing_summary = df.isnull().sum()\n",
|
|
" critical_missing = missing_summary[missing_summary > 0]\n",
|
|
" \n",
|
|
" if len(critical_missing) > 0:\n",
|
|
" print(\"Missing values after merge:\")\n",
|
|
" print(critical_missing)\n",
|
|
" \n",
|
|
" # Data type consistency\n",
|
|
" print(f\"\\nData types:\")\n",
|
|
" print(df.dtypes)\n",
|
|
" \n",
|
|
" return df\n",
|
|
"\n",
|
|
"# Example validation\n",
|
|
"sample_merge = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
|
|
"validated_result = validate_merge_result(sample_merge, key_col='customer_id')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Practice Exercises\n",
|
|
"\n",
|
|
"Apply merging and joining techniques to real-world scenarios:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 1: Customer Lifetime Value Analysis\n",
|
|
"# Create a comprehensive customer analysis by joining:\n",
|
|
"# - Customer demographics\n",
|
|
"# - Order history\n",
|
|
"# - Product information\n",
|
|
"# - Customer segments\n",
|
|
"# Calculate CLV metrics for each customer\n",
|
|
"\n",
|
|
"def calculate_customer_lifetime_value(customers, orders, products, segments):\n",
|
|
" \"\"\"Calculate comprehensive customer lifetime value metrics\"\"\"\n",
|
|
" # Your implementation here\n",
|
|
" pass\n",
|
|
"\n",
|
|
"# clv_analysis = calculate_customer_lifetime_value(df_customers, df_orders, df_products, df_segments)\n",
|
|
"# print(\"Customer Lifetime Value Analysis:\")\n",
|
|
"# print(clv_analysis.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 2: Data Quality Assessment\n",
|
|
"# Create a function that analyzes data quality issues when merging multiple datasets:\n",
|
|
"# - Identify orphaned records\n",
|
|
"# - Find data inconsistencies\n",
|
|
"# - Suggest data cleaning steps\n",
|
|
"# - Provide merge recommendations\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 3: Time-series Join Challenge\n",
|
|
"# Create a complex time-based join scenario:\n",
|
|
"# - Join orders with time-varying product prices\n",
|
|
"# - Handle seasonal promotions\n",
|
|
"# - Calculate accurate historical revenue\n",
|
|
"# - Account for price changes over time\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Key Takeaways\n",
|
|
"\n",
|
|
"1. **Join Types**:\n",
|
|
" - **Inner**: Only matching records from both tables\n",
|
|
" - **Left**: All records from left table + matching from right\n",
|
|
" - **Right**: All records from right table + matching from left\n",
|
|
" - **Outer**: All records from both tables\n",
|
|
"\n",
|
|
"2. **Method Selection**:\n",
|
|
" - **`pd.merge()`**: Most flexible, works with any columns\n",
|
|
" - **`.join()`**: Faster for index-based joins\n",
|
|
" - **`pd.concat()`**: For stacking DataFrames vertically/horizontally\n",
|
|
"\n",
|
|
"3. **Best Practices**:\n",
|
|
" - Always analyze merge keys before joining\n",
|
|
" - Check for duplicates and missing values\n",
|
|
" - Validate results after merging\n",
|
|
" - Use appropriate join types for your use case\n",
|
|
" - Consider performance implications for large datasets\n",
|
|
"\n",
|
|
"4. **Common Pitfalls**:\n",
|
|
" - Cartesian products from duplicate keys\n",
|
|
" - Unexpected result sizes\n",
|
|
" - Data type inconsistencies\n",
|
|
" - Missing value propagation\n",
|
|
"\n",
|
|
"## Join Type Selection Guide\n",
|
|
"\n",
|
|
"| Use Case | Recommended Join | Rationale |\n",
|
|
"|----------|-----------------|----------|\n",
|
|
"| Customer orders analysis | Inner | Only customers with orders |\n",
|
|
"| Customer segmentation | Left | Keep all customers, add segment info |\n",
|
|
"| Order validation | Right | Keep all orders, check customer validity |\n",
|
|
"| Data completeness analysis | Outer | See all records and identify gaps |\n",
|
|
"| Performance-critical operations | Index-based join | Faster execution |\n",
|
|
"\n",
|
|
"## Performance Tips\n",
|
|
"\n",
|
|
"1. **Index Usage**: Set indexes for frequently joined columns\n",
|
|
"2. **Data Types**: Ensure consistent data types before joining\n",
|
|
"3. **Memory Management**: Consider chunking for very large datasets\n",
|
|
"4. **Join Order**: Start with smallest datasets\n",
|
|
"5. **Validation**: Always validate merge results"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|