{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Session 1 - DataFrames - Lesson 7: Merging and Joining DataFrames\n", "\n", "## Learning Objectives\n", "- Master different types of joins (inner, outer, left, right)\n", "- Understand when to use merge vs join vs concat\n", "- Handle duplicate keys and join conflicts\n", "- Learn advanced merging techniques and best practices\n", "- Practice with real-world data integration scenarios\n", "\n", "## Prerequisites\n", "- Completed Lessons 1-6\n", "- Understanding of relational database concepts (helpful)\n", "- Basic knowledge of SQL joins (helpful but not required)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "from datetime import datetime, timedelta\n", "import matplotlib.pyplot as plt\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Set display options\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', 50)\n", "\n", "print(\"Libraries loaded successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Sample Datasets\n", "\n", "Let's create realistic datasets that represent common business scenarios." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create sample datasets for merging examples\n", "np.random.seed(42)\n", "\n", "# Customer dataset\n", "customers_data = {\n", " 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n", " 'customer_name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson',\n", " 'Frank Miller', 'Grace Lee', 'Henry Davis', 'Ivy Chen', 'Jack Robinson'],\n", " 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com',\n", " 'frank@email.com', 'grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],\n", " 'age': [28, 35, 42, 31, 29, 45, 38, 33, 27, 41],\n", " 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',\n", " 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],\n", " 'signup_date': pd.date_range('2023-01-01', periods=10, freq='M')\n", "}\n", "\n", "df_customers = pd.DataFrame(customers_data)\n", "\n", "# Orders dataset\n", "orders_data = {\n", " 'order_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],\n", " 'customer_id': [1, 2, 1, 3, 4, 2, 5, 1, 6, 11, 3, 2], # Note: customer_id 11 doesn't exist in customers\n", " 'order_date': pd.date_range('2023-06-01', periods=12, freq='W'),\n", " 'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Monitor', 'Phone', \n", " 'Headphones', 'Mouse', 'Keyboard', 'Laptop', 'Tablet', 'Monitor'],\n", " 'quantity': [1, 2, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1],\n", " 'amount': [1200, 800, 400, 1200, 300, 800, 150, 50, 75, 1200, 800, 300]\n", "}\n", "\n", "df_orders = pd.DataFrame(orders_data)\n", "\n", "# Product information dataset\n", "products_data = {\n", " 'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones', 'Mouse', 'Keyboard', 'Webcam'],\n", " 'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', \n", " 'Audio', 'Accessories', 'Accessories', 'Electronics'],\n", " 'price': [1200, 800, 400, 300, 150, 50, 75, 100],\n", " 'supplier': ['TechCorp', 'MobileCorp', 'TechCorp', 'DisplayCorp', \n", " 'AudioCorp', 'AccessoryCorp', 'AccessoryCorp', 'TechCorp']\n", "}\n", "\n", "df_products = pd.DataFrame(products_data)\n", "\n", "# Customer segments dataset\n", "segments_data = {\n", " 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 12, 13], # Some customers not in main customer table\n", " 'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Basic', \n", " 'Premium', 'Standard', 'Basic', 'Premium', 'Standard'],\n", " 'loyalty_points': [1500, 800, 1200, 600, 200, 1800, 750, 300, 2000, 900]\n", "}\n", "\n", "df_segments = pd.DataFrame(segments_data)\n", "\n", "print(\"Sample datasets created:\")\n", "print(f\"Customers: {df_customers.shape}\")\n", "print(f\"Orders: {df_orders.shape}\")\n", "print(f\"Products: {df_products.shape}\")\n", "print(f\"Segments: {df_segments.shape}\")\n", "\n", "print(\"\\nCustomers dataset:\")\n", "print(df_customers.head())\n", "\n", "print(\"\\nOrders dataset:\")\n", "print(df_orders.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Basic Merge Operations\n", "\n", "Understanding the fundamental merge operations and join types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inner Join - only matching records\n", "print(\"=== INNER JOIN ===\")\n", "inner_join = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n", "print(f\"Result shape: {inner_join.shape}\")\n", "print(\"Sample results:\")\n", "print(inner_join[['customer_name', 'order_id', 'product', 'amount']].head())\n", "\n", "print(f\"\\nUnique customers in result: {inner_join['customer_id'].nunique()}\")\n", "print(f\"Total orders: {len(inner_join)}\")\n", "\n", "# Check which customers have orders\n", "customers_with_orders = inner_join['customer_id'].unique()\n", "print(f\"Customers with orders: {sorted(customers_with_orders)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Left Join - all records from left table\n", "print(\"=== LEFT JOIN ===\")\n", "left_join = pd.merge(df_customers, df_orders, on='customer_id', how='left')\n", "print(f\"Result shape: {left_join.shape}\")\n", "print(\"Sample results:\")\n", "print(left_join[['customer_name', 'order_id', 'product', 'amount']].head(10))\n", "\n", "# Check customers without orders\n", "customers_without_orders = left_join[left_join['order_id'].isnull()]['customer_name'].tolist()\n", "print(f\"\\nCustomers without orders: {customers_without_orders}\")\n", "\n", "# Summary statistics\n", "print(f\"\\nTotal records: {len(left_join)}\")\n", "print(f\"Records with orders: {left_join['order_id'].notna().sum()}\")\n", "print(f\"Records without orders: {left_join['order_id'].isnull().sum()}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Right Join - all records from right table\n", "print(\"=== RIGHT JOIN ===\")\n", "right_join = pd.merge(df_customers, df_orders, on='customer_id', how='right')\n", "print(f\"Result shape: {right_join.shape}\")\n", "print(\"Sample results:\")\n", "print(right_join[['customer_name', 'order_id', 'product', 'amount']].head())\n", "\n", "# Check orders without customer information\n", "orders_without_customers = right_join[right_join['customer_name'].isnull()]\n", "print(f\"\\nOrders without customer info: {len(orders_without_customers)}\")\n", "if len(orders_without_customers) > 0:\n", " print(orders_without_customers[['customer_id', 'order_id', 'product', 'amount']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Outer Join - all records from both tables\n", "print(\"=== OUTER JOIN ===\")\n", "outer_join = pd.merge(df_customers, df_orders, on='customer_id', how='outer')\n", "print(f\"Result shape: {outer_join.shape}\")\n", "\n", "# Analyze the result\n", "print(\"\\nData quality analysis:\")\n", "print(f\"Records with complete customer info: {outer_join['customer_name'].notna().sum()}\")\n", "print(f\"Records with complete order info: {outer_join['order_id'].notna().sum()}\")\n", "print(f\"Records with both customer and order info: {(outer_join['customer_name'].notna() & outer_join['order_id'].notna()).sum()}\")\n", "\n", "# Show different categories of records\n", "print(\"\\nCustomers without orders:\")\n", "customers_only = outer_join[(outer_join['customer_name'].notna()) & (outer_join['order_id'].isnull())]\n", "print(customers_only[['customer_name', 'city']].drop_duplicates())\n", "\n", "print(\"\\nOrders without customer data:\")\n", "orders_only = outer_join[(outer_join['customer_name'].isnull()) & (outer_join['order_id'].notna())]\n", "print(orders_only[['customer_id', 'order_id', 'product', 'amount']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Multiple Table Joins\n", "\n", "Combining data from multiple sources in sequence." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Three-way join: Customers + Orders + Products\n", "print(\"=== THREE-WAY JOIN ===\")\n", "\n", "# Step 1: Join customers and orders\n", "customer_orders = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n", "print(f\"After joining customers and orders: {customer_orders.shape}\")\n", "\n", "# Step 2: Join with products\n", "complete_data = pd.merge(customer_orders, df_products, on='product', how='left')\n", "print(f\"After joining with products: {complete_data.shape}\")\n", "\n", "# Display comprehensive view\n", "print(\"\\nComplete order information:\")\n", "display_cols = ['customer_name', 'order_id', 'product', 'category', 'quantity', 'amount', 'price', 'supplier']\n", "print(complete_data[display_cols].head())\n", "\n", "# Verify data consistency\n", "print(\"\\nData consistency check:\")\n", "# Check if order amount matches product price * quantity\n", "complete_data['calculated_amount'] = complete_data['price'] * complete_data['quantity']\n", "amount_matches = (complete_data['amount'] == complete_data['calculated_amount']).all()\n", "print(f\"Order amounts match calculated amounts: {amount_matches}\")\n", "\n", "if not amount_matches:\n", " mismatched = complete_data[complete_data['amount'] != complete_data['calculated_amount']]\n", " print(f\"\\nMismatched records: {len(mismatched)}\")\n", " print(mismatched[['order_id', 'product', 'amount', 'calculated_amount']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add customer segment information\n", "print(\"=== ADDING CUSTOMER SEGMENTS ===\")\n", "\n", "# Join with segments (left join to keep all customers)\n", "customers_with_segments = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n", "print(f\"Customers with segments shape: {customers_with_segments.shape}\")\n", "\n", "# Check which customers don't have segment information\n", "missing_segments = customers_with_segments[customers_with_segments['segment'].isnull()]\n", "print(f\"\\nCustomers without segment info: {len(missing_segments)}\")\n", "if len(missing_segments) > 0:\n", " print(missing_segments[['customer_name', 'city']])\n", "\n", "# Create comprehensive customer profile\n", "full_customer_profile = pd.merge(complete_data, df_segments, on='customer_id', how='left')\n", "print(f\"\\nFull customer profile shape: {full_customer_profile.shape}\")\n", "\n", "# Analyze by segment\n", "segment_analysis = full_customer_profile.groupby('segment').agg({\n", " 'amount': ['sum', 'mean', 'count'],\n", " 'customer_id': 'nunique'\n", "}).round(2)\n", "segment_analysis.columns = ['Total_Revenue', 'Avg_Order_Value', 'Total_Orders', 'Unique_Customers']\n", "print(\"\\nRevenue by customer segment:\")\n", "print(segment_analysis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Advanced Merge Techniques\n", "\n", "Handling complex merging scenarios and edge cases." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Merge with different column names\n", "print(\"=== MERGE WITH DIFFERENT COLUMN NAMES ===\")\n", "\n", "# Create a dataset with different column name\n", "customer_demographics = pd.DataFrame({\n", " 'cust_id': [1, 2, 3, 4, 5],\n", " 'income_range': ['50-75k', '75-100k', '50-75k', '100k+', '25-50k'],\n", " 'education': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor'],\n", " 'occupation': ['Engineer', 'Manager', 'Professor', 'Director', 'Analyst']\n", "})\n", "\n", "# Merge using left_on and right_on parameters\n", "customers_with_demographics = pd.merge(\n", " df_customers, \n", " customer_demographics, \n", " left_on='customer_id', \n", " right_on='cust_id', \n", " how='left'\n", ")\n", "\n", "print(\"Merge with different column names:\")\n", "print(customers_with_demographics[['customer_name', 'customer_id', 'cust_id', 'income_range', 'education']].head())\n", "\n", "# Clean up duplicate columns\n", "customers_with_demographics = customers_with_demographics.drop('cust_id', axis=1)\n", "print(f\"\\nAfter cleanup: {customers_with_demographics.shape}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Merge on multiple columns\n", "print(\"=== MERGE ON MULTIPLE COLUMNS ===\")\n", "\n", "# Create time-based pricing data\n", "pricing_data = pd.DataFrame({\n", " 'product': ['Laptop', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Tablet'],\n", " 'date': pd.to_datetime(['2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01']),\n", " 'price': [1200, 1100, 800, 750, 400, 380],\n", " 'promotion': [False, True, False, True, False, True]\n", "})\n", "\n", "# Add year-month to orders for matching\n", "df_orders_with_period = df_orders.copy()\n", "df_orders_with_period['order_month'] = df_orders_with_period['order_date'].dt.to_period('M').dt.start_time\n", "\n", "# Create matching periods in pricing data\n", "pricing_data['period'] = pricing_data['date'].dt.to_period('M').dt.start_time\n", "\n", "# Merge on product and time period\n", "orders_with_pricing = pd.merge(\n", " df_orders_with_period,\n", " pricing_data,\n", " left_on=['product', 'order_month'],\n", " right_on=['product', 'period'],\n", " how='left'\n", ")\n", "\n", "print(\"Orders with time-based pricing:\")\n", "print(orders_with_pricing[['order_id', 'product', 'order_date', 'amount', 'price', 'promotion']].head())\n", "\n", "# Check for pricing discrepancies\n", "pricing_discrepancies = orders_with_pricing[\n", " (orders_with_pricing['amount'] != orders_with_pricing['price'] * orders_with_pricing['quantity']) &\n", " orders_with_pricing['price'].notna()\n", "]\n", "print(f\"\\nOrders with pricing discrepancies: {len(pricing_discrepancies)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Handling duplicate keys in merge\n", "print(\"=== HANDLING DUPLICATE KEYS ===\")\n", "\n", "# Create data with duplicate keys\n", "customer_contacts = pd.DataFrame({\n", " 'customer_id': [1, 1, 2, 2, 3],\n", " 'contact_type': ['email', 'phone', 'email', 'phone', 'email'],\n", " 'contact_value': ['alice@email.com', '555-0101', 'bob@email.com', '555-0102', 'charlie@email.com'],\n", " 'is_primary': [True, False, True, True, True]\n", "})\n", "\n", "print(\"Customer contacts with duplicates:\")\n", "print(customer_contacts)\n", "\n", "# Merge will create cartesian product for duplicate keys\n", "customers_with_contacts = pd.merge(df_customers, customer_contacts, on='customer_id', how='inner')\n", "print(f\"\\nResult of merge with duplicates: {customers_with_contacts.shape}\")\n", "print(customers_with_contacts[['customer_name', 'contact_type', 'contact_value', 'is_primary']].head())\n", "\n", "# Strategy 1: Filter before merge\n", "primary_contacts = customer_contacts[customer_contacts['is_primary'] == True]\n", "customers_primary_contacts = pd.merge(df_customers, primary_contacts, on='customer_id', how='left')\n", "print(f\"\\nAfter filtering to primary contacts: {customers_primary_contacts.shape}\")\n", "\n", "# Strategy 2: Pivot contacts to columns\n", "contacts_pivoted = customer_contacts.pivot_table(\n", " index='customer_id',\n", " columns='contact_type',\n", " values='contact_value',\n", " aggfunc='first'\n", ").reset_index()\n", "print(\"\\nPivoted contacts:\")\n", "print(contacts_pivoted)\n", "\n", "customers_with_pivoted_contacts = pd.merge(df_customers, contacts_pivoted, on='customer_id', how='left')\n", "print(f\"\\nAfter merging pivoted contacts: {customers_with_pivoted_contacts.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Index-based Joins\n", "\n", "Using DataFrame indices for joining operations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set up DataFrames with indices\n", "print(\"=== INDEX-BASED JOINS ===\")\n", "\n", "# Set customer_id as index\n", "customers_indexed = df_customers.set_index('customer_id')\n", "segments_indexed = df_segments.set_index('customer_id')\n", "\n", "print(\"Customers with index:\")\n", "print(customers_indexed.head())\n", "\n", "# Join using indices\n", "joined_by_index = customers_indexed.join(segments_indexed, how='left')\n", "print(f\"\\nJoined by index shape: {joined_by_index.shape}\")\n", "print(joined_by_index[['customer_name', 'city', 'segment', 'loyalty_points']].head())\n", "\n", "# Compare with merge\n", "merged_equivalent = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n", "print(f\"\\nEquivalent merge shape: {merged_equivalent.shape}\")\n", "\n", "# Verify they're the same (after sorting)\n", "joined_sorted = joined_by_index.reset_index().sort_values('customer_id')\n", "merged_sorted = merged_equivalent.sort_values('customer_id')\n", "are_equal = joined_sorted.equals(merged_sorted)\n", "print(f\"Results are identical: {are_equal}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Multi-index joins\n", "print(\"=== MULTI-INDEX JOINS ===\")\n", "\n", "# Create a dataset with multiple index levels\n", "sales_by_region_product = pd.DataFrame({\n", " 'region': ['North', 'North', 'South', 'South', 'East', 'East'],\n", " 'product': ['Laptop', 'Phone', 'Laptop', 'Phone', 'Laptop', 'Phone'],\n", " 'sales_target': [10, 15, 8, 12, 12, 18],\n", " 'commission_rate': [0.05, 0.04, 0.06, 0.05, 0.05, 0.04]\n", "})\n", "\n", "# Set multi-index\n", "sales_targets = sales_by_region_product.set_index(['region', 'product'])\n", "print(\"Sales targets with multi-index:\")\n", "print(sales_targets)\n", "\n", "# Create customer orders with region mapping\n", "customer_regions = {\n", " 1: 'North', 2: 'South', 3: 'East', 4: 'North', 5: 'South', 6: 'East'\n", "}\n", "\n", "orders_with_region = df_orders.copy()\n", "orders_with_region['region'] = orders_with_region['customer_id'].map(customer_regions)\n", "orders_with_region = orders_with_region.dropna(subset=['region'])\n", "\n", "# Merge on multiple columns to match multi-index\n", "orders_with_targets = pd.merge(\n", " orders_with_region,\n", " sales_targets.reset_index(),\n", " on=['region', 'product'],\n", " how='left'\n", ")\n", "\n", "print(\"\\nOrders with sales targets:\")\n", "print(orders_with_targets[['order_id', 'region', 'product', 'amount', 'sales_target', 'commission_rate']].head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Concatenation Operations\n", "\n", "Combining DataFrames vertically and horizontally." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Vertical concatenation (stacking DataFrames)\n", "print(\"=== VERTICAL CONCATENATION ===\")\n", "\n", "# Create additional customer data (new batch)\n", "new_customers = pd.DataFrame({\n", " 'customer_id': [11, 12, 13, 14, 15],\n", " 'customer_name': ['Kate Wilson', 'Liam Brown', 'Mia Garcia', 'Noah Jones', 'Olivia Miller'],\n", " 'email': ['kate@email.com', 'liam@email.com', 'mia@email.com', 'noah@email.com', 'olivia@email.com'],\n", " 'age': [26, 39, 31, 44, 28],\n", " 'city': ['Austin', 'Seattle', 'Denver', 'Boston', 'Miami'],\n", " 'signup_date': pd.date_range('2024-01-01', periods=5, freq='M')\n", "})\n", "\n", "# Concatenate vertically\n", "all_customers = pd.concat([df_customers, new_customers], ignore_index=True)\n", "print(f\"Original customers: {len(df_customers)}\")\n", "print(f\"New customers: {len(new_customers)}\")\n", "print(f\"Combined customers: {len(all_customers)}\")\n", "\n", "print(\"\\nCombined customer data:\")\n", "print(all_customers.tail())\n", "\n", "# Concatenation with different columns\n", "customers_with_extra_info = pd.DataFrame({\n", " 'customer_id': [16, 17],\n", " 'customer_name': ['Paul Davis', 'Quinn Taylor'],\n", " 'email': ['paul@email.com', 'quinn@email.com'],\n", " 'age': [35, 29],\n", " 'city': ['Portland', 'Nashville'],\n", " 'signup_date': pd.date_range('2024-06-01', periods=2, freq='M'),\n", " 'referral_source': ['Google', 'Facebook'] # Extra column\n", "})\n", "\n", "# Concat with different columns (creates NaN for missing columns)\n", "all_customers_extended = pd.concat([all_customers, customers_with_extra_info], ignore_index=True, sort=False)\n", "print(f\"\\nAfter adding customers with extra info: {all_customers_extended.shape}\")\n", "print(\"Missing values in referral_source:\")\n", "print(all_customers_extended['referral_source'].isnull().sum())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Horizontal concatenation\n", "print(\"=== HORIZONTAL CONCATENATION ===\")\n", "\n", "# Split customer data into parts\n", "customer_basic_info = df_customers[['customer_id', 'customer_name', 'email']]\n", "customer_demographics = df_customers[['customer_id', 'age', 'city', 'signup_date']]\n", "\n", "print(\"Customer basic info:\")\n", "print(customer_basic_info.head())\n", "\n", "print(\"\\nCustomer demographics:\")\n", "print(customer_demographics.head())\n", "\n", "# Concatenate horizontally (by index)\n", "customers_recombined = pd.concat([customer_basic_info, customer_demographics.drop('customer_id', axis=1)], axis=1)\n", "print(f\"\\nRecombined shape: {customers_recombined.shape}\")\n", "print(customers_recombined.head())\n", "\n", "# Verify it matches original\n", "columns_match = set(customers_recombined.columns) == set(df_customers.columns)\n", "print(f\"\\nColumns match original: {columns_match}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Concat with keys (creating hierarchical columns)\n", "print(\"=== CONCAT WITH KEYS ===\")\n", "\n", "# Create quarterly sales data\n", "q1_sales = pd.DataFrame({\n", " 'product': ['Laptop', 'Phone', 'Tablet'],\n", " 'units_sold': [50, 75, 30],\n", " 'revenue': [60000, 60000, 12000]\n", "})\n", "\n", "q2_sales = pd.DataFrame({\n", " 'product': ['Laptop', 'Phone', 'Tablet'],\n", " 'units_sold': [45, 80, 35],\n", " 'revenue': [54000, 64000, 14000]\n", "})\n", "\n", "# Concatenate with keys\n", "quarterly_sales = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])\n", "print(\"Quarterly sales with hierarchical index:\")\n", "print(quarterly_sales)\n", "\n", "# Access specific quarter\n", "print(\"\\nQ1 sales only:\")\n", "print(quarterly_sales.loc['Q1'])\n", "\n", "# Create summary comparison\n", "quarterly_comparison = pd.concat([q1_sales.set_index('product'), q2_sales.set_index('product')], \n", " keys=['Q1', 'Q2'], axis=1)\n", "print(\"\\nQuarterly comparison (side by side):\")\n", "print(quarterly_comparison)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Performance and Best Practices\n", "\n", "Optimizing merge operations and avoiding common pitfalls." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Performance comparison: merge vs join\n", "import time\n", "\n", "print(\"=== PERFORMANCE COMPARISON ===\")\n", "\n", "# Create larger datasets for performance testing\n", "np.random.seed(42)\n", "large_customers = pd.DataFrame({\n", " 'customer_id': range(1, 10001),\n", " 'customer_name': [f'Customer_{i}' for i in range(1, 10001)],\n", " 'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], 10000)\n", "})\n", "\n", "large_orders = pd.DataFrame({\n", " 'order_id': range(1, 50001),\n", " 'customer_id': np.random.randint(1, 10001, 50000),\n", " 'amount': np.random.normal(100, 30, 50000)\n", "})\n", "\n", "print(f\"Large customers: {large_customers.shape}\")\n", "print(f\"Large orders: {large_orders.shape}\")\n", "\n", "# Test merge performance\n", "start_time = time.time()\n", "merged_result = pd.merge(large_customers, large_orders, on='customer_id', how='inner')\n", "merge_time = time.time() - start_time\n", "\n", "# Test join performance\n", "customers_indexed = large_customers.set_index('customer_id')\n", "orders_indexed = large_orders.set_index('customer_id')\n", "\n", "start_time = time.time()\n", "joined_result = customers_indexed.join(orders_indexed, how='inner')\n", "join_time = time.time() - start_time\n", "\n", "print(f\"\\nMerge time: {merge_time:.4f} seconds\")\n", "print(f\"Join time: {join_time:.4f} seconds\")\n", "print(f\"Join is {merge_time/join_time:.2f}x faster\")\n", "\n", "print(f\"\\nResults shape - Merge: {merged_result.shape}, Join: {joined_result.shape}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Best practices and common pitfalls\n", "print(\"=== BEST PRACTICES ===\")\n", "\n", "def analyze_merge_keys(df1, df2, key_col):\n", " \"\"\"Analyze merge keys before joining\"\"\"\n", " print(f\"\\n--- Analyzing merge on '{key_col}' ---\")\n", " \n", " # Check for duplicates\n", " df1_dups = df1[key_col].duplicated().sum()\n", " df2_dups = df2[key_col].duplicated().sum()\n", " \n", " print(f\"Duplicates in left table: {df1_dups}\")\n", " print(f\"Duplicates in right table: {df2_dups}\")\n", " \n", " # Check for missing values\n", " df1_missing = df1[key_col].isnull().sum()\n", " df2_missing = df2[key_col].isnull().sum()\n", " \n", " print(f\"Missing values in left table: {df1_missing}\")\n", " print(f\"Missing values in right table: {df2_missing}\")\n", " \n", " # Check overlap\n", " left_keys = set(df1[key_col].dropna())\n", " right_keys = set(df2[key_col].dropna())\n", " \n", " overlap = left_keys & right_keys\n", " left_only = left_keys - right_keys\n", " right_only = right_keys - left_keys\n", " \n", " print(f\"Keys in both tables: {len(overlap)}\")\n", " print(f\"Keys only in left: {len(left_only)}\")\n", " print(f\"Keys only in right: {len(right_only)}\")\n", " \n", " # Predict result sizes\n", " if df1_dups == 0 and df2_dups == 0:\n", " inner_size = len(overlap)\n", " left_size = len(df1)\n", " right_size = len(df2)\n", " outer_size = len(left_keys | right_keys)\n", " else:\n", " print(\"Warning: Duplicates present, result size may be larger than expected\")\n", " inner_size = \"Cannot predict (duplicates present)\"\n", " left_size = \"Cannot predict (duplicates present)\"\n", " right_size = \"Cannot predict (duplicates present)\"\n", " outer_size = \"Cannot predict (duplicates present)\"\n", " \n", " print(f\"\\nPredicted result sizes:\")\n", " print(f\"Inner join: {inner_size}\")\n", " print(f\"Left join: {left_size}\")\n", " print(f\"Right join: {right_size}\")\n", " print(f\"Outer join: {outer_size}\")\n", "\n", "# Analyze our sample data\n", "analyze_merge_keys(df_customers, df_orders, 'customer_id')\n", "analyze_merge_keys(df_customers, df_segments, 'customer_id')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Data validation after merge\n", "def validate_merge_result(df, expected_rows=None, key_col=None):\n", " \"\"\"Validate merge results\"\"\"\n", " print(\"\\n=== MERGE VALIDATION ===\")\n", " \n", " print(f\"Result shape: {df.shape}\")\n", " \n", " if expected_rows:\n", " print(f\"Expected rows: {expected_rows}\")\n", " if len(df) != expected_rows:\n", " print(\"⚠️ Row count doesn't match expectation!\")\n", " \n", " # Check for unexpected duplicates\n", " if key_col and key_col in df.columns:\n", " duplicates = df[key_col].duplicated().sum()\n", " if duplicates > 0:\n", " print(f\"⚠️ Found {duplicates} duplicate keys after merge\")\n", " \n", " # Check for missing values in key columns\n", " missing_summary = df.isnull().sum()\n", " critical_missing = missing_summary[missing_summary > 0]\n", " \n", " if len(critical_missing) > 0:\n", " print(\"Missing values after merge:\")\n", " print(critical_missing)\n", " \n", " # Data type consistency\n", " print(f\"\\nData types:\")\n", " print(df.dtypes)\n", " \n", " return df\n", "\n", "# Example validation\n", "sample_merge = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n", "validated_result = validate_merge_result(sample_merge, key_col='customer_id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice Exercises\n", "\n", "Apply merging and joining techniques to real-world scenarios:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Exercise 1: Customer Lifetime Value Analysis\n", "# Create a comprehensive customer analysis by joining:\n", "# - Customer demographics\n", "# - Order history\n", "# - Product information\n", "# - Customer segments\n", "# Calculate CLV metrics for each customer\n", "\n", "def calculate_customer_lifetime_value(customers, orders, products, segments):\n", " \"\"\"Calculate comprehensive customer lifetime value metrics\"\"\"\n", " # Your implementation here\n", " pass\n", "\n", "# clv_analysis = calculate_customer_lifetime_value(df_customers, df_orders, df_products, df_segments)\n", "# print(\"Customer Lifetime Value Analysis:\")\n", "# print(clv_analysis.head())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Exercise 2: Data Quality Assessment\n", "# Create a function that analyzes data quality issues when merging multiple datasets:\n", "# - Identify orphaned records\n", "# - Find data inconsistencies\n", "# - Suggest data cleaning steps\n", "# - Provide merge recommendations\n", "\n", "# Your code here:\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Exercise 3: Time-series Join Challenge\n", "# Create a complex time-based join scenario:\n", "# - Join orders with time-varying product prices\n", "# - Handle seasonal promotions\n", "# - Calculate accurate historical revenue\n", "# - Account for price changes over time\n", "\n", "# Your code here:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "1. **Join Types**:\n", " - **Inner**: Only matching records from both tables\n", " - **Left**: All records from left table + matching from right\n", " - **Right**: All records from right table + matching from left\n", " - **Outer**: All records from both tables\n", "\n", "2. **Method Selection**:\n", " - **`pd.merge()`**: Most flexible, works with any columns\n", " - **`.join()`**: Faster for index-based joins\n", " - **`pd.concat()`**: For stacking DataFrames vertically/horizontally\n", "\n", "3. **Best Practices**:\n", " - Always analyze merge keys before joining\n", " - Check for duplicates and missing values\n", " - Validate results after merging\n", " - Use appropriate join types for your use case\n", " - Consider performance implications for large datasets\n", "\n", "4. **Common Pitfalls**:\n", " - Cartesian products from duplicate keys\n", " - Unexpected result sizes\n", " - Data type inconsistencies\n", " - Missing value propagation\n", "\n", "## Join Type Selection Guide\n", "\n", "| Use Case | Recommended Join | Rationale |\n", "|----------|-----------------|----------|\n", "| Customer orders analysis | Inner | Only customers with orders |\n", "| Customer segmentation | Left | Keep all customers, add segment info |\n", "| Order validation | Right | Keep all orders, check customer validity |\n", "| Data completeness analysis | Outer | See all records and identify gaps |\n", "| Performance-critical operations | Index-based join | Faster execution |\n", "\n", "## Performance Tips\n", "\n", "1. **Index Usage**: Set indexes for frequently joined columns\n", "2. **Data Types**: Ensure consistent data types before joining\n", "3. **Memory Management**: Consider chunking for very large datasets\n", "4. **Join Order**: Start with smallest datasets\n", "5. **Validation**: Always validate merge results" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 4 }