crypto_bot_training/Session_01/PandasDataFrame-exmples/07_merging_joining.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 7: Merging and Joining DataFrames\n",
    "\n",
    "## Learning Objectives\n",
    "- Master different types of joins (inner, outer, left, right)\n",
    "- Understand when to use merge vs join vs concat\n",
    "- Handle duplicate keys and join conflicts\n",
    "- Learn advanced merging techniques and best practices\n",
    "- Practice with real-world data integration scenarios\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lessons 1-6\n",
    "- Understanding of relational database concepts (helpful)\n",
    "- Basic knowledge of SQL joins (helpful but not required)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "import matplotlib.pyplot as plt\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set display options\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 50)\n",
    "\n",
    "print(\"Libraries loaded successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Sample Datasets\n",
    "\n",
    "Let's create realistic datasets that represent common business scenarios."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create sample datasets for merging examples\n",
    "np.random.seed(42)\n",
    "\n",
    "# Customer dataset\n",
    "customers_data = {\n",
    "    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n",
    "    'customer_name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson',\n",
    "                     'Frank Miller', 'Grace Lee', 'Henry Davis', 'Ivy Chen', 'Jack Robinson'],\n",
    "    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com',\n",
    "             'frank@email.com', 'grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],\n",
    "    'age': [28, 35, 42, 31, 29, 45, 38, 33, 27, 41],\n",
    "    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',\n",
    "            'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],\n",
    "    'signup_date': pd.date_range('2023-01-01', periods=10, freq='M')\n",
    "}\n",
    "\n",
    "df_customers = pd.DataFrame(customers_data)\n",
    "\n",
    "# Orders dataset\n",
    "orders_data = {\n",
    "    'order_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],\n",
    "    'customer_id': [1, 2, 1, 3, 4, 2, 5, 1, 6, 11, 3, 2],  # Note: customer_id 11 doesn't exist in customers\n",
    "    'order_date': pd.date_range('2023-06-01', periods=12, freq='W'),\n",
    "    'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Monitor', 'Phone', \n",
    "               'Headphones', 'Mouse', 'Keyboard', 'Laptop', 'Tablet', 'Monitor'],\n",
    "    'quantity': [1, 2, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1],\n",
    "    'amount': [1200, 800, 400, 1200, 300, 800, 150, 50, 75, 1200, 800, 300]\n",
    "}\n",
    "\n",
    "df_orders = pd.DataFrame(orders_data)\n",
    "\n",
    "# Product information dataset\n",
    "products_data = {\n",
    "    'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones', 'Mouse', 'Keyboard', 'Webcam'],\n",
    "    'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', \n",
    "                'Audio', 'Accessories', 'Accessories', 'Electronics'],\n",
    "    'price': [1200, 800, 400, 300, 150, 50, 75, 100],\n",
    "    'supplier': ['TechCorp', 'MobileCorp', 'TechCorp', 'DisplayCorp', \n",
    "                'AudioCorp', 'AccessoryCorp', 'AccessoryCorp', 'TechCorp']\n",
    "}\n",
    "\n",
    "df_products = pd.DataFrame(products_data)\n",
    "\n",
    "# Customer segments dataset\n",
    "segments_data = {\n",
    "    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 12, 13],  # Some customers not in main customer table\n",
    "    'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Basic', \n",
    "               'Premium', 'Standard', 'Basic', 'Premium', 'Standard'],\n",
    "    'loyalty_points': [1500, 800, 1200, 600, 200, 1800, 750, 300, 2000, 900]\n",
    "}\n",
    "\n",
    "df_segments = pd.DataFrame(segments_data)\n",
    "\n",
    "print(\"Sample datasets created:\")\n",
    "print(f\"Customers: {df_customers.shape}\")\n",
    "print(f\"Orders: {df_orders.shape}\")\n",
    "print(f\"Products: {df_products.shape}\")\n",
    "print(f\"Segments: {df_segments.shape}\")\n",
    "\n",
    "print(\"\\nCustomers dataset:\")\n",
    "print(df_customers.head())\n",
    "\n",
    "print(\"\\nOrders dataset:\")\n",
    "print(df_orders.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Basic Merge Operations\n",
    "\n",
    "Understanding the fundamental merge operations and join types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inner Join - only matching records\n",
    "print(\"=== INNER JOIN ===\")\n",
    "inner_join = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
    "print(f\"Result shape: {inner_join.shape}\")\n",
    "print(\"Sample results:\")\n",
    "print(inner_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
    "\n",
    "print(f\"\\nUnique customers in result: {inner_join['customer_id'].nunique()}\")\n",
    "print(f\"Total orders: {len(inner_join)}\")\n",
    "\n",
    "# Check which customers have orders\n",
    "customers_with_orders = inner_join['customer_id'].unique()\n",
    "print(f\"Customers with orders: {sorted(customers_with_orders)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Left Join - all records from left table\n",
    "print(\"=== LEFT JOIN ===\")\n",
    "left_join = pd.merge(df_customers, df_orders, on='customer_id', how='left')\n",
    "print(f\"Result shape: {left_join.shape}\")\n",
    "print(\"Sample results:\")\n",
    "print(left_join[['customer_name', 'order_id', 'product', 'amount']].head(10))\n",
    "\n",
    "# Check customers without orders\n",
    "customers_without_orders = left_join[left_join['order_id'].isnull()]['customer_name'].tolist()\n",
    "print(f\"\\nCustomers without orders: {customers_without_orders}\")\n",
    "\n",
    "# Summary statistics\n",
    "print(f\"\\nTotal records: {len(left_join)}\")\n",
    "print(f\"Records with orders: {left_join['order_id'].notna().sum()}\")\n",
    "print(f\"Records without orders: {left_join['order_id'].isnull().sum()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Right Join - all records from right table\n",
    "print(\"=== RIGHT JOIN ===\")\n",
    "right_join = pd.merge(df_customers, df_orders, on='customer_id', how='right')\n",
    "print(f\"Result shape: {right_join.shape}\")\n",
    "print(\"Sample results:\")\n",
    "print(right_join[['customer_name', 'order_id', 'product', 'amount']].head())\n",
    "\n",
    "# Check orders without customer information\n",
    "orders_without_customers = right_join[right_join['customer_name'].isnull()]\n",
    "print(f\"\\nOrders without customer info: {len(orders_without_customers)}\")\n",
    "if len(orders_without_customers) > 0:\n",
    "    print(orders_without_customers[['customer_id', 'order_id', 'product', 'amount']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Outer Join - all records from both tables\n",
    "print(\"=== OUTER JOIN ===\")\n",
    "outer_join = pd.merge(df_customers, df_orders, on='customer_id', how='outer')\n",
    "print(f\"Result shape: {outer_join.shape}\")\n",
    "\n",
    "# Analyze the result\n",
    "print(\"\\nData quality analysis:\")\n",
    "print(f\"Records with complete customer info: {outer_join['customer_name'].notna().sum()}\")\n",
    "print(f\"Records with complete order info: {outer_join['order_id'].notna().sum()}\")\n",
    "print(f\"Records with both customer and order info: {(outer_join['customer_name'].notna() & outer_join['order_id'].notna()).sum()}\")\n",
    "\n",
    "# Show different categories of records\n",
    "print(\"\\nCustomers without orders:\")\n",
    "customers_only = outer_join[(outer_join['customer_name'].notna()) & (outer_join['order_id'].isnull())]\n",
    "print(customers_only[['customer_name', 'city']].drop_duplicates())\n",
    "\n",
    "print(\"\\nOrders without customer data:\")\n",
    "orders_only = outer_join[(outer_join['customer_name'].isnull()) & (outer_join['order_id'].notna())]\n",
    "print(orders_only[['customer_id', 'order_id', 'product', 'amount']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Multiple Table Joins\n",
    "\n",
    "Combining data from multiple sources in sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Three-way join: Customers + Orders + Products\n",
    "print(\"=== THREE-WAY JOIN ===\")\n",
    "\n",
    "# Step 1: Join customers and orders\n",
    "customer_orders = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
    "print(f\"After joining customers and orders: {customer_orders.shape}\")\n",
    "\n",
    "# Step 2: Join with products\n",
    "complete_data = pd.merge(customer_orders, df_products, on='product', how='left')\n",
    "print(f\"After joining with products: {complete_data.shape}\")\n",
    "\n",
    "# Display comprehensive view\n",
    "print(\"\\nComplete order information:\")\n",
    "display_cols = ['customer_name', 'order_id', 'product', 'category', 'quantity', 'amount', 'price', 'supplier']\n",
    "print(complete_data[display_cols].head())\n",
    "\n",
    "# Verify data consistency\n",
    "print(\"\\nData consistency check:\")\n",
    "# Check if order amount matches product price * quantity\n",
    "complete_data['calculated_amount'] = complete_data['price'] * complete_data['quantity']\n",
    "amount_matches = (complete_data['amount'] == complete_data['calculated_amount']).all()\n",
    "print(f\"Order amounts match calculated amounts: {amount_matches}\")\n",
    "\n",
    "if not amount_matches:\n",
    "    mismatched = complete_data[complete_data['amount'] != complete_data['calculated_amount']]\n",
    "    print(f\"\\nMismatched records: {len(mismatched)}\")\n",
    "    print(mismatched[['order_id', 'product', 'amount', 'calculated_amount']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add customer segment information\n",
    "print(\"=== ADDING CUSTOMER SEGMENTS ===\")\n",
    "\n",
    "# Join with segments (left join to keep all customers)\n",
    "customers_with_segments = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
    "print(f\"Customers with segments shape: {customers_with_segments.shape}\")\n",
    "\n",
    "# Check which customers don't have segment information\n",
    "missing_segments = customers_with_segments[customers_with_segments['segment'].isnull()]\n",
    "print(f\"\\nCustomers without segment info: {len(missing_segments)}\")\n",
    "if len(missing_segments) > 0:\n",
    "    print(missing_segments[['customer_name', 'city']])\n",
    "\n",
    "# Create comprehensive customer profile\n",
    "full_customer_profile = pd.merge(complete_data, df_segments, on='customer_id', how='left')\n",
    "print(f\"\\nFull customer profile shape: {full_customer_profile.shape}\")\n",
    "\n",
    "# Analyze by segment\n",
    "segment_analysis = full_customer_profile.groupby('segment').agg({\n",
    "    'amount': ['sum', 'mean', 'count'],\n",
    "    'customer_id': 'nunique'\n",
    "}).round(2)\n",
    "segment_analysis.columns = ['Total_Revenue', 'Avg_Order_Value', 'Total_Orders', 'Unique_Customers']\n",
    "print(\"\\nRevenue by customer segment:\")\n",
    "print(segment_analysis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Advanced Merge Techniques\n",
    "\n",
    "Handling complex merging scenarios and edge cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge with different column names\n",
    "print(\"=== MERGE WITH DIFFERENT COLUMN NAMES ===\")\n",
    "\n",
    "# Create a dataset with different column name\n",
    "customer_demographics = pd.DataFrame({\n",
    "    'cust_id': [1, 2, 3, 4, 5],\n",
    "    'income_range': ['50-75k', '75-100k', '50-75k', '100k+', '25-50k'],\n",
    "    'education': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor'],\n",
    "    'occupation': ['Engineer', 'Manager', 'Professor', 'Director', 'Analyst']\n",
    "})\n",
    "\n",
    "# Merge using left_on and right_on parameters\n",
    "customers_with_demographics = pd.merge(\n",
    "    df_customers, \n",
    "    customer_demographics, \n",
    "    left_on='customer_id', \n",
    "    right_on='cust_id', \n",
    "    how='left'\n",
    ")\n",
    "\n",
    "print(\"Merge with different column names:\")\n",
    "print(customers_with_demographics[['customer_name', 'customer_id', 'cust_id', 'income_range', 'education']].head())\n",
    "\n",
    "# Clean up duplicate columns\n",
    "customers_with_demographics = customers_with_demographics.drop('cust_id', axis=1)\n",
    "print(f\"\\nAfter cleanup: {customers_with_demographics.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge on multiple columns\n",
    "print(\"=== MERGE ON MULTIPLE COLUMNS ===\")\n",
    "\n",
    "# Create time-based pricing data\n",
    "pricing_data = pd.DataFrame({\n",
    "    'product': ['Laptop', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Tablet'],\n",
    "    'date': pd.to_datetime(['2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01']),\n",
    "    'price': [1200, 1100, 800, 750, 400, 380],\n",
    "    'promotion': [False, True, False, True, False, True]\n",
    "})\n",
    "\n",
    "# Add year-month to orders for matching\n",
    "df_orders_with_period = df_orders.copy()\n",
    "df_orders_with_period['order_month'] = df_orders_with_period['order_date'].dt.to_period('M').dt.start_time\n",
    "\n",
    "# Create matching periods in pricing data\n",
    "pricing_data['period'] = pricing_data['date'].dt.to_period('M').dt.start_time\n",
    "\n",
    "# Merge on product and time period\n",
    "orders_with_pricing = pd.merge(\n",
    "    df_orders_with_period,\n",
    "    pricing_data,\n",
    "    left_on=['product', 'order_month'],\n",
    "    right_on=['product', 'period'],\n",
    "    how='left'\n",
    ")\n",
    "\n",
    "print(\"Orders with time-based pricing:\")\n",
    "print(orders_with_pricing[['order_id', 'product', 'order_date', 'amount', 'price', 'promotion']].head())\n",
    "\n",
    "# Check for pricing discrepancies\n",
    "pricing_discrepancies = orders_with_pricing[\n",
    "    (orders_with_pricing['amount'] != orders_with_pricing['price'] * orders_with_pricing['quantity']) &\n",
    "    orders_with_pricing['price'].notna()\n",
    "]\n",
    "print(f\"\\nOrders with pricing discrepancies: {len(pricing_discrepancies)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Handling duplicate keys in merge\n",
    "print(\"=== HANDLING DUPLICATE KEYS ===\")\n",
    "\n",
    "# Create data with duplicate keys\n",
    "customer_contacts = pd.DataFrame({\n",
    "    'customer_id': [1, 1, 2, 2, 3],\n",
    "    'contact_type': ['email', 'phone', 'email', 'phone', 'email'],\n",
    "    'contact_value': ['alice@email.com', '555-0101', 'bob@email.com', '555-0102', 'charlie@email.com'],\n",
    "    'is_primary': [True, False, True, True, True]\n",
    "})\n",
    "\n",
    "print(\"Customer contacts with duplicates:\")\n",
    "print(customer_contacts)\n",
    "\n",
    "# Merge will create cartesian product for duplicate keys\n",
    "customers_with_contacts = pd.merge(df_customers, customer_contacts, on='customer_id', how='inner')\n",
    "print(f\"\\nResult of merge with duplicates: {customers_with_contacts.shape}\")\n",
    "print(customers_with_contacts[['customer_name', 'contact_type', 'contact_value', 'is_primary']].head())\n",
    "\n",
    "# Strategy 1: Filter before merge\n",
    "primary_contacts = customer_contacts[customer_contacts['is_primary'] == True]\n",
    "customers_primary_contacts = pd.merge(df_customers, primary_contacts, on='customer_id', how='left')\n",
    "print(f\"\\nAfter filtering to primary contacts: {customers_primary_contacts.shape}\")\n",
    "\n",
    "# Strategy 2: Pivot contacts to columns\n",
    "contacts_pivoted = customer_contacts.pivot_table(\n",
    "    index='customer_id',\n",
    "    columns='contact_type',\n",
    "    values='contact_value',\n",
    "    aggfunc='first'\n",
    ").reset_index()\n",
    "print(\"\\nPivoted contacts:\")\n",
    "print(contacts_pivoted)\n",
    "\n",
    "customers_with_pivoted_contacts = pd.merge(df_customers, contacts_pivoted, on='customer_id', how='left')\n",
    "print(f\"\\nAfter merging pivoted contacts: {customers_with_pivoted_contacts.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Index-based Joins\n",
    "\n",
    "Using DataFrame indices for joining operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up DataFrames with indices\n",
    "print(\"=== INDEX-BASED JOINS ===\")\n",
    "\n",
    "# Set customer_id as index\n",
    "customers_indexed = df_customers.set_index('customer_id')\n",
    "segments_indexed = df_segments.set_index('customer_id')\n",
    "\n",
    "print(\"Customers with index:\")\n",
    "print(customers_indexed.head())\n",
    "\n",
    "# Join using indices\n",
    "joined_by_index = customers_indexed.join(segments_indexed, how='left')\n",
    "print(f\"\\nJoined by index shape: {joined_by_index.shape}\")\n",
    "print(joined_by_index[['customer_name', 'city', 'segment', 'loyalty_points']].head())\n",
    "\n",
    "# Compare with merge\n",
    "merged_equivalent = pd.merge(df_customers, df_segments, on='customer_id', how='left')\n",
    "print(f\"\\nEquivalent merge shape: {merged_equivalent.shape}\")\n",
    "\n",
    "# Verify they're the same (after sorting)\n",
    "joined_sorted = joined_by_index.reset_index().sort_values('customer_id')\n",
    "merged_sorted = merged_equivalent.sort_values('customer_id')\n",
    "are_equal = joined_sorted.equals(merged_sorted)\n",
    "print(f\"Results are identical: {are_equal}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multi-index joins\n",
    "print(\"=== MULTI-INDEX JOINS ===\")\n",
    "\n",
    "# Create a dataset with multiple index levels\n",
    "sales_by_region_product = pd.DataFrame({\n",
    "    'region': ['North', 'North', 'South', 'South', 'East', 'East'],\n",
    "    'product': ['Laptop', 'Phone', 'Laptop', 'Phone', 'Laptop', 'Phone'],\n",
    "    'sales_target': [10, 15, 8, 12, 12, 18],\n",
    "    'commission_rate': [0.05, 0.04, 0.06, 0.05, 0.05, 0.04]\n",
    "})\n",
    "\n",
    "# Set multi-index\n",
    "sales_targets = sales_by_region_product.set_index(['region', 'product'])\n",
    "print(\"Sales targets with multi-index:\")\n",
    "print(sales_targets)\n",
    "\n",
    "# Create customer orders with region mapping\n",
    "customer_regions = {\n",
    "    1: 'North', 2: 'South', 3: 'East', 4: 'North', 5: 'South', 6: 'East'\n",
    "}\n",
    "\n",
    "orders_with_region = df_orders.copy()\n",
    "orders_with_region['region'] = orders_with_region['customer_id'].map(customer_regions)\n",
    "orders_with_region = orders_with_region.dropna(subset=['region'])\n",
    "\n",
    "# Merge on multiple columns to match multi-index\n",
    "orders_with_targets = pd.merge(\n",
    "    orders_with_region,\n",
    "    sales_targets.reset_index(),\n",
    "    on=['region', 'product'],\n",
    "    how='left'\n",
    ")\n",
    "\n",
    "print(\"\\nOrders with sales targets:\")\n",
    "print(orders_with_targets[['order_id', 'region', 'product', 'amount', 'sales_target', 'commission_rate']].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Concatenation Operations\n",
    "\n",
    "Combining DataFrames vertically and horizontally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Vertical concatenation (stacking DataFrames)\n",
    "print(\"=== VERTICAL CONCATENATION ===\")\n",
    "\n",
    "# Create additional customer data (new batch)\n",
    "new_customers = pd.DataFrame({\n",
    "    'customer_id': [11, 12, 13, 14, 15],\n",
    "    'customer_name': ['Kate Wilson', 'Liam Brown', 'Mia Garcia', 'Noah Jones', 'Olivia Miller'],\n",
    "    'email': ['kate@email.com', 'liam@email.com', 'mia@email.com', 'noah@email.com', 'olivia@email.com'],\n",
    "    'age': [26, 39, 31, 44, 28],\n",
    "    'city': ['Austin', 'Seattle', 'Denver', 'Boston', 'Miami'],\n",
    "    'signup_date': pd.date_range('2024-01-01', periods=5, freq='M')\n",
    "})\n",
    "\n",
    "# Concatenate vertically\n",
    "all_customers = pd.concat([df_customers, new_customers], ignore_index=True)\n",
    "print(f\"Original customers: {len(df_customers)}\")\n",
    "print(f\"New customers: {len(new_customers)}\")\n",
    "print(f\"Combined customers: {len(all_customers)}\")\n",
    "\n",
    "print(\"\\nCombined customer data:\")\n",
    "print(all_customers.tail())\n",
    "\n",
    "# Concatenation with different columns\n",
    "customers_with_extra_info = pd.DataFrame({\n",
    "    'customer_id': [16, 17],\n",
    "    'customer_name': ['Paul Davis', 'Quinn Taylor'],\n",
    "    'email': ['paul@email.com', 'quinn@email.com'],\n",
    "    'age': [35, 29],\n",
    "    'city': ['Portland', 'Nashville'],\n",
    "    'signup_date': pd.date_range('2024-06-01', periods=2, freq='M'),\n",
    "    'referral_source': ['Google', 'Facebook']  # Extra column\n",
    "})\n",
    "\n",
    "# Concat with different columns (creates NaN for missing columns)\n",
    "all_customers_extended = pd.concat([all_customers, customers_with_extra_info], ignore_index=True, sort=False)\n",
    "print(f\"\\nAfter adding customers with extra info: {all_customers_extended.shape}\")\n",
    "print(\"Missing values in referral_source:\")\n",
    "print(all_customers_extended['referral_source'].isnull().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Horizontal concatenation\n",
    "print(\"=== HORIZONTAL CONCATENATION ===\")\n",
    "\n",
    "# Split customer data into parts\n",
    "customer_basic_info = df_customers[['customer_id', 'customer_name', 'email']]\n",
    "customer_demographics = df_customers[['customer_id', 'age', 'city', 'signup_date']]\n",
    "\n",
    "print(\"Customer basic info:\")\n",
    "print(customer_basic_info.head())\n",
    "\n",
    "print(\"\\nCustomer demographics:\")\n",
    "print(customer_demographics.head())\n",
    "\n",
    "# Concatenate horizontally (by index)\n",
    "customers_recombined = pd.concat([customer_basic_info, customer_demographics.drop('customer_id', axis=1)], axis=1)\n",
    "print(f\"\\nRecombined shape: {customers_recombined.shape}\")\n",
    "print(customers_recombined.head())\n",
    "\n",
    "# Verify it matches original\n",
    "columns_match = set(customers_recombined.columns) == set(df_customers.columns)\n",
    "print(f\"\\nColumns match original: {columns_match}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Concat with keys (creating hierarchical columns)\n",
    "print(\"=== CONCAT WITH KEYS ===\")\n",
    "\n",
    "# Create quarterly sales data\n",
    "q1_sales = pd.DataFrame({\n",
    "    'product': ['Laptop', 'Phone', 'Tablet'],\n",
    "    'units_sold': [50, 75, 30],\n",
    "    'revenue': [60000, 60000, 12000]\n",
    "})\n",
    "\n",
    "q2_sales = pd.DataFrame({\n",
    "    'product': ['Laptop', 'Phone', 'Tablet'],\n",
    "    'units_sold': [45, 80, 35],\n",
    "    'revenue': [54000, 64000, 14000]\n",
    "})\n",
    "\n",
    "# Concatenate with keys\n",
    "quarterly_sales = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])\n",
    "print(\"Quarterly sales with hierarchical index:\")\n",
    "print(quarterly_sales)\n",
    "\n",
    "# Access specific quarter\n",
    "print(\"\\nQ1 sales only:\")\n",
    "print(quarterly_sales.loc['Q1'])\n",
    "\n",
    "# Create summary comparison\n",
    "quarterly_comparison = pd.concat([q1_sales.set_index('product'), q2_sales.set_index('product')], \n",
    "                                keys=['Q1', 'Q2'], axis=1)\n",
    "print(\"\\nQuarterly comparison (side by side):\")\n",
    "print(quarterly_comparison)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Performance and Best Practices\n",
    "\n",
    "Optimizing merge operations and avoiding common pitfalls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Performance comparison: merge vs join\n",
    "import time\n",
    "\n",
    "print(\"=== PERFORMANCE COMPARISON ===\")\n",
    "\n",
    "# Create larger datasets for performance testing\n",
    "np.random.seed(42)\n",
    "large_customers = pd.DataFrame({\n",
    "    'customer_id': range(1, 10001),\n",
    "    'customer_name': [f'Customer_{i}' for i in range(1, 10001)],\n",
    "    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], 10000)\n",
    "})\n",
    "\n",
    "large_orders = pd.DataFrame({\n",
    "    'order_id': range(1, 50001),\n",
    "    'customer_id': np.random.randint(1, 10001, 50000),\n",
    "    'amount': np.random.normal(100, 30, 50000)\n",
    "})\n",
    "\n",
    "print(f\"Large customers: {large_customers.shape}\")\n",
    "print(f\"Large orders: {large_orders.shape}\")\n",
    "\n",
    "# Test merge performance\n",
    "start_time = time.time()\n",
    "merged_result = pd.merge(large_customers, large_orders, on='customer_id', how='inner')\n",
    "merge_time = time.time() - start_time\n",
    "\n",
    "# Test join performance\n",
    "customers_indexed = large_customers.set_index('customer_id')\n",
    "orders_indexed = large_orders.set_index('customer_id')\n",
    "\n",
    "start_time = time.time()\n",
    "joined_result = customers_indexed.join(orders_indexed, how='inner')\n",
    "join_time = time.time() - start_time\n",
    "\n",
    "print(f\"\\nMerge time: {merge_time:.4f} seconds\")\n",
    "print(f\"Join time: {join_time:.4f} seconds\")\n",
    "print(f\"Join is {merge_time/join_time:.2f}x faster\")\n",
    "\n",
    "print(f\"\\nResults shape - Merge: {merged_result.shape}, Join: {joined_result.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Best practices and common pitfalls\n",
    "print(\"=== BEST PRACTICES ===\")\n",
    "\n",
    "def analyze_merge_keys(df1, df2, key_col):\n",
    "    \"\"\"Analyze merge keys before joining\"\"\"\n",
    "    print(f\"\\n--- Analyzing merge on '{key_col}' ---\")\n",
    "    \n",
    "    # Check for duplicates\n",
    "    df1_dups = df1[key_col].duplicated().sum()\n",
    "    df2_dups = df2[key_col].duplicated().sum()\n",
    "    \n",
    "    print(f\"Duplicates in left table: {df1_dups}\")\n",
    "    print(f\"Duplicates in right table: {df2_dups}\")\n",
    "    \n",
    "    # Check for missing values\n",
    "    df1_missing = df1[key_col].isnull().sum()\n",
    "    df2_missing = df2[key_col].isnull().sum()\n",
    "    \n",
    "    print(f\"Missing values in left table: {df1_missing}\")\n",
    "    print(f\"Missing values in right table: {df2_missing}\")\n",
    "    \n",
    "    # Check overlap\n",
    "    left_keys = set(df1[key_col].dropna())\n",
    "    right_keys = set(df2[key_col].dropna())\n",
    "    \n",
    "    overlap = left_keys & right_keys\n",
    "    left_only = left_keys - right_keys\n",
    "    right_only = right_keys - left_keys\n",
    "    \n",
    "    print(f\"Keys in both tables: {len(overlap)}\")\n",
    "    print(f\"Keys only in left: {len(left_only)}\")\n",
    "    print(f\"Keys only in right: {len(right_only)}\")\n",
    "    \n",
    "    # Predict result sizes\n",
    "    if df1_dups == 0 and df2_dups == 0:\n",
    "        inner_size = len(overlap)\n",
    "        left_size = len(df1)\n",
    "        right_size = len(df2)\n",
    "        outer_size = len(left_keys | right_keys)\n",
    "    else:\n",
    "        print(\"Warning: Duplicates present, result size may be larger than expected\")\n",
    "        inner_size = \"Cannot predict (duplicates present)\"\n",
    "        left_size = \"Cannot predict (duplicates present)\"\n",
    "        right_size = \"Cannot predict (duplicates present)\"\n",
    "        outer_size = \"Cannot predict (duplicates present)\"\n",
    "    \n",
    "    print(f\"\\nPredicted result sizes:\")\n",
    "    print(f\"Inner join: {inner_size}\")\n",
    "    print(f\"Left join: {left_size}\")\n",
    "    print(f\"Right join: {right_size}\")\n",
    "    print(f\"Outer join: {outer_size}\")\n",
    "\n",
    "# Analyze our sample data\n",
    "analyze_merge_keys(df_customers, df_orders, 'customer_id')\n",
    "analyze_merge_keys(df_customers, df_segments, 'customer_id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data validation after merge\n",
    "def validate_merge_result(df, expected_rows=None, key_col=None):\n",
    "    \"\"\"Validate merge results\"\"\"\n",
    "    print(\"\\n=== MERGE VALIDATION ===\")\n",
    "    \n",
    "    print(f\"Result shape: {df.shape}\")\n",
    "    \n",
    "    if expected_rows:\n",
    "        print(f\"Expected rows: {expected_rows}\")\n",
    "        if len(df) != expected_rows:\n",
    "            print(\"⚠️  Row count doesn't match expectation!\")\n",
    "    \n",
    "    # Check for unexpected duplicates\n",
    "    if key_col and key_col in df.columns:\n",
    "        duplicates = df[key_col].duplicated().sum()\n",
    "        if duplicates > 0:\n",
    "            print(f\"⚠️  Found {duplicates} duplicate keys after merge\")\n",
    "    \n",
    "    # Check for missing values in key columns\n",
    "    missing_summary = df.isnull().sum()\n",
    "    critical_missing = missing_summary[missing_summary > 0]\n",
    "    \n",
    "    if len(critical_missing) > 0:\n",
    "        print(\"Missing values after merge:\")\n",
    "        print(critical_missing)\n",
    "    \n",
    "    # Data type consistency\n",
    "    print(f\"\\nData types:\")\n",
    "    print(df.dtypes)\n",
    "    \n",
    "    return df\n",
    "\n",
    "# Example validation\n",
    "sample_merge = pd.merge(df_customers, df_orders, on='customer_id', how='inner')\n",
    "validated_result = validate_merge_result(sample_merge, key_col='customer_id')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Apply merging and joining techniques to real-world scenarios:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Customer Lifetime Value Analysis\n",
    "# Create a comprehensive customer analysis by joining:\n",
    "# - Customer demographics\n",
    "# - Order history\n",
    "# - Product information\n",
    "# - Customer segments\n",
    "# Calculate CLV metrics for each customer\n",
    "\n",
    "def calculate_customer_lifetime_value(customers, orders, products, segments):\n",
    "    \"\"\"Calculate comprehensive customer lifetime value metrics\"\"\"\n",
    "    # Your implementation here\n",
    "    pass\n",
    "\n",
    "# clv_analysis = calculate_customer_lifetime_value(df_customers, df_orders, df_products, df_segments)\n",
    "# print(\"Customer Lifetime Value Analysis:\")\n",
    "# print(clv_analysis.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Data Quality Assessment\n",
    "# Create a function that analyzes data quality issues when merging multiple datasets:\n",
    "# - Identify orphaned records\n",
    "# - Find data inconsistencies\n",
    "# - Suggest data cleaning steps\n",
    "# - Provide merge recommendations\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Time-series Join Challenge\n",
    "# Create a complex time-based join scenario:\n",
    "# - Join orders with time-varying product prices\n",
    "# - Handle seasonal promotions\n",
    "# - Calculate accurate historical revenue\n",
    "# - Account for price changes over time\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Join Types**:\n",
    "   - **Inner**: Only matching records from both tables\n",
    "   - **Left**: All records from left table + matching from right\n",
    "   - **Right**: All records from right table + matching from left\n",
    "   - **Outer**: All records from both tables\n",
    "\n",
    "2. **Method Selection**:\n",
    "   - **`pd.merge()`**: Most flexible, works with any columns\n",
    "   - **`.join()`**: Faster for index-based joins\n",
    "   - **`pd.concat()`**: For stacking DataFrames vertically/horizontally\n",
    "\n",
    "3. **Best Practices**:\n",
    "   - Always analyze merge keys before joining\n",
    "   - Check for duplicates and missing values\n",
    "   - Validate results after merging\n",
    "   - Use appropriate join types for your use case\n",
    "   - Consider performance implications for large datasets\n",
    "\n",
    "4. **Common Pitfalls**:\n",
    "   - Cartesian products from duplicate keys\n",
    "   - Unexpected result sizes\n",
    "   - Data type inconsistencies\n",
    "   - Missing value propagation\n",
    "\n",
    "## Join Type Selection Guide\n",
    "\n",
    "| Use Case | Recommended Join | Rationale |\n",
    "|----------|-----------------|----------|\n",
    "| Customer orders analysis | Inner | Only customers with orders |\n",
    "| Customer segmentation | Left | Keep all customers, add segment info |\n",
    "| Order validation | Right | Keep all orders, check customer validity |\n",
    "| Data completeness analysis | Outer | See all records and identify gaps |\n",
    "| Performance-critical operations | Index-based join | Faster execution |\n",
    "\n",
    "## Performance Tips\n",
    "\n",
    "1. **Index Usage**: Set indexes for frequently joined columns\n",
    "2. **Data Types**: Ensure consistent data types before joining\n",
    "3. **Memory Management**: Consider chunking for very large datasets\n",
    "4. **Join Order**: Start with smallest datasets\n",
    "5. **Validation**: Always validate merge results"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}