# Session 1 - DataFrames - Lesson 7: Merging and Joining DataFrames

## Learning Objectives
- Master different types of joins (inner, outer, left, right)
- Understand when to use merge vs join vs concat
- Handle duplicate keys and join conflicts
- Learn advanced merging techniques and best practices
- Practice with real-world data integration scenarios

## Prerequisites
- Completed Lessons 1-6
- Understanding of relational database concepts (helpful)
- Basic knowledge of SQL joins (helpful but not required)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

print("Libraries loaded successfully!")

## Creating Sample Datasets

Let's create realistic datasets that represent common business scenarios.

In [None]:
# Create sample datasets for merging examples
np.random.seed(42)

# Customer dataset
customers_data = {
 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 'customer_name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson',
 'Frank Miller', 'Grace Lee', 'Henry Davis', 'Ivy Chen', 'Jack Robinson'],
 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com',
 'frank@email.com', 'grace@email.com', 'henry@email.com', 'ivy@email.com', 'jack@email.com'],
 'age': [28, 35, 42, 31, 29, 45, 38, 33, 27, 41],
 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',
 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
 'signup_date': pd.date_range('2023-01-01', periods=10, freq='M')
}

df_customers = pd.DataFrame(customers_data)

# Orders dataset
orders_data = {
 'order_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],
 'customer_id': [1, 2, 1, 3, 4, 2, 5, 1, 6, 11, 3, 2], # Note: customer_id 11 doesn't exist in customers
 'order_date': pd.date_range('2023-06-01', periods=12, freq='W'),
 'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Monitor', 'Phone', 
 'Headphones', 'Mouse', 'Keyboard', 'Laptop', 'Tablet', 'Monitor'],
 'quantity': [1, 2, 1, 1, 1, 1, 3, 2, 1, 1, 2, 1],
 'amount': [1200, 800, 400, 1200, 300, 800, 150, 50, 75, 1200, 800, 300]
}

df_orders = pd.DataFrame(orders_data)

# Product information dataset
products_data = {
 'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones', 'Mouse', 'Keyboard', 'Webcam'],
 'category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 
 'Audio', 'Accessories', 'Accessories', 'Electronics'],
 'price': [1200, 800, 400, 300, 150, 50, 75, 100],
 'supplier': ['TechCorp', 'MobileCorp', 'TechCorp', 'DisplayCorp', 
 'AudioCorp', 'AccessoryCorp', 'AccessoryCorp', 'TechCorp']
}

df_products = pd.DataFrame(products_data)

# Customer segments dataset
segments_data = {
 'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 12, 13], # Some customers not in main customer table
 'segment': ['Premium', 'Standard', 'Premium', 'Standard', 'Basic', 
 'Premium', 'Standard', 'Basic', 'Premium', 'Standard'],
 'loyalty_points': [1500, 800, 1200, 600, 200, 1800, 750, 300, 2000, 900]
}

df_segments = pd.DataFrame(segments_data)

print("Sample datasets created:")
print(f"Customers: {df_customers.shape}")
print(f"Orders: {df_orders.shape}")
print(f"Products: {df_products.shape}")
print(f"Segments: {df_segments.shape}")

print("\nCustomers dataset:")
print(df_customers.head())

print("\nOrders dataset:")
print(df_orders.head())

## 1. Basic Merge Operations

Understanding the fundamental merge operations and join types.

In [None]:
# Inner Join - only matching records
print("=== INNER JOIN ===")
inner_join = pd.merge(df_customers, df_orders, on='customer_id', how='inner')
print(f"Result shape: {inner_join.shape}")
print("Sample results:")
print(inner_join[['customer_name', 'order_id', 'product', 'amount']].head())

print(f"\nUnique customers in result: {inner_join['customer_id'].nunique()}")
print(f"Total orders: {len(inner_join)}")

# Check which customers have orders
customers_with_orders = inner_join['customer_id'].unique()
print(f"Customers with orders: {sorted(customers_with_orders)}")

In [None]:
# Left Join - all records from left table
print("=== LEFT JOIN ===")
left_join = pd.merge(df_customers, df_orders, on='customer_id', how='left')
print(f"Result shape: {left_join.shape}")
print("Sample results:")
print(left_join[['customer_name', 'order_id', 'product', 'amount']].head(10))

# Check customers without orders
customers_without_orders = left_join[left_join['order_id'].isnull()]['customer_name'].tolist()
print(f"\nCustomers without orders: {customers_without_orders}")

# Summary statistics
print(f"\nTotal records: {len(left_join)}")
print(f"Records with orders: {left_join['order_id'].notna().sum()}")
print(f"Records without orders: {left_join['order_id'].isnull().sum()}")

In [None]:
# Right Join - all records from right table
print("=== RIGHT JOIN ===")
right_join = pd.merge(df_customers, df_orders, on='customer_id', how='right')
print(f"Result shape: {right_join.shape}")
print("Sample results:")
print(right_join[['customer_name', 'order_id', 'product', 'amount']].head())

# Check orders without customer information
orders_without_customers = right_join[right_join['customer_name'].isnull()]
print(f"\nOrders without customer info: {len(orders_without_customers)}")
if len(orders_without_customers) > 0:
 print(orders_without_customers[['customer_id', 'order_id', 'product', 'amount']])

In [None]:
# Outer Join - all records from both tables
print("=== OUTER JOIN ===")
outer_join = pd.merge(df_customers, df_orders, on='customer_id', how='outer')
print(f"Result shape: {outer_join.shape}")

# Analyze the result
print("\nData quality analysis:")
print(f"Records with complete customer info: {outer_join['customer_name'].notna().sum()}")
print(f"Records with complete order info: {outer_join['order_id'].notna().sum()}")
print(f"Records with both customer and order info: {(outer_join['customer_name'].notna() & outer_join['order_id'].notna()).sum()}")

# Show different categories of records
print("\nCustomers without orders:")
customers_only = outer_join[(outer_join['customer_name'].notna()) & (outer_join['order_id'].isnull())]
print(customers_only[['customer_name', 'city']].drop_duplicates())

print("\nOrders without customer data:")
orders_only = outer_join[(outer_join['customer_name'].isnull()) & (outer_join['order_id'].notna())]
print(orders_only[['customer_id', 'order_id', 'product', 'amount']])

## 2. Multiple Table Joins

Combining data from multiple sources in sequence.

In [None]:
# Three-way join: Customers + Orders + Products
print("=== THREE-WAY JOIN ===")

# Step 1: Join customers and orders
customer_orders = pd.merge(df_customers, df_orders, on='customer_id', how='inner')
print(f"After joining customers and orders: {customer_orders.shape}")

# Step 2: Join with products
complete_data = pd.merge(customer_orders, df_products, on='product', how='left')
print(f"After joining with products: {complete_data.shape}")

# Display comprehensive view
print("\nComplete order information:")
display_cols = ['customer_name', 'order_id', 'product', 'category', 'quantity', 'amount', 'price', 'supplier']
print(complete_data[display_cols].head())

# Verify data consistency
print("\nData consistency check:")
# Check if order amount matches product price * quantity
complete_data['calculated_amount'] = complete_data['price'] * complete_data['quantity']
amount_matches = (complete_data['amount'] == complete_data['calculated_amount']).all()
print(f"Order amounts match calculated amounts: {amount_matches}")

if not amount_matches:
 mismatched = complete_data[complete_data['amount'] != complete_data['calculated_amount']]
 print(f"\nMismatched records: {len(mismatched)}")
 print(mismatched[['order_id', 'product', 'amount', 'calculated_amount']])

In [None]:
# Add customer segment information
print("=== ADDING CUSTOMER SEGMENTS ===")

# Join with segments (left join to keep all customers)
customers_with_segments = pd.merge(df_customers, df_segments, on='customer_id', how='left')
print(f"Customers with segments shape: {customers_with_segments.shape}")

# Check which customers don't have segment information
missing_segments = customers_with_segments[customers_with_segments['segment'].isnull()]
print(f"\nCustomers without segment info: {len(missing_segments)}")
if len(missing_segments) > 0:
 print(missing_segments[['customer_name', 'city']])

# Create comprehensive customer profile
full_customer_profile = pd.merge(complete_data, df_segments, on='customer_id', how='left')
print(f"\nFull customer profile shape: {full_customer_profile.shape}")

# Analyze by segment
segment_analysis = full_customer_profile.groupby('segment').agg({
 'amount': ['sum', 'mean', 'count'],
 'customer_id': 'nunique'
}).round(2)
segment_analysis.columns = ['Total_Revenue', 'Avg_Order_Value', 'Total_Orders', 'Unique_Customers']
print("\nRevenue by customer segment:")
print(segment_analysis)

## 3. Advanced Merge Techniques

Handling complex merging scenarios and edge cases.

In [None]:
# Merge with different column names
print("=== MERGE WITH DIFFERENT COLUMN NAMES ===")

# Create a dataset with different column name
customer_demographics = pd.DataFrame({
 'cust_id': [1, 2, 3, 4, 5],
 'income_range': ['50-75k', '75-100k', '50-75k', '100k+', '25-50k'],
 'education': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor'],
 'occupation': ['Engineer', 'Manager', 'Professor', 'Director', 'Analyst']
})

# Merge using left_on and right_on parameters
customers_with_demographics = pd.merge(
 df_customers, 
 customer_demographics, 
 left_on='customer_id', 
 right_on='cust_id', 
 how='left'
)

print("Merge with different column names:")
print(customers_with_demographics[['customer_name', 'customer_id', 'cust_id', 'income_range', 'education']].head())

# Clean up duplicate columns
customers_with_demographics = customers_with_demographics.drop('cust_id', axis=1)
print(f"\nAfter cleanup: {customers_with_demographics.shape}")

In [None]:
# Merge on multiple columns
print("=== MERGE ON MULTIPLE COLUMNS ===")

# Create time-based pricing data
pricing_data = pd.DataFrame({
 'product': ['Laptop', 'Laptop', 'Phone', 'Phone', 'Tablet', 'Tablet'],
 'date': pd.to_datetime(['2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01', '2023-06-01', '2023-08-01']),
 'price': [1200, 1100, 800, 750, 400, 380],
 'promotion': [False, True, False, True, False, True]
})

# Add year-month to orders for matching
df_orders_with_period = df_orders.copy()
df_orders_with_period['order_month'] = df_orders_with_period['order_date'].dt.to_period('M').dt.start_time

# Create matching periods in pricing data
pricing_data['period'] = pricing_data['date'].dt.to_period('M').dt.start_time

# Merge on product and time period
orders_with_pricing = pd.merge(
 df_orders_with_period,
 pricing_data,
 left_on=['product', 'order_month'],
 right_on=['product', 'period'],
 how='left'
)

print("Orders with time-based pricing:")
print(orders_with_pricing[['order_id', 'product', 'order_date', 'amount', 'price', 'promotion']].head())

# Check for pricing discrepancies
pricing_discrepancies = orders_with_pricing[
 (orders_with_pricing['amount'] != orders_with_pricing['price'] * orders_with_pricing['quantity']) &
 orders_with_pricing['price'].notna()
]
print(f"\nOrders with pricing discrepancies: {len(pricing_discrepancies)}")

In [None]:
# Handling duplicate keys in merge
print("=== HANDLING DUPLICATE KEYS ===")

# Create data with duplicate keys
customer_contacts = pd.DataFrame({
 'customer_id': [1, 1, 2, 2, 3],
 'contact_type': ['email', 'phone', 'email', 'phone', 'email'],
 'contact_value': ['alice@email.com', '555-0101', 'bob@email.com', '555-0102', 'charlie@email.com'],
 'is_primary': [True, False, True, True, True]
})

print("Customer contacts with duplicates:")
print(customer_contacts)

# Merge will create cartesian product for duplicate keys
customers_with_contacts = pd.merge(df_customers, customer_contacts, on='customer_id', how='inner')
print(f"\nResult of merge with duplicates: {customers_with_contacts.shape}")
print(customers_with_contacts[['customer_name', 'contact_type', 'contact_value', 'is_primary']].head())

# Strategy 1: Filter before merge
primary_contacts = customer_contacts[customer_contacts['is_primary'] == True]
customers_primary_contacts = pd.merge(df_customers, primary_contacts, on='customer_id', how='left')
print(f"\nAfter filtering to primary contacts: {customers_primary_contacts.shape}")

# Strategy 2: Pivot contacts to columns
contacts_pivoted = customer_contacts.pivot_table(
 index='customer_id',
 columns='contact_type',
 values='contact_value',
 aggfunc='first'
).reset_index()
print("\nPivoted contacts:")
print(contacts_pivoted)

customers_with_pivoted_contacts = pd.merge(df_customers, contacts_pivoted, on='customer_id', how='left')
print(f"\nAfter merging pivoted contacts: {customers_with_pivoted_contacts.shape}")

## 4. Index-based Joins

Using DataFrame indices for joining operations.

In [None]:
# Set up DataFrames with indices
print("=== INDEX-BASED JOINS ===")

# Set customer_id as index
customers_indexed = df_customers.set_index('customer_id')
segments_indexed = df_segments.set_index('customer_id')

print("Customers with index:")
print(customers_indexed.head())

# Join using indices
joined_by_index = customers_indexed.join(segments_indexed, how='left')
print(f"\nJoined by index shape: {joined_by_index.shape}")
print(joined_by_index[['customer_name', 'city', 'segment', 'loyalty_points']].head())

# Compare with merge
merged_equivalent = pd.merge(df_customers, df_segments, on='customer_id', how='left')
print(f"\nEquivalent merge shape: {merged_equivalent.shape}")

# Verify they're the same (after sorting)
joined_sorted = joined_by_index.reset_index().sort_values('customer_id')
merged_sorted = merged_equivalent.sort_values('customer_id')
are_equal = joined_sorted.equals(merged_sorted)
print(f"Results are identical: {are_equal}")

In [None]:
# Multi-index joins
print("=== MULTI-INDEX JOINS ===")

# Create a dataset with multiple index levels
sales_by_region_product = pd.DataFrame({
 'region': ['North', 'North', 'South', 'South', 'East', 'East'],
 'product': ['Laptop', 'Phone', 'Laptop', 'Phone', 'Laptop', 'Phone'],
 'sales_target': [10, 15, 8, 12, 12, 18],
 'commission_rate': [0.05, 0.04, 0.06, 0.05, 0.05, 0.04]
})

# Set multi-index
sales_targets = sales_by_region_product.set_index(['region', 'product'])
print("Sales targets with multi-index:")
print(sales_targets)

# Create customer orders with region mapping
customer_regions = {
 1: 'North', 2: 'South', 3: 'East', 4: 'North', 5: 'South', 6: 'East'
}

orders_with_region = df_orders.copy()
orders_with_region['region'] = orders_with_region['customer_id'].map(customer_regions)
orders_with_region = orders_with_region.dropna(subset=['region'])

# Merge on multiple columns to match multi-index
orders_with_targets = pd.merge(
 orders_with_region,
 sales_targets.reset_index(),
 on=['region', 'product'],
 how='left'
)

print("\nOrders with sales targets:")
print(orders_with_targets[['order_id', 'region', 'product', 'amount', 'sales_target', 'commission_rate']].head())

## 5. Concatenation Operations

Combining DataFrames vertically and horizontally.

In [None]:
# Vertical concatenation (stacking DataFrames)
print("=== VERTICAL CONCATENATION ===")

# Create additional customer data (new batch)
new_customers = pd.DataFrame({
 'customer_id': [11, 12, 13, 14, 15],
 'customer_name': ['Kate Wilson', 'Liam Brown', 'Mia Garcia', 'Noah Jones', 'Olivia Miller'],
 'email': ['kate@email.com', 'liam@email.com', 'mia@email.com', 'noah@email.com', 'olivia@email.com'],
 'age': [26, 39, 31, 44, 28],
 'city': ['Austin', 'Seattle', 'Denver', 'Boston', 'Miami'],
 'signup_date': pd.date_range('2024-01-01', periods=5, freq='M')
})

# Concatenate vertically
all_customers = pd.concat([df_customers, new_customers], ignore_index=True)
print(f"Original customers: {len(df_customers)}")
print(f"New customers: {len(new_customers)}")
print(f"Combined customers: {len(all_customers)}")

print("\nCombined customer data:")
print(all_customers.tail())

# Concatenation with different columns
customers_with_extra_info = pd.DataFrame({
 'customer_id': [16, 17],
 'customer_name': ['Paul Davis', 'Quinn Taylor'],
 'email': ['paul@email.com', 'quinn@email.com'],
 'age': [35, 29],
 'city': ['Portland', 'Nashville'],
 'signup_date': pd.date_range('2024-06-01', periods=2, freq='M'),
 'referral_source': ['Google', 'Facebook'] # Extra column
})

# Concat with different columns (creates NaN for missing columns)
all_customers_extended = pd.concat([all_customers, customers_with_extra_info], ignore_index=True, sort=False)
print(f"\nAfter adding customers with extra info: {all_customers_extended.shape}")
print("Missing values in referral_source:")
print(all_customers_extended['referral_source'].isnull().sum())

In [None]:
# Horizontal concatenation
print("=== HORIZONTAL CONCATENATION ===")

# Split customer data into parts
customer_basic_info = df_customers[['customer_id', 'customer_name', 'email']]
customer_demographics = df_customers[['customer_id', 'age', 'city', 'signup_date']]

print("Customer basic info:")
print(customer_basic_info.head())

print("\nCustomer demographics:")
print(customer_demographics.head())

# Concatenate horizontally (by index)
customers_recombined = pd.concat([customer_basic_info, customer_demographics.drop('customer_id', axis=1)], axis=1)
print(f"\nRecombined shape: {customers_recombined.shape}")
print(customers_recombined.head())

# Verify it matches original
columns_match = set(customers_recombined.columns) == set(df_customers.columns)
print(f"\nColumns match original: {columns_match}")

In [None]:
# Concat with keys (creating hierarchical columns)
print("=== CONCAT WITH KEYS ===")

# Create quarterly sales data
q1_sales = pd.DataFrame({
 'product': ['Laptop', 'Phone', 'Tablet'],
 'units_sold': [50, 75, 30],
 'revenue': [60000, 60000, 12000]
})

q2_sales = pd.DataFrame({
 'product': ['Laptop', 'Phone', 'Tablet'],
 'units_sold': [45, 80, 35],
 'revenue': [54000, 64000, 14000]
})

# Concatenate with keys
quarterly_sales = pd.concat([q1_sales, q2_sales], keys=['Q1', 'Q2'])
print("Quarterly sales with hierarchical index:")
print(quarterly_sales)

# Access specific quarter
print("\nQ1 sales only:")
print(quarterly_sales.loc['Q1'])

# Create summary comparison
quarterly_comparison = pd.concat([q1_sales.set_index('product'), q2_sales.set_index('product')], 
 keys=['Q1', 'Q2'], axis=1)
print("\nQuarterly comparison (side by side):")
print(quarterly_comparison)

## 6. Performance and Best Practices

Optimizing merge operations and avoiding common pitfalls.

In [None]:
# Performance comparison: merge vs join
import time

print("=== PERFORMANCE COMPARISON ===")

# Create larger datasets for performance testing
np.random.seed(42)
large_customers = pd.DataFrame({
 'customer_id': range(1, 10001),
 'customer_name': [f'Customer_{i}' for i in range(1, 10001)],
 'city': np.random.choice(['New York', 'Los Angeles', 'Chicago'], 10000)
})

large_orders = pd.DataFrame({
 'order_id': range(1, 50001),
 'customer_id': np.random.randint(1, 10001, 50000),
 'amount': np.random.normal(100, 30, 50000)
})

print(f"Large customers: {large_customers.shape}")
print(f"Large orders: {large_orders.shape}")

# Test merge performance
start_time = time.time()
merged_result = pd.merge(large_customers, large_orders, on='customer_id', how='inner')
merge_time = time.time() - start_time

# Test join performance
customers_indexed = large_customers.set_index('customer_id')
orders_indexed = large_orders.set_index('customer_id')

start_time = time.time()
joined_result = customers_indexed.join(orders_indexed, how='inner')
join_time = time.time() - start_time

print(f"\nMerge time: {merge_time:.4f} seconds")
print(f"Join time: {join_time:.4f} seconds")
print(f"Join is {merge_time/join_time:.2f}x faster")

print(f"\nResults shape - Merge: {merged_result.shape}, Join: {joined_result.shape}")

In [None]:
# Best practices and common pitfalls
print("=== BEST PRACTICES ===")

def analyze_merge_keys(df1, df2, key_col):
 """Analyze merge keys before joining"""
 print(f"\n--- Analyzing merge on '{key_col}' ---")
 
 # Check for duplicates
 df1_dups = df1[key_col].duplicated().sum()
 df2_dups = df2[key_col].duplicated().sum()
 
 print(f"Duplicates in left table: {df1_dups}")
 print(f"Duplicates in right table: {df2_dups}")
 
 # Check for missing values
 df1_missing = df1[key_col].isnull().sum()
 df2_missing = df2[key_col].isnull().sum()
 
 print(f"Missing values in left table: {df1_missing}")
 print(f"Missing values in right table: {df2_missing}")
 
 # Check overlap
 left_keys = set(df1[key_col].dropna())
 right_keys = set(df2[key_col].dropna())
 
 overlap = left_keys & right_keys
 left_only = left_keys - right_keys
 right_only = right_keys - left_keys
 
 print(f"Keys in both tables: {len(overlap)}")
 print(f"Keys only in left: {len(left_only)}")
 print(f"Keys only in right: {len(right_only)}")
 
 # Predict result sizes
 if df1_dups == 0 and df2_dups == 0:
 inner_size = len(overlap)
 left_size = len(df1)
 right_size = len(df2)
 outer_size = len(left_keys | right_keys)
 else:
 print("Warning: Duplicates present, result size may be larger than expected")
 inner_size = "Cannot predict (duplicates present)"
 left_size = "Cannot predict (duplicates present)"
 right_size = "Cannot predict (duplicates present)"
 outer_size = "Cannot predict (duplicates present)"
 
 print(f"\nPredicted result sizes:")
 print(f"Inner join: {inner_size}")
 print(f"Left join: {left_size}")
 print(f"Right join: {right_size}")
 print(f"Outer join: {outer_size}")

# Analyze our sample data
analyze_merge_keys(df_customers, df_orders, 'customer_id')
analyze_merge_keys(df_customers, df_segments, 'customer_id')

In [None]:
# Data validation after merge
def validate_merge_result(df, expected_rows=None, key_col=None):
 """Validate merge results"""
 print("\n=== MERGE VALIDATION ===")
 
 print(f"Result shape: {df.shape}")
 
 if expected_rows:
 print(f"Expected rows: {expected_rows}")
 if len(df) != expected_rows:
 print("⚠️ Row count doesn't match expectation!")
 
 # Check for unexpected duplicates
 if key_col and key_col in df.columns:
 duplicates = df[key_col].duplicated().sum()
 if duplicates > 0:
 print(f"⚠️ Found {duplicates} duplicate keys after merge")
 
 # Check for missing values in key columns
 missing_summary = df.isnull().sum()
 critical_missing = missing_summary[missing_summary > 0]
 
 if len(critical_missing) > 0:
 print("Missing values after merge:")
 print(critical_missing)
 
 # Data type consistency
 print(f"\nData types:")
 print(df.dtypes)
 
 return df

# Example validation
sample_merge = pd.merge(df_customers, df_orders, on='customer_id', how='inner')
validated_result = validate_merge_result(sample_merge, key_col='customer_id')

## Practice Exercises

Apply merging and joining techniques to real-world scenarios:

In [20]:
# Exercise 1: Customer Lifetime Value Analysis
# Create a comprehensive customer analysis by joining:
# - Customer demographics
# - Order history
# - Product information
# - Customer segments
# Calculate CLV metrics for each customer

def calculate_customer_lifetime_value(customers, orders, products, segments):
 """Calculate comprehensive customer lifetime value metrics"""
 # Your implementation here
 pass

# clv_analysis = calculate_customer_lifetime_value(df_customers, df_orders, df_products, df_segments)
# print("Customer Lifetime Value Analysis:")
# print(clv_analysis.head())

In [21]:
# Exercise 2: Data Quality Assessment
# Create a function that analyzes data quality issues when merging multiple datasets:
# - Identify orphaned records
# - Find data inconsistencies
# - Suggest data cleaning steps
# - Provide merge recommendations

# Your code here:


In [22]:
# Exercise 3: Time-series Join Challenge
# Create a complex time-based join scenario:
# - Join orders with time-varying product prices
# - Handle seasonal promotions
# - Calculate accurate historical revenue
# - Account for price changes over time

# Your code here:


## Key Takeaways

1. **Join Types**:
 - **Inner**: Only matching records from both tables
 - **Left**: All records from left table + matching from right
 - **Right**: All records from right table + matching from left
 - **Outer**: All records from both tables

2. **Method Selection**:
 - **`pd.merge()`**: Most flexible, works with any columns
 - **`.join()`**: Faster for index-based joins
 - **`pd.concat()`**: For stacking DataFrames vertically/horizontally

3. **Best Practices**:
 - Always analyze merge keys before joining
 - Check for duplicates and missing values
 - Validate results after merging
 - Use appropriate join types for your use case
 - Consider performance implications for large datasets

4. **Common Pitfalls**:
 - Cartesian products from duplicate keys
 - Unexpected result sizes
 - Data type inconsistencies
 - Missing value propagation

## Join Type Selection Guide

| Use Case | Recommended Join | Rationale |
|----------|-----------------|----------|
| Customer orders analysis | Inner | Only customers with orders |
| Customer segmentation | Left | Keep all customers, add segment info |
| Order validation | Right | Keep all orders, check customer validity |
| Data completeness analysis | Outer | See all records and identify gaps |
| Performance-critical operations | Index-based join | Faster execution |

## Performance Tips

1. **Index Usage**: Set indexes for frequently joined columns
2. **Data Types**: Ensure consistent data types before joining
3. **Memory Management**: Consider chunking for very large datasets
4. **Join Order**: Start with smallest datasets
5. **Validation**: Always validate merge results