# Session 1 - DataFrames - Lesson 9: Pivot Tables and Data Reshaping

## Learning Objectives
- Master pivot table creation and customization
- Understand data reshaping with melt, pivot, stack, and unstack
- Learn cross-tabulation and contingency tables
- Practice with multi-level indexing and hierarchical data
- Apply reshaping techniques to real-world analysis scenarios

## Prerequisites
- Completed Lessons 1-8
- Understanding of aggregation and groupby operations
- Familiarity with Excel pivot tables (helpful but not required)

In [24]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.width', None)

print("Libraries loaded successfully!")

Libraries loaded successfully!


## Creating Sample Dataset

Let's create a comprehensive business dataset for pivot table examples.

In [25]:
# Create comprehensive business dataset
np.random.seed(42)
n_records = 1000

# Generate realistic business data
business_data = {
    'date': pd.date_range('2024-01-01', periods=n_records, freq='D'),
    'salesperson': np.random.choice([
        'Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson',
        'Frank Miller', 'Grace Lee', 'Henry Davis', 'Ivy Chen', 'Jack Robinson'
    ], n_records),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home & Garden'], n_records),
    'product': np.random.choice([
        'Laptop', 'Phone', 'Tablet', 'Headphones', 'Speaker',
        'Shirt', 'Pants', 'Shoes', 'Jacket', 'Hat',
        'Novel', 'Textbook', 'Magazine', 'Comic', 'Cookbook',
        'Plant', 'Tool', 'Furniture', 'Decoration', 'Garden'
    ], n_records),
    'customer_type': np.random.choice(['New', 'Returning', 'VIP'], n_records, p=[0.3, 0.5, 0.2]),
    'sales_channel': np.random.choice(['Online', 'Store', 'Phone'], n_records, p=[0.6, 0.3, 0.1]),
    'quantity': np.random.randint(1, 10, n_records),
    'unit_price': np.random.normal(50, 20, n_records),
    'discount_percent': np.random.choice([0, 5, 10, 15, 20], n_records, p=[0.5, 0.2, 0.2, 0.08, 0.02]),
    'shipping_cost': np.random.normal(8, 3, n_records)
}

df_business = pd.DataFrame(business_data)

# Clean and calculate derived fields
df_business['unit_price'] = np.abs(df_business['unit_price']).round(2)
df_business['shipping_cost'] = np.abs(df_business['shipping_cost']).round(2)
df_business['gross_sales'] = df_business['quantity'] * df_business['unit_price']
df_business['discount_amount'] = df_business['gross_sales'] * df_business['discount_percent'] / 100
df_business['net_sales'] = df_business['gross_sales'] - df_business['discount_amount']
df_business['total_order'] = df_business['net_sales'] + df_business['shipping_cost']

# Add time-based columns
df_business['year'] = df_business['date'].dt.year
df_business['month'] = df_business['date'].dt.month
df_business['quarter'] = df_business['date'].dt.quarter
df_business['day_of_week'] = df_business['date'].dt.day_name()
df_business['month_name'] = df_business['date'].dt.month_name()

print("Business dataset created:")
print(f"Shape: {df_business.shape}")
print("\nFirst few rows:")
print(df_business.head())
print("\nColumn info:")
print(df_business.dtypes)

Business dataset created:
Shape: (1000, 20)

First few rows:
        date   salesperson region product_category   product customer_type  \
0 2024-01-01     Grace Lee  North         Clothing     Shoes     Returning   
1 2024-01-02  Diana Prince   East         Clothing  Magazine     Returning   
2 2024-01-03   Henry Davis   West            Books     Shoes           VIP   
3 2024-01-04    Eve Wilson   West         Clothing     Novel           New   
4 2024-01-05     Grace Lee   East         Clothing    Laptop           VIP   

  sales_channel  quantity  unit_price  discount_percent  shipping_cost  \
0        Online         1       72.81                 0          16.13   
1         Store         9       99.98                15          10.42   
2        Online         6        4.61                 0           8.73   
3         Phone         8       31.51                 0           8.27   
4         Store         3       86.28                15          11.61   

   gross_sales  discount_

## 1. Basic Pivot Tables

Creating fundamental pivot tables for data summarization.

In [26]:
# Simple pivot table - sales by region and product category
print("=== BASIC PIVOT TABLES ===")

# Basic pivot: sum of sales by region and product category
basic_pivot = df_business.pivot_table(
    values='net_sales',
    index='region',
    columns='product_category',
    aggfunc='sum',
    fill_value=0
)

print("Sales by Region and Product Category:")
print(basic_pivot.round(0))

# Average order value by customer type and sales channel
avg_order_pivot = df_business.pivot_table(
    values='total_order',
    index='customer_type',
    columns='sales_channel',
    aggfunc='mean',
    fill_value=0
)

print("\nAverage Order Value by Customer Type and Sales Channel:")
print(avg_order_pivot.round(2))

# Count of transactions
transaction_count = df_business.pivot_table(
    values='total_order',
    index='region',
    columns='customer_type',
    aggfunc='count',
    fill_value=0
)

print("\nTransaction Count by Region and Customer Type:")
print(transaction_count)

=== BASIC PIVOT TABLES ===
Sales by Region and Product Category:
product_category    Books  Clothing  Electronics  Home & Garden
region                                                         
East              12377.0   12466.0      16463.0        14140.0
North             13492.0   14938.0      15229.0        17770.0
South             17590.0   16946.0      14552.0        13695.0
West               8157.0   16544.0      12681.0        15652.0

Average Order Value by Customer Type and Sales Channel:
sales_channel  Online   Phone   Store
customer_type                        
New            263.07  244.46  222.16
Returning      246.03  220.45  224.39
VIP            234.35  237.16  253.16

Transaction Count by Region and Customer Type:
customer_type  New  Returning  VIP
region                            
East            60        128   60
North           70        144   53
South           68        138   45
West            61        119   54


In [27]:
# Multiple aggregation functions in one pivot table
print("=== MULTIPLE AGGREGATION FUNCTIONS ===")

# Multiple aggregations for comprehensive analysis
multi_agg_pivot = df_business.pivot_table(
    values='net_sales',
    index='region',
    columns='product_category',
    aggfunc=['sum', 'mean', 'count'],
    fill_value=0
)

print("Multiple aggregations (sum, mean, count):")
print(multi_agg_pivot.round(2))

# Different values with different aggregations
mixed_agg_pivot = df_business.pivot_table(
    values=['net_sales', 'quantity', 'total_order'],
    index='region',
    columns='sales_channel',
    aggfunc={
        'net_sales': 'sum',
        'quantity': 'sum',
        'total_order': 'mean'
    },
    fill_value=0
)

print("\nMixed aggregations for different metrics:")
print(mixed_agg_pivot.round(2))

=== MULTIPLE AGGREGATION FUNCTIONS ===
Multiple aggregations (sum, mean, count):
                       sum                                        mean  \
product_category     Books  Clothing Electronics Home & Garden   Books   
region                                                                   
East              12376.85  12465.78    16462.72      14139.75  217.14   
North             13491.69  14938.35    15228.66      17769.95  214.15   
South             17589.76  16945.54    14551.58      13695.40  266.51   
West               8156.64  16544.42    12680.58      15651.63  189.69   

                                                    count           \
product_category Clothing Electronics Home & Garden Books Clothing   
region                                                               
East               197.87      238.59        239.66    57       63   
North              226.34      220.71        257.54    63       66   
South              260.70      234.70        236.1

In [28]:
# Pivot tables with totals and margins
print("=== PIVOT TABLES WITH TOTALS ===")

# Add margins (totals) to pivot table
pivot_with_totals = df_business.pivot_table(
    values='net_sales',
    index='region',
    columns='product_category',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)

print("Pivot table with row and column totals:")
print(pivot_with_totals.round(0))

# Calculate percentages of total
pivot_percentages = df_business.pivot_table(
    values='net_sales',
    index='region',
    columns='product_category',
    aggfunc='sum',
    fill_value=0,
    margins=True,
    margins_name='Total'
)

# Convert to percentages (excluding totals row/column for calculation)
total_sales = pivot_percentages.loc['Total', 'Total']
pivot_pct = (pivot_percentages / total_sales * 100).round(2)

print("\nSales distribution as percentages:")
print(pivot_pct)

=== PIVOT TABLES WITH TOTALS ===
Pivot table with row and column totals:
product_category    Books  Clothing  Electronics  Home & Garden     Total
region                                                                   
East              12377.0   12466.0      16463.0        14140.0   55445.0
North             13492.0   14938.0      15229.0        17770.0   61429.0
South             17590.0   16946.0      14552.0        13695.0   62782.0
West               8157.0   16544.0      12681.0        15652.0   53033.0
Total             51615.0   60894.0      58924.0        61257.0  232689.0

Sales distribution as percentages:
product_category  Books  Clothing  Electronics  Home & Garden   Total
region                                                               
East               5.32      5.36         7.07           6.08   23.83
North              5.80      6.42         6.54           7.64   26.40
South              7.56      7.28         6.25           5.89   26.98
West               3.51

## 2. Advanced Pivot Table Techniques

Complex pivot tables with multiple indices and custom aggregations.

In [29]:
# Multi-level index pivot tables
print("=== MULTI-LEVEL INDEX PIVOT TABLES ===")

# Hierarchical rows
hierarchical_pivot = df_business.pivot_table(
    values='net_sales',
    index=['region', 'salesperson'],
    columns='product_category',
    aggfunc='sum',
    fill_value=0
)

print("Hierarchical pivot (Region > Salesperson vs Product Category):")
print(hierarchical_pivot.head(20))

# Hierarchical columns
hierarchical_cols_pivot = df_business.pivot_table(
    values='net_sales',
    index='region',
    columns=['product_category', 'customer_type'],
    aggfunc='sum',
    fill_value=0
)

print("\nHierarchical columns (Product Category > Customer Type):")
print(hierarchical_cols_pivot.round(0))

# Both hierarchical rows and columns
full_hierarchical = df_business.pivot_table(
    values='net_sales',
    index=['region', 'quarter'],
    columns=['product_category', 'sales_channel'],
    aggfunc='sum',
    fill_value=0
)

print("\nFull hierarchical pivot (limited sample):")
print(full_hierarchical.iloc[:8, :6].round(0))  # Show subset for readability

=== MULTI-LEVEL INDEX PIVOT TABLES ===
Hierarchical pivot (Region > Salesperson vs Product Category):
product_category          Books   Clothing  Electronics  Home & Garden
region salesperson                                                    
East   Alice Johnson  1537.0605   838.5400    1792.4760      1693.2745
       Bob Smith      2039.4035  1156.4895    1110.9565       950.9340
       Charlie Brown   435.2560   952.0800    2610.3275      1098.6095
       Diana Prince   1874.3990  1845.6785    1898.2705      1610.8440
       Eve Wilson     1585.4275  1534.6530    1395.2180      1179.5475
       Frank Miller   1106.1950  1476.6940     945.8550       988.5600
       Grace Lee      1556.9905  1298.2700     865.6170       938.2240
       Henry Davis     875.1640   725.9365    2336.0330      1322.1800
       Ivy Chen        797.3240  1608.6365    1110.4660      1724.6260
       Jack Robinson   569.6290  1028.8020    2397.5045      2632.9480
North  Alice Johnson  2314.9745  2757.4270    

In [30]:
# Custom aggregation functions
print("=== CUSTOM AGGREGATION FUNCTIONS ===")

# Define custom aggregation functions
def coefficient_of_variation(series):
    """Calculate coefficient of variation (std/mean)"""
    return series.std() / series.mean() if series.mean() != 0 else 0

def sales_range(series):
    """Calculate range (max - min)"""
    return series.max() - series.min()

def high_value_count(series, threshold=100):
    """Count values above threshold"""
    return (series > threshold).sum()

# Apply custom aggregations
custom_agg_pivot = df_business.pivot_table(
    values='net_sales',
    index='region',
    columns='product_category',
    aggfunc=[np.mean, np.std, coefficient_of_variation, sales_range],
    fill_value=0
)

print("Custom aggregations pivot table:")
print(custom_agg_pivot.round(2))

# Lambda functions for inline custom aggregations
lambda_agg_pivot = df_business.pivot_table(
    values='net_sales',
    index='region',
    columns='sales_channel',
    aggfunc={
        'net_sales': [
            'mean',
            lambda x: x.quantile(0.75),  # 75th percentile
            lambda x: (x > x.mean()).sum(),  # Count above average
            lambda x: x.max() / x.min() if x.min() > 0 else 0  # Max/Min ratio
        ]
    },
    fill_value=0
)

print("\nLambda function aggregations:")
print(lambda_agg_pivot.round(2))

=== CUSTOM AGGREGATION FUNCTIONS ===
Custom aggregations pivot table:
                    mean                                        std           \
product_category   Books Clothing Electronics Home & Garden   Books Clothing   
region                                                                         
East              217.14   197.87      238.59        239.66  133.32   176.45   
North             214.15   226.34      220.71        257.54  144.81   157.82   
South             266.51   260.70      234.70        236.13  153.97   158.86   
West              189.69   239.77      218.63        244.56  125.47   163.81   

                                           coefficient_of_variation           \
product_category Electronics Home & Garden                    Books Clothing   
region                                                                         
East                  154.31        150.89                     0.61     0.89   
North                 165.37        174.64       

In [31]:
# Time-based pivot tables
print("=== TIME-BASED PIVOT TABLES ===")

# Monthly sales trends by product category
monthly_sales = df_business.pivot_table(
    values='net_sales',
    index='month_name',
    columns='product_category',
    aggfunc='sum',
    fill_value=0
)

# Reorder months correctly
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
monthly_sales = monthly_sales.reindex([m for m in month_order if m in monthly_sales.index])

print("Monthly sales by product category:")
print(monthly_sales.round(0))

# Day of week analysis
dow_analysis = df_business.pivot_table(
    values=['net_sales', 'quantity'],
    index='day_of_week',
    columns='sales_channel',
    aggfunc={
        'net_sales': 'sum',
        'quantity': 'mean'
    },
    fill_value=0
)

# Reorder days of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_analysis = dow_analysis.reindex([d for d in day_order if d in dow_analysis.index])

print("\nDay of week analysis:")
print(dow_analysis.round(2))

=== TIME-BASED PIVOT TABLES ===
Monthly sales by product category:
product_category   Books  Clothing  Electronics  Home & Garden
month_name                                                    
January           4813.0    7034.0       4631.0         3365.0
February          4022.0    8492.0       3016.0         5593.0
March             6809.0    3461.0       5373.0         6898.0
April             5464.0    5000.0       5719.0         5014.0
May               5767.0    4023.0       5243.0         6586.0
June              4104.0    8360.0       7231.0         2890.0
July              5149.0    3183.0       6009.0         4328.0
August            3052.0    6048.0       6709.0         5114.0
September         4366.0    3026.0       4470.0         8162.0
October           3442.0    4719.0       2349.0         4465.0
November          2562.0    3316.0       3646.0         5111.0
December          2064.0    4231.0       4528.0         3729.0

Day of week analysis:
              net_sales     

## 3. Cross-tabulation and Contingency Tables

Analyzing relationships between categorical variables.

In [32]:
# Basic cross-tabulation
print("=== CROSS-TABULATION ===")

# Simple crosstab - count of transactions
basic_crosstab = pd.crosstab(df_business['region'], df_business['customer_type'])
print("Transaction count by Region and Customer Type:")
print(basic_crosstab)

# Crosstab with values (not just counts)
sales_crosstab = pd.crosstab(
    df_business['region'],
    df_business['customer_type'],
    values=df_business['net_sales'],
    aggfunc='sum'
)
print("\nTotal sales by Region and Customer Type:")
print(sales_crosstab.round(0))

# Crosstab with margins (totals)
crosstab_with_margins = pd.crosstab(
    df_business['product_category'],
    df_business['sales_channel'],
    margins=True,
    margins_name='Total'
)
print("\nTransaction count by Product Category and Sales Channel (with totals):")
print(crosstab_with_margins)

=== CROSS-TABULATION ===
Transaction count by Region and Customer Type:
customer_type  New  Returning  VIP
region                            
East            60        128   60
North           70        144   53
South           68        138   45
West            61        119   54

Total sales by Region and Customer Type:
customer_type      New  Returning      VIP
region                                    
East           13293.0    29623.0  12529.0
North          16544.0    32016.0  12869.0
South          17201.0    33234.0  12347.0
West           15336.0    26394.0  11304.0

Transaction count by Product Category and Sales Channel (with totals):
sales_channel     Online  Phone  Store  Total
product_category                             
Books                140     15     74    229
Clothing             169     26     68    263
Electronics          166     21     71    258
Home & Garden        142     28     80    250
Total                617     90    293   1000


In [33]:
# Normalized cross-tabulation (percentages)
print("=== NORMALIZED CROSS-TABULATION ===")

# Normalize by rows (percentage of each row)
crosstab_row_pct = pd.crosstab(
    df_business['region'],
    df_business['customer_type'],
    normalize='index'  # Normalize by rows
) * 100

print("Row percentages (Customer type distribution within each region):")
print(crosstab_row_pct.round(1))

# Normalize by columns (percentage of each column)
crosstab_col_pct = pd.crosstab(
    df_business['region'],
    df_business['customer_type'],
    normalize='columns'  # Normalize by columns
) * 100

print("\nColumn percentages (Region distribution within each customer type):")
print(crosstab_col_pct.round(1))

# Normalize by total (percentage of grand total)
crosstab_total_pct = pd.crosstab(
    df_business['region'],
    df_business['customer_type'],
    normalize='all'  # Normalize by total
) * 100

print("\nTotal percentages (Percentage of overall total):")
print(crosstab_total_pct.round(1))

=== NORMALIZED CROSS-TABULATION ===
Row percentages (Customer type distribution within each region):
customer_type   New  Returning   VIP
region                              
East           24.2       51.6  24.2
North          26.2       53.9  19.9
South          27.1       55.0  17.9
West           26.1       50.9  23.1

Column percentages (Region distribution within each customer type):
customer_type   New  Returning   VIP
region                              
East           23.2       24.2  28.3
North          27.0       27.2  25.0
South          26.3       26.1  21.2
West           23.6       22.5  25.5

Total percentages (Percentage of overall total):
customer_type  New  Returning  VIP
region                            
East           6.0       12.8  6.0
North          7.0       14.4  5.3
South          6.8       13.8  4.5
West           6.1       11.9  5.4


In [34]:
# Multi-dimensional cross-tabulation
print("=== MULTI-DIMENSIONAL CROSS-TABULATION ===")

# Three-way crosstab
three_way_crosstab = pd.crosstab(
    [df_business['region'], df_business['product_category']],
    df_business['customer_type'],
    values=df_business['net_sales'],
    aggfunc='mean'
)

print("Three-way analysis (Region & Product Category vs Customer Type):")
print(three_way_crosstab.round(2))

# Analysis with multiple aggregations
multi_agg_crosstab = pd.crosstab(
    df_business['region'],
    df_business['sales_channel'],
    values=df_business['net_sales'],
    aggfunc=['count', 'sum', 'mean']
)

print("\nMultiple aggregations crosstab:")
print(multi_agg_crosstab.round(2))

=== MULTI-DIMENSIONAL CROSS-TABULATION ===
Three-way analysis (Region & Product Category vs Customer Type):
customer_type               New  Returning     VIP
region product_category                           
East   Books             188.82     228.53  210.82
       Clothing          206.15     221.17  154.18
       Electronics       240.11     224.41  266.92
       Home & Garden     233.78     254.03  215.29
North  Books             254.64     183.41  270.87
       Clothing          211.37     208.73  279.06
       Electronics       233.40     230.79  176.23
       Home & Garden     248.49     266.38  238.33
South  Books             202.21     273.49  311.87
       Clothing          340.66     234.73  198.67
       Electronics       245.96     220.96  250.30
       Home & Garden     198.59     228.10  318.92
West   Books             223.46     193.70  151.62
       Clothing          248.72     231.62  250.01
       Electronics       251.07     211.88  204.71
       Home & Garden     

## 4. Data Reshaping: Melt, Pivot, Stack, Unstack

Transforming data between wide and long formats.

In [35]:
# Melting data from wide to long format
print("=== MELTING DATA (WIDE TO LONG) ===")

# Create a wide format dataset first
wide_sales = df_business.pivot_table(
    values='net_sales',
    index='salesperson',
    columns='product_category',
    aggfunc='sum',
    fill_value=0
).reset_index()

print("Wide format data:")
print(wide_sales.head())

# Melt to long format
long_sales = pd.melt(
    wide_sales,
    id_vars=['salesperson'],
    var_name='product_category',
    value_name='total_sales'
)

print("\nLong format data (melted):")
print(long_sales.head(10))

# Melt with multiple ID variables
# First create a more complex wide dataset
complex_wide = df_business.groupby(['region', 'salesperson', 'product_category'])['net_sales'].sum().unstack(fill_value=0).reset_index()

print("\nComplex wide format:")
print(complex_wide.head())

# Melt with multiple ID vars
complex_long = pd.melt(
    complex_wide,
    id_vars=['region', 'salesperson'],
    var_name='product_category',
    value_name='total_sales'
)

print("\nComplex long format:")
print(complex_long.head(10))

=== MELTING DATA (WIDE TO LONG) ===
Wide format data:
product_category    salesperson      Books   Clothing  Electronics  \
0                 Alice Johnson  7260.4360  7069.2950    8358.1560   
1                     Bob Smith  4511.1555  4336.2970    3679.3625   
2                 Charlie Brown  3677.6895  7830.9720   10264.1605   
3                  Diana Prince  4861.2975  6275.8940    3348.5570   
4                    Eve Wilson  6363.1345  7208.5765    5191.2705   

product_category  Home & Garden  
0                     5283.4600  
1                     5164.6310  
2                     4281.9240  
3                     6108.1495  
4                     6582.1870  

Long format data (melted):
     salesperson product_category  total_sales
0  Alice Johnson            Books    7260.4360
1      Bob Smith            Books    4511.1555
2  Charlie Brown            Books    3677.6895
3   Diana Prince            Books    4861.2975
4     Eve Wilson            Books    6363.1345
5   Frank M

In [36]:
# Pivot operation (long to wide)
print("=== PIVOT OPERATION (LONG TO WIDE) ===")

# Create long format data
long_data = df_business[['region', 'product_category', 'customer_type', 'net_sales']].copy()
print("Long format data sample:")
print(long_data.head())

# Simple pivot
pivoted_data = long_data.pivot_table(
    values='net_sales',
    index='region',
    columns='product_category',
    aggfunc='sum'
)

print("\nPivoted to wide format:")
print(pivoted_data.round(0))

# Reset index to make it a regular DataFrame
pivoted_reset = pivoted_data.reset_index()
print("\nPivoted with reset index:")
print(pivoted_reset.head())

=== PIVOT OPERATION (LONG TO WIDE) ===
Long format data sample:
  region product_category customer_type  net_sales
0  North         Clothing     Returning     72.810
1   East         Clothing     Returning    764.847
2   West            Books           VIP     27.660
3   West         Clothing           New    252.080
4   East         Clothing           VIP    220.014

Pivoted to wide format:
product_category    Books  Clothing  Electronics  Home & Garden
region                                                         
East              12377.0   12466.0      16463.0        14140.0
North             13492.0   14938.0      15229.0        17770.0
South             17590.0   16946.0      14552.0        13695.0
West               8157.0   16544.0      12681.0        15652.0

Pivoted with reset index:
product_category region       Books    Clothing  Electronics  Home & Garden
0                  East  12376.8490  12465.7800   16462.7240     14139.7475
1                 North  13491.6935  14938

In [37]:
# Stack and Unstack operations
print("=== STACK AND UNSTACK OPERATIONS ===")

# Create a DataFrame with MultiIndex
multi_index_df = df_business.groupby(['region', 'product_category', 'customer_type'])['net_sales'].sum().reset_index()
multi_pivot = multi_index_df.pivot_table(
    values='net_sales',
    index=['region', 'product_category'],
    columns='customer_type',
    fill_value=0
)

print("Multi-index DataFrame:")
print(multi_pivot.head(10))

# Stack operation (columns to rows)
stacked = multi_pivot.stack()
print("\nAfter stacking (columns become rows):")
print(stacked.head(10))

# Unstack operation (rows to columns)
unstacked = stacked.unstack()
print("\nAfter unstacking (back to original):")
print(unstacked.head(10))

# Unstack different levels
unstacked_level0 = multi_pivot.stack().unstack(level=0)
print("\nUnstacking level 0 (region):")
print(unstacked_level0.head())

=== STACK AND UNSTACK OPERATIONS ===
Multi-index DataFrame:
customer_type                  New   Returning        VIP
region product_category                                  
East   Books             2077.0530   7769.9840  2529.8120
       Clothing          2679.9640   6856.4155  2929.4005
       Electronics       4562.0515   7629.9035  4270.7690
       Home & Garden     3974.1930   7366.8250  2798.7295
North  Books             2546.4320   7153.0835  3792.1780
       Clothing          2959.1455   7514.3230  4464.8800
       Electronics       6068.4815   6692.9180  2467.2650
       Home & Garden     4969.7755  10655.2095  2144.9645
South  Books             2830.9150  10392.7060  4366.1430
       Clothing          6472.4515   8685.0290  1788.0630

After stacking (columns become rows):
region  product_category  customer_type
East    Books             New              2077.0530
                          Returning        7769.9840
                          VIP              2529.8120
      

## 5. Advanced Reshaping Techniques

Complex data transformations for specialized analysis.

In [38]:
# Wide to long with multiple value columns
print("=== MULTIPLE VALUE COLUMNS MELTING ===")

# Create dataset with multiple metrics
metrics_wide = df_business.groupby(['region', 'product_category']).agg({
    'net_sales': 'sum',
    'quantity': 'sum',
    'total_order': 'mean'
}).reset_index()

print("Wide format with multiple metrics:")
print(metrics_wide.head())

# Melt multiple value columns
metrics_long = pd.melt(
    metrics_wide,
    id_vars=['region', 'product_category'],
    value_vars=['net_sales', 'quantity', 'total_order'],
    var_name='metric',
    value_name='value'
)

print("\nLong format with multiple metrics:")
print(metrics_long.head(15))

# Alternative: melt and then pivot for different structure
metrics_pivot = metrics_long.pivot_table(
    values='value',
    index=['region', 'metric'],
    columns='product_category',
    fill_value=0
)

print("\nReshaping: Region-Metric vs Product Category:")
print(metrics_pivot.round(2))

=== MULTIPLE VALUE COLUMNS MELTING ===
Wide format with multiple metrics:
  region product_category   net_sales  quantity  total_order
0   East            Books  12376.8490       277   225.021912
1   East         Clothing  12465.7800       266   205.978571
2   East      Electronics  16462.7240       349   245.975420
3   East    Home & Garden  14139.7475       301   247.297246
4  North            Books  13491.6935       296   221.647357

Long format with multiple metrics:
   region product_category     metric       value
0    East            Books  net_sales  12376.8490
1    East         Clothing  net_sales  12465.7800
2    East      Electronics  net_sales  16462.7240
3    East    Home & Garden  net_sales  14139.7475
4   North            Books  net_sales  13491.6935
5   North         Clothing  net_sales  14938.3485
6   North      Electronics  net_sales  15228.6645
7   North    Home & Garden  net_sales  17769.9495
8   South            Books  net_sales  17589.7640
9   South         Clothi

In [39]:
# Creating time series pivot tables
print("=== TIME SERIES PIVOT TABLES ===")

# Daily sales by product category
daily_sales = df_business.groupby(['date', 'product_category'])['net_sales'].sum().reset_index()
daily_pivot = daily_sales.pivot(index='date', columns='product_category', values='net_sales')

print("Daily sales pivot (first 10 days):")
print(daily_pivot.head(10).round(0))

# Fill missing values and calculate rolling averages
daily_pivot_filled = daily_pivot.fillna(0)
rolling_avg = daily_pivot_filled.rolling(window=7).mean()

print("\n7-day rolling average (sample):")
print(rolling_avg.tail(5).round(2))

# Month-over-month growth
monthly_sales = df_business.groupby(['year', 'month', 'product_category'])['net_sales'].sum().reset_index()
monthly_sales['period'] = monthly_sales['year'].astype(str) + '-' + monthly_sales['month'].astype(str).str.zfill(2)
monthly_pivot = monthly_sales.pivot(index='period', columns='product_category', values='net_sales')

print("\nMonthly sales:")
print(monthly_pivot.round(0))

# Calculate month-over-month growth
mom_growth = monthly_pivot.pct_change() * 100
print("\nMonth-over-month growth (%):")
print(mom_growth.round(1))

=== TIME SERIES PIVOT TABLES ===
Daily sales pivot (first 10 days):
product_category  Books  Clothing  Electronics  Home & Garden
date                                                         
2024-01-01          NaN      73.0          NaN            NaN
2024-01-02          NaN     765.0          NaN            NaN
2024-01-03         28.0       NaN          NaN            NaN
2024-01-04          NaN     252.0          NaN            NaN
2024-01-05          NaN     220.0          NaN            NaN
2024-01-06        112.0       NaN          NaN            NaN
2024-01-07          NaN       NaN          NaN          172.0
2024-01-08          NaN       NaN        201.0            NaN
2024-01-09          NaN       NaN          NaN          153.0
2024-01-10         24.0       NaN          NaN            NaN

7-day rolling average (sample):
product_category  Books  Clothing  Electronics  Home & Garden
date                                                         
2026-09-22        32.15       0

In [40]:
# Complex multi-level reshaping
print("=== COMPLEX MULTI-LEVEL RESHAPING ===")

# Create complex hierarchical data
complex_data = df_business.groupby(['region', 'salesperson', 'quarter', 'product_category']).agg({
    'net_sales': 'sum',
    'quantity': 'sum'
}).reset_index()

print("Complex hierarchical data:")
print(complex_data.head(10))

# Multiple pivot operations
# First pivot: Quarter vs Product Category for each salesperson
salesperson_pivot = complex_data.pivot_table(
    values='net_sales',
    index=['region', 'salesperson', 'quarter'],
    columns='product_category',
    fill_value=0
)

print("\nSalesperson performance by quarter and category:")
print(salesperson_pivot.head(15))

# Stack and unstack for different views
# Unstack quarter to see quarterly comparison
quarterly_comparison = salesperson_pivot.unstack(level=2)

print("\nQuarterly comparison view (sample):")
print(quarterly_comparison.iloc[:5, :8].round(0))  # Show subset

# Create summary by region
region_summary = salesperson_pivot.groupby('region').sum()
print("\nRegion summary:")
print(region_summary.round(0))

=== COMPLEX MULTI-LEVEL RESHAPING ===
Complex hierarchical data:
  region    salesperson  quarter product_category  net_sales  quantity
0   East  Alice Johnson        1            Books   710.9865        14
1   East  Alice Johnson        1    Home & Garden   428.9600         8
2   East  Alice Johnson        2            Books   240.5520         5
3   East  Alice Johnson        2         Clothing   436.8720         7
4   East  Alice Johnson        2      Electronics   750.9260        20
5   East  Alice Johnson        2    Home & Garden    96.1745         2
6   East  Alice Johnson        3            Books   223.0520         3
7   East  Alice Johnson        3         Clothing   128.8600         2
8   East  Alice Johnson        3      Electronics   567.8100         9
9   East  Alice Johnson        3    Home & Garden   655.3200        12

Salesperson performance by quarter and category:
product_category                  Books   Clothing  Electronics  Home & Garden
region salesperson   quar

## 6. Business Intelligence Applications

Real-world business analysis using pivot tables and reshaping.

In [41]:
# Sales performance dashboard
print("=== SALES PERFORMANCE DASHBOARD ===")

def create_sales_dashboard(df):
    """Create comprehensive sales dashboard using pivot tables"""
    dashboard = {}
    
    # 1. Regional performance summary
    dashboard['regional_summary'] = df.pivot_table(
        values=['net_sales', 'quantity', 'total_order'],
        index='region',
        aggfunc={
            'net_sales': ['sum', 'mean'],
            'quantity': 'sum',
            'total_order': 'count'
        }
    ).round(2)
    
    # 2. Product category performance
    dashboard['category_performance'] = df.pivot_table(
        values='net_sales',
        index='product_category',
        columns='quarter',
        aggfunc='sum',
        margins=True,
        fill_value=0
    ).round(0)
    
    # 3. Sales channel analysis
    dashboard['channel_analysis'] = df.pivot_table(
        values='net_sales',
        index='sales_channel',
        columns='customer_type',
        aggfunc=['sum', 'mean'],
        fill_value=0
    ).round(2)
    
    # 4. Top performers
    salesperson_performance = df.groupby('salesperson').agg({
        'net_sales': 'sum',
        'quantity': 'sum',
        'total_order': 'count'
    }).round(2)
    dashboard['top_performers'] = salesperson_performance.sort_values('net_sales', ascending=False).head(5)
    
    # 5. Monthly trends
    dashboard['monthly_trends'] = df.pivot_table(
        values='net_sales',
        index='month_name',
        columns='product_category',
        aggfunc='sum',
        fill_value=0
    ).round(0)
    
    return dashboard

# Generate dashboard
sales_dashboard = create_sales_dashboard(df_business)

print("1. Regional Performance Summary:")
print(sales_dashboard['regional_summary'])

print("\n2. Category Performance by Quarter:")
print(sales_dashboard['category_performance'])

print("\n3. Sales Channel Analysis:")
print(sales_dashboard['channel_analysis'])

print("\n4. Top 5 Performers:")
print(sales_dashboard['top_performers'])

=== SALES PERFORMANCE DASHBOARD ===
1. Regional Performance Summary:
       net_sales           quantity total_order
            mean       sum      sum       count
region                                         
East      223.57  55445.10     1193         248
North     230.07  61428.66     1291         267
South     250.13  62782.29     1275         251
West      226.64  53033.27     1152         234

2. Category Performance by Quarter:
quarter                 1        2        3        4       All
product_category                                              
Books             15644.0  15334.0  12567.0   8069.0   51615.0
Clothing          18988.0  17384.0  12257.0  12266.0   60894.0
Electronics       13021.0  18192.0  17188.0  10523.0   58924.0
Home & Garden     15857.0  14490.0  17604.0  13305.0   61257.0
All               63509.0  65400.0  59617.0  44163.0  232689.0

3. Sales Channel Analysis:
                    sum                        mean                  
customer_type      

In [42]:
# Customer segmentation analysis
print("=== CUSTOMER SEGMENTATION ANALYSIS ===")

# Customer behavior analysis
customer_behavior = df_business.pivot_table(
    values=['net_sales', 'quantity', 'discount_amount'],
    index='customer_type',
    columns='sales_channel',
    aggfunc={
        'net_sales': 'mean',
        'quantity': 'mean',
        'discount_amount': 'mean'
    },
    fill_value=0
)

print("Customer behavior by type and channel:")
print(customer_behavior.round(2))

# Purchase patterns analysis
purchase_patterns = pd.crosstab(
    [df_business['customer_type'], df_business['product_category']],
    df_business['quarter'],
    values=df_business['net_sales'],
    aggfunc='sum',
    normalize='columns'
) * 100

print("\nPurchase patterns (% of quarterly sales):")
print(purchase_patterns.round(1))

# Customer value distribution
value_distribution = df_business.groupby('customer_type')['total_order'].describe()
print("\nCustomer value distribution:")
print(value_distribution.round(2))

=== CUSTOMER SEGMENTATION ANALYSIS ===
Customer behavior by type and channel:
              discount_amount               net_sales                  \
sales_channel          Online  Phone  Store    Online   Phone   Store   
customer_type                                                           
New                     11.73  15.24  10.61    255.16  236.46  214.13   
Returning               11.45   9.43  11.81    238.01  212.45  216.42   
VIP                     11.36  15.36  15.59    226.44  229.70  244.02   

              quantity              
sales_channel   Online Phone Store  
customer_type                       
New               5.29  4.78  4.47  
Returning         4.88  5.14  4.62  
VIP               5.03  4.44  5.21  

Purchase patterns (% of quarterly sales):
quarter                            1     2     3     4
customer_type product_category                        
New           Books              6.2   4.4   3.1   4.3
              Clothing           6.9   7.2   5.7  10.

In [43]:
# Cohort analysis using pivot tables
print("=== COHORT ANALYSIS ===")

# Simplified cohort analysis by quarter
# Group customers by their first purchase quarter
customer_first_purchase = df_business.groupby('salesperson')['quarter'].min().reset_index()
customer_first_purchase.columns = ['salesperson', 'first_quarter']

# Merge back to get cohort information
cohort_data = df_business.merge(customer_first_purchase, on='salesperson')
cohort_data['periods_since_first'] = cohort_data['quarter'] - cohort_data['first_quarter']

# Create cohort table
cohort_table = cohort_data.pivot_table(
    values='net_sales',
    index='first_quarter',
    columns='periods_since_first',
    aggfunc='mean',
    fill_value=0
)

print("Cohort analysis (average sales by quarters since first purchase):")
print(cohort_table.round(0))

# Retention analysis
cohort_counts = cohort_data.pivot_table(
    values='salesperson',
    index='first_quarter',
    columns='periods_since_first',
    aggfunc='count',
    fill_value=0
)

print("\nCohort retention (number of active salespeople):")
print(cohort_counts)

# Calculate retention rates
cohort_retention = cohort_counts.divide(cohort_counts.iloc[:, 0], axis=0) * 100
print("\nRetention rates (%):")
print(cohort_retention.round(1))

=== COHORT ANALYSIS ===
Cohort analysis (average sales by quarters since first purchase):
periods_since_first      0      1      2      3
first_quarter                                  
1                    234.0  240.0  219.0  240.0

Cohort retention (number of active salespeople):
periods_since_first    0    1    2    3
first_quarter                          
1                    271  273  272  184

Retention rates (%):
periods_since_first      0      1      2     3
first_quarter                                 
1                    100.0  100.7  100.4  67.9


## Practice Exercises

Apply pivot tables and reshaping to complex business scenarios:

In [44]:
# Exercise 1: Multi-dimensional Business Intelligence Report
# Create a comprehensive BI report that includes:
# - Sales performance across multiple dimensions
# - Trend analysis with period-over-period comparisons
# - Customer segmentation insights
# - Product performance matrix
# - Actionable recommendations based on pivot table insights

def create_comprehensive_bi_report(df):
    """Create multi-dimensional business intelligence report"""
    # Your implementation here
    pass

# bi_report = create_comprehensive_bi_report(df_business)
# print("Comprehensive BI Report:")
# for section, data in bi_report.items():
#     print(f"\n{section}:")
#     print(data)

In [45]:
# Exercise 2: Dynamic Pivot Table Generator
# Create a flexible function that can generate pivot tables with:
# - User-specified dimensions (rows, columns, values)
# - Multiple aggregation functions
# - Automatic handling of different data types
# - Export capabilities for different formats

# Your code here:


In [46]:
# Exercise 3: Advanced Reshaping Challenge
# Transform the data through multiple reshaping operations:
# - Convert to time series format
# - Create rolling calculations
# - Build comparison matrices
# - Generate variance analysis reports

# Your code here:


## Key Takeaways

1. **Pivot Tables**:
   - **`pivot_table()`**: Most flexible, handles duplicates with aggregation
   - **`pivot()`**: Simple reshaping, requires unique index-column combinations
   - **`crosstab()`**: Specialized for frequency tables and cross-tabulation

2. **Reshaping Operations**:
   - **`melt()`**: Wide to long format (unpivot)
   - **`pivot()`**: Long to wide format
   - **`stack()`**: Column to row index
   - **`unstack()`**: Row index to column

3. **Best Practices**:
   - Use `fill_value=0` to handle missing combinations
   - Add `margins=True` for totals when needed
   - Choose appropriate aggregation functions for your data
   - Consider data types when reshaping

4. **Business Applications**:
   - Sales performance analysis
   - Customer segmentation
   - Trend analysis and forecasting
   - Cohort and retention analysis

## Pivot Table Quick Reference

```python
# Basic pivot table
df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')

# Multiple aggregations
df.pivot_table(values='sales', index='region', aggfunc=['sum', 'mean', 'count'])

# With margins (totals)
df.pivot_table(values='sales', index='region', columns='product', 
               aggfunc='sum', margins=True)

# Cross-tabulation
pd.crosstab(df['region'], df['product'], normalize='index')

# Reshaping
pd.melt(df, id_vars=['id'], value_vars=['col1', 'col2'])  # Wide to long
df.pivot(index='date', columns='category', values='value')  # Long to wide
```

## Common Use Cases

| Scenario | Best Tool | Key Parameters |
|----------|-----------|----------------|
| Sales by region/product | `pivot_table()` | `values='sales', index='region', columns='product'` |
| Frequency analysis | `crosstab()` | `normalize='index'` for percentages |
| Time series analysis | `pivot()` + `unstack()` | Handle dates in index |
| Data normalization | `melt()` | `id_vars` for identifiers |
| Multi-level analysis | Hierarchical indexing | Multiple columns in `index`/`columns` |