# Session 1 - DataFrames - Lesson 4: Grouping and Aggregation

## Learning Objectives
- Master the `.groupby()` operation for data aggregation
- Learn different aggregation functions and methods
- Understand multi-level grouping and hierarchical indexing
- Practice custom aggregation functions
- Explore advanced grouping techniques

## Prerequisites
- Completed Lessons 1-3
- Understanding of basic statistical concepts (mean, sum, count, etc.)

In [22]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create comprehensive sample dataset
np.random.seed(42)
n_records = 200

sales_data = {
    'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones'], n_records),
    'Category': np.random.choice(['Electronics', 'Accessories'], n_records, p=[0.8, 0.2]),
    'Sales': np.random.normal(1000, 300, n_records).astype(int),
    'Quantity': np.random.randint(1, 10, n_records),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'], n_records),
    'Commission_Rate': np.random.choice([0.08, 0.10, 0.12, 0.15], n_records)
}

df_sales = pd.DataFrame(sales_data)
df_sales['Sales'] = np.abs(df_sales['Sales'])  # Ensure positive values
df_sales['Commission'] = df_sales['Sales'] * df_sales['Commission_Rate']
df_sales['Month'] = df_sales['Date'].dt.month
df_sales['Quarter'] = df_sales['Date'].dt.quarter

print("Dataset created:")
print(f"Shape: {df_sales.shape}")
print("\nFirst few rows:")
print(df_sales.head())

Dataset created:
Shape: (200, 11)

First few rows:
        Date     Product     Category  Sales  Quantity Region Salesperson  \
0 2024-01-01     Monitor  Accessories   1068         6   West       Diana   
1 2024-01-02  Headphones  Electronics    918         1   East       Alice   
2 2024-01-03      Tablet  Accessories   1133         5  North       Diana   
3 2024-01-04  Headphones  Electronics   1340         9   West         Bob   
4 2024-01-05  Headphones  Electronics   1150         2  North         Eve   

   Commission_Rate  Commission  Month  Quarter  
0             0.15      160.20      1        1  
1             0.12      110.16      1        1  
2             0.08       90.64      1        1  
3             0.08      107.20      1        1  
4             0.12      138.00      1        1  


## 1. Basic GroupBy Operations

Understanding the fundamentals of grouping data.

In [23]:
# Simple groupby with single aggregation
print("Total sales by product:")
product_sales = df_sales.groupby('Product')['Sales'].sum()
print(product_sales)
print(f"\nType: {type(product_sales)}")

print("\nAverage sales by region:")
region_avg = df_sales.groupby('Region')['Sales'].mean().round(2)
print(region_avg)

Total sales by product:
Product
Headphones    36032
Laptop        45296
Monitor       47419
Phone         36847
Tablet        34711
Name: Sales, dtype: int64

Type: <class 'pandas.core.series.Series'>

Average sales by region:
Region
East     1030.52
North    1007.14
South     966.86
West      999.78
Name: Sales, dtype: float64


In [24]:
# Multiple aggregations on the same column
print("Multiple statistics for sales by product:")
product_stats = df_sales.groupby('Product')['Sales'].agg(['count', 'sum', 'mean', 'std']).round(2)
print(product_stats)

print("\nWith custom column names:")
product_stats_named = df_sales.groupby('Product')['Sales'].agg([
    ('Count', 'count'),
    ('Total_Sales', 'sum'),
    ('Average_Sales', 'mean'),
    ('Std_Dev', 'std')
]).round(2)
print(product_stats_named)

Multiple statistics for sales by product:
            count    sum     mean     std
Product                                  
Headphones     36  36032  1000.89  298.06
Laptop         43  45296  1053.40  361.78
Monitor        49  47419   967.73  270.57
Phone          35  36847  1052.77  323.17
Tablet         37  34711   938.14  309.20

With custom column names:
            Count  Total_Sales  Average_Sales  Std_Dev
Product                                               
Headphones     36        36032        1000.89   298.06
Laptop         43        45296        1053.40   361.78
Monitor        49        47419         967.73   270.57
Phone          35        36847        1052.77   323.17
Tablet         37        34711         938.14   309.20


In [25]:
# Groupby with multiple columns and aggregations
print("Aggregating multiple columns:")
multi_agg = df_sales.groupby('Product').agg({
    'Sales': ['sum', 'mean', 'count'],
    'Quantity': ['sum', 'mean'],
    'Commission': ['sum', 'mean']
}).round(2)
print(multi_agg)

print("\nFlattened column names:")
multi_agg_flat = df_sales.groupby('Product').agg({
    'Sales': ['sum', 'mean', 'count'],
    'Quantity': ['sum', 'mean'],
    'Commission': ['sum', 'mean']
}).round(2)
multi_agg_flat.columns = ['_'.join(col).strip() for col in multi_agg_flat.columns.values]
print(multi_agg_flat.head())

Aggregating multiple columns:
            Sales                Quantity       Commission        
              sum     mean count      sum  mean        sum    mean
Product                                                           
Headphones  36032  1000.89    36      178  4.94    4004.08  111.22
Laptop      45296  1053.40    43      219  5.09    5018.59  116.71
Monitor     47419   967.73    49      253  5.16    5078.17  103.64
Phone       36847  1052.77    35      162  4.63    4121.58  117.76
Tablet      34711   938.14    37      194  5.24    3699.82  100.00

Flattened column names:
            Sales_sum  Sales_mean  Sales_count  Quantity_sum  Quantity_mean  \
Product                                                                       
Headphones      36032     1000.89           36           178           4.94   
Laptop          45296     1053.40           43           219           5.09   
Monitor         47419      967.73           49           253           5.16   
Phone         

## 2. Multiple Group Columns

Grouping by multiple categorical variables.

In [26]:
# Group by multiple columns
print("Sales by Region and Product:")
region_product = df_sales.groupby(['Region', 'Product'])['Sales'].sum().round(2)
print(region_product)

print("\nAs DataFrame with reset_index():")
region_product_df = df_sales.groupby(['Region', 'Product'])['Sales'].sum().reset_index()
print(region_product_df.head(10))

Sales by Region and Product:
Region  Product   
East    Headphones     9791
        Laptop        17001
        Monitor       11728
        Phone          6514
        Tablet        10614
North   Headphones    11527
        Laptop         6514
        Monitor       11273
        Phone         13293
        Tablet         7750
South   Headphones     7131
        Laptop        13003
        Monitor       12007
        Phone         10115
        Tablet         7054
West    Headphones     7583
        Laptop         8778
        Monitor       12411
        Phone          6925
        Tablet         9293
Name: Sales, dtype: int64

As DataFrame with reset_index():
  Region     Product  Sales
0   East  Headphones   9791
1   East      Laptop  17001
2   East     Monitor  11728
3   East       Phone   6514
4   East      Tablet  10614
5  North  Headphones  11527
6  North      Laptop   6514
7  North     Monitor  11273
8  North       Phone  13293
9  North      Tablet   7750


In [27]:
# Working with hierarchical index
print("Hierarchical indexing example:")
hierarchy = df_sales.groupby(['Region', 'Product', 'Month'])['Sales'].sum()
print("First 15 entries:")
print(hierarchy.head(15))

print("\nAccessing specific groups:")
print("North region, Laptop sales by month:")
try:
    north_laptops = hierarchy.loc[('North', 'Laptop')]
    print(north_laptops)
except KeyError:
    print("No data available for North region Laptops")

print("\nAll North region sales:")
try:
    north_all = hierarchy.loc['North']
    print(north_all.head())
except KeyError:
    print("No data available for North region")

Hierarchical indexing example:
First 15 entries:
Region  Product     Month
East    Headphones  1        2287
                    3        1194
                    4         985
                    5        2030
                    6         883
                    7        2412
        Laptop      1        1585
                    2        3151
                    3        4563
                    4        2966
                    5         919
                    6        2504
                    7        1313
        Monitor     1        4583
                    2         536
Name: Sales, dtype: int64

Accessing specific groups:
North region, Laptop sales by month:
Month
2    1976
3    1141
4    1342
5      43
6     844
7    1168
Name: Sales, dtype: int64

All North region sales:
Product     Month
Headphones  1        1769
            2        1080
            3        2884
            4        1460
            5        4334
Name: Sales, dtype: int64


In [28]:
# Unstacking hierarchical data
print("Unstacking hierarchical data:")
region_product_pivot = df_sales.groupby(['Region', 'Product'])['Sales'].sum().unstack(fill_value=0)
print(region_product_pivot)

print("\nUnstacking different levels:")
product_region_pivot = df_sales.groupby(['Product', 'Region'])['Sales'].sum().unstack(fill_value=0)
print(product_region_pivot)

Unstacking hierarchical data:
Product  Headphones  Laptop  Monitor  Phone  Tablet
Region                                             
East           9791   17001    11728   6514   10614
North         11527    6514    11273  13293    7750
South          7131   13003    12007  10115    7054
West           7583    8778    12411   6925    9293

Unstacking different levels:
Region       East  North  South   West
Product                               
Headphones   9791  11527   7131   7583
Laptop      17001   6514  13003   8778
Monitor     11728  11273  12007  12411
Phone        6514  13293  10115   6925
Tablet      10614   7750   7054   9293


## 3. Common Aggregation Functions

Explore the most useful aggregation functions.

In [29]:
# Comprehensive aggregation example
print("Comprehensive statistics by salesperson:")
salesperson_stats = df_sales.groupby('Salesperson')['Sales'].agg([
    'count',      # Number of sales
    'sum',        # Total sales
    'mean',       # Average sale
    'median',     # Median sale
    'std',        # Standard deviation
    'min',        # Minimum sale
    'max',        # Maximum sale
    lambda x: x.quantile(0.25),  # 25th percentile
    lambda x: x.quantile(0.75)   # 75th percentile
]).round(2)

# Rename lambda columns
salesperson_stats.columns = ['Count', 'Total', 'Mean', 'Median', 'Std', 'Min', 'Max', 'Q25', 'Q75']
print(salesperson_stats)

Comprehensive statistics by salesperson:
             Count  Total     Mean  Median     Std  Min   Max     Q25      Q75
Salesperson                                                                   
Alice           35  33468   956.23   929.0  288.45  298  1588  841.00  1089.50
Bob             36  36427  1011.86  1050.0  314.32  230  1702  802.25  1196.25
Charlie         37  39529  1068.35  1070.0  329.60  539  1761  806.00  1313.00
Diana           29  28906   996.76  1068.0  325.84   43  1607  831.00  1179.00
Eve             34  35134  1033.35  1046.0  323.72  519  1976  775.00  1159.25
Frank           29  26841   925.55   904.0  296.00  381  1477  745.00  1145.00


In [30]:
# Date-based aggregations
print("Monthly sales trends:")
monthly_sales = df_sales.groupby('Month').agg({
    'Sales': ['sum', 'mean', 'count'],
    'Quantity': 'sum',
    'Commission': 'sum'
}).round(2)
print(monthly_sales)

print("\nQuarterly performance:")
quarterly_sales = df_sales.groupby('Quarter').agg({
    'Sales': ['sum', 'mean'],
    'Quantity': 'sum',
    'Salesperson': 'nunique'  # Number of unique salespeople
}).round(2)
print(quarterly_sales)

Monthly sales trends:
       Sales                Quantity Commission
         sum     mean count      sum        sum
Month                                          
1      31482  1015.55    31      157    3324.78
2      29854  1029.45    29      153    3437.00
3      28500   919.35    31      173    3242.74
4      27043   901.43    30      124    2973.03
5      31530  1017.10    31      166    3351.57
6      33686  1122.87    30      147    3770.67
7      18210  1011.67    18       86    1822.45

Quarterly performance:
         Sales          Quantity Salesperson
           sum     mean      sum     nunique
Quarter                                     
1        89836   987.21      483           6
2        92259  1013.84      437           6
3        18210  1011.67       86           6


## 4. Custom Aggregation Functions

Create your own aggregation functions for specific business logic.

In [31]:
# Custom aggregation functions
def sales_range(series):
    """Calculate the range of sales values"""
    return series.max() - series.min()

def high_value_count(series, threshold=1200):
    """Count sales above a threshold"""
    return (series > threshold).sum()

def coefficient_of_variation(series):
    """Calculate coefficient of variation (std/mean)"""
    return series.std() / series.mean() if series.mean() != 0 else 0

print("Custom aggregations by product:")
custom_agg = df_sales.groupby('Product')['Sales'].agg([
    'mean',
    'std',
    sales_range,
    high_value_count,
    coefficient_of_variation
]).round(3)

custom_agg.columns = ['Mean', 'Std_Dev', 'Range', 'High_Value_Count', 'CV']
print(custom_agg)

Custom aggregations by product:
                Mean  Std_Dev  Range  High_Value_Count     CV
Product                                                      
Headphones  1000.889  298.055   1309                 8  0.298
Laptop      1053.395  361.778   1933                14  0.343
Monitor      967.735  270.570   1151                 8  0.280
Phone       1052.771  323.173   1321                11  0.307
Tablet       938.135  309.205   1314                 6  0.330


In [32]:
# Lambda functions for quick custom aggregations
print("Lambda function aggregations:")
lambda_agg = df_sales.groupby('Region')['Sales'].agg([
    ('Total', 'sum'),
    ('Average', 'mean'),
    ('Top_10_Percent', lambda x: x.quantile(0.9)),
    ('Above_Average_Count', lambda x: (x > x.mean()).sum()),
    ('Sales_Concentration', lambda x: x.nlargest(5).sum() / x.sum())  # Top 5 sales as % of total
]).round(3)
print(lambda_agg)

Lambda function aggregations:
        Total   Average  Top_10_Percent  Above_Average_Count  \
Region                                                         
East    55648  1030.519          1416.6                   25   
North   50357  1007.140          1342.1                   29   
South   49310   966.863          1346.0                   26   
West    44990   999.778          1469.6                   19   

        Sales_Concentration  
Region                       
East                  0.134  
North                 0.159  
South                 0.160  
West                  0.171  


## 5. Transform and Apply Operations

Learn `.transform()` and `.apply()` for more complex group operations.

In [33]:
# Transform operations - return same size as original
print("Transform operations:")

# Add group statistics as new columns
df_transformed = df_sales.copy()
df_transformed['Product_Avg_Sales'] = df_sales.groupby('Product')['Sales'].transform('mean')
df_transformed['Region_Total_Sales'] = df_sales.groupby('Region')['Sales'].transform('sum')
df_transformed['Sales_vs_Product_Avg'] = df_transformed['Sales'] - df_transformed['Product_Avg_Sales']

print("Sample with transform columns:")
print(df_transformed[['Product', 'Sales', 'Product_Avg_Sales', 'Sales_vs_Product_Avg']].head())

print("\nRanking within groups:")
df_transformed['Sales_Rank_in_Product'] = df_sales.groupby('Product')['Sales'].rank(ascending=False)
print(df_transformed[['Product', 'Sales', 'Sales_Rank_in_Product']].head(10))

Transform operations:
Sample with transform columns:
      Product  Sales  Product_Avg_Sales  Sales_vs_Product_Avg
0     Monitor   1068         967.734694            100.265306
1  Headphones    918        1000.888889            -82.888889
2      Tablet   1133         938.135135            194.864865
3  Headphones   1340        1000.888889            339.111111
4  Headphones   1150        1000.888889            149.111111

Ranking within groups:
      Product  Sales  Sales_Rank_in_Product
0     Monitor   1068                   16.5
1  Headphones    918                   24.0
2      Tablet   1133                   12.0
3  Headphones   1340                    6.0
4  Headphones   1150                   12.0
5       Phone   1318                    7.5
6      Tablet    799                   24.0
7      Tablet    739                   27.0
8      Tablet    836                   22.0
9  Headphones    619                   32.0


In [34]:
# Apply operations - can return different structures
print("Apply operations:")

def group_summary(group):
    """Return a summary Series for each group"""
    return pd.Series({
        'total_sales': group['Sales'].sum(),
        'avg_sales': group['Sales'].mean(),
        'num_transactions': len(group),
        'top_salesperson': group.loc[group['Sales'].idxmax(), 'Salesperson'],
        'sales_per_quantity': (group['Sales'] / group['Quantity']).mean()
    })

apply_result = df_sales.groupby('Product').apply(group_summary).round(2)
print(apply_result)

print("\nTop performing sale in each region:")
top_sales_by_region = df_sales.groupby('Region').apply(lambda x: x.loc[x['Sales'].idxmax()])
print(top_sales_by_region[['Product', 'Sales', 'Salesperson']])

Apply operations:
            total_sales  avg_sales  num_transactions top_salesperson  \
Product                                                                
Headphones        36032    1000.89                36           Diana   
Laptop            45296    1053.40                43             Eve   
Monitor           47419     967.73                49           Alice   
Phone             36847    1052.77                35             Bob   
Tablet            34711     938.14                37             Bob   

            sales_per_quantity  
Product                         
Headphones              375.94  
Laptop                  345.93  
Monitor                 257.01  
Phone                   373.06  
Tablet                  313.06  

Top performing sale in each region:
           Product  Sales Salesperson
Region                               
East        Laptop   1585     Charlie
North       Laptop   1976         Eve
South       Laptop   1761     Charlie
West    Headphones 

  apply_result = df_sales.groupby('Product').apply(group_summary).round(2)
  top_sales_by_region = df_sales.groupby('Region').apply(lambda x: x.loc[x['Sales'].idxmax()])


## 6. Filtering Groups

Filter entire groups based on group-level conditions.

In [35]:
# Filter groups based on group characteristics
print("Groups with more than 30 transactions:")
active_products = df_sales.groupby('Product').filter(lambda x: len(x) > 30)
print(f"Original data: {len(df_sales)} rows")
print(f"Filtered data: {len(active_products)} rows")
print("\nProduct transaction counts in filtered data:")
print(active_products['Product'].value_counts())

print("\nGroups with average sales > $1000:")
high_value_products = df_sales.groupby('Product').filter(lambda x: x['Sales'].mean() > 1000)
print("High-value products:")
print(high_value_products.groupby('Product')['Sales'].mean().round(2))

Groups with more than 30 transactions:
Original data: 200 rows
Filtered data: 200 rows

Product transaction counts in filtered data:
Product
Monitor       49
Laptop        43
Tablet        37
Headphones    36
Phone         35
Name: count, dtype: int64

Groups with average sales > $1000:
High-value products:
Product
Headphones    1000.89
Laptop        1053.40
Phone         1052.77
Name: Sales, dtype: float64


In [36]:
# Complex filtering conditions
print("Salespeople with consistent performance:")
# Filter salespeople with at least 20 sales and CV < 0.5
consistent_performers = df_sales.groupby('Salesperson').filter(
    lambda x: len(x) >= 20 and (x['Sales'].std() / x['Sales'].mean()) < 0.5
)

if len(consistent_performers) > 0:
    print("Consistent performers analysis:")
    consistency_analysis = consistent_performers.groupby('Salesperson')['Sales'].agg([
        'count', 'mean', 'std', lambda x: x.std()/x.mean()
    ]).round(3)
    consistency_analysis.columns = ['Count', 'Mean', 'Std', 'CV']
    print(consistency_analysis)
else:
    print("No salespeople meet the consistency criteria")

Salespeople with consistent performance:
Consistent performers analysis:
             Count      Mean      Std     CV
Salesperson                                 
Alice           35   956.229  288.455  0.302
Bob             36  1011.861  314.322  0.311
Charlie         37  1068.351  329.602  0.309
Diana           29   996.759  325.836  0.327
Eve             34  1033.353  323.720  0.313
Frank           29   925.552  296.002  0.320


## 7. Advanced Grouping Techniques

More sophisticated grouping operations.

In [37]:
# Groupby with categorical cuts
print("Grouping by sales value ranges:")
# Create sales categories
df_sales['Sales_Category'] = pd.cut(df_sales['Sales'], 
                                   bins=[0, 500, 1000, 1500, float('inf')],
                                   labels=['Low', 'Medium', 'High', 'Very High'])

sales_category_analysis = df_sales.groupby('Sales_Category').agg({
    'Sales': ['count', 'mean', 'sum'],
    'Quantity': 'sum',
    'Commission': 'sum'
}).round(2)
print(sales_category_analysis)

print("\nProduct distribution across sales categories:")
category_product_cross = pd.crosstab(df_sales['Sales_Category'], df_sales['Product'])
print(category_product_cross)

Grouping by sales value ranges:
               Sales                  Quantity Commission
               count     mean     sum      sum        sum
Sales_Category                                           
Low                8   330.88    2647       54     272.61
Medium            92   788.87   72576      478    8110.68
High              89  1202.89  107057      423   11652.60
Very High         11  1638.64   18025       51    1886.35

Product distribution across sales categories:
Product         Headphones  Laptop  Monitor  Phone  Tablet
Sales_Category                                            
Low                      1       2        1      2       2
Medium                  18      17       26     12      19
High                    16      21       21     17      14
Very High                1       3        1      4       2


  sales_category_analysis = df_sales.groupby('Sales_Category').agg({


In [38]:
# Time-based grouping
print("Weekly sales analysis:")
df_sales['Week'] = df_sales['Date'].dt.isocalendar().week
weekly_analysis = df_sales.groupby('Week').agg({
    'Sales': ['sum', 'mean', 'count'],
    'Product': lambda x: x.mode().iloc[0] if not x.mode().empty else 'None',  # Most common product
    'Salesperson': 'nunique'
}).round(2)
print(weekly_analysis.head(10))

print("\nDay of week analysis:")
df_sales['DayOfWeek'] = df_sales['Date'].dt.day_name()
day_analysis = df_sales.groupby('DayOfWeek')['Sales'].agg(['count', 'mean', 'sum']).round(2)
# Reorder by weekday
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_analysis = day_analysis.reindex([day for day in day_order if day in day_analysis.index])
print(day_analysis)

Weekly sales analysis:
     Sales                    Product Salesperson
       sum     mean count    <lambda>     nunique
Week                                             
1     7726  1103.71     7  Headphones           4
2     6078   868.29     7      Tablet           4
3     7281  1040.14     7     Monitor           5
4     6867   981.00     7      Laptop           4
5     7285  1040.71     7     Monitor           4
6     7994  1142.00     7  Headphones           4
7     6652   950.29     7       Phone           3
8     7125  1017.86     7     Monitor           4
9     6293   899.00     7     Monitor           5
10    7755  1107.86     7       Phone           4

Day of week analysis:
           count     mean    sum
DayOfWeek                       
Monday        29  1017.90  29519
Tuesday       29  1003.07  29089
Wednesday     29   956.72  27745
Thursday      29   963.62  27945
Friday        28  1136.71  31828
Saturday      28  1018.32  28513
Sunday        28   916.64  25666


## 8. Performance Considerations

Tips for efficient groupby operations.

In [39]:
# Efficient groupby operations
import time

# Create larger dataset for timing comparison
large_df = pd.concat([df_sales] * 10, ignore_index=True)
print(f"Large dataset size: {len(large_df)} rows")

# Method 1: Multiple separate groupby calls (less efficient)
start_time = time.time()
result1_sum = large_df.groupby('Product')['Sales'].sum()
result1_mean = large_df.groupby('Product')['Sales'].mean()
result1_count = large_df.groupby('Product')['Sales'].count()
time1 = time.time() - start_time

# Method 2: Single groupby with agg (more efficient)
start_time = time.time()
result2 = large_df.groupby('Product')['Sales'].agg(['sum', 'mean', 'count'])
time2 = time.time() - start_time

print(f"Multiple groupby calls: {time1:.4f} seconds")
print(f"Single groupby with agg: {time2:.4f} seconds")
print(f"Efficiency gain: {time1/time2:.2f}x faster")

# Verify results are the same
print(f"\nResults are equivalent: {result1_sum.equals(result2['sum'])}")

Large dataset size: 2000 rows
Multiple groupby calls: 0.0016 seconds
Single groupby with agg: 0.0007 seconds
Efficiency gain: 2.44x faster

Results are equivalent: True


## Practice Exercises

Apply your grouping and aggregation skills:

In [40]:
# Exercise 1: Sales Performance Analysis
# Create a comprehensive sales performance report that includes:
# - Total and average sales by salesperson and region
# - Commission earned by each salesperson
# - Performance ranking within each region
# - Identify top and bottom performers

# Your code here:
def sales_performance_report(df):
    """Generate comprehensive sales performance report"""
    # Your implementation here
    pass

# sales_performance_report(df_sales)

In [41]:
# Exercise 2: Product Analysis
# Analyze product performance including:
# - Which products are most/least popular (by quantity and sales)
# - Seasonal trends for each product
# - Regional preferences for different products
# - Price consistency across regions

# Your code here:


In [42]:
# Exercise 3: Custom Business Metrics
# Create custom aggregation functions to calculate:
# - Customer acquisition cost (if you have marketing spend data)
# - Sales velocity (sales per day) for each product
# - Market share by region
# - Performance consistency score

# Your code here:


## Key Takeaways

1. **GroupBy Basics**: `.groupby()` splits data into groups based on categorical variables
2. **Aggregation Functions**: Use built-in functions (`sum`, `mean`, `count`) or custom functions
3. **Multiple Aggregations**: Use `.agg()` with lists or dictionaries for multiple operations
4. **Hierarchical Indexing**: Multiple group columns create hierarchical indices
5. **Transform vs Apply**: `.transform()` preserves original size, `.apply()` can return different structures
6. **Filtering Groups**: Use `.filter()` to remove entire groups based on conditions
7. **Performance**: Single `.agg()` calls are more efficient than multiple `.groupby()` operations

## Common Patterns

```python
# Basic aggregation
df.groupby('column')['value'].sum()

# Multiple aggregations
df.groupby('column')['value'].agg(['sum', 'mean', 'count'])

# Multiple columns and aggregations
df.groupby('group_col').agg({
    'col1': ['sum', 'mean'],
    'col2': 'count'
})

# Custom aggregation
df.groupby('column')['value'].agg(lambda x: x.max() - x.min())

# Transform for group statistics
df['group_mean'] = df.groupby('group')['value'].transform('mean')
```