1
Fork 0
crypto_bot_training/Session_01/PandasDataFrame-exmples/04_grouping_aggregation.ipynb
2025-06-13 07:25:59 +02:00

1137 lines
41 KiB
Text
Executable file

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 4: Grouping and Aggregation\n",
"\n",
"## Learning Objectives\n",
"- Master the `.groupby()` operation for data aggregation\n",
"- Learn different aggregation functions and methods\n",
"- Understand multi-level grouping and hierarchical indexing\n",
"- Practice custom aggregation functions\n",
"- Explore advanced grouping techniques\n",
"\n",
"## Prerequisites\n",
"- Completed Lessons 1-3\n",
"- Understanding of basic statistical concepts (mean, sum, count, etc.)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset created:\n",
"Shape: (200, 11)\n",
"\n",
"First few rows:\n",
" Date Product Category Sales Quantity Region Salesperson \\\n",
"0 2024-01-01 Monitor Accessories 1068 6 West Diana \n",
"1 2024-01-02 Headphones Electronics 918 1 East Alice \n",
"2 2024-01-03 Tablet Accessories 1133 5 North Diana \n",
"3 2024-01-04 Headphones Electronics 1340 9 West Bob \n",
"4 2024-01-05 Headphones Electronics 1150 2 North Eve \n",
"\n",
" Commission_Rate Commission Month Quarter \n",
"0 0.15 160.20 1 1 \n",
"1 0.12 110.16 1 1 \n",
"2 0.08 90.64 1 1 \n",
"3 0.08 107.20 1 1 \n",
"4 0.12 138.00 1 1 \n"
]
}
],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"\n",
"# Create comprehensive sample dataset\n",
"np.random.seed(42)\n",
"n_records = 200\n",
"\n",
"sales_data = {\n",
" 'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n",
" 'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones'], n_records),\n",
" 'Category': np.random.choice(['Electronics', 'Accessories'], n_records, p=[0.8, 0.2]),\n",
" 'Sales': np.random.normal(1000, 300, n_records).astype(int),\n",
" 'Quantity': np.random.randint(1, 10, n_records),\n",
" 'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
" 'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'], n_records),\n",
" 'Commission_Rate': np.random.choice([0.08, 0.10, 0.12, 0.15], n_records)\n",
"}\n",
"\n",
"df_sales = pd.DataFrame(sales_data)\n",
"df_sales['Sales'] = np.abs(df_sales['Sales']) # Ensure positive values\n",
"df_sales['Commission'] = df_sales['Sales'] * df_sales['Commission_Rate']\n",
"df_sales['Month'] = df_sales['Date'].dt.month\n",
"df_sales['Quarter'] = df_sales['Date'].dt.quarter\n",
"\n",
"print(\"Dataset created:\")\n",
"print(f\"Shape: {df_sales.shape}\")\n",
"print(\"\\nFirst few rows:\")\n",
"print(df_sales.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Basic GroupBy Operations\n",
"\n",
"Understanding the fundamentals of grouping data."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total sales by product:\n",
"Product\n",
"Headphones 36032\n",
"Laptop 45296\n",
"Monitor 47419\n",
"Phone 36847\n",
"Tablet 34711\n",
"Name: Sales, dtype: int64\n",
"\n",
"Type: <class 'pandas.core.series.Series'>\n",
"\n",
"Average sales by region:\n",
"Region\n",
"East 1030.52\n",
"North 1007.14\n",
"South 966.86\n",
"West 999.78\n",
"Name: Sales, dtype: float64\n"
]
}
],
"source": [
"# Simple groupby with single aggregation\n",
"print(\"Total sales by product:\")\n",
"product_sales = df_sales.groupby('Product')['Sales'].sum()\n",
"print(product_sales)\n",
"print(f\"\\nType: {type(product_sales)}\")\n",
"\n",
"print(\"\\nAverage sales by region:\")\n",
"region_avg = df_sales.groupby('Region')['Sales'].mean().round(2)\n",
"print(region_avg)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Multiple statistics for sales by product:\n",
" count sum mean std\n",
"Product \n",
"Headphones 36 36032 1000.89 298.06\n",
"Laptop 43 45296 1053.40 361.78\n",
"Monitor 49 47419 967.73 270.57\n",
"Phone 35 36847 1052.77 323.17\n",
"Tablet 37 34711 938.14 309.20\n",
"\n",
"With custom column names:\n",
" Count Total_Sales Average_Sales Std_Dev\n",
"Product \n",
"Headphones 36 36032 1000.89 298.06\n",
"Laptop 43 45296 1053.40 361.78\n",
"Monitor 49 47419 967.73 270.57\n",
"Phone 35 36847 1052.77 323.17\n",
"Tablet 37 34711 938.14 309.20\n"
]
}
],
"source": [
"# Multiple aggregations on the same column\n",
"print(\"Multiple statistics for sales by product:\")\n",
"product_stats = df_sales.groupby('Product')['Sales'].agg(['count', 'sum', 'mean', 'std']).round(2)\n",
"print(product_stats)\n",
"\n",
"print(\"\\nWith custom column names:\")\n",
"product_stats_named = df_sales.groupby('Product')['Sales'].agg([\n",
" ('Count', 'count'),\n",
" ('Total_Sales', 'sum'),\n",
" ('Average_Sales', 'mean'),\n",
" ('Std_Dev', 'std')\n",
"]).round(2)\n",
"print(product_stats_named)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Aggregating multiple columns:\n",
" Sales Quantity Commission \n",
" sum mean count sum mean sum mean\n",
"Product \n",
"Headphones 36032 1000.89 36 178 4.94 4004.08 111.22\n",
"Laptop 45296 1053.40 43 219 5.09 5018.59 116.71\n",
"Monitor 47419 967.73 49 253 5.16 5078.17 103.64\n",
"Phone 36847 1052.77 35 162 4.63 4121.58 117.76\n",
"Tablet 34711 938.14 37 194 5.24 3699.82 100.00\n",
"\n",
"Flattened column names:\n",
" Sales_sum Sales_mean Sales_count Quantity_sum Quantity_mean \\\n",
"Product \n",
"Headphones 36032 1000.89 36 178 4.94 \n",
"Laptop 45296 1053.40 43 219 5.09 \n",
"Monitor 47419 967.73 49 253 5.16 \n",
"Phone 36847 1052.77 35 162 4.63 \n",
"Tablet 34711 938.14 37 194 5.24 \n",
"\n",
" Commission_sum Commission_mean \n",
"Product \n",
"Headphones 4004.08 111.22 \n",
"Laptop 5018.59 116.71 \n",
"Monitor 5078.17 103.64 \n",
"Phone 4121.58 117.76 \n",
"Tablet 3699.82 100.00 \n"
]
}
],
"source": [
"# Groupby with multiple columns and aggregations\n",
"print(\"Aggregating multiple columns:\")\n",
"multi_agg = df_sales.groupby('Product').agg({\n",
" 'Sales': ['sum', 'mean', 'count'],\n",
" 'Quantity': ['sum', 'mean'],\n",
" 'Commission': ['sum', 'mean']\n",
"}).round(2)\n",
"print(multi_agg)\n",
"\n",
"print(\"\\nFlattened column names:\")\n",
"multi_agg_flat = df_sales.groupby('Product').agg({\n",
" 'Sales': ['sum', 'mean', 'count'],\n",
" 'Quantity': ['sum', 'mean'],\n",
" 'Commission': ['sum', 'mean']\n",
"}).round(2)\n",
"multi_agg_flat.columns = ['_'.join(col).strip() for col in multi_agg_flat.columns.values]\n",
"print(multi_agg_flat.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Multiple Group Columns\n",
"\n",
"Grouping by multiple categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sales by Region and Product:\n",
"Region Product \n",
"East Headphones 9791\n",
" Laptop 17001\n",
" Monitor 11728\n",
" Phone 6514\n",
" Tablet 10614\n",
"North Headphones 11527\n",
" Laptop 6514\n",
" Monitor 11273\n",
" Phone 13293\n",
" Tablet 7750\n",
"South Headphones 7131\n",
" Laptop 13003\n",
" Monitor 12007\n",
" Phone 10115\n",
" Tablet 7054\n",
"West Headphones 7583\n",
" Laptop 8778\n",
" Monitor 12411\n",
" Phone 6925\n",
" Tablet 9293\n",
"Name: Sales, dtype: int64\n",
"\n",
"As DataFrame with reset_index():\n",
" Region Product Sales\n",
"0 East Headphones 9791\n",
"1 East Laptop 17001\n",
"2 East Monitor 11728\n",
"3 East Phone 6514\n",
"4 East Tablet 10614\n",
"5 North Headphones 11527\n",
"6 North Laptop 6514\n",
"7 North Monitor 11273\n",
"8 North Phone 13293\n",
"9 North Tablet 7750\n"
]
}
],
"source": [
"# Group by multiple columns\n",
"print(\"Sales by Region and Product:\")\n",
"region_product = df_sales.groupby(['Region', 'Product'])['Sales'].sum().round(2)\n",
"print(region_product)\n",
"\n",
"print(\"\\nAs DataFrame with reset_index():\")\n",
"region_product_df = df_sales.groupby(['Region', 'Product'])['Sales'].sum().reset_index()\n",
"print(region_product_df.head(10))"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hierarchical indexing example:\n",
"First 15 entries:\n",
"Region Product Month\n",
"East Headphones 1 2287\n",
" 3 1194\n",
" 4 985\n",
" 5 2030\n",
" 6 883\n",
" 7 2412\n",
" Laptop 1 1585\n",
" 2 3151\n",
" 3 4563\n",
" 4 2966\n",
" 5 919\n",
" 6 2504\n",
" 7 1313\n",
" Monitor 1 4583\n",
" 2 536\n",
"Name: Sales, dtype: int64\n",
"\n",
"Accessing specific groups:\n",
"North region, Laptop sales by month:\n",
"Month\n",
"2 1976\n",
"3 1141\n",
"4 1342\n",
"5 43\n",
"6 844\n",
"7 1168\n",
"Name: Sales, dtype: int64\n",
"\n",
"All North region sales:\n",
"Product Month\n",
"Headphones 1 1769\n",
" 2 1080\n",
" 3 2884\n",
" 4 1460\n",
" 5 4334\n",
"Name: Sales, dtype: int64\n"
]
}
],
"source": [
"# Working with hierarchical index\n",
"print(\"Hierarchical indexing example:\")\n",
"hierarchy = df_sales.groupby(['Region', 'Product', 'Month'])['Sales'].sum()\n",
"print(\"First 15 entries:\")\n",
"print(hierarchy.head(15))\n",
"\n",
"print(\"\\nAccessing specific groups:\")\n",
"print(\"North region, Laptop sales by month:\")\n",
"try:\n",
" north_laptops = hierarchy.loc[('North', 'Laptop')]\n",
" print(north_laptops)\n",
"except KeyError:\n",
" print(\"No data available for North region Laptops\")\n",
"\n",
"print(\"\\nAll North region sales:\")\n",
"try:\n",
" north_all = hierarchy.loc['North']\n",
" print(north_all.head())\n",
"except KeyError:\n",
" print(\"No data available for North region\")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Unstacking hierarchical data:\n",
"Product Headphones Laptop Monitor Phone Tablet\n",
"Region \n",
"East 9791 17001 11728 6514 10614\n",
"North 11527 6514 11273 13293 7750\n",
"South 7131 13003 12007 10115 7054\n",
"West 7583 8778 12411 6925 9293\n",
"\n",
"Unstacking different levels:\n",
"Region East North South West\n",
"Product \n",
"Headphones 9791 11527 7131 7583\n",
"Laptop 17001 6514 13003 8778\n",
"Monitor 11728 11273 12007 12411\n",
"Phone 6514 13293 10115 6925\n",
"Tablet 10614 7750 7054 9293\n"
]
}
],
"source": [
"# Unstacking hierarchical data\n",
"print(\"Unstacking hierarchical data:\")\n",
"region_product_pivot = df_sales.groupby(['Region', 'Product'])['Sales'].sum().unstack(fill_value=0)\n",
"print(region_product_pivot)\n",
"\n",
"print(\"\\nUnstacking different levels:\")\n",
"product_region_pivot = df_sales.groupby(['Product', 'Region'])['Sales'].sum().unstack(fill_value=0)\n",
"print(product_region_pivot)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Common Aggregation Functions\n",
"\n",
"Explore the most useful aggregation functions."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Comprehensive statistics by salesperson:\n",
" Count Total Mean Median Std Min Max Q25 Q75\n",
"Salesperson \n",
"Alice 35 33468 956.23 929.0 288.45 298 1588 841.00 1089.50\n",
"Bob 36 36427 1011.86 1050.0 314.32 230 1702 802.25 1196.25\n",
"Charlie 37 39529 1068.35 1070.0 329.60 539 1761 806.00 1313.00\n",
"Diana 29 28906 996.76 1068.0 325.84 43 1607 831.00 1179.00\n",
"Eve 34 35134 1033.35 1046.0 323.72 519 1976 775.00 1159.25\n",
"Frank 29 26841 925.55 904.0 296.00 381 1477 745.00 1145.00\n"
]
}
],
"source": [
"# Comprehensive aggregation example\n",
"print(\"Comprehensive statistics by salesperson:\")\n",
"salesperson_stats = df_sales.groupby('Salesperson')['Sales'].agg([\n",
" 'count', # Number of sales\n",
" 'sum', # Total sales\n",
" 'mean', # Average sale\n",
" 'median', # Median sale\n",
" 'std', # Standard deviation\n",
" 'min', # Minimum sale\n",
" 'max', # Maximum sale\n",
" lambda x: x.quantile(0.25), # 25th percentile\n",
" lambda x: x.quantile(0.75) # 75th percentile\n",
"]).round(2)\n",
"\n",
"# Rename lambda columns\n",
"salesperson_stats.columns = ['Count', 'Total', 'Mean', 'Median', 'Std', 'Min', 'Max', 'Q25', 'Q75']\n",
"print(salesperson_stats)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Monthly sales trends:\n",
" Sales Quantity Commission\n",
" sum mean count sum sum\n",
"Month \n",
"1 31482 1015.55 31 157 3324.78\n",
"2 29854 1029.45 29 153 3437.00\n",
"3 28500 919.35 31 173 3242.74\n",
"4 27043 901.43 30 124 2973.03\n",
"5 31530 1017.10 31 166 3351.57\n",
"6 33686 1122.87 30 147 3770.67\n",
"7 18210 1011.67 18 86 1822.45\n",
"\n",
"Quarterly performance:\n",
" Sales Quantity Salesperson\n",
" sum mean sum nunique\n",
"Quarter \n",
"1 89836 987.21 483 6\n",
"2 92259 1013.84 437 6\n",
"3 18210 1011.67 86 6\n"
]
}
],
"source": [
"# Date-based aggregations\n",
"print(\"Monthly sales trends:\")\n",
"monthly_sales = df_sales.groupby('Month').agg({\n",
" 'Sales': ['sum', 'mean', 'count'],\n",
" 'Quantity': 'sum',\n",
" 'Commission': 'sum'\n",
"}).round(2)\n",
"print(monthly_sales)\n",
"\n",
"print(\"\\nQuarterly performance:\")\n",
"quarterly_sales = df_sales.groupby('Quarter').agg({\n",
" 'Sales': ['sum', 'mean'],\n",
" 'Quantity': 'sum',\n",
" 'Salesperson': 'nunique' # Number of unique salespeople\n",
"}).round(2)\n",
"print(quarterly_sales)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Custom Aggregation Functions\n",
"\n",
"Create your own aggregation functions for specific business logic."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Custom aggregations by product:\n",
" Mean Std_Dev Range High_Value_Count CV\n",
"Product \n",
"Headphones 1000.889 298.055 1309 8 0.298\n",
"Laptop 1053.395 361.778 1933 14 0.343\n",
"Monitor 967.735 270.570 1151 8 0.280\n",
"Phone 1052.771 323.173 1321 11 0.307\n",
"Tablet 938.135 309.205 1314 6 0.330\n"
]
}
],
"source": [
"# Custom aggregation functions\n",
"def sales_range(series):\n",
" \"\"\"Calculate the range of sales values\"\"\"\n",
" return series.max() - series.min()\n",
"\n",
"def high_value_count(series, threshold=1200):\n",
" \"\"\"Count sales above a threshold\"\"\"\n",
" return (series > threshold).sum()\n",
"\n",
"def coefficient_of_variation(series):\n",
" \"\"\"Calculate coefficient of variation (std/mean)\"\"\"\n",
" return series.std() / series.mean() if series.mean() != 0 else 0\n",
"\n",
"print(\"Custom aggregations by product:\")\n",
"custom_agg = df_sales.groupby('Product')['Sales'].agg([\n",
" 'mean',\n",
" 'std',\n",
" sales_range,\n",
" high_value_count,\n",
" coefficient_of_variation\n",
"]).round(3)\n",
"\n",
"custom_agg.columns = ['Mean', 'Std_Dev', 'Range', 'High_Value_Count', 'CV']\n",
"print(custom_agg)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lambda function aggregations:\n",
" Total Average Top_10_Percent Above_Average_Count \\\n",
"Region \n",
"East 55648 1030.519 1416.6 25 \n",
"North 50357 1007.140 1342.1 29 \n",
"South 49310 966.863 1346.0 26 \n",
"West 44990 999.778 1469.6 19 \n",
"\n",
" Sales_Concentration \n",
"Region \n",
"East 0.134 \n",
"North 0.159 \n",
"South 0.160 \n",
"West 0.171 \n"
]
}
],
"source": [
"# Lambda functions for quick custom aggregations\n",
"print(\"Lambda function aggregations:\")\n",
"lambda_agg = df_sales.groupby('Region')['Sales'].agg([\n",
" ('Total', 'sum'),\n",
" ('Average', 'mean'),\n",
" ('Top_10_Percent', lambda x: x.quantile(0.9)),\n",
" ('Above_Average_Count', lambda x: (x > x.mean()).sum()),\n",
" ('Sales_Concentration', lambda x: x.nlargest(5).sum() / x.sum()) # Top 5 sales as % of total\n",
"]).round(3)\n",
"print(lambda_agg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Transform and Apply Operations\n",
"\n",
"Learn `.transform()` and `.apply()` for more complex group operations."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Transform operations:\n",
"Sample with transform columns:\n",
" Product Sales Product_Avg_Sales Sales_vs_Product_Avg\n",
"0 Monitor 1068 967.734694 100.265306\n",
"1 Headphones 918 1000.888889 -82.888889\n",
"2 Tablet 1133 938.135135 194.864865\n",
"3 Headphones 1340 1000.888889 339.111111\n",
"4 Headphones 1150 1000.888889 149.111111\n",
"\n",
"Ranking within groups:\n",
" Product Sales Sales_Rank_in_Product\n",
"0 Monitor 1068 16.5\n",
"1 Headphones 918 24.0\n",
"2 Tablet 1133 12.0\n",
"3 Headphones 1340 6.0\n",
"4 Headphones 1150 12.0\n",
"5 Phone 1318 7.5\n",
"6 Tablet 799 24.0\n",
"7 Tablet 739 27.0\n",
"8 Tablet 836 22.0\n",
"9 Headphones 619 32.0\n"
]
}
],
"source": [
"# Transform operations - return same size as original\n",
"print(\"Transform operations:\")\n",
"\n",
"# Add group statistics as new columns\n",
"df_transformed = df_sales.copy()\n",
"df_transformed['Product_Avg_Sales'] = df_sales.groupby('Product')['Sales'].transform('mean')\n",
"df_transformed['Region_Total_Sales'] = df_sales.groupby('Region')['Sales'].transform('sum')\n",
"df_transformed['Sales_vs_Product_Avg'] = df_transformed['Sales'] - df_transformed['Product_Avg_Sales']\n",
"\n",
"print(\"Sample with transform columns:\")\n",
"print(df_transformed[['Product', 'Sales', 'Product_Avg_Sales', 'Sales_vs_Product_Avg']].head())\n",
"\n",
"print(\"\\nRanking within groups:\")\n",
"df_transformed['Sales_Rank_in_Product'] = df_sales.groupby('Product')['Sales'].rank(ascending=False)\n",
"print(df_transformed[['Product', 'Sales', 'Sales_Rank_in_Product']].head(10))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Apply operations:\n",
" total_sales avg_sales num_transactions top_salesperson \\\n",
"Product \n",
"Headphones 36032 1000.89 36 Diana \n",
"Laptop 45296 1053.40 43 Eve \n",
"Monitor 47419 967.73 49 Alice \n",
"Phone 36847 1052.77 35 Bob \n",
"Tablet 34711 938.14 37 Bob \n",
"\n",
" sales_per_quantity \n",
"Product \n",
"Headphones 375.94 \n",
"Laptop 345.93 \n",
"Monitor 257.01 \n",
"Phone 373.06 \n",
"Tablet 313.06 \n",
"\n",
"Top performing sale in each region:\n",
" Product Sales Salesperson\n",
"Region \n",
"East Laptop 1585 Charlie\n",
"North Laptop 1976 Eve\n",
"South Laptop 1761 Charlie\n",
"West Headphones 1607 Diana\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/3089764804.py:14: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
" apply_result = df_sales.groupby('Product').apply(group_summary).round(2)\n",
"/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/3089764804.py:18: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
" top_sales_by_region = df_sales.groupby('Region').apply(lambda x: x.loc[x['Sales'].idxmax()])\n"
]
}
],
"source": [
"# Apply operations - can return different structures\n",
"print(\"Apply operations:\")\n",
"\n",
"def group_summary(group):\n",
" \"\"\"Return a summary Series for each group\"\"\"\n",
" return pd.Series({\n",
" 'total_sales': group['Sales'].sum(),\n",
" 'avg_sales': group['Sales'].mean(),\n",
" 'num_transactions': len(group),\n",
" 'top_salesperson': group.loc[group['Sales'].idxmax(), 'Salesperson'],\n",
" 'sales_per_quantity': (group['Sales'] / group['Quantity']).mean()\n",
" })\n",
"\n",
"apply_result = df_sales.groupby('Product').apply(group_summary).round(2)\n",
"print(apply_result)\n",
"\n",
"print(\"\\nTop performing sale in each region:\")\n",
"top_sales_by_region = df_sales.groupby('Region').apply(lambda x: x.loc[x['Sales'].idxmax()])\n",
"print(top_sales_by_region[['Product', 'Sales', 'Salesperson']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Filtering Groups\n",
"\n",
"Filter entire groups based on group-level conditions."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Groups with more than 30 transactions:\n",
"Original data: 200 rows\n",
"Filtered data: 200 rows\n",
"\n",
"Product transaction counts in filtered data:\n",
"Product\n",
"Monitor 49\n",
"Laptop 43\n",
"Tablet 37\n",
"Headphones 36\n",
"Phone 35\n",
"Name: count, dtype: int64\n",
"\n",
"Groups with average sales > $1000:\n",
"High-value products:\n",
"Product\n",
"Headphones 1000.89\n",
"Laptop 1053.40\n",
"Phone 1052.77\n",
"Name: Sales, dtype: float64\n"
]
}
],
"source": [
"# Filter groups based on group characteristics\n",
"print(\"Groups with more than 30 transactions:\")\n",
"active_products = df_sales.groupby('Product').filter(lambda x: len(x) > 30)\n",
"print(f\"Original data: {len(df_sales)} rows\")\n",
"print(f\"Filtered data: {len(active_products)} rows\")\n",
"print(\"\\nProduct transaction counts in filtered data:\")\n",
"print(active_products['Product'].value_counts())\n",
"\n",
"print(\"\\nGroups with average sales > $1000:\")\n",
"high_value_products = df_sales.groupby('Product').filter(lambda x: x['Sales'].mean() > 1000)\n",
"print(\"High-value products:\")\n",
"print(high_value_products.groupby('Product')['Sales'].mean().round(2))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Salespeople with consistent performance:\n",
"Consistent performers analysis:\n",
" Count Mean Std CV\n",
"Salesperson \n",
"Alice 35 956.229 288.455 0.302\n",
"Bob 36 1011.861 314.322 0.311\n",
"Charlie 37 1068.351 329.602 0.309\n",
"Diana 29 996.759 325.836 0.327\n",
"Eve 34 1033.353 323.720 0.313\n",
"Frank 29 925.552 296.002 0.320\n"
]
}
],
"source": [
"# Complex filtering conditions\n",
"print(\"Salespeople with consistent performance:\")\n",
"# Filter salespeople with at least 20 sales and CV < 0.5\n",
"consistent_performers = df_sales.groupby('Salesperson').filter(\n",
" lambda x: len(x) >= 20 and (x['Sales'].std() / x['Sales'].mean()) < 0.5\n",
")\n",
"\n",
"if len(consistent_performers) > 0:\n",
" print(\"Consistent performers analysis:\")\n",
" consistency_analysis = consistent_performers.groupby('Salesperson')['Sales'].agg([\n",
" 'count', 'mean', 'std', lambda x: x.std()/x.mean()\n",
" ]).round(3)\n",
" consistency_analysis.columns = ['Count', 'Mean', 'Std', 'CV']\n",
" print(consistency_analysis)\n",
"else:\n",
" print(\"No salespeople meet the consistency criteria\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Advanced Grouping Techniques\n",
"\n",
"More sophisticated grouping operations."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Grouping by sales value ranges:\n",
" Sales Quantity Commission\n",
" count mean sum sum sum\n",
"Sales_Category \n",
"Low 8 330.88 2647 54 272.61\n",
"Medium 92 788.87 72576 478 8110.68\n",
"High 89 1202.89 107057 423 11652.60\n",
"Very High 11 1638.64 18025 51 1886.35\n",
"\n",
"Product distribution across sales categories:\n",
"Product Headphones Laptop Monitor Phone Tablet\n",
"Sales_Category \n",
"Low 1 2 1 2 2\n",
"Medium 18 17 26 12 19\n",
"High 16 21 21 17 14\n",
"Very High 1 3 1 4 2\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/1556724418.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n",
" sales_category_analysis = df_sales.groupby('Sales_Category').agg({\n"
]
}
],
"source": [
"# Groupby with categorical cuts\n",
"print(\"Grouping by sales value ranges:\")\n",
"# Create sales categories\n",
"df_sales['Sales_Category'] = pd.cut(df_sales['Sales'], \n",
" bins=[0, 500, 1000, 1500, float('inf')],\n",
" labels=['Low', 'Medium', 'High', 'Very High'])\n",
"\n",
"sales_category_analysis = df_sales.groupby('Sales_Category').agg({\n",
" 'Sales': ['count', 'mean', 'sum'],\n",
" 'Quantity': 'sum',\n",
" 'Commission': 'sum'\n",
"}).round(2)\n",
"print(sales_category_analysis)\n",
"\n",
"print(\"\\nProduct distribution across sales categories:\")\n",
"category_product_cross = pd.crosstab(df_sales['Sales_Category'], df_sales['Product'])\n",
"print(category_product_cross)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Weekly sales analysis:\n",
" Sales Product Salesperson\n",
" sum mean count <lambda> nunique\n",
"Week \n",
"1 7726 1103.71 7 Headphones 4\n",
"2 6078 868.29 7 Tablet 4\n",
"3 7281 1040.14 7 Monitor 5\n",
"4 6867 981.00 7 Laptop 4\n",
"5 7285 1040.71 7 Monitor 4\n",
"6 7994 1142.00 7 Headphones 4\n",
"7 6652 950.29 7 Phone 3\n",
"8 7125 1017.86 7 Monitor 4\n",
"9 6293 899.00 7 Monitor 5\n",
"10 7755 1107.86 7 Phone 4\n",
"\n",
"Day of week analysis:\n",
" count mean sum\n",
"DayOfWeek \n",
"Monday 29 1017.90 29519\n",
"Tuesday 29 1003.07 29089\n",
"Wednesday 29 956.72 27745\n",
"Thursday 29 963.62 27945\n",
"Friday 28 1136.71 31828\n",
"Saturday 28 1018.32 28513\n",
"Sunday 28 916.64 25666\n"
]
}
],
"source": [
"# Time-based grouping\n",
"print(\"Weekly sales analysis:\")\n",
"df_sales['Week'] = df_sales['Date'].dt.isocalendar().week\n",
"weekly_analysis = df_sales.groupby('Week').agg({\n",
" 'Sales': ['sum', 'mean', 'count'],\n",
" 'Product': lambda x: x.mode().iloc[0] if not x.mode().empty else 'None', # Most common product\n",
" 'Salesperson': 'nunique'\n",
"}).round(2)\n",
"print(weekly_analysis.head(10))\n",
"\n",
"print(\"\\nDay of week analysis:\")\n",
"df_sales['DayOfWeek'] = df_sales['Date'].dt.day_name()\n",
"day_analysis = df_sales.groupby('DayOfWeek')['Sales'].agg(['count', 'mean', 'sum']).round(2)\n",
"# Reorder by weekday\n",
"day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']\n",
"day_analysis = day_analysis.reindex([day for day in day_order if day in day_analysis.index])\n",
"print(day_analysis)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Performance Considerations\n",
"\n",
"Tips for efficient groupby operations."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Large dataset size: 2000 rows\n",
"Multiple groupby calls: 0.0016 seconds\n",
"Single groupby with agg: 0.0007 seconds\n",
"Efficiency gain: 2.44x faster\n",
"\n",
"Results are equivalent: True\n"
]
}
],
"source": [
"# Efficient groupby operations\n",
"import time\n",
"\n",
"# Create larger dataset for timing comparison\n",
"large_df = pd.concat([df_sales] * 10, ignore_index=True)\n",
"print(f\"Large dataset size: {len(large_df)} rows\")\n",
"\n",
"# Method 1: Multiple separate groupby calls (less efficient)\n",
"start_time = time.time()\n",
"result1_sum = large_df.groupby('Product')['Sales'].sum()\n",
"result1_mean = large_df.groupby('Product')['Sales'].mean()\n",
"result1_count = large_df.groupby('Product')['Sales'].count()\n",
"time1 = time.time() - start_time\n",
"\n",
"# Method 2: Single groupby with agg (more efficient)\n",
"start_time = time.time()\n",
"result2 = large_df.groupby('Product')['Sales'].agg(['sum', 'mean', 'count'])\n",
"time2 = time.time() - start_time\n",
"\n",
"print(f\"Multiple groupby calls: {time1:.4f} seconds\")\n",
"print(f\"Single groupby with agg: {time2:.4f} seconds\")\n",
"print(f\"Efficiency gain: {time1/time2:.2f}x faster\")\n",
"\n",
"# Verify results are the same\n",
"print(f\"\\nResults are equivalent: {result1_sum.equals(result2['sum'])}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Apply your grouping and aggregation skills:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Sales Performance Analysis\n",
"# Create a comprehensive sales performance report that includes:\n",
"# - Total and average sales by salesperson and region\n",
"# - Commission earned by each salesperson\n",
"# - Performance ranking within each region\n",
"# - Identify top and bottom performers\n",
"\n",
"# Your code here:\n",
"def sales_performance_report(df):\n",
" \"\"\"Generate comprehensive sales performance report\"\"\"\n",
" # Your implementation here\n",
" pass\n",
"\n",
"# sales_performance_report(df_sales)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Product Analysis\n",
"# Analyze product performance including:\n",
"# - Which products are most/least popular (by quantity and sales)\n",
"# - Seasonal trends for each product\n",
"# - Regional preferences for different products\n",
"# - Price consistency across regions\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Custom Business Metrics\n",
"# Create custom aggregation functions to calculate:\n",
"# - Customer acquisition cost (if you have marketing spend data)\n",
"# - Sales velocity (sales per day) for each product\n",
"# - Market share by region\n",
"# - Performance consistency score\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **GroupBy Basics**: `.groupby()` splits data into groups based on categorical variables\n",
"2. **Aggregation Functions**: Use built-in functions (`sum`, `mean`, `count`) or custom functions\n",
"3. **Multiple Aggregations**: Use `.agg()` with lists or dictionaries for multiple operations\n",
"4. **Hierarchical Indexing**: Multiple group columns create hierarchical indices\n",
"5. **Transform vs Apply**: `.transform()` preserves original size, `.apply()` can return different structures\n",
"6. **Filtering Groups**: Use `.filter()` to remove entire groups based on conditions\n",
"7. **Performance**: Single `.agg()` calls are more efficient than multiple `.groupby()` operations\n",
"\n",
"## Common Patterns\n",
"\n",
"```python\n",
"# Basic aggregation\n",
"df.groupby('column')['value'].sum()\n",
"\n",
"# Multiple aggregations\n",
"df.groupby('column')['value'].agg(['sum', 'mean', 'count'])\n",
"\n",
"# Multiple columns and aggregations\n",
"df.groupby('group_col').agg({\n",
" 'col1': ['sum', 'mean'],\n",
" 'col2': 'count'\n",
"})\n",
"\n",
"# Custom aggregation\n",
"df.groupby('column')['value'].agg(lambda x: x.max() - x.min())\n",
"\n",
"# Transform for group statistics\n",
"df['group_mean'] = df.groupby('group')['value'].transform('mean')\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}