1137 lines
41 KiB
Text
Executable file
1137 lines
41 KiB
Text
Executable file
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Session 1 - DataFrames - Lesson 4: Grouping and Aggregation\n",
|
|
"\n",
|
|
"## Learning Objectives\n",
|
|
"- Master the `.groupby()` operation for data aggregation\n",
|
|
"- Learn different aggregation functions and methods\n",
|
|
"- Understand multi-level grouping and hierarchical indexing\n",
|
|
"- Practice custom aggregation functions\n",
|
|
"- Explore advanced grouping techniques\n",
|
|
"\n",
|
|
"## Prerequisites\n",
|
|
"- Completed Lessons 1-3\n",
|
|
"- Understanding of basic statistical concepts (mean, sum, count, etc.)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Dataset created:\n",
|
|
"Shape: (200, 11)\n",
|
|
"\n",
|
|
"First few rows:\n",
|
|
" Date Product Category Sales Quantity Region Salesperson \\\n",
|
|
"0 2024-01-01 Monitor Accessories 1068 6 West Diana \n",
|
|
"1 2024-01-02 Headphones Electronics 918 1 East Alice \n",
|
|
"2 2024-01-03 Tablet Accessories 1133 5 North Diana \n",
|
|
"3 2024-01-04 Headphones Electronics 1340 9 West Bob \n",
|
|
"4 2024-01-05 Headphones Electronics 1150 2 North Eve \n",
|
|
"\n",
|
|
" Commission_Rate Commission Month Quarter \n",
|
|
"0 0.15 160.20 1 1 \n",
|
|
"1 0.12 110.16 1 1 \n",
|
|
"2 0.08 90.64 1 1 \n",
|
|
"3 0.08 107.20 1 1 \n",
|
|
"4 0.12 138.00 1 1 \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Import required libraries\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"from datetime import datetime, timedelta\n",
|
|
"\n",
|
|
"# Create comprehensive sample dataset\n",
|
|
"np.random.seed(42)\n",
|
|
"n_records = 200\n",
|
|
"\n",
|
|
"sales_data = {\n",
|
|
" 'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n",
|
|
" 'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones'], n_records),\n",
|
|
" 'Category': np.random.choice(['Electronics', 'Accessories'], n_records, p=[0.8, 0.2]),\n",
|
|
" 'Sales': np.random.normal(1000, 300, n_records).astype(int),\n",
|
|
" 'Quantity': np.random.randint(1, 10, n_records),\n",
|
|
" 'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
|
|
" 'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'], n_records),\n",
|
|
" 'Commission_Rate': np.random.choice([0.08, 0.10, 0.12, 0.15], n_records)\n",
|
|
"}\n",
|
|
"\n",
|
|
"df_sales = pd.DataFrame(sales_data)\n",
|
|
"df_sales['Sales'] = np.abs(df_sales['Sales']) # Ensure positive values\n",
|
|
"df_sales['Commission'] = df_sales['Sales'] * df_sales['Commission_Rate']\n",
|
|
"df_sales['Month'] = df_sales['Date'].dt.month\n",
|
|
"df_sales['Quarter'] = df_sales['Date'].dt.quarter\n",
|
|
"\n",
|
|
"print(\"Dataset created:\")\n",
|
|
"print(f\"Shape: {df_sales.shape}\")\n",
|
|
"print(\"\\nFirst few rows:\")\n",
|
|
"print(df_sales.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Basic GroupBy Operations\n",
|
|
"\n",
|
|
"Understanding the fundamentals of grouping data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Total sales by product:\n",
|
|
"Product\n",
|
|
"Headphones 36032\n",
|
|
"Laptop 45296\n",
|
|
"Monitor 47419\n",
|
|
"Phone 36847\n",
|
|
"Tablet 34711\n",
|
|
"Name: Sales, dtype: int64\n",
|
|
"\n",
|
|
"Type: <class 'pandas.core.series.Series'>\n",
|
|
"\n",
|
|
"Average sales by region:\n",
|
|
"Region\n",
|
|
"East 1030.52\n",
|
|
"North 1007.14\n",
|
|
"South 966.86\n",
|
|
"West 999.78\n",
|
|
"Name: Sales, dtype: float64\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Simple groupby with single aggregation\n",
|
|
"print(\"Total sales by product:\")\n",
|
|
"product_sales = df_sales.groupby('Product')['Sales'].sum()\n",
|
|
"print(product_sales)\n",
|
|
"print(f\"\\nType: {type(product_sales)}\")\n",
|
|
"\n",
|
|
"print(\"\\nAverage sales by region:\")\n",
|
|
"region_avg = df_sales.groupby('Region')['Sales'].mean().round(2)\n",
|
|
"print(region_avg)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Multiple statistics for sales by product:\n",
|
|
" count sum mean std\n",
|
|
"Product \n",
|
|
"Headphones 36 36032 1000.89 298.06\n",
|
|
"Laptop 43 45296 1053.40 361.78\n",
|
|
"Monitor 49 47419 967.73 270.57\n",
|
|
"Phone 35 36847 1052.77 323.17\n",
|
|
"Tablet 37 34711 938.14 309.20\n",
|
|
"\n",
|
|
"With custom column names:\n",
|
|
" Count Total_Sales Average_Sales Std_Dev\n",
|
|
"Product \n",
|
|
"Headphones 36 36032 1000.89 298.06\n",
|
|
"Laptop 43 45296 1053.40 361.78\n",
|
|
"Monitor 49 47419 967.73 270.57\n",
|
|
"Phone 35 36847 1052.77 323.17\n",
|
|
"Tablet 37 34711 938.14 309.20\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Multiple aggregations on the same column\n",
|
|
"print(\"Multiple statistics for sales by product:\")\n",
|
|
"product_stats = df_sales.groupby('Product')['Sales'].agg(['count', 'sum', 'mean', 'std']).round(2)\n",
|
|
"print(product_stats)\n",
|
|
"\n",
|
|
"print(\"\\nWith custom column names:\")\n",
|
|
"product_stats_named = df_sales.groupby('Product')['Sales'].agg([\n",
|
|
" ('Count', 'count'),\n",
|
|
" ('Total_Sales', 'sum'),\n",
|
|
" ('Average_Sales', 'mean'),\n",
|
|
" ('Std_Dev', 'std')\n",
|
|
"]).round(2)\n",
|
|
"print(product_stats_named)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Aggregating multiple columns:\n",
|
|
" Sales Quantity Commission \n",
|
|
" sum mean count sum mean sum mean\n",
|
|
"Product \n",
|
|
"Headphones 36032 1000.89 36 178 4.94 4004.08 111.22\n",
|
|
"Laptop 45296 1053.40 43 219 5.09 5018.59 116.71\n",
|
|
"Monitor 47419 967.73 49 253 5.16 5078.17 103.64\n",
|
|
"Phone 36847 1052.77 35 162 4.63 4121.58 117.76\n",
|
|
"Tablet 34711 938.14 37 194 5.24 3699.82 100.00\n",
|
|
"\n",
|
|
"Flattened column names:\n",
|
|
" Sales_sum Sales_mean Sales_count Quantity_sum Quantity_mean \\\n",
|
|
"Product \n",
|
|
"Headphones 36032 1000.89 36 178 4.94 \n",
|
|
"Laptop 45296 1053.40 43 219 5.09 \n",
|
|
"Monitor 47419 967.73 49 253 5.16 \n",
|
|
"Phone 36847 1052.77 35 162 4.63 \n",
|
|
"Tablet 34711 938.14 37 194 5.24 \n",
|
|
"\n",
|
|
" Commission_sum Commission_mean \n",
|
|
"Product \n",
|
|
"Headphones 4004.08 111.22 \n",
|
|
"Laptop 5018.59 116.71 \n",
|
|
"Monitor 5078.17 103.64 \n",
|
|
"Phone 4121.58 117.76 \n",
|
|
"Tablet 3699.82 100.00 \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Groupby with multiple columns and aggregations\n",
|
|
"print(\"Aggregating multiple columns:\")\n",
|
|
"multi_agg = df_sales.groupby('Product').agg({\n",
|
|
" 'Sales': ['sum', 'mean', 'count'],\n",
|
|
" 'Quantity': ['sum', 'mean'],\n",
|
|
" 'Commission': ['sum', 'mean']\n",
|
|
"}).round(2)\n",
|
|
"print(multi_agg)\n",
|
|
"\n",
|
|
"print(\"\\nFlattened column names:\")\n",
|
|
"multi_agg_flat = df_sales.groupby('Product').agg({\n",
|
|
" 'Sales': ['sum', 'mean', 'count'],\n",
|
|
" 'Quantity': ['sum', 'mean'],\n",
|
|
" 'Commission': ['sum', 'mean']\n",
|
|
"}).round(2)\n",
|
|
"multi_agg_flat.columns = ['_'.join(col).strip() for col in multi_agg_flat.columns.values]\n",
|
|
"print(multi_agg_flat.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Multiple Group Columns\n",
|
|
"\n",
|
|
"Grouping by multiple categorical variables."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Sales by Region and Product:\n",
|
|
"Region Product \n",
|
|
"East Headphones 9791\n",
|
|
" Laptop 17001\n",
|
|
" Monitor 11728\n",
|
|
" Phone 6514\n",
|
|
" Tablet 10614\n",
|
|
"North Headphones 11527\n",
|
|
" Laptop 6514\n",
|
|
" Monitor 11273\n",
|
|
" Phone 13293\n",
|
|
" Tablet 7750\n",
|
|
"South Headphones 7131\n",
|
|
" Laptop 13003\n",
|
|
" Monitor 12007\n",
|
|
" Phone 10115\n",
|
|
" Tablet 7054\n",
|
|
"West Headphones 7583\n",
|
|
" Laptop 8778\n",
|
|
" Monitor 12411\n",
|
|
" Phone 6925\n",
|
|
" Tablet 9293\n",
|
|
"Name: Sales, dtype: int64\n",
|
|
"\n",
|
|
"As DataFrame with reset_index():\n",
|
|
" Region Product Sales\n",
|
|
"0 East Headphones 9791\n",
|
|
"1 East Laptop 17001\n",
|
|
"2 East Monitor 11728\n",
|
|
"3 East Phone 6514\n",
|
|
"4 East Tablet 10614\n",
|
|
"5 North Headphones 11527\n",
|
|
"6 North Laptop 6514\n",
|
|
"7 North Monitor 11273\n",
|
|
"8 North Phone 13293\n",
|
|
"9 North Tablet 7750\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Group by multiple columns\n",
|
|
"print(\"Sales by Region and Product:\")\n",
|
|
"region_product = df_sales.groupby(['Region', 'Product'])['Sales'].sum().round(2)\n",
|
|
"print(region_product)\n",
|
|
"\n",
|
|
"print(\"\\nAs DataFrame with reset_index():\")\n",
|
|
"region_product_df = df_sales.groupby(['Region', 'Product'])['Sales'].sum().reset_index()\n",
|
|
"print(region_product_df.head(10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 27,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Hierarchical indexing example:\n",
|
|
"First 15 entries:\n",
|
|
"Region Product Month\n",
|
|
"East Headphones 1 2287\n",
|
|
" 3 1194\n",
|
|
" 4 985\n",
|
|
" 5 2030\n",
|
|
" 6 883\n",
|
|
" 7 2412\n",
|
|
" Laptop 1 1585\n",
|
|
" 2 3151\n",
|
|
" 3 4563\n",
|
|
" 4 2966\n",
|
|
" 5 919\n",
|
|
" 6 2504\n",
|
|
" 7 1313\n",
|
|
" Monitor 1 4583\n",
|
|
" 2 536\n",
|
|
"Name: Sales, dtype: int64\n",
|
|
"\n",
|
|
"Accessing specific groups:\n",
|
|
"North region, Laptop sales by month:\n",
|
|
"Month\n",
|
|
"2 1976\n",
|
|
"3 1141\n",
|
|
"4 1342\n",
|
|
"5 43\n",
|
|
"6 844\n",
|
|
"7 1168\n",
|
|
"Name: Sales, dtype: int64\n",
|
|
"\n",
|
|
"All North region sales:\n",
|
|
"Product Month\n",
|
|
"Headphones 1 1769\n",
|
|
" 2 1080\n",
|
|
" 3 2884\n",
|
|
" 4 1460\n",
|
|
" 5 4334\n",
|
|
"Name: Sales, dtype: int64\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Working with hierarchical index\n",
|
|
"print(\"Hierarchical indexing example:\")\n",
|
|
"hierarchy = df_sales.groupby(['Region', 'Product', 'Month'])['Sales'].sum()\n",
|
|
"print(\"First 15 entries:\")\n",
|
|
"print(hierarchy.head(15))\n",
|
|
"\n",
|
|
"print(\"\\nAccessing specific groups:\")\n",
|
|
"print(\"North region, Laptop sales by month:\")\n",
|
|
"try:\n",
|
|
" north_laptops = hierarchy.loc[('North', 'Laptop')]\n",
|
|
" print(north_laptops)\n",
|
|
"except KeyError:\n",
|
|
" print(\"No data available for North region Laptops\")\n",
|
|
"\n",
|
|
"print(\"\\nAll North region sales:\")\n",
|
|
"try:\n",
|
|
" north_all = hierarchy.loc['North']\n",
|
|
" print(north_all.head())\n",
|
|
"except KeyError:\n",
|
|
" print(\"No data available for North region\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 28,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Unstacking hierarchical data:\n",
|
|
"Product Headphones Laptop Monitor Phone Tablet\n",
|
|
"Region \n",
|
|
"East 9791 17001 11728 6514 10614\n",
|
|
"North 11527 6514 11273 13293 7750\n",
|
|
"South 7131 13003 12007 10115 7054\n",
|
|
"West 7583 8778 12411 6925 9293\n",
|
|
"\n",
|
|
"Unstacking different levels:\n",
|
|
"Region East North South West\n",
|
|
"Product \n",
|
|
"Headphones 9791 11527 7131 7583\n",
|
|
"Laptop 17001 6514 13003 8778\n",
|
|
"Monitor 11728 11273 12007 12411\n",
|
|
"Phone 6514 13293 10115 6925\n",
|
|
"Tablet 10614 7750 7054 9293\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Unstacking hierarchical data\n",
|
|
"print(\"Unstacking hierarchical data:\")\n",
|
|
"region_product_pivot = df_sales.groupby(['Region', 'Product'])['Sales'].sum().unstack(fill_value=0)\n",
|
|
"print(region_product_pivot)\n",
|
|
"\n",
|
|
"print(\"\\nUnstacking different levels:\")\n",
|
|
"product_region_pivot = df_sales.groupby(['Product', 'Region'])['Sales'].sum().unstack(fill_value=0)\n",
|
|
"print(product_region_pivot)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Common Aggregation Functions\n",
|
|
"\n",
|
|
"Explore the most useful aggregation functions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Comprehensive statistics by salesperson:\n",
|
|
" Count Total Mean Median Std Min Max Q25 Q75\n",
|
|
"Salesperson \n",
|
|
"Alice 35 33468 956.23 929.0 288.45 298 1588 841.00 1089.50\n",
|
|
"Bob 36 36427 1011.86 1050.0 314.32 230 1702 802.25 1196.25\n",
|
|
"Charlie 37 39529 1068.35 1070.0 329.60 539 1761 806.00 1313.00\n",
|
|
"Diana 29 28906 996.76 1068.0 325.84 43 1607 831.00 1179.00\n",
|
|
"Eve 34 35134 1033.35 1046.0 323.72 519 1976 775.00 1159.25\n",
|
|
"Frank 29 26841 925.55 904.0 296.00 381 1477 745.00 1145.00\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Comprehensive aggregation example\n",
|
|
"print(\"Comprehensive statistics by salesperson:\")\n",
|
|
"salesperson_stats = df_sales.groupby('Salesperson')['Sales'].agg([\n",
|
|
" 'count', # Number of sales\n",
|
|
" 'sum', # Total sales\n",
|
|
" 'mean', # Average sale\n",
|
|
" 'median', # Median sale\n",
|
|
" 'std', # Standard deviation\n",
|
|
" 'min', # Minimum sale\n",
|
|
" 'max', # Maximum sale\n",
|
|
" lambda x: x.quantile(0.25), # 25th percentile\n",
|
|
" lambda x: x.quantile(0.75) # 75th percentile\n",
|
|
"]).round(2)\n",
|
|
"\n",
|
|
"# Rename lambda columns\n",
|
|
"salesperson_stats.columns = ['Count', 'Total', 'Mean', 'Median', 'Std', 'Min', 'Max', 'Q25', 'Q75']\n",
|
|
"print(salesperson_stats)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Monthly sales trends:\n",
|
|
" Sales Quantity Commission\n",
|
|
" sum mean count sum sum\n",
|
|
"Month \n",
|
|
"1 31482 1015.55 31 157 3324.78\n",
|
|
"2 29854 1029.45 29 153 3437.00\n",
|
|
"3 28500 919.35 31 173 3242.74\n",
|
|
"4 27043 901.43 30 124 2973.03\n",
|
|
"5 31530 1017.10 31 166 3351.57\n",
|
|
"6 33686 1122.87 30 147 3770.67\n",
|
|
"7 18210 1011.67 18 86 1822.45\n",
|
|
"\n",
|
|
"Quarterly performance:\n",
|
|
" Sales Quantity Salesperson\n",
|
|
" sum mean sum nunique\n",
|
|
"Quarter \n",
|
|
"1 89836 987.21 483 6\n",
|
|
"2 92259 1013.84 437 6\n",
|
|
"3 18210 1011.67 86 6\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Date-based aggregations\n",
|
|
"print(\"Monthly sales trends:\")\n",
|
|
"monthly_sales = df_sales.groupby('Month').agg({\n",
|
|
" 'Sales': ['sum', 'mean', 'count'],\n",
|
|
" 'Quantity': 'sum',\n",
|
|
" 'Commission': 'sum'\n",
|
|
"}).round(2)\n",
|
|
"print(monthly_sales)\n",
|
|
"\n",
|
|
"print(\"\\nQuarterly performance:\")\n",
|
|
"quarterly_sales = df_sales.groupby('Quarter').agg({\n",
|
|
" 'Sales': ['sum', 'mean'],\n",
|
|
" 'Quantity': 'sum',\n",
|
|
" 'Salesperson': 'nunique' # Number of unique salespeople\n",
|
|
"}).round(2)\n",
|
|
"print(quarterly_sales)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Custom Aggregation Functions\n",
|
|
"\n",
|
|
"Create your own aggregation functions for specific business logic."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Custom aggregations by product:\n",
|
|
" Mean Std_Dev Range High_Value_Count CV\n",
|
|
"Product \n",
|
|
"Headphones 1000.889 298.055 1309 8 0.298\n",
|
|
"Laptop 1053.395 361.778 1933 14 0.343\n",
|
|
"Monitor 967.735 270.570 1151 8 0.280\n",
|
|
"Phone 1052.771 323.173 1321 11 0.307\n",
|
|
"Tablet 938.135 309.205 1314 6 0.330\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Custom aggregation functions\n",
|
|
"def sales_range(series):\n",
|
|
" \"\"\"Calculate the range of sales values\"\"\"\n",
|
|
" return series.max() - series.min()\n",
|
|
"\n",
|
|
"def high_value_count(series, threshold=1200):\n",
|
|
" \"\"\"Count sales above a threshold\"\"\"\n",
|
|
" return (series > threshold).sum()\n",
|
|
"\n",
|
|
"def coefficient_of_variation(series):\n",
|
|
" \"\"\"Calculate coefficient of variation (std/mean)\"\"\"\n",
|
|
" return series.std() / series.mean() if series.mean() != 0 else 0\n",
|
|
"\n",
|
|
"print(\"Custom aggregations by product:\")\n",
|
|
"custom_agg = df_sales.groupby('Product')['Sales'].agg([\n",
|
|
" 'mean',\n",
|
|
" 'std',\n",
|
|
" sales_range,\n",
|
|
" high_value_count,\n",
|
|
" coefficient_of_variation\n",
|
|
"]).round(3)\n",
|
|
"\n",
|
|
"custom_agg.columns = ['Mean', 'Std_Dev', 'Range', 'High_Value_Count', 'CV']\n",
|
|
"print(custom_agg)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Lambda function aggregations:\n",
|
|
" Total Average Top_10_Percent Above_Average_Count \\\n",
|
|
"Region \n",
|
|
"East 55648 1030.519 1416.6 25 \n",
|
|
"North 50357 1007.140 1342.1 29 \n",
|
|
"South 49310 966.863 1346.0 26 \n",
|
|
"West 44990 999.778 1469.6 19 \n",
|
|
"\n",
|
|
" Sales_Concentration \n",
|
|
"Region \n",
|
|
"East 0.134 \n",
|
|
"North 0.159 \n",
|
|
"South 0.160 \n",
|
|
"West 0.171 \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Lambda functions for quick custom aggregations\n",
|
|
"print(\"Lambda function aggregations:\")\n",
|
|
"lambda_agg = df_sales.groupby('Region')['Sales'].agg([\n",
|
|
" ('Total', 'sum'),\n",
|
|
" ('Average', 'mean'),\n",
|
|
" ('Top_10_Percent', lambda x: x.quantile(0.9)),\n",
|
|
" ('Above_Average_Count', lambda x: (x > x.mean()).sum()),\n",
|
|
" ('Sales_Concentration', lambda x: x.nlargest(5).sum() / x.sum()) # Top 5 sales as % of total\n",
|
|
"]).round(3)\n",
|
|
"print(lambda_agg)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Transform and Apply Operations\n",
|
|
"\n",
|
|
"Learn `.transform()` and `.apply()` for more complex group operations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Transform operations:\n",
|
|
"Sample with transform columns:\n",
|
|
" Product Sales Product_Avg_Sales Sales_vs_Product_Avg\n",
|
|
"0 Monitor 1068 967.734694 100.265306\n",
|
|
"1 Headphones 918 1000.888889 -82.888889\n",
|
|
"2 Tablet 1133 938.135135 194.864865\n",
|
|
"3 Headphones 1340 1000.888889 339.111111\n",
|
|
"4 Headphones 1150 1000.888889 149.111111\n",
|
|
"\n",
|
|
"Ranking within groups:\n",
|
|
" Product Sales Sales_Rank_in_Product\n",
|
|
"0 Monitor 1068 16.5\n",
|
|
"1 Headphones 918 24.0\n",
|
|
"2 Tablet 1133 12.0\n",
|
|
"3 Headphones 1340 6.0\n",
|
|
"4 Headphones 1150 12.0\n",
|
|
"5 Phone 1318 7.5\n",
|
|
"6 Tablet 799 24.0\n",
|
|
"7 Tablet 739 27.0\n",
|
|
"8 Tablet 836 22.0\n",
|
|
"9 Headphones 619 32.0\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Transform operations - return same size as original\n",
|
|
"print(\"Transform operations:\")\n",
|
|
"\n",
|
|
"# Add group statistics as new columns\n",
|
|
"df_transformed = df_sales.copy()\n",
|
|
"df_transformed['Product_Avg_Sales'] = df_sales.groupby('Product')['Sales'].transform('mean')\n",
|
|
"df_transformed['Region_Total_Sales'] = df_sales.groupby('Region')['Sales'].transform('sum')\n",
|
|
"df_transformed['Sales_vs_Product_Avg'] = df_transformed['Sales'] - df_transformed['Product_Avg_Sales']\n",
|
|
"\n",
|
|
"print(\"Sample with transform columns:\")\n",
|
|
"print(df_transformed[['Product', 'Sales', 'Product_Avg_Sales', 'Sales_vs_Product_Avg']].head())\n",
|
|
"\n",
|
|
"print(\"\\nRanking within groups:\")\n",
|
|
"df_transformed['Sales_Rank_in_Product'] = df_sales.groupby('Product')['Sales'].rank(ascending=False)\n",
|
|
"print(df_transformed[['Product', 'Sales', 'Sales_Rank_in_Product']].head(10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Apply operations:\n",
|
|
" total_sales avg_sales num_transactions top_salesperson \\\n",
|
|
"Product \n",
|
|
"Headphones 36032 1000.89 36 Diana \n",
|
|
"Laptop 45296 1053.40 43 Eve \n",
|
|
"Monitor 47419 967.73 49 Alice \n",
|
|
"Phone 36847 1052.77 35 Bob \n",
|
|
"Tablet 34711 938.14 37 Bob \n",
|
|
"\n",
|
|
" sales_per_quantity \n",
|
|
"Product \n",
|
|
"Headphones 375.94 \n",
|
|
"Laptop 345.93 \n",
|
|
"Monitor 257.01 \n",
|
|
"Phone 373.06 \n",
|
|
"Tablet 313.06 \n",
|
|
"\n",
|
|
"Top performing sale in each region:\n",
|
|
" Product Sales Salesperson\n",
|
|
"Region \n",
|
|
"East Laptop 1585 Charlie\n",
|
|
"North Laptop 1976 Eve\n",
|
|
"South Laptop 1761 Charlie\n",
|
|
"West Headphones 1607 Diana\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/3089764804.py:14: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
|
|
" apply_result = df_sales.groupby('Product').apply(group_summary).round(2)\n",
|
|
"/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/3089764804.py:18: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
|
|
" top_sales_by_region = df_sales.groupby('Region').apply(lambda x: x.loc[x['Sales'].idxmax()])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Apply operations - can return different structures\n",
|
|
"print(\"Apply operations:\")\n",
|
|
"\n",
|
|
"def group_summary(group):\n",
|
|
" \"\"\"Return a summary Series for each group\"\"\"\n",
|
|
" return pd.Series({\n",
|
|
" 'total_sales': group['Sales'].sum(),\n",
|
|
" 'avg_sales': group['Sales'].mean(),\n",
|
|
" 'num_transactions': len(group),\n",
|
|
" 'top_salesperson': group.loc[group['Sales'].idxmax(), 'Salesperson'],\n",
|
|
" 'sales_per_quantity': (group['Sales'] / group['Quantity']).mean()\n",
|
|
" })\n",
|
|
"\n",
|
|
"apply_result = df_sales.groupby('Product').apply(group_summary).round(2)\n",
|
|
"print(apply_result)\n",
|
|
"\n",
|
|
"print(\"\\nTop performing sale in each region:\")\n",
|
|
"top_sales_by_region = df_sales.groupby('Region').apply(lambda x: x.loc[x['Sales'].idxmax()])\n",
|
|
"print(top_sales_by_region[['Product', 'Sales', 'Salesperson']])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 6. Filtering Groups\n",
|
|
"\n",
|
|
"Filter entire groups based on group-level conditions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 35,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Groups with more than 30 transactions:\n",
|
|
"Original data: 200 rows\n",
|
|
"Filtered data: 200 rows\n",
|
|
"\n",
|
|
"Product transaction counts in filtered data:\n",
|
|
"Product\n",
|
|
"Monitor 49\n",
|
|
"Laptop 43\n",
|
|
"Tablet 37\n",
|
|
"Headphones 36\n",
|
|
"Phone 35\n",
|
|
"Name: count, dtype: int64\n",
|
|
"\n",
|
|
"Groups with average sales > $1000:\n",
|
|
"High-value products:\n",
|
|
"Product\n",
|
|
"Headphones 1000.89\n",
|
|
"Laptop 1053.40\n",
|
|
"Phone 1052.77\n",
|
|
"Name: Sales, dtype: float64\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Filter groups based on group characteristics\n",
|
|
"print(\"Groups with more than 30 transactions:\")\n",
|
|
"active_products = df_sales.groupby('Product').filter(lambda x: len(x) > 30)\n",
|
|
"print(f\"Original data: {len(df_sales)} rows\")\n",
|
|
"print(f\"Filtered data: {len(active_products)} rows\")\n",
|
|
"print(\"\\nProduct transaction counts in filtered data:\")\n",
|
|
"print(active_products['Product'].value_counts())\n",
|
|
"\n",
|
|
"print(\"\\nGroups with average sales > $1000:\")\n",
|
|
"high_value_products = df_sales.groupby('Product').filter(lambda x: x['Sales'].mean() > 1000)\n",
|
|
"print(\"High-value products:\")\n",
|
|
"print(high_value_products.groupby('Product')['Sales'].mean().round(2))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 36,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Salespeople with consistent performance:\n",
|
|
"Consistent performers analysis:\n",
|
|
" Count Mean Std CV\n",
|
|
"Salesperson \n",
|
|
"Alice 35 956.229 288.455 0.302\n",
|
|
"Bob 36 1011.861 314.322 0.311\n",
|
|
"Charlie 37 1068.351 329.602 0.309\n",
|
|
"Diana 29 996.759 325.836 0.327\n",
|
|
"Eve 34 1033.353 323.720 0.313\n",
|
|
"Frank 29 925.552 296.002 0.320\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Complex filtering conditions\n",
|
|
"print(\"Salespeople with consistent performance:\")\n",
|
|
"# Filter salespeople with at least 20 sales and CV < 0.5\n",
|
|
"consistent_performers = df_sales.groupby('Salesperson').filter(\n",
|
|
" lambda x: len(x) >= 20 and (x['Sales'].std() / x['Sales'].mean()) < 0.5\n",
|
|
")\n",
|
|
"\n",
|
|
"if len(consistent_performers) > 0:\n",
|
|
" print(\"Consistent performers analysis:\")\n",
|
|
" consistency_analysis = consistent_performers.groupby('Salesperson')['Sales'].agg([\n",
|
|
" 'count', 'mean', 'std', lambda x: x.std()/x.mean()\n",
|
|
" ]).round(3)\n",
|
|
" consistency_analysis.columns = ['Count', 'Mean', 'Std', 'CV']\n",
|
|
" print(consistency_analysis)\n",
|
|
"else:\n",
|
|
" print(\"No salespeople meet the consistency criteria\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 7. Advanced Grouping Techniques\n",
|
|
"\n",
|
|
"More sophisticated grouping operations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 37,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Grouping by sales value ranges:\n",
|
|
" Sales Quantity Commission\n",
|
|
" count mean sum sum sum\n",
|
|
"Sales_Category \n",
|
|
"Low 8 330.88 2647 54 272.61\n",
|
|
"Medium 92 788.87 72576 478 8110.68\n",
|
|
"High 89 1202.89 107057 423 11652.60\n",
|
|
"Very High 11 1638.64 18025 51 1886.35\n",
|
|
"\n",
|
|
"Product distribution across sales categories:\n",
|
|
"Product Headphones Laptop Monitor Phone Tablet\n",
|
|
"Sales_Category \n",
|
|
"Low 1 2 1 2 2\n",
|
|
"Medium 18 17 26 12 19\n",
|
|
"High 16 21 21 17 14\n",
|
|
"Very High 1 3 1 4 2\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/1556724418.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n",
|
|
" sales_category_analysis = df_sales.groupby('Sales_Category').agg({\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Groupby with categorical cuts\n",
|
|
"print(\"Grouping by sales value ranges:\")\n",
|
|
"# Create sales categories\n",
|
|
"df_sales['Sales_Category'] = pd.cut(df_sales['Sales'], \n",
|
|
" bins=[0, 500, 1000, 1500, float('inf')],\n",
|
|
" labels=['Low', 'Medium', 'High', 'Very High'])\n",
|
|
"\n",
|
|
"sales_category_analysis = df_sales.groupby('Sales_Category').agg({\n",
|
|
" 'Sales': ['count', 'mean', 'sum'],\n",
|
|
" 'Quantity': 'sum',\n",
|
|
" 'Commission': 'sum'\n",
|
|
"}).round(2)\n",
|
|
"print(sales_category_analysis)\n",
|
|
"\n",
|
|
"print(\"\\nProduct distribution across sales categories:\")\n",
|
|
"category_product_cross = pd.crosstab(df_sales['Sales_Category'], df_sales['Product'])\n",
|
|
"print(category_product_cross)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 38,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Weekly sales analysis:\n",
|
|
" Sales Product Salesperson\n",
|
|
" sum mean count <lambda> nunique\n",
|
|
"Week \n",
|
|
"1 7726 1103.71 7 Headphones 4\n",
|
|
"2 6078 868.29 7 Tablet 4\n",
|
|
"3 7281 1040.14 7 Monitor 5\n",
|
|
"4 6867 981.00 7 Laptop 4\n",
|
|
"5 7285 1040.71 7 Monitor 4\n",
|
|
"6 7994 1142.00 7 Headphones 4\n",
|
|
"7 6652 950.29 7 Phone 3\n",
|
|
"8 7125 1017.86 7 Monitor 4\n",
|
|
"9 6293 899.00 7 Monitor 5\n",
|
|
"10 7755 1107.86 7 Phone 4\n",
|
|
"\n",
|
|
"Day of week analysis:\n",
|
|
" count mean sum\n",
|
|
"DayOfWeek \n",
|
|
"Monday 29 1017.90 29519\n",
|
|
"Tuesday 29 1003.07 29089\n",
|
|
"Wednesday 29 956.72 27745\n",
|
|
"Thursday 29 963.62 27945\n",
|
|
"Friday 28 1136.71 31828\n",
|
|
"Saturday 28 1018.32 28513\n",
|
|
"Sunday 28 916.64 25666\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Time-based grouping\n",
|
|
"print(\"Weekly sales analysis:\")\n",
|
|
"df_sales['Week'] = df_sales['Date'].dt.isocalendar().week\n",
|
|
"weekly_analysis = df_sales.groupby('Week').agg({\n",
|
|
" 'Sales': ['sum', 'mean', 'count'],\n",
|
|
" 'Product': lambda x: x.mode().iloc[0] if not x.mode().empty else 'None', # Most common product\n",
|
|
" 'Salesperson': 'nunique'\n",
|
|
"}).round(2)\n",
|
|
"print(weekly_analysis.head(10))\n",
|
|
"\n",
|
|
"print(\"\\nDay of week analysis:\")\n",
|
|
"df_sales['DayOfWeek'] = df_sales['Date'].dt.day_name()\n",
|
|
"day_analysis = df_sales.groupby('DayOfWeek')['Sales'].agg(['count', 'mean', 'sum']).round(2)\n",
|
|
"# Reorder by weekday\n",
|
|
"day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']\n",
|
|
"day_analysis = day_analysis.reindex([day for day in day_order if day in day_analysis.index])\n",
|
|
"print(day_analysis)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 8. Performance Considerations\n",
|
|
"\n",
|
|
"Tips for efficient groupby operations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 39,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Large dataset size: 2000 rows\n",
|
|
"Multiple groupby calls: 0.0016 seconds\n",
|
|
"Single groupby with agg: 0.0007 seconds\n",
|
|
"Efficiency gain: 2.44x faster\n",
|
|
"\n",
|
|
"Results are equivalent: True\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Efficient groupby operations\n",
|
|
"import time\n",
|
|
"\n",
|
|
"# Create larger dataset for timing comparison\n",
|
|
"large_df = pd.concat([df_sales] * 10, ignore_index=True)\n",
|
|
"print(f\"Large dataset size: {len(large_df)} rows\")\n",
|
|
"\n",
|
|
"# Method 1: Multiple separate groupby calls (less efficient)\n",
|
|
"start_time = time.time()\n",
|
|
"result1_sum = large_df.groupby('Product')['Sales'].sum()\n",
|
|
"result1_mean = large_df.groupby('Product')['Sales'].mean()\n",
|
|
"result1_count = large_df.groupby('Product')['Sales'].count()\n",
|
|
"time1 = time.time() - start_time\n",
|
|
"\n",
|
|
"# Method 2: Single groupby with agg (more efficient)\n",
|
|
"start_time = time.time()\n",
|
|
"result2 = large_df.groupby('Product')['Sales'].agg(['sum', 'mean', 'count'])\n",
|
|
"time2 = time.time() - start_time\n",
|
|
"\n",
|
|
"print(f\"Multiple groupby calls: {time1:.4f} seconds\")\n",
|
|
"print(f\"Single groupby with agg: {time2:.4f} seconds\")\n",
|
|
"print(f\"Efficiency gain: {time1/time2:.2f}x faster\")\n",
|
|
"\n",
|
|
"# Verify results are the same\n",
|
|
"print(f\"\\nResults are equivalent: {result1_sum.equals(result2['sum'])}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Practice Exercises\n",
|
|
"\n",
|
|
"Apply your grouping and aggregation skills:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 40,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 1: Sales Performance Analysis\n",
|
|
"# Create a comprehensive sales performance report that includes:\n",
|
|
"# - Total and average sales by salesperson and region\n",
|
|
"# - Commission earned by each salesperson\n",
|
|
"# - Performance ranking within each region\n",
|
|
"# - Identify top and bottom performers\n",
|
|
"\n",
|
|
"# Your code here:\n",
|
|
"def sales_performance_report(df):\n",
|
|
" \"\"\"Generate comprehensive sales performance report\"\"\"\n",
|
|
" # Your implementation here\n",
|
|
" pass\n",
|
|
"\n",
|
|
"# sales_performance_report(df_sales)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 41,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 2: Product Analysis\n",
|
|
"# Analyze product performance including:\n",
|
|
"# - Which products are most/least popular (by quantity and sales)\n",
|
|
"# - Seasonal trends for each product\n",
|
|
"# - Regional preferences for different products\n",
|
|
"# - Price consistency across regions\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 3: Custom Business Metrics\n",
|
|
"# Create custom aggregation functions to calculate:\n",
|
|
"# - Customer acquisition cost (if you have marketing spend data)\n",
|
|
"# - Sales velocity (sales per day) for each product\n",
|
|
"# - Market share by region\n",
|
|
"# - Performance consistency score\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Key Takeaways\n",
|
|
"\n",
|
|
"1. **GroupBy Basics**: `.groupby()` splits data into groups based on categorical variables\n",
|
|
"2. **Aggregation Functions**: Use built-in functions (`sum`, `mean`, `count`) or custom functions\n",
|
|
"3. **Multiple Aggregations**: Use `.agg()` with lists or dictionaries for multiple operations\n",
|
|
"4. **Hierarchical Indexing**: Multiple group columns create hierarchical indices\n",
|
|
"5. **Transform vs Apply**: `.transform()` preserves original size, `.apply()` can return different structures\n",
|
|
"6. **Filtering Groups**: Use `.filter()` to remove entire groups based on conditions\n",
|
|
"7. **Performance**: Single `.agg()` calls are more efficient than multiple `.groupby()` operations\n",
|
|
"\n",
|
|
"## Common Patterns\n",
|
|
"\n",
|
|
"```python\n",
|
|
"# Basic aggregation\n",
|
|
"df.groupby('column')['value'].sum()\n",
|
|
"\n",
|
|
"# Multiple aggregations\n",
|
|
"df.groupby('column')['value'].agg(['sum', 'mean', 'count'])\n",
|
|
"\n",
|
|
"# Multiple columns and aggregations\n",
|
|
"df.groupby('group_col').agg({\n",
|
|
" 'col1': ['sum', 'mean'],\n",
|
|
" 'col2': 'count'\n",
|
|
"})\n",
|
|
"\n",
|
|
"# Custom aggregation\n",
|
|
"df.groupby('column')['value'].agg(lambda x: x.max() - x.min())\n",
|
|
"\n",
|
|
"# Transform for group statistics\n",
|
|
"df['group_mean'] = df.groupby('group')['value'].transform('mean')\n",
|
|
"```"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|