crypto_bot_training/Session_01/PandasDataFrame-exmples/04_grouping_aggregation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 4: Grouping and Aggregation\n",
    "\n",
    "## Learning Objectives\n",
    "- Master the `.groupby()` operation for data aggregation\n",
    "- Learn different aggregation functions and methods\n",
    "- Understand multi-level grouping and hierarchical indexing\n",
    "- Practice custom aggregation functions\n",
    "- Explore advanced grouping techniques\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lessons 1-3\n",
    "- Understanding of basic statistical concepts (mean, sum, count, etc.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset created:\n",
      "Shape: (200, 11)\n",
      "\n",
      "First few rows:\n",
      "        Date     Product     Category  Sales  Quantity Region Salesperson  \\\n",
      "0 2024-01-01     Monitor  Accessories   1068         6   West       Diana   \n",
      "1 2024-01-02  Headphones  Electronics    918         1   East       Alice   \n",
      "2 2024-01-03      Tablet  Accessories   1133         5  North       Diana   \n",
      "3 2024-01-04  Headphones  Electronics   1340         9   West         Bob   \n",
      "4 2024-01-05  Headphones  Electronics   1150         2  North         Eve   \n",
      "\n",
      "   Commission_Rate  Commission  Month  Quarter  \n",
      "0             0.15      160.20      1        1  \n",
      "1             0.12      110.16      1        1  \n",
      "2             0.08       90.64      1        1  \n",
      "3             0.08      107.20      1        1  \n",
      "4             0.12      138.00      1        1  \n"
     ]
    }
   ],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "# Create comprehensive sample dataset\n",
    "np.random.seed(42)\n",
    "n_records = 200\n",
    "\n",
    "sales_data = {\n",
    "    'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n",
    "    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Headphones'], n_records),\n",
    "    'Category': np.random.choice(['Electronics', 'Accessories'], n_records, p=[0.8, 0.2]),\n",
    "    'Sales': np.random.normal(1000, 300, n_records).astype(int),\n",
    "    'Quantity': np.random.randint(1, 10, n_records),\n",
    "    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
    "    'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'], n_records),\n",
    "    'Commission_Rate': np.random.choice([0.08, 0.10, 0.12, 0.15], n_records)\n",
    "}\n",
    "\n",
    "df_sales = pd.DataFrame(sales_data)\n",
    "df_sales['Sales'] = np.abs(df_sales['Sales'])  # Ensure positive values\n",
    "df_sales['Commission'] = df_sales['Sales'] * df_sales['Commission_Rate']\n",
    "df_sales['Month'] = df_sales['Date'].dt.month\n",
    "df_sales['Quarter'] = df_sales['Date'].dt.quarter\n",
    "\n",
    "print(\"Dataset created:\")\n",
    "print(f\"Shape: {df_sales.shape}\")\n",
    "print(\"\\nFirst few rows:\")\n",
    "print(df_sales.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Basic GroupBy Operations\n",
    "\n",
    "Understanding the fundamentals of grouping data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total sales by product:\n",
      "Product\n",
      "Headphones    36032\n",
      "Laptop        45296\n",
      "Monitor       47419\n",
      "Phone         36847\n",
      "Tablet        34711\n",
      "Name: Sales, dtype: int64\n",
      "\n",
      "Type: <class 'pandas.core.series.Series'>\n",
      "\n",
      "Average sales by region:\n",
      "Region\n",
      "East     1030.52\n",
      "North    1007.14\n",
      "South     966.86\n",
      "West      999.78\n",
      "Name: Sales, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Simple groupby with single aggregation\n",
    "print(\"Total sales by product:\")\n",
    "product_sales = df_sales.groupby('Product')['Sales'].sum()\n",
    "print(product_sales)\n",
    "print(f\"\\nType: {type(product_sales)}\")\n",
    "\n",
    "print(\"\\nAverage sales by region:\")\n",
    "region_avg = df_sales.groupby('Region')['Sales'].mean().round(2)\n",
    "print(region_avg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Multiple statistics for sales by product:\n",
      "            count    sum     mean     std\n",
      "Product                                  \n",
      "Headphones     36  36032  1000.89  298.06\n",
      "Laptop         43  45296  1053.40  361.78\n",
      "Monitor        49  47419   967.73  270.57\n",
      "Phone          35  36847  1052.77  323.17\n",
      "Tablet         37  34711   938.14  309.20\n",
      "\n",
      "With custom column names:\n",
      "            Count  Total_Sales  Average_Sales  Std_Dev\n",
      "Product                                               \n",
      "Headphones     36        36032        1000.89   298.06\n",
      "Laptop         43        45296        1053.40   361.78\n",
      "Monitor        49        47419         967.73   270.57\n",
      "Phone          35        36847        1052.77   323.17\n",
      "Tablet         37        34711         938.14   309.20\n"
     ]
    }
   ],
   "source": [
    "# Multiple aggregations on the same column\n",
    "print(\"Multiple statistics for sales by product:\")\n",
    "product_stats = df_sales.groupby('Product')['Sales'].agg(['count', 'sum', 'mean', 'std']).round(2)\n",
    "print(product_stats)\n",
    "\n",
    "print(\"\\nWith custom column names:\")\n",
    "product_stats_named = df_sales.groupby('Product')['Sales'].agg([\n",
    "    ('Count', 'count'),\n",
    "    ('Total_Sales', 'sum'),\n",
    "    ('Average_Sales', 'mean'),\n",
    "    ('Std_Dev', 'std')\n",
    "]).round(2)\n",
    "print(product_stats_named)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Aggregating multiple columns:\n",
      "            Sales                Quantity       Commission        \n",
      "              sum     mean count      sum  mean        sum    mean\n",
      "Product                                                           \n",
      "Headphones  36032  1000.89    36      178  4.94    4004.08  111.22\n",
      "Laptop      45296  1053.40    43      219  5.09    5018.59  116.71\n",
      "Monitor     47419   967.73    49      253  5.16    5078.17  103.64\n",
      "Phone       36847  1052.77    35      162  4.63    4121.58  117.76\n",
      "Tablet      34711   938.14    37      194  5.24    3699.82  100.00\n",
      "\n",
      "Flattened column names:\n",
      "            Sales_sum  Sales_mean  Sales_count  Quantity_sum  Quantity_mean  \\\n",
      "Product                                                                       \n",
      "Headphones      36032     1000.89           36           178           4.94   \n",
      "Laptop          45296     1053.40           43           219           5.09   \n",
      "Monitor         47419      967.73           49           253           5.16   \n",
      "Phone           36847     1052.77           35           162           4.63   \n",
      "Tablet          34711      938.14           37           194           5.24   \n",
      "\n",
      "            Commission_sum  Commission_mean  \n",
      "Product                                      \n",
      "Headphones         4004.08           111.22  \n",
      "Laptop             5018.59           116.71  \n",
      "Monitor            5078.17           103.64  \n",
      "Phone              4121.58           117.76  \n",
      "Tablet             3699.82           100.00  \n"
     ]
    }
   ],
   "source": [
    "# Groupby with multiple columns and aggregations\n",
    "print(\"Aggregating multiple columns:\")\n",
    "multi_agg = df_sales.groupby('Product').agg({\n",
    "    'Sales': ['sum', 'mean', 'count'],\n",
    "    'Quantity': ['sum', 'mean'],\n",
    "    'Commission': ['sum', 'mean']\n",
    "}).round(2)\n",
    "print(multi_agg)\n",
    "\n",
    "print(\"\\nFlattened column names:\")\n",
    "multi_agg_flat = df_sales.groupby('Product').agg({\n",
    "    'Sales': ['sum', 'mean', 'count'],\n",
    "    'Quantity': ['sum', 'mean'],\n",
    "    'Commission': ['sum', 'mean']\n",
    "}).round(2)\n",
    "multi_agg_flat.columns = ['_'.join(col).strip() for col in multi_agg_flat.columns.values]\n",
    "print(multi_agg_flat.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Multiple Group Columns\n",
    "\n",
    "Grouping by multiple categorical variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sales by Region and Product:\n",
      "Region  Product   \n",
      "East    Headphones     9791\n",
      "        Laptop        17001\n",
      "        Monitor       11728\n",
      "        Phone          6514\n",
      "        Tablet        10614\n",
      "North   Headphones    11527\n",
      "        Laptop         6514\n",
      "        Monitor       11273\n",
      "        Phone         13293\n",
      "        Tablet         7750\n",
      "South   Headphones     7131\n",
      "        Laptop        13003\n",
      "        Monitor       12007\n",
      "        Phone         10115\n",
      "        Tablet         7054\n",
      "West    Headphones     7583\n",
      "        Laptop         8778\n",
      "        Monitor       12411\n",
      "        Phone          6925\n",
      "        Tablet         9293\n",
      "Name: Sales, dtype: int64\n",
      "\n",
      "As DataFrame with reset_index():\n",
      "  Region     Product  Sales\n",
      "0   East  Headphones   9791\n",
      "1   East      Laptop  17001\n",
      "2   East     Monitor  11728\n",
      "3   East       Phone   6514\n",
      "4   East      Tablet  10614\n",
      "5  North  Headphones  11527\n",
      "6  North      Laptop   6514\n",
      "7  North     Monitor  11273\n",
      "8  North       Phone  13293\n",
      "9  North      Tablet   7750\n"
     ]
    }
   ],
   "source": [
    "# Group by multiple columns\n",
    "print(\"Sales by Region and Product:\")\n",
    "region_product = df_sales.groupby(['Region', 'Product'])['Sales'].sum().round(2)\n",
    "print(region_product)\n",
    "\n",
    "print(\"\\nAs DataFrame with reset_index():\")\n",
    "region_product_df = df_sales.groupby(['Region', 'Product'])['Sales'].sum().reset_index()\n",
    "print(region_product_df.head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hierarchical indexing example:\n",
      "First 15 entries:\n",
      "Region  Product     Month\n",
      "East    Headphones  1        2287\n",
      "                    3        1194\n",
      "                    4         985\n",
      "                    5        2030\n",
      "                    6         883\n",
      "                    7        2412\n",
      "        Laptop      1        1585\n",
      "                    2        3151\n",
      "                    3        4563\n",
      "                    4        2966\n",
      "                    5         919\n",
      "                    6        2504\n",
      "                    7        1313\n",
      "        Monitor     1        4583\n",
      "                    2         536\n",
      "Name: Sales, dtype: int64\n",
      "\n",
      "Accessing specific groups:\n",
      "North region, Laptop sales by month:\n",
      "Month\n",
      "2    1976\n",
      "3    1141\n",
      "4    1342\n",
      "5      43\n",
      "6     844\n",
      "7    1168\n",
      "Name: Sales, dtype: int64\n",
      "\n",
      "All North region sales:\n",
      "Product     Month\n",
      "Headphones  1        1769\n",
      "            2        1080\n",
      "            3        2884\n",
      "            4        1460\n",
      "            5        4334\n",
      "Name: Sales, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Working with hierarchical index\n",
    "print(\"Hierarchical indexing example:\")\n",
    "hierarchy = df_sales.groupby(['Region', 'Product', 'Month'])['Sales'].sum()\n",
    "print(\"First 15 entries:\")\n",
    "print(hierarchy.head(15))\n",
    "\n",
    "print(\"\\nAccessing specific groups:\")\n",
    "print(\"North region, Laptop sales by month:\")\n",
    "try:\n",
    "    north_laptops = hierarchy.loc[('North', 'Laptop')]\n",
    "    print(north_laptops)\n",
    "except KeyError:\n",
    "    print(\"No data available for North region Laptops\")\n",
    "\n",
    "print(\"\\nAll North region sales:\")\n",
    "try:\n",
    "    north_all = hierarchy.loc['North']\n",
    "    print(north_all.head())\n",
    "except KeyError:\n",
    "    print(\"No data available for North region\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Unstacking hierarchical data:\n",
      "Product  Headphones  Laptop  Monitor  Phone  Tablet\n",
      "Region                                             \n",
      "East           9791   17001    11728   6514   10614\n",
      "North         11527    6514    11273  13293    7750\n",
      "South          7131   13003    12007  10115    7054\n",
      "West           7583    8778    12411   6925    9293\n",
      "\n",
      "Unstacking different levels:\n",
      "Region       East  North  South   West\n",
      "Product                               \n",
      "Headphones   9791  11527   7131   7583\n",
      "Laptop      17001   6514  13003   8778\n",
      "Monitor     11728  11273  12007  12411\n",
      "Phone        6514  13293  10115   6925\n",
      "Tablet      10614   7750   7054   9293\n"
     ]
    }
   ],
   "source": [
    "# Unstacking hierarchical data\n",
    "print(\"Unstacking hierarchical data:\")\n",
    "region_product_pivot = df_sales.groupby(['Region', 'Product'])['Sales'].sum().unstack(fill_value=0)\n",
    "print(region_product_pivot)\n",
    "\n",
    "print(\"\\nUnstacking different levels:\")\n",
    "product_region_pivot = df_sales.groupby(['Product', 'Region'])['Sales'].sum().unstack(fill_value=0)\n",
    "print(product_region_pivot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Common Aggregation Functions\n",
    "\n",
    "Explore the most useful aggregation functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comprehensive statistics by salesperson:\n",
      "             Count  Total     Mean  Median     Std  Min   Max     Q25      Q75\n",
      "Salesperson                                                                   \n",
      "Alice           35  33468   956.23   929.0  288.45  298  1588  841.00  1089.50\n",
      "Bob             36  36427  1011.86  1050.0  314.32  230  1702  802.25  1196.25\n",
      "Charlie         37  39529  1068.35  1070.0  329.60  539  1761  806.00  1313.00\n",
      "Diana           29  28906   996.76  1068.0  325.84   43  1607  831.00  1179.00\n",
      "Eve             34  35134  1033.35  1046.0  323.72  519  1976  775.00  1159.25\n",
      "Frank           29  26841   925.55   904.0  296.00  381  1477  745.00  1145.00\n"
     ]
    }
   ],
   "source": [
    "# Comprehensive aggregation example\n",
    "print(\"Comprehensive statistics by salesperson:\")\n",
    "salesperson_stats = df_sales.groupby('Salesperson')['Sales'].agg([\n",
    "    'count',      # Number of sales\n",
    "    'sum',        # Total sales\n",
    "    'mean',       # Average sale\n",
    "    'median',     # Median sale\n",
    "    'std',        # Standard deviation\n",
    "    'min',        # Minimum sale\n",
    "    'max',        # Maximum sale\n",
    "    lambda x: x.quantile(0.25),  # 25th percentile\n",
    "    lambda x: x.quantile(0.75)   # 75th percentile\n",
    "]).round(2)\n",
    "\n",
    "# Rename lambda columns\n",
    "salesperson_stats.columns = ['Count', 'Total', 'Mean', 'Median', 'Std', 'Min', 'Max', 'Q25', 'Q75']\n",
    "print(salesperson_stats)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Monthly sales trends:\n",
      "       Sales                Quantity Commission\n",
      "         sum     mean count      sum        sum\n",
      "Month                                          \n",
      "1      31482  1015.55    31      157    3324.78\n",
      "2      29854  1029.45    29      153    3437.00\n",
      "3      28500   919.35    31      173    3242.74\n",
      "4      27043   901.43    30      124    2973.03\n",
      "5      31530  1017.10    31      166    3351.57\n",
      "6      33686  1122.87    30      147    3770.67\n",
      "7      18210  1011.67    18       86    1822.45\n",
      "\n",
      "Quarterly performance:\n",
      "         Sales          Quantity Salesperson\n",
      "           sum     mean      sum     nunique\n",
      "Quarter                                     \n",
      "1        89836   987.21      483           6\n",
      "2        92259  1013.84      437           6\n",
      "3        18210  1011.67       86           6\n"
     ]
    }
   ],
   "source": [
    "# Date-based aggregations\n",
    "print(\"Monthly sales trends:\")\n",
    "monthly_sales = df_sales.groupby('Month').agg({\n",
    "    'Sales': ['sum', 'mean', 'count'],\n",
    "    'Quantity': 'sum',\n",
    "    'Commission': 'sum'\n",
    "}).round(2)\n",
    "print(monthly_sales)\n",
    "\n",
    "print(\"\\nQuarterly performance:\")\n",
    "quarterly_sales = df_sales.groupby('Quarter').agg({\n",
    "    'Sales': ['sum', 'mean'],\n",
    "    'Quantity': 'sum',\n",
    "    'Salesperson': 'nunique'  # Number of unique salespeople\n",
    "}).round(2)\n",
    "print(quarterly_sales)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Custom Aggregation Functions\n",
    "\n",
    "Create your own aggregation functions for specific business logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Custom aggregations by product:\n",
      "                Mean  Std_Dev  Range  High_Value_Count     CV\n",
      "Product                                                      \n",
      "Headphones  1000.889  298.055   1309                 8  0.298\n",
      "Laptop      1053.395  361.778   1933                14  0.343\n",
      "Monitor      967.735  270.570   1151                 8  0.280\n",
      "Phone       1052.771  323.173   1321                11  0.307\n",
      "Tablet       938.135  309.205   1314                 6  0.330\n"
     ]
    }
   ],
   "source": [
    "# Custom aggregation functions\n",
    "def sales_range(series):\n",
    "    \"\"\"Calculate the range of sales values\"\"\"\n",
    "    return series.max() - series.min()\n",
    "\n",
    "def high_value_count(series, threshold=1200):\n",
    "    \"\"\"Count sales above a threshold\"\"\"\n",
    "    return (series > threshold).sum()\n",
    "\n",
    "def coefficient_of_variation(series):\n",
    "    \"\"\"Calculate coefficient of variation (std/mean)\"\"\"\n",
    "    return series.std() / series.mean() if series.mean() != 0 else 0\n",
    "\n",
    "print(\"Custom aggregations by product:\")\n",
    "custom_agg = df_sales.groupby('Product')['Sales'].agg([\n",
    "    'mean',\n",
    "    'std',\n",
    "    sales_range,\n",
    "    high_value_count,\n",
    "    coefficient_of_variation\n",
    "]).round(3)\n",
    "\n",
    "custom_agg.columns = ['Mean', 'Std_Dev', 'Range', 'High_Value_Count', 'CV']\n",
    "print(custom_agg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lambda function aggregations:\n",
      "        Total   Average  Top_10_Percent  Above_Average_Count  \\\n",
      "Region                                                         \n",
      "East    55648  1030.519          1416.6                   25   \n",
      "North   50357  1007.140          1342.1                   29   \n",
      "South   49310   966.863          1346.0                   26   \n",
      "West    44990   999.778          1469.6                   19   \n",
      "\n",
      "        Sales_Concentration  \n",
      "Region                       \n",
      "East                  0.134  \n",
      "North                 0.159  \n",
      "South                 0.160  \n",
      "West                  0.171  \n"
     ]
    }
   ],
   "source": [
    "# Lambda functions for quick custom aggregations\n",
    "print(\"Lambda function aggregations:\")\n",
    "lambda_agg = df_sales.groupby('Region')['Sales'].agg([\n",
    "    ('Total', 'sum'),\n",
    "    ('Average', 'mean'),\n",
    "    ('Top_10_Percent', lambda x: x.quantile(0.9)),\n",
    "    ('Above_Average_Count', lambda x: (x > x.mean()).sum()),\n",
    "    ('Sales_Concentration', lambda x: x.nlargest(5).sum() / x.sum())  # Top 5 sales as % of total\n",
    "]).round(3)\n",
    "print(lambda_agg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Transform and Apply Operations\n",
    "\n",
    "Learn `.transform()` and `.apply()` for more complex group operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Transform operations:\n",
      "Sample with transform columns:\n",
      "      Product  Sales  Product_Avg_Sales  Sales_vs_Product_Avg\n",
      "0     Monitor   1068         967.734694            100.265306\n",
      "1  Headphones    918        1000.888889            -82.888889\n",
      "2      Tablet   1133         938.135135            194.864865\n",
      "3  Headphones   1340        1000.888889            339.111111\n",
      "4  Headphones   1150        1000.888889            149.111111\n",
      "\n",
      "Ranking within groups:\n",
      "      Product  Sales  Sales_Rank_in_Product\n",
      "0     Monitor   1068                   16.5\n",
      "1  Headphones    918                   24.0\n",
      "2      Tablet   1133                   12.0\n",
      "3  Headphones   1340                    6.0\n",
      "4  Headphones   1150                   12.0\n",
      "5       Phone   1318                    7.5\n",
      "6      Tablet    799                   24.0\n",
      "7      Tablet    739                   27.0\n",
      "8      Tablet    836                   22.0\n",
      "9  Headphones    619                   32.0\n"
     ]
    }
   ],
   "source": [
    "# Transform operations - return same size as original\n",
    "print(\"Transform operations:\")\n",
    "\n",
    "# Add group statistics as new columns\n",
    "df_transformed = df_sales.copy()\n",
    "df_transformed['Product_Avg_Sales'] = df_sales.groupby('Product')['Sales'].transform('mean')\n",
    "df_transformed['Region_Total_Sales'] = df_sales.groupby('Region')['Sales'].transform('sum')\n",
    "df_transformed['Sales_vs_Product_Avg'] = df_transformed['Sales'] - df_transformed['Product_Avg_Sales']\n",
    "\n",
    "print(\"Sample with transform columns:\")\n",
    "print(df_transformed[['Product', 'Sales', 'Product_Avg_Sales', 'Sales_vs_Product_Avg']].head())\n",
    "\n",
    "print(\"\\nRanking within groups:\")\n",
    "df_transformed['Sales_Rank_in_Product'] = df_sales.groupby('Product')['Sales'].rank(ascending=False)\n",
    "print(df_transformed[['Product', 'Sales', 'Sales_Rank_in_Product']].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Apply operations:\n",
      "            total_sales  avg_sales  num_transactions top_salesperson  \\\n",
      "Product                                                                \n",
      "Headphones        36032    1000.89                36           Diana   \n",
      "Laptop            45296    1053.40                43             Eve   \n",
      "Monitor           47419     967.73                49           Alice   \n",
      "Phone             36847    1052.77                35             Bob   \n",
      "Tablet            34711     938.14                37             Bob   \n",
      "\n",
      "            sales_per_quantity  \n",
      "Product                         \n",
      "Headphones              375.94  \n",
      "Laptop                  345.93  \n",
      "Monitor                 257.01  \n",
      "Phone                   373.06  \n",
      "Tablet                  313.06  \n",
      "\n",
      "Top performing sale in each region:\n",
      "           Product  Sales Salesperson\n",
      "Region                               \n",
      "East        Laptop   1585     Charlie\n",
      "North       Laptop   1976         Eve\n",
      "South       Laptop   1761     Charlie\n",
      "West    Headphones   1607       Diana\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/3089764804.py:14: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
      "  apply_result = df_sales.groupby('Product').apply(group_summary).round(2)\n",
      "/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/3089764804.py:18: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
      "  top_sales_by_region = df_sales.groupby('Region').apply(lambda x: x.loc[x['Sales'].idxmax()])\n"
     ]
    }
   ],
   "source": [
    "# Apply operations - can return different structures\n",
    "print(\"Apply operations:\")\n",
    "\n",
    "def group_summary(group):\n",
    "    \"\"\"Return a summary Series for each group\"\"\"\n",
    "    return pd.Series({\n",
    "        'total_sales': group['Sales'].sum(),\n",
    "        'avg_sales': group['Sales'].mean(),\n",
    "        'num_transactions': len(group),\n",
    "        'top_salesperson': group.loc[group['Sales'].idxmax(), 'Salesperson'],\n",
    "        'sales_per_quantity': (group['Sales'] / group['Quantity']).mean()\n",
    "    })\n",
    "\n",
    "apply_result = df_sales.groupby('Product').apply(group_summary).round(2)\n",
    "print(apply_result)\n",
    "\n",
    "print(\"\\nTop performing sale in each region:\")\n",
    "top_sales_by_region = df_sales.groupby('Region').apply(lambda x: x.loc[x['Sales'].idxmax()])\n",
    "print(top_sales_by_region[['Product', 'Sales', 'Salesperson']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Filtering Groups\n",
    "\n",
    "Filter entire groups based on group-level conditions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Groups with more than 30 transactions:\n",
      "Original data: 200 rows\n",
      "Filtered data: 200 rows\n",
      "\n",
      "Product transaction counts in filtered data:\n",
      "Product\n",
      "Monitor       49\n",
      "Laptop        43\n",
      "Tablet        37\n",
      "Headphones    36\n",
      "Phone         35\n",
      "Name: count, dtype: int64\n",
      "\n",
      "Groups with average sales > $1000:\n",
      "High-value products:\n",
      "Product\n",
      "Headphones    1000.89\n",
      "Laptop        1053.40\n",
      "Phone         1052.77\n",
      "Name: Sales, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Filter groups based on group characteristics\n",
    "print(\"Groups with more than 30 transactions:\")\n",
    "active_products = df_sales.groupby('Product').filter(lambda x: len(x) > 30)\n",
    "print(f\"Original data: {len(df_sales)} rows\")\n",
    "print(f\"Filtered data: {len(active_products)} rows\")\n",
    "print(\"\\nProduct transaction counts in filtered data:\")\n",
    "print(active_products['Product'].value_counts())\n",
    "\n",
    "print(\"\\nGroups with average sales > $1000:\")\n",
    "high_value_products = df_sales.groupby('Product').filter(lambda x: x['Sales'].mean() > 1000)\n",
    "print(\"High-value products:\")\n",
    "print(high_value_products.groupby('Product')['Sales'].mean().round(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Salespeople with consistent performance:\n",
      "Consistent performers analysis:\n",
      "             Count      Mean      Std     CV\n",
      "Salesperson                                 \n",
      "Alice           35   956.229  288.455  0.302\n",
      "Bob             36  1011.861  314.322  0.311\n",
      "Charlie         37  1068.351  329.602  0.309\n",
      "Diana           29   996.759  325.836  0.327\n",
      "Eve             34  1033.353  323.720  0.313\n",
      "Frank           29   925.552  296.002  0.320\n"
     ]
    }
   ],
   "source": [
    "# Complex filtering conditions\n",
    "print(\"Salespeople with consistent performance:\")\n",
    "# Filter salespeople with at least 20 sales and CV < 0.5\n",
    "consistent_performers = df_sales.groupby('Salesperson').filter(\n",
    "    lambda x: len(x) >= 20 and (x['Sales'].std() / x['Sales'].mean()) < 0.5\n",
    ")\n",
    "\n",
    "if len(consistent_performers) > 0:\n",
    "    print(\"Consistent performers analysis:\")\n",
    "    consistency_analysis = consistent_performers.groupby('Salesperson')['Sales'].agg([\n",
    "        'count', 'mean', 'std', lambda x: x.std()/x.mean()\n",
    "    ]).round(3)\n",
    "    consistency_analysis.columns = ['Count', 'Mean', 'Std', 'CV']\n",
    "    print(consistency_analysis)\n",
    "else:\n",
    "    print(\"No salespeople meet the consistency criteria\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Advanced Grouping Techniques\n",
    "\n",
    "More sophisticated grouping operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Grouping by sales value ranges:\n",
      "               Sales                  Quantity Commission\n",
      "               count     mean     sum      sum        sum\n",
      "Sales_Category                                           \n",
      "Low                8   330.88    2647       54     272.61\n",
      "Medium            92   788.87   72576      478    8110.68\n",
      "High              89  1202.89  107057      423   11652.60\n",
      "Very High         11  1638.64   18025       51    1886.35\n",
      "\n",
      "Product distribution across sales categories:\n",
      "Product         Headphones  Laptop  Monitor  Phone  Tablet\n",
      "Sales_Category                                            \n",
      "Low                      1       2        1      2       2\n",
      "Medium                  18      17       26     12      19\n",
      "High                    16      21       21     17      14\n",
      "Very High                1       3        1      4       2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/cd/drz_5fvd69ddxy7rzw4p3zx80000gn/T/ipykernel_61733/1556724418.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.\n",
      "  sales_category_analysis = df_sales.groupby('Sales_Category').agg({\n"
     ]
    }
   ],
   "source": [
    "# Groupby with categorical cuts\n",
    "print(\"Grouping by sales value ranges:\")\n",
    "# Create sales categories\n",
    "df_sales['Sales_Category'] = pd.cut(df_sales['Sales'], \n",
    "                                   bins=[0, 500, 1000, 1500, float('inf')],\n",
    "                                   labels=['Low', 'Medium', 'High', 'Very High'])\n",
    "\n",
    "sales_category_analysis = df_sales.groupby('Sales_Category').agg({\n",
    "    'Sales': ['count', 'mean', 'sum'],\n",
    "    'Quantity': 'sum',\n",
    "    'Commission': 'sum'\n",
    "}).round(2)\n",
    "print(sales_category_analysis)\n",
    "\n",
    "print(\"\\nProduct distribution across sales categories:\")\n",
    "category_product_cross = pd.crosstab(df_sales['Sales_Category'], df_sales['Product'])\n",
    "print(category_product_cross)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Weekly sales analysis:\n",
      "     Sales                    Product Salesperson\n",
      "       sum     mean count    <lambda>     nunique\n",
      "Week                                             \n",
      "1     7726  1103.71     7  Headphones           4\n",
      "2     6078   868.29     7      Tablet           4\n",
      "3     7281  1040.14     7     Monitor           5\n",
      "4     6867   981.00     7      Laptop           4\n",
      "5     7285  1040.71     7     Monitor           4\n",
      "6     7994  1142.00     7  Headphones           4\n",
      "7     6652   950.29     7       Phone           3\n",
      "8     7125  1017.86     7     Monitor           4\n",
      "9     6293   899.00     7     Monitor           5\n",
      "10    7755  1107.86     7       Phone           4\n",
      "\n",
      "Day of week analysis:\n",
      "           count     mean    sum\n",
      "DayOfWeek                       \n",
      "Monday        29  1017.90  29519\n",
      "Tuesday       29  1003.07  29089\n",
      "Wednesday     29   956.72  27745\n",
      "Thursday      29   963.62  27945\n",
      "Friday        28  1136.71  31828\n",
      "Saturday      28  1018.32  28513\n",
      "Sunday        28   916.64  25666\n"
     ]
    }
   ],
   "source": [
    "# Time-based grouping\n",
    "print(\"Weekly sales analysis:\")\n",
    "df_sales['Week'] = df_sales['Date'].dt.isocalendar().week\n",
    "weekly_analysis = df_sales.groupby('Week').agg({\n",
    "    'Sales': ['sum', 'mean', 'count'],\n",
    "    'Product': lambda x: x.mode().iloc[0] if not x.mode().empty else 'None',  # Most common product\n",
    "    'Salesperson': 'nunique'\n",
    "}).round(2)\n",
    "print(weekly_analysis.head(10))\n",
    "\n",
    "print(\"\\nDay of week analysis:\")\n",
    "df_sales['DayOfWeek'] = df_sales['Date'].dt.day_name()\n",
    "day_analysis = df_sales.groupby('DayOfWeek')['Sales'].agg(['count', 'mean', 'sum']).round(2)\n",
    "# Reorder by weekday\n",
    "day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']\n",
    "day_analysis = day_analysis.reindex([day for day in day_order if day in day_analysis.index])\n",
    "print(day_analysis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Performance Considerations\n",
    "\n",
    "Tips for efficient groupby operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Large dataset size: 2000 rows\n",
      "Multiple groupby calls: 0.0016 seconds\n",
      "Single groupby with agg: 0.0007 seconds\n",
      "Efficiency gain: 2.44x faster\n",
      "\n",
      "Results are equivalent: True\n"
     ]
    }
   ],
   "source": [
    "# Efficient groupby operations\n",
    "import time\n",
    "\n",
    "# Create larger dataset for timing comparison\n",
    "large_df = pd.concat([df_sales] * 10, ignore_index=True)\n",
    "print(f\"Large dataset size: {len(large_df)} rows\")\n",
    "\n",
    "# Method 1: Multiple separate groupby calls (less efficient)\n",
    "start_time = time.time()\n",
    "result1_sum = large_df.groupby('Product')['Sales'].sum()\n",
    "result1_mean = large_df.groupby('Product')['Sales'].mean()\n",
    "result1_count = large_df.groupby('Product')['Sales'].count()\n",
    "time1 = time.time() - start_time\n",
    "\n",
    "# Method 2: Single groupby with agg (more efficient)\n",
    "start_time = time.time()\n",
    "result2 = large_df.groupby('Product')['Sales'].agg(['sum', 'mean', 'count'])\n",
    "time2 = time.time() - start_time\n",
    "\n",
    "print(f\"Multiple groupby calls: {time1:.4f} seconds\")\n",
    "print(f\"Single groupby with agg: {time2:.4f} seconds\")\n",
    "print(f\"Efficiency gain: {time1/time2:.2f}x faster\")\n",
    "\n",
    "# Verify results are the same\n",
    "print(f\"\\nResults are equivalent: {result1_sum.equals(result2['sum'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Apply your grouping and aggregation skills:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Sales Performance Analysis\n",
    "# Create a comprehensive sales performance report that includes:\n",
    "# - Total and average sales by salesperson and region\n",
    "# - Commission earned by each salesperson\n",
    "# - Performance ranking within each region\n",
    "# - Identify top and bottom performers\n",
    "\n",
    "# Your code here:\n",
    "def sales_performance_report(df):\n",
    "    \"\"\"Generate comprehensive sales performance report\"\"\"\n",
    "    # Your implementation here\n",
    "    pass\n",
    "\n",
    "# sales_performance_report(df_sales)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Product Analysis\n",
    "# Analyze product performance including:\n",
    "# - Which products are most/least popular (by quantity and sales)\n",
    "# - Seasonal trends for each product\n",
    "# - Regional preferences for different products\n",
    "# - Price consistency across regions\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Custom Business Metrics\n",
    "# Create custom aggregation functions to calculate:\n",
    "# - Customer acquisition cost (if you have marketing spend data)\n",
    "# - Sales velocity (sales per day) for each product\n",
    "# - Market share by region\n",
    "# - Performance consistency score\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **GroupBy Basics**: `.groupby()` splits data into groups based on categorical variables\n",
    "2. **Aggregation Functions**: Use built-in functions (`sum`, `mean`, `count`) or custom functions\n",
    "3. **Multiple Aggregations**: Use `.agg()` with lists or dictionaries for multiple operations\n",
    "4. **Hierarchical Indexing**: Multiple group columns create hierarchical indices\n",
    "5. **Transform vs Apply**: `.transform()` preserves original size, `.apply()` can return different structures\n",
    "6. **Filtering Groups**: Use `.filter()` to remove entire groups based on conditions\n",
    "7. **Performance**: Single `.agg()` calls are more efficient than multiple `.groupby()` operations\n",
    "\n",
    "## Common Patterns\n",
    "\n",
    "```python\n",
    "# Basic aggregation\n",
    "df.groupby('column')['value'].sum()\n",
    "\n",
    "# Multiple aggregations\n",
    "df.groupby('column')['value'].agg(['sum', 'mean', 'count'])\n",
    "\n",
    "# Multiple columns and aggregations\n",
    "df.groupby('group_col').agg({\n",
    "    'col1': ['sum', 'mean'],\n",
    "    'col2': 'count'\n",
    "})\n",
    "\n",
    "# Custom aggregation\n",
    "df.groupby('column')['value'].agg(lambda x: x.max() - x.min())\n",
    "\n",
    "# Transform for group statistics\n",
    "df['group_mean'] = df.groupby('group')['value'].transform('mean')\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}