crypto_bot_training/Session_01/PandasDataFrame-exmples/05_adding_modifying_columns.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 5: Adding and Modifying Columns\n",
    "\n",
    "## Learning Objectives\n",
    "- Learn different methods to add new columns to DataFrames\n",
    "- Master conditional column creation using various techniques\n",
    "- Understand how to modify existing columns\n",
    "- Practice with calculated fields and derived columns\n",
    "- Explore data type conversions and transformations\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lessons 1-4\n",
    "- Understanding of basic Python operations and functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "# Create sample dataset\n",
    "np.random.seed(42)\n",
    "n_records = 150\n",
    "\n",
    "sales_data = {\n",
    "    'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n",
    "    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),\n",
    "    'Sales': np.random.normal(1000, 200, n_records).astype(int),\n",
    "    'Quantity': np.random.randint(1, 8, n_records),\n",
    "    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
    "    'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n_records),\n",
    "    'Customer_Type': np.random.choice(['New', 'Returning', 'VIP'], n_records, p=[0.3, 0.6, 0.1])\n",
    "}\n",
    "\n",
    "df_sales = pd.DataFrame(sales_data)\n",
    "df_sales['Sales'] = np.abs(df_sales['Sales'])  # Ensure positive values\n",
    "\n",
    "print(\"Original dataset:\")\n",
    "print(f\"Shape: {df_sales.shape}\")\n",
    "print(\"\\nFirst few rows:\")\n",
    "print(df_sales.head())\n",
    "print(\"\\nData types:\")\n",
    "print(df_sales.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Basic Column Addition\n",
    "\n",
    "Simple methods to add new columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 1: Direct assignment\n",
    "df_modified = df_sales.copy()\n",
    "\n",
    "# Add simple calculated columns\n",
    "df_modified['Revenue'] = df_modified['Sales'] * df_modified['Quantity']\n",
    "df_modified['Commission_10%'] = df_modified['Sales'] * 0.10\n",
    "df_modified['Sales_per_Unit'] = df_modified['Sales'] / df_modified['Quantity']\n",
    "\n",
    "print(\"New calculated columns:\")\n",
    "print(df_modified[['Sales', 'Quantity', 'Revenue', 'Commission_10%', 'Sales_per_Unit']].head())\n",
    "\n",
    "# Add constant value column\n",
    "df_modified['Year'] = 2024\n",
    "df_modified['Currency'] = 'USD'\n",
    "df_modified['Department'] = 'Sales'\n",
    "\n",
    "print(\"\\nConstant value columns added:\")\n",
    "print(df_modified[['Year', 'Currency', 'Department']].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 2: Using assign() method (more functional approach)\n",
    "df_assigned = df_sales.assign(\n",
    "    Revenue=lambda x: x['Sales'] * x['Quantity'],\n",
    "    Commission_Rate=0.08,\n",
    "    Commission_Amount=lambda x: x['Sales'] * 0.08,\n",
    "    Sales_Squared=lambda x: x['Sales'] ** 2,\n",
    "    Is_High_Volume=lambda x: x['Quantity'] > 5\n",
    ")\n",
    "\n",
    "print(\"Using assign() method:\")\n",
    "print(df_assigned[['Sales', 'Quantity', 'Revenue', 'Commission_Amount', 'Is_High_Volume']].head())\n",
    "\n",
    "print(f\"\\nOriginal shape: {df_sales.shape}\")\n",
    "print(f\"Modified shape: {df_assigned.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 3: Using insert() for specific positioning\n",
    "df_insert = df_sales.copy()\n",
    "\n",
    "# Insert column at specific position (after 'Sales')\n",
    "sales_index = df_insert.columns.get_loc('Sales')\n",
    "df_insert.insert(sales_index + 1, 'Sales_Tax', df_insert['Sales'] * 0.08)\n",
    "df_insert.insert(sales_index + 2, 'Total_with_Tax', df_insert['Sales'] + df_insert['Sales_Tax'])\n",
    "\n",
    "print(\"Using insert() for positioned columns:\")\n",
    "print(df_insert[['Product', 'Sales', 'Sales_Tax', 'Total_with_Tax', 'Quantity']].head())\n",
    "print(f\"\\nColumn order: {list(df_insert.columns)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Conditional Column Creation\n",
    "\n",
    "Create columns based on conditions and business logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 1: Using np.where() for simple conditions\n",
    "df_conditional = df_sales.copy()\n",
    "\n",
    "# Simple binary conditions\n",
    "df_conditional['High_Sales'] = np.where(df_conditional['Sales'] > 1000, 'Yes', 'No')\n",
    "df_conditional['Weekend'] = np.where(df_conditional['Date'].dt.dayofweek >= 5, 'Weekend', 'Weekday')\n",
    "df_conditional['Bulk_Order'] = np.where(df_conditional['Quantity'] >= 5, 'Bulk', 'Regular')\n",
    "\n",
    "print(\"Simple conditional columns:\")\n",
    "print(df_conditional[['Sales', 'High_Sales', 'Date', 'Weekend', 'Quantity', 'Bulk_Order']].head())\n",
    "\n",
    "# Nested conditions\n",
    "df_conditional['Sales_Category'] = np.where(df_conditional['Sales'] > 1200, 'High',\n",
    "                                  np.where(df_conditional['Sales'] > 800, 'Medium', 'Low'))\n",
    "\n",
    "print(\"\\nNested conditions:\")\n",
    "print(df_conditional[['Sales', 'Sales_Category']].head(10))\n",
    "print(\"\\nCategory distribution:\")\n",
    "print(df_conditional['Sales_Category'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 2: Using pd.cut() for binning numerical data\n",
    "df_conditional['Sales_Tier'] = pd.cut(df_conditional['Sales'], \n",
    "                                     bins=[0, 500, 800, 1200, float('inf')],\n",
    "                                     labels=['Entry', 'Standard', 'Premium', 'Luxury'])\n",
    "\n",
    "print(\"Using pd.cut() for binning:\")\n",
    "print(df_conditional[['Sales', 'Sales_Tier']].head(10))\n",
    "print(\"\\nTier distribution:\")\n",
    "print(df_conditional['Sales_Tier'].value_counts())\n",
    "\n",
    "# Using pd.qcut() for quantile-based binning\n",
    "df_conditional['Sales_Quintile'] = pd.qcut(df_conditional['Sales'], \n",
    "                                          q=5, \n",
    "                                          labels=['Bottom 20%', 'Low 20%', 'Mid 20%', 'High 20%', 'Top 20%'])\n",
    "\n",
    "print(\"\\nUsing pd.qcut() for quantile binning:\")\n",
    "print(df_conditional['Sales_Quintile'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Method 3: Using pandas.select() for multiple conditions\n",
    "# Define conditions and choices\n",
    "conditions = [\n",
    "    (df_conditional['Sales'] >= 1200) & (df_conditional['Quantity'] >= 5),\n",
    "    (df_conditional['Sales'] >= 1000) & (df_conditional['Customer_Type'] == 'VIP'),\n",
    "    (df_conditional['Sales'] >= 800) & (df_conditional['Region'] == 'North'),\n",
    "    df_conditional['Customer_Type'] == 'New'\n",
    "]\n",
    "\n",
    "choices = ['Premium Deal', 'VIP Sale', 'North Preferred', 'New Customer']\n",
    "default = 'Standard'\n",
    "\n",
    "df_conditional['Deal_Type'] = np.select(conditions, choices, default=default)\n",
    "\n",
    "print(\"Using np.select() for complex conditions:\")\n",
    "print(df_conditional[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Deal_Type']].head(10))\n",
    "print(\"\\nDeal type distribution:\")\n",
    "print(df_conditional['Deal_Type'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Using Apply and Lambda Functions\n",
    "\n",
    "Create complex calculated columns using custom functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simple lambda functions\n",
    "df_apply = df_sales.copy()\n",
    "\n",
    "# Single column transformations\n",
    "df_apply['Sales_Log'] = df_apply['Sales'].apply(lambda x: np.log(x))\n",
    "df_apply['Product_Length'] = df_apply['Product'].apply(lambda x: len(x))\n",
    "df_apply['Days_Since_Start'] = df_apply['Date'].apply(lambda x: (x - df_apply['Date'].min()).days)\n",
    "\n",
    "print(\"Simple lambda transformations:\")\n",
    "print(df_apply[['Sales', 'Sales_Log', 'Product', 'Product_Length', 'Days_Since_Start']].head())\n",
    "\n",
    "# Multiple column operations using lambda\n",
    "df_apply['Efficiency_Score'] = df_apply.apply(\n",
    "    lambda row: (row['Sales'] * row['Quantity']) / (row['Days_Since_Start'] + 1), \n",
    "    axis=1\n",
    ")\n",
    "\n",
    "print(\"\\nMultiple column lambda:\")\n",
    "print(df_apply[['Sales', 'Quantity', 'Days_Since_Start', 'Efficiency_Score']].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Custom functions for complex business logic\n",
    "def calculate_commission(row):\n",
    "    \"\"\"Calculate commission based on complex business rules\"\"\"\n",
    "    base_rate = 0.05\n",
    "    \n",
    "    # VIP customers get higher commission\n",
    "    if row['Customer_Type'] == 'VIP':\n",
    "        base_rate += 0.02\n",
    "    \n",
    "    # High quantity orders get bonus\n",
    "    if row['Quantity'] >= 5:\n",
    "        base_rate += 0.01\n",
    "    \n",
    "    # Regional multipliers\n",
    "    region_multipliers = {'North': 1.2, 'South': 1.0, 'East': 1.1, 'West': 0.9}\n",
    "    multiplier = region_multipliers.get(row['Region'], 1.0)\n",
    "    \n",
    "    return row['Sales'] * base_rate * multiplier\n",
    "\n",
    "def performance_rating(row):\n",
    "    \"\"\"Calculate performance rating based on multiple factors\"\"\"\n",
    "    score = 0\n",
    "    \n",
    "    # Sales performance (40% weight)\n",
    "    if row['Sales'] > 1200:\n",
    "        score += 40\n",
    "    elif row['Sales'] > 800:\n",
    "        score += 30\n",
    "    else:\n",
    "        score += 20\n",
    "    \n",
    "    # Quantity performance (30% weight)\n",
    "    if row['Quantity'] >= 6:\n",
    "        score += 30\n",
    "    elif row['Quantity'] >= 4:\n",
    "        score += 20\n",
    "    else:\n",
    "        score += 10\n",
    "    \n",
    "    # Customer type bonus (30% weight)\n",
    "    customer_bonus = {'VIP': 30, 'Returning': 20, 'New': 15}\n",
    "    score += customer_bonus.get(row['Customer_Type'], 0)\n",
    "    \n",
    "    # Convert to letter grade\n",
    "    if score >= 85:\n",
    "        return 'A'\n",
    "    elif score >= 70:\n",
    "        return 'B'\n",
    "    elif score >= 55:\n",
    "        return 'C'\n",
    "    else:\n",
    "        return 'D'\n",
    "\n",
    "# Apply custom functions\n",
    "df_apply['Commission'] = df_apply.apply(calculate_commission, axis=1)\n",
    "df_apply['Performance_Rating'] = df_apply.apply(performance_rating, axis=1)\n",
    "\n",
    "print(\"Custom function results:\")\n",
    "print(df_apply[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Commission', 'Performance_Rating']].head())\n",
    "\n",
    "print(\"\\nPerformance rating distribution:\")\n",
    "print(df_apply['Performance_Rating'].value_counts().sort_index())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Date and Time Derived Columns\n",
    "\n",
    "Extract useful information from datetime columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract date components\n",
    "df_dates = df_sales.copy()\n",
    "\n",
    "# Basic date components\n",
    "df_dates['Year'] = df_dates['Date'].dt.year\n",
    "df_dates['Month'] = df_dates['Date'].dt.month\n",
    "df_dates['Day'] = df_dates['Date'].dt.day\n",
    "df_dates['DayOfWeek'] = df_dates['Date'].dt.dayofweek  # 0=Monday, 6=Sunday\n",
    "df_dates['DayName'] = df_dates['Date'].dt.day_name()\n",
    "df_dates['MonthName'] = df_dates['Date'].dt.month_name()\n",
    "\n",
    "print(\"Basic date components:\")\n",
    "print(df_dates[['Date', 'Year', 'Month', 'Day', 'DayOfWeek', 'DayName', 'MonthName']].head())\n",
    "\n",
    "# Business-relevant date features\n",
    "df_dates['Quarter'] = df_dates['Date'].dt.quarter\n",
    "df_dates['Week'] = df_dates['Date'].dt.isocalendar().week\n",
    "df_dates['DayOfYear'] = df_dates['Date'].dt.dayofyear\n",
    "df_dates['IsWeekend'] = df_dates['Date'].dt.dayofweek >= 5\n",
    "df_dates['IsMonthStart'] = df_dates['Date'].dt.is_month_start\n",
    "df_dates['IsMonthEnd'] = df_dates['Date'].dt.is_month_end\n",
    "\n",
    "print(\"\\nBusiness date features:\")\n",
    "print(df_dates[['Date', 'Quarter', 'Week', 'IsWeekend', 'IsMonthStart', 'IsMonthEnd']].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Time-based calculations\n",
    "start_date = df_dates['Date'].min()\n",
    "df_dates['Days_Since_Start'] = (df_dates['Date'] - start_date).dt.days\n",
    "df_dates['Weeks_Since_Start'] = df_dates['Days_Since_Start'] // 7\n",
    "\n",
    "# Create season column\n",
    "def get_season(month):\n",
    "    if month in [12, 1, 2]:\n",
    "        return 'Winter'\n",
    "    elif month in [3, 4, 5]:\n",
    "        return 'Spring'\n",
    "    elif month in [6, 7, 8]:\n",
    "        return 'Summer'\n",
    "    else:\n",
    "        return 'Fall'\n",
    "\n",
    "df_dates['Season'] = df_dates['Month'].apply(get_season)\n",
    "\n",
    "# Business day calculations\n",
    "df_dates['IsBusinessDay'] = df_dates['Date'].dt.dayofweek < 5\n",
    "df_dates['BusinessDaysSinceStart'] = df_dates.apply(\n",
    "    lambda row: np.busday_count(start_date.date(), row['Date'].date()), axis=1\n",
    ")\n",
    "\n",
    "print(\"Time-based calculations:\")\n",
    "print(df_dates[['Date', 'Days_Since_Start', 'Weeks_Since_Start', 'Season', \n",
    "               'IsBusinessDay', 'BusinessDaysSinceStart']].head())\n",
    "\n",
    "print(\"\\nSeason distribution:\")\n",
    "print(df_dates['Season'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Text and String Manipulations\n",
    "\n",
    "Create columns based on string operations and text processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# String manipulations\n",
    "df_text = df_sales.copy()\n",
    "\n",
    "# Basic string operations\n",
    "df_text['Product_Upper'] = df_text['Product'].str.upper()\n",
    "df_text['Product_Lower'] = df_text['Product'].str.lower()\n",
    "df_text['Product_Length'] = df_text['Product'].str.len()\n",
    "df_text['Product_First_Char'] = df_text['Product'].str[0]\n",
    "df_text['Product_Last_Three'] = df_text['Product'].str[-3:]\n",
    "\n",
    "print(\"Basic string operations:\")\n",
    "print(df_text[['Product', 'Product_Upper', 'Product_Lower', 'Product_Length', \n",
    "              'Product_First_Char', 'Product_Last_Three']].head())\n",
    "\n",
    "# Text categorization\n",
    "df_text['Product_Category'] = df_text['Product'].apply(lambda x: \n",
    "    'Computer' if x in ['Laptop', 'Monitor'] else\n",
    "    'Mobile' if x in ['Phone', 'Tablet'] else\n",
    "    'Other'\n",
    ")\n",
    "\n",
    "# Check for patterns\n",
    "df_text['Has_Letter_A'] = df_text['Product'].str.contains('a', case=False)\n",
    "df_text['Starts_With_L'] = df_text['Product'].str.startswith('L')\n",
    "df_text['Ends_With_E'] = df_text['Product'].str.endswith('e')\n",
    "\n",
    "print(\"\\nText patterns and categorization:\")\n",
    "print(df_text[['Product', 'Product_Category', 'Has_Letter_A', 'Starts_With_L', 'Ends_With_E']].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create formatted text columns\n",
    "df_text['Sales_Formatted'] = df_text['Sales'].apply(lambda x: f\"${x:,.2f}\")\n",
    "df_text['Transaction_ID'] = df_text.apply(\n",
    "    lambda row: f\"{row['Region'][:1]}{row['Product'][:3].upper()}{row.name:04d}\", axis=1\n",
    ")\n",
    "\n",
    "# Create summary descriptions\n",
    "df_text['Transaction_Summary'] = df_text.apply(\n",
    "    lambda row: f\"{row['Salesperson']} sold {row['Quantity']} {row['Product']}(s) \"\n",
    "                f\"for {row['Sales_Formatted']} in {row['Region']} region\", \n",
    "    axis=1\n",
    ")\n",
    "\n",
    "print(\"Formatted text columns:\")\n",
    "print(df_text[['Sales_Formatted', 'Transaction_ID']].head())\n",
    "print(\"\\nTransaction summaries:\")\n",
    "for i, summary in enumerate(df_text['Transaction_Summary'].head(3)):\n",
    "    print(f\"{i+1}. {summary}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Working with Categorical Data\n",
    "\n",
    "Optimize memory usage and enable category-specific operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert to categorical data types\n",
    "df_categorical = df_sales.copy()\n",
    "\n",
    "# Check memory usage before\n",
    "print(\"Memory usage before categorical conversion:\")\n",
    "print(df_categorical.memory_usage(deep=True))\n",
    "\n",
    "# Convert string columns to categorical\n",
    "categorical_columns = ['Product', 'Region', 'Salesperson', 'Customer_Type']\n",
    "for col in categorical_columns:\n",
    "    df_categorical[col] = df_categorical[col].astype('category')\n",
    "\n",
    "print(\"\\nMemory usage after categorical conversion:\")\n",
    "print(df_categorical.memory_usage(deep=True))\n",
    "\n",
    "print(\"\\nData types after conversion:\")\n",
    "print(df_categorical.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Working with ordered categories\n",
    "# Create ordered categorical for sales performance\n",
    "performance_categories = ['Poor', 'Fair', 'Good', 'Excellent']\n",
    "df_categorical['Performance_Level'] = pd.cut(\n",
    "    df_categorical['Sales'],\n",
    "    bins=[0, 700, 900, 1200, float('inf')],\n",
    "    labels=performance_categories,\n",
    "    ordered=True\n",
    ")\n",
    "\n",
    "print(\"Ordered categorical data:\")\n",
    "print(df_categorical['Performance_Level'].head(10))\n",
    "print(\"\\nCategory info:\")\n",
    "print(df_categorical['Performance_Level'].cat.categories)\n",
    "print(f\"Is ordered: {df_categorical['Performance_Level'].cat.ordered}\")\n",
    "\n",
    "# Categorical operations\n",
    "print(\"\\nPerformance level distribution:\")\n",
    "print(df_categorical['Performance_Level'].value_counts().sort_index())\n",
    "\n",
    "# Add new category\n",
    "df_categorical['Performance_Level'] = df_categorical['Performance_Level'].cat.add_categories(['Outstanding'])\n",
    "print(f\"\\nCategories after adding 'Outstanding': {df_categorical['Performance_Level'].cat.categories}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Mathematical and Statistical Transformations\n",
    "\n",
    "Create columns using mathematical functions and statistical transformations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mathematical transformations\n",
    "df_math = df_sales.copy()\n",
    "\n",
    "# Common mathematical transformations\n",
    "df_math['Sales_Log'] = np.log(df_math['Sales'])\n",
    "df_math['Sales_Sqrt'] = np.sqrt(df_math['Sales'])\n",
    "df_math['Sales_Squared'] = df_math['Sales'] ** 2\n",
    "df_math['Sales_Reciprocal'] = 1 / df_math['Sales']\n",
    "\n",
    "print(\"Mathematical transformations:\")\n",
    "print(df_math[['Sales', 'Sales_Log', 'Sales_Sqrt', 'Sales_Squared', 'Sales_Reciprocal']].head())\n",
    "\n",
    "# Statistical standardization\n",
    "df_math['Sales_Z_Score'] = (df_math['Sales'] - df_math['Sales'].mean()) / df_math['Sales'].std()\n",
    "df_math['Sales_Min_Max_Scaled'] = (df_math['Sales'] - df_math['Sales'].min()) / (df_math['Sales'].max() - df_math['Sales'].min())\n",
    "\n",
    "# Rolling statistics\n",
    "df_math = df_math.sort_values('Date')\n",
    "df_math['Sales_Rolling_7_Mean'] = df_math['Sales'].rolling(window=7, min_periods=1).mean()\n",
    "df_math['Sales_Rolling_7_Std'] = df_math['Sales'].rolling(window=7, min_periods=1).std()\n",
    "df_math['Sales_Cumulative_Sum'] = df_math['Sales'].cumsum()\n",
    "\n",
    "print(\"\\nStatistical transformations:\")\n",
    "print(df_math[['Sales', 'Sales_Z_Score', 'Sales_Min_Max_Scaled', \n",
    "              'Sales_Rolling_7_Mean', 'Sales_Cumulative_Sum']].head(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rank and percentile columns\n",
    "df_math['Sales_Rank'] = df_math['Sales'].rank(ascending=False)\n",
    "df_math['Sales_Percentile'] = df_math['Sales'].rank(pct=True) * 100\n",
    "df_math['Sales_Rank_by_Region'] = df_math.groupby('Region')['Sales'].rank(ascending=False)\n",
    "\n",
    "# Binning and discretization\n",
    "df_math['Sales_Decile'] = pd.qcut(df_math['Sales'], q=10, labels=range(1, 11))\n",
    "df_math['Sales_Tertile'] = pd.qcut(df_math['Sales'], q=3, labels=['Low', 'Medium', 'High'])\n",
    "\n",
    "print(\"Ranking and binning:\")\n",
    "print(df_math[['Sales', 'Sales_Rank', 'Sales_Percentile', 'Sales_Rank_by_Region', \n",
    "              'Sales_Decile', 'Sales_Tertile']].head(10))\n",
    "\n",
    "print(\"\\nDecile distribution:\")\n",
    "print(df_math['Sales_Decile'].value_counts().sort_index())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Apply your column creation and modification skills:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Customer Segmentation\n",
    "# Create a comprehensive customer segmentation system:\n",
    "# - Combine purchase behavior, frequency, and value\n",
    "# - Create RFM-like scores (Recency, Frequency, Monetary)\n",
    "# - Assign customer segments (e.g., Champion, Loyal, At Risk, etc.)\n",
    "\n",
    "def create_customer_segmentation(df):\n",
    "    \"\"\"Create customer segmentation based on purchase patterns\"\"\"\n",
    "    # Your implementation here\n",
    "    pass\n",
    "\n",
    "# segmented_df = create_customer_segmentation(df_sales)\n",
    "# print(segmented_df[['Customer_Type', 'Sales', 'Frequency_Score', 'Monetary_Score', 'Segment']].head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Performance Metrics Dashboard\n",
    "# Create a comprehensive set of KPI columns:\n",
    "# - Sales efficiency metrics\n",
    "# - Trend indicators (growth rates, momentum)\n",
    "# - Comparative metrics (vs. average, vs. target)\n",
    "# - Alert flags for unusual patterns\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Feature Engineering for ML\n",
    "# Create features that could be useful for machine learning:\n",
    "# - Interaction features (product of two variables)\n",
    "# - Polynomial features\n",
    "# - Time-based features (seasonality, trends)\n",
    "# - Lag features (previous period values)\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **Column Assignment**: Use direct assignment (`df['col'] = value`) for simple cases\n",
    "2. **Assign Method**: Use `.assign()` for functional programming style and method chaining\n",
    "3. **Conditional Logic**: Combine `np.where()`, `pd.cut()`, `pd.qcut()`, and `np.select()` for complex conditions\n",
    "4. **Apply Functions**: Use `.apply()` with lambda or custom functions for complex transformations\n",
    "5. **Date Features**: Extract meaningful components from datetime columns\n",
    "6. **String Operations**: Leverage `.str` accessor for text manipulations\n",
    "7. **Categorical Data**: Convert to categories for memory efficiency and special operations\n",
    "8. **Mathematical Transformations**: Apply statistical and mathematical functions for data preprocessing\n",
    "\n",
    "## Performance Tips\n",
    "\n",
    "1. **Vectorized Operations**: Prefer pandas/numpy operations over loops\n",
    "2. **Categorical Types**: Use categorical data for repeated string values\n",
    "3. **Memory Management**: Monitor memory usage when creating many new columns\n",
    "4. **Method Chaining**: Use `.assign()` for readable method chains\n",
    "5. **Avoid apply() When Possible**: Use vectorized operations instead of `.apply()` for better performance\n",
    "\n",
    "## Common Patterns\n",
    "\n",
    "```python\n",
    "# Simple calculation\n",
    "df['new_col'] = df['col1'] * df['col2']\n",
    "\n",
    "# Conditional column\n",
    "df['category'] = np.where(df['value'] > threshold, 'High', 'Low')\n",
    "\n",
    "# Apply custom function\n",
    "df['result'] = df.apply(custom_function, axis=1)\n",
    "\n",
    "# Date features\n",
    "df['month'] = df['date'].dt.month\n",
    "\n",
    "# String operations\n",
    "df['upper'] = df['text'].str.upper()\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}