1
Fork 0
crypto_bot_training/Session_01/PandasDataFrame-exmples/05_adding_modifying_columns.ipynb
2025-06-13 07:25:59 +02:00

733 lines
26 KiB
Text
Executable file

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 5: Adding and Modifying Columns\n",
"\n",
"## Learning Objectives\n",
"- Learn different methods to add new columns to DataFrames\n",
"- Master conditional column creation using various techniques\n",
"- Understand how to modify existing columns\n",
"- Practice with calculated fields and derived columns\n",
"- Explore data type conversions and transformations\n",
"\n",
"## Prerequisites\n",
"- Completed Lessons 1-4\n",
"- Understanding of basic Python operations and functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"\n",
"# Create sample dataset\n",
"np.random.seed(42)\n",
"n_records = 150\n",
"\n",
"sales_data = {\n",
" 'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n",
" 'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),\n",
" 'Sales': np.random.normal(1000, 200, n_records).astype(int),\n",
" 'Quantity': np.random.randint(1, 8, n_records),\n",
" 'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n",
" 'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n_records),\n",
" 'Customer_Type': np.random.choice(['New', 'Returning', 'VIP'], n_records, p=[0.3, 0.6, 0.1])\n",
"}\n",
"\n",
"df_sales = pd.DataFrame(sales_data)\n",
"df_sales['Sales'] = np.abs(df_sales['Sales']) # Ensure positive values\n",
"\n",
"print(\"Original dataset:\")\n",
"print(f\"Shape: {df_sales.shape}\")\n",
"print(\"\\nFirst few rows:\")\n",
"print(df_sales.head())\n",
"print(\"\\nData types:\")\n",
"print(df_sales.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Basic Column Addition\n",
"\n",
"Simple methods to add new columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 1: Direct assignment\n",
"df_modified = df_sales.copy()\n",
"\n",
"# Add simple calculated columns\n",
"df_modified['Revenue'] = df_modified['Sales'] * df_modified['Quantity']\n",
"df_modified['Commission_10%'] = df_modified['Sales'] * 0.10\n",
"df_modified['Sales_per_Unit'] = df_modified['Sales'] / df_modified['Quantity']\n",
"\n",
"print(\"New calculated columns:\")\n",
"print(df_modified[['Sales', 'Quantity', 'Revenue', 'Commission_10%', 'Sales_per_Unit']].head())\n",
"\n",
"# Add constant value column\n",
"df_modified['Year'] = 2024\n",
"df_modified['Currency'] = 'USD'\n",
"df_modified['Department'] = 'Sales'\n",
"\n",
"print(\"\\nConstant value columns added:\")\n",
"print(df_modified[['Year', 'Currency', 'Department']].head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 2: Using assign() method (more functional approach)\n",
"df_assigned = df_sales.assign(\n",
" Revenue=lambda x: x['Sales'] * x['Quantity'],\n",
" Commission_Rate=0.08,\n",
" Commission_Amount=lambda x: x['Sales'] * 0.08,\n",
" Sales_Squared=lambda x: x['Sales'] ** 2,\n",
" Is_High_Volume=lambda x: x['Quantity'] > 5\n",
")\n",
"\n",
"print(\"Using assign() method:\")\n",
"print(df_assigned[['Sales', 'Quantity', 'Revenue', 'Commission_Amount', 'Is_High_Volume']].head())\n",
"\n",
"print(f\"\\nOriginal shape: {df_sales.shape}\")\n",
"print(f\"Modified shape: {df_assigned.shape}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 3: Using insert() for specific positioning\n",
"df_insert = df_sales.copy()\n",
"\n",
"# Insert column at specific position (after 'Sales')\n",
"sales_index = df_insert.columns.get_loc('Sales')\n",
"df_insert.insert(sales_index + 1, 'Sales_Tax', df_insert['Sales'] * 0.08)\n",
"df_insert.insert(sales_index + 2, 'Total_with_Tax', df_insert['Sales'] + df_insert['Sales_Tax'])\n",
"\n",
"print(\"Using insert() for positioned columns:\")\n",
"print(df_insert[['Product', 'Sales', 'Sales_Tax', 'Total_with_Tax', 'Quantity']].head())\n",
"print(f\"\\nColumn order: {list(df_insert.columns)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Conditional Column Creation\n",
"\n",
"Create columns based on conditions and business logic."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 1: Using np.where() for simple conditions\n",
"df_conditional = df_sales.copy()\n",
"\n",
"# Simple binary conditions\n",
"df_conditional['High_Sales'] = np.where(df_conditional['Sales'] > 1000, 'Yes', 'No')\n",
"df_conditional['Weekend'] = np.where(df_conditional['Date'].dt.dayofweek >= 5, 'Weekend', 'Weekday')\n",
"df_conditional['Bulk_Order'] = np.where(df_conditional['Quantity'] >= 5, 'Bulk', 'Regular')\n",
"\n",
"print(\"Simple conditional columns:\")\n",
"print(df_conditional[['Sales', 'High_Sales', 'Date', 'Weekend', 'Quantity', 'Bulk_Order']].head())\n",
"\n",
"# Nested conditions\n",
"df_conditional['Sales_Category'] = np.where(df_conditional['Sales'] > 1200, 'High',\n",
" np.where(df_conditional['Sales'] > 800, 'Medium', 'Low'))\n",
"\n",
"print(\"\\nNested conditions:\")\n",
"print(df_conditional[['Sales', 'Sales_Category']].head(10))\n",
"print(\"\\nCategory distribution:\")\n",
"print(df_conditional['Sales_Category'].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 2: Using pd.cut() for binning numerical data\n",
"df_conditional['Sales_Tier'] = pd.cut(df_conditional['Sales'], \n",
" bins=[0, 500, 800, 1200, float('inf')],\n",
" labels=['Entry', 'Standard', 'Premium', 'Luxury'])\n",
"\n",
"print(\"Using pd.cut() for binning:\")\n",
"print(df_conditional[['Sales', 'Sales_Tier']].head(10))\n",
"print(\"\\nTier distribution:\")\n",
"print(df_conditional['Sales_Tier'].value_counts())\n",
"\n",
"# Using pd.qcut() for quantile-based binning\n",
"df_conditional['Sales_Quintile'] = pd.qcut(df_conditional['Sales'], \n",
" q=5, \n",
" labels=['Bottom 20%', 'Low 20%', 'Mid 20%', 'High 20%', 'Top 20%'])\n",
"\n",
"print(\"\\nUsing pd.qcut() for quantile binning:\")\n",
"print(df_conditional['Sales_Quintile'].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Method 3: Using pandas.select() for multiple conditions\n",
"# Define conditions and choices\n",
"conditions = [\n",
" (df_conditional['Sales'] >= 1200) & (df_conditional['Quantity'] >= 5),\n",
" (df_conditional['Sales'] >= 1000) & (df_conditional['Customer_Type'] == 'VIP'),\n",
" (df_conditional['Sales'] >= 800) & (df_conditional['Region'] == 'North'),\n",
" df_conditional['Customer_Type'] == 'New'\n",
"]\n",
"\n",
"choices = ['Premium Deal', 'VIP Sale', 'North Preferred', 'New Customer']\n",
"default = 'Standard'\n",
"\n",
"df_conditional['Deal_Type'] = np.select(conditions, choices, default=default)\n",
"\n",
"print(\"Using np.select() for complex conditions:\")\n",
"print(df_conditional[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Deal_Type']].head(10))\n",
"print(\"\\nDeal type distribution:\")\n",
"print(df_conditional['Deal_Type'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Using Apply and Lambda Functions\n",
"\n",
"Create complex calculated columns using custom functions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Simple lambda functions\n",
"df_apply = df_sales.copy()\n",
"\n",
"# Single column transformations\n",
"df_apply['Sales_Log'] = df_apply['Sales'].apply(lambda x: np.log(x))\n",
"df_apply['Product_Length'] = df_apply['Product'].apply(lambda x: len(x))\n",
"df_apply['Days_Since_Start'] = df_apply['Date'].apply(lambda x: (x - df_apply['Date'].min()).days)\n",
"\n",
"print(\"Simple lambda transformations:\")\n",
"print(df_apply[['Sales', 'Sales_Log', 'Product', 'Product_Length', 'Days_Since_Start']].head())\n",
"\n",
"# Multiple column operations using lambda\n",
"df_apply['Efficiency_Score'] = df_apply.apply(\n",
" lambda row: (row['Sales'] * row['Quantity']) / (row['Days_Since_Start'] + 1), \n",
" axis=1\n",
")\n",
"\n",
"print(\"\\nMultiple column lambda:\")\n",
"print(df_apply[['Sales', 'Quantity', 'Days_Since_Start', 'Efficiency_Score']].head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Custom functions for complex business logic\n",
"def calculate_commission(row):\n",
" \"\"\"Calculate commission based on complex business rules\"\"\"\n",
" base_rate = 0.05\n",
" \n",
" # VIP customers get higher commission\n",
" if row['Customer_Type'] == 'VIP':\n",
" base_rate += 0.02\n",
" \n",
" # High quantity orders get bonus\n",
" if row['Quantity'] >= 5:\n",
" base_rate += 0.01\n",
" \n",
" # Regional multipliers\n",
" region_multipliers = {'North': 1.2, 'South': 1.0, 'East': 1.1, 'West': 0.9}\n",
" multiplier = region_multipliers.get(row['Region'], 1.0)\n",
" \n",
" return row['Sales'] * base_rate * multiplier\n",
"\n",
"def performance_rating(row):\n",
" \"\"\"Calculate performance rating based on multiple factors\"\"\"\n",
" score = 0\n",
" \n",
" # Sales performance (40% weight)\n",
" if row['Sales'] > 1200:\n",
" score += 40\n",
" elif row['Sales'] > 800:\n",
" score += 30\n",
" else:\n",
" score += 20\n",
" \n",
" # Quantity performance (30% weight)\n",
" if row['Quantity'] >= 6:\n",
" score += 30\n",
" elif row['Quantity'] >= 4:\n",
" score += 20\n",
" else:\n",
" score += 10\n",
" \n",
" # Customer type bonus (30% weight)\n",
" customer_bonus = {'VIP': 30, 'Returning': 20, 'New': 15}\n",
" score += customer_bonus.get(row['Customer_Type'], 0)\n",
" \n",
" # Convert to letter grade\n",
" if score >= 85:\n",
" return 'A'\n",
" elif score >= 70:\n",
" return 'B'\n",
" elif score >= 55:\n",
" return 'C'\n",
" else:\n",
" return 'D'\n",
"\n",
"# Apply custom functions\n",
"df_apply['Commission'] = df_apply.apply(calculate_commission, axis=1)\n",
"df_apply['Performance_Rating'] = df_apply.apply(performance_rating, axis=1)\n",
"\n",
"print(\"Custom function results:\")\n",
"print(df_apply[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Commission', 'Performance_Rating']].head())\n",
"\n",
"print(\"\\nPerformance rating distribution:\")\n",
"print(df_apply['Performance_Rating'].value_counts().sort_index())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Date and Time Derived Columns\n",
"\n",
"Extract useful information from datetime columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract date components\n",
"df_dates = df_sales.copy()\n",
"\n",
"# Basic date components\n",
"df_dates['Year'] = df_dates['Date'].dt.year\n",
"df_dates['Month'] = df_dates['Date'].dt.month\n",
"df_dates['Day'] = df_dates['Date'].dt.day\n",
"df_dates['DayOfWeek'] = df_dates['Date'].dt.dayofweek # 0=Monday, 6=Sunday\n",
"df_dates['DayName'] = df_dates['Date'].dt.day_name()\n",
"df_dates['MonthName'] = df_dates['Date'].dt.month_name()\n",
"\n",
"print(\"Basic date components:\")\n",
"print(df_dates[['Date', 'Year', 'Month', 'Day', 'DayOfWeek', 'DayName', 'MonthName']].head())\n",
"\n",
"# Business-relevant date features\n",
"df_dates['Quarter'] = df_dates['Date'].dt.quarter\n",
"df_dates['Week'] = df_dates['Date'].dt.isocalendar().week\n",
"df_dates['DayOfYear'] = df_dates['Date'].dt.dayofyear\n",
"df_dates['IsWeekend'] = df_dates['Date'].dt.dayofweek >= 5\n",
"df_dates['IsMonthStart'] = df_dates['Date'].dt.is_month_start\n",
"df_dates['IsMonthEnd'] = df_dates['Date'].dt.is_month_end\n",
"\n",
"print(\"\\nBusiness date features:\")\n",
"print(df_dates[['Date', 'Quarter', 'Week', 'IsWeekend', 'IsMonthStart', 'IsMonthEnd']].head(10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Time-based calculations\n",
"start_date = df_dates['Date'].min()\n",
"df_dates['Days_Since_Start'] = (df_dates['Date'] - start_date).dt.days\n",
"df_dates['Weeks_Since_Start'] = df_dates['Days_Since_Start'] // 7\n",
"\n",
"# Create season column\n",
"def get_season(month):\n",
" if month in [12, 1, 2]:\n",
" return 'Winter'\n",
" elif month in [3, 4, 5]:\n",
" return 'Spring'\n",
" elif month in [6, 7, 8]:\n",
" return 'Summer'\n",
" else:\n",
" return 'Fall'\n",
"\n",
"df_dates['Season'] = df_dates['Month'].apply(get_season)\n",
"\n",
"# Business day calculations\n",
"df_dates['IsBusinessDay'] = df_dates['Date'].dt.dayofweek < 5\n",
"df_dates['BusinessDaysSinceStart'] = df_dates.apply(\n",
" lambda row: np.busday_count(start_date.date(), row['Date'].date()), axis=1\n",
")\n",
"\n",
"print(\"Time-based calculations:\")\n",
"print(df_dates[['Date', 'Days_Since_Start', 'Weeks_Since_Start', 'Season', \n",
" 'IsBusinessDay', 'BusinessDaysSinceStart']].head())\n",
"\n",
"print(\"\\nSeason distribution:\")\n",
"print(df_dates['Season'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Text and String Manipulations\n",
"\n",
"Create columns based on string operations and text processing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# String manipulations\n",
"df_text = df_sales.copy()\n",
"\n",
"# Basic string operations\n",
"df_text['Product_Upper'] = df_text['Product'].str.upper()\n",
"df_text['Product_Lower'] = df_text['Product'].str.lower()\n",
"df_text['Product_Length'] = df_text['Product'].str.len()\n",
"df_text['Product_First_Char'] = df_text['Product'].str[0]\n",
"df_text['Product_Last_Three'] = df_text['Product'].str[-3:]\n",
"\n",
"print(\"Basic string operations:\")\n",
"print(df_text[['Product', 'Product_Upper', 'Product_Lower', 'Product_Length', \n",
" 'Product_First_Char', 'Product_Last_Three']].head())\n",
"\n",
"# Text categorization\n",
"df_text['Product_Category'] = df_text['Product'].apply(lambda x: \n",
" 'Computer' if x in ['Laptop', 'Monitor'] else\n",
" 'Mobile' if x in ['Phone', 'Tablet'] else\n",
" 'Other'\n",
")\n",
"\n",
"# Check for patterns\n",
"df_text['Has_Letter_A'] = df_text['Product'].str.contains('a', case=False)\n",
"df_text['Starts_With_L'] = df_text['Product'].str.startswith('L')\n",
"df_text['Ends_With_E'] = df_text['Product'].str.endswith('e')\n",
"\n",
"print(\"\\nText patterns and categorization:\")\n",
"print(df_text[['Product', 'Product_Category', 'Has_Letter_A', 'Starts_With_L', 'Ends_With_E']].head(10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create formatted text columns\n",
"df_text['Sales_Formatted'] = df_text['Sales'].apply(lambda x: f\"${x:,.2f}\")\n",
"df_text['Transaction_ID'] = df_text.apply(\n",
" lambda row: f\"{row['Region'][:1]}{row['Product'][:3].upper()}{row.name:04d}\", axis=1\n",
")\n",
"\n",
"# Create summary descriptions\n",
"df_text['Transaction_Summary'] = df_text.apply(\n",
" lambda row: f\"{row['Salesperson']} sold {row['Quantity']} {row['Product']}(s) \"\n",
" f\"for {row['Sales_Formatted']} in {row['Region']} region\", \n",
" axis=1\n",
")\n",
"\n",
"print(\"Formatted text columns:\")\n",
"print(df_text[['Sales_Formatted', 'Transaction_ID']].head())\n",
"print(\"\\nTransaction summaries:\")\n",
"for i, summary in enumerate(df_text['Transaction_Summary'].head(3)):\n",
" print(f\"{i+1}. {summary}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Working with Categorical Data\n",
"\n",
"Optimize memory usage and enable category-specific operations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert to categorical data types\n",
"df_categorical = df_sales.copy()\n",
"\n",
"# Check memory usage before\n",
"print(\"Memory usage before categorical conversion:\")\n",
"print(df_categorical.memory_usage(deep=True))\n",
"\n",
"# Convert string columns to categorical\n",
"categorical_columns = ['Product', 'Region', 'Salesperson', 'Customer_Type']\n",
"for col in categorical_columns:\n",
" df_categorical[col] = df_categorical[col].astype('category')\n",
"\n",
"print(\"\\nMemory usage after categorical conversion:\")\n",
"print(df_categorical.memory_usage(deep=True))\n",
"\n",
"print(\"\\nData types after conversion:\")\n",
"print(df_categorical.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Working with ordered categories\n",
"# Create ordered categorical for sales performance\n",
"performance_categories = ['Poor', 'Fair', 'Good', 'Excellent']\n",
"df_categorical['Performance_Level'] = pd.cut(\n",
" df_categorical['Sales'],\n",
" bins=[0, 700, 900, 1200, float('inf')],\n",
" labels=performance_categories,\n",
" ordered=True\n",
")\n",
"\n",
"print(\"Ordered categorical data:\")\n",
"print(df_categorical['Performance_Level'].head(10))\n",
"print(\"\\nCategory info:\")\n",
"print(df_categorical['Performance_Level'].cat.categories)\n",
"print(f\"Is ordered: {df_categorical['Performance_Level'].cat.ordered}\")\n",
"\n",
"# Categorical operations\n",
"print(\"\\nPerformance level distribution:\")\n",
"print(df_categorical['Performance_Level'].value_counts().sort_index())\n",
"\n",
"# Add new category\n",
"df_categorical['Performance_Level'] = df_categorical['Performance_Level'].cat.add_categories(['Outstanding'])\n",
"print(f\"\\nCategories after adding 'Outstanding': {df_categorical['Performance_Level'].cat.categories}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Mathematical and Statistical Transformations\n",
"\n",
"Create columns using mathematical functions and statistical transformations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Mathematical transformations\n",
"df_math = df_sales.copy()\n",
"\n",
"# Common mathematical transformations\n",
"df_math['Sales_Log'] = np.log(df_math['Sales'])\n",
"df_math['Sales_Sqrt'] = np.sqrt(df_math['Sales'])\n",
"df_math['Sales_Squared'] = df_math['Sales'] ** 2\n",
"df_math['Sales_Reciprocal'] = 1 / df_math['Sales']\n",
"\n",
"print(\"Mathematical transformations:\")\n",
"print(df_math[['Sales', 'Sales_Log', 'Sales_Sqrt', 'Sales_Squared', 'Sales_Reciprocal']].head())\n",
"\n",
"# Statistical standardization\n",
"df_math['Sales_Z_Score'] = (df_math['Sales'] - df_math['Sales'].mean()) / df_math['Sales'].std()\n",
"df_math['Sales_Min_Max_Scaled'] = (df_math['Sales'] - df_math['Sales'].min()) / (df_math['Sales'].max() - df_math['Sales'].min())\n",
"\n",
"# Rolling statistics\n",
"df_math = df_math.sort_values('Date')\n",
"df_math['Sales_Rolling_7_Mean'] = df_math['Sales'].rolling(window=7, min_periods=1).mean()\n",
"df_math['Sales_Rolling_7_Std'] = df_math['Sales'].rolling(window=7, min_periods=1).std()\n",
"df_math['Sales_Cumulative_Sum'] = df_math['Sales'].cumsum()\n",
"\n",
"print(\"\\nStatistical transformations:\")\n",
"print(df_math[['Sales', 'Sales_Z_Score', 'Sales_Min_Max_Scaled', \n",
" 'Sales_Rolling_7_Mean', 'Sales_Cumulative_Sum']].head(10))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Rank and percentile columns\n",
"df_math['Sales_Rank'] = df_math['Sales'].rank(ascending=False)\n",
"df_math['Sales_Percentile'] = df_math['Sales'].rank(pct=True) * 100\n",
"df_math['Sales_Rank_by_Region'] = df_math.groupby('Region')['Sales'].rank(ascending=False)\n",
"\n",
"# Binning and discretization\n",
"df_math['Sales_Decile'] = pd.qcut(df_math['Sales'], q=10, labels=range(1, 11))\n",
"df_math['Sales_Tertile'] = pd.qcut(df_math['Sales'], q=3, labels=['Low', 'Medium', 'High'])\n",
"\n",
"print(\"Ranking and binning:\")\n",
"print(df_math[['Sales', 'Sales_Rank', 'Sales_Percentile', 'Sales_Rank_by_Region', \n",
" 'Sales_Decile', 'Sales_Tertile']].head(10))\n",
"\n",
"print(\"\\nDecile distribution:\")\n",
"print(df_math['Sales_Decile'].value_counts().sort_index())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Apply your column creation and modification skills:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Customer Segmentation\n",
"# Create a comprehensive customer segmentation system:\n",
"# - Combine purchase behavior, frequency, and value\n",
"# - Create RFM-like scores (Recency, Frequency, Monetary)\n",
"# - Assign customer segments (e.g., Champion, Loyal, At Risk, etc.)\n",
"\n",
"def create_customer_segmentation(df):\n",
" \"\"\"Create customer segmentation based on purchase patterns\"\"\"\n",
" # Your implementation here\n",
" pass\n",
"\n",
"# segmented_df = create_customer_segmentation(df_sales)\n",
"# print(segmented_df[['Customer_Type', 'Sales', 'Frequency_Score', 'Monetary_Score', 'Segment']].head())"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Performance Metrics Dashboard\n",
"# Create a comprehensive set of KPI columns:\n",
"# - Sales efficiency metrics\n",
"# - Trend indicators (growth rates, momentum)\n",
"# - Comparative metrics (vs. average, vs. target)\n",
"# - Alert flags for unusual patterns\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Feature Engineering for ML\n",
"# Create features that could be useful for machine learning:\n",
"# - Interaction features (product of two variables)\n",
"# - Polynomial features\n",
"# - Time-based features (seasonality, trends)\n",
"# - Lag features (previous period values)\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Column Assignment**: Use direct assignment (`df['col'] = value`) for simple cases\n",
"2. **Assign Method**: Use `.assign()` for functional programming style and method chaining\n",
"3. **Conditional Logic**: Combine `np.where()`, `pd.cut()`, `pd.qcut()`, and `np.select()` for complex conditions\n",
"4. **Apply Functions**: Use `.apply()` with lambda or custom functions for complex transformations\n",
"5. **Date Features**: Extract meaningful components from datetime columns\n",
"6. **String Operations**: Leverage `.str` accessor for text manipulations\n",
"7. **Categorical Data**: Convert to categories for memory efficiency and special operations\n",
"8. **Mathematical Transformations**: Apply statistical and mathematical functions for data preprocessing\n",
"\n",
"## Performance Tips\n",
"\n",
"1. **Vectorized Operations**: Prefer pandas/numpy operations over loops\n",
"2. **Categorical Types**: Use categorical data for repeated string values\n",
"3. **Memory Management**: Monitor memory usage when creating many new columns\n",
"4. **Method Chaining**: Use `.assign()` for readable method chains\n",
"5. **Avoid apply() When Possible**: Use vectorized operations instead of `.apply()` for better performance\n",
"\n",
"## Common Patterns\n",
"\n",
"```python\n",
"# Simple calculation\n",
"df['new_col'] = df['col1'] * df['col2']\n",
"\n",
"# Conditional column\n",
"df['category'] = np.where(df['value'] > threshold, 'High', 'Low')\n",
"\n",
"# Apply custom function\n",
"df['result'] = df.apply(custom_function, axis=1)\n",
"\n",
"# Date features\n",
"df['month'] = df['date'].dt.month\n",
"\n",
"# String operations\n",
"df['upper'] = df['text'].str.upper()\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}