{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Session 1 - DataFrames - Lesson 5: Adding and Modifying Columns\n", "\n", "## Learning Objectives\n", "- Learn different methods to add new columns to DataFrames\n", "- Master conditional column creation using various techniques\n", "- Understand how to modify existing columns\n", "- Practice with calculated fields and derived columns\n", "- Explore data type conversions and transformations\n", "\n", "## Prerequisites\n", "- Completed Lessons 1-4\n", "- Understanding of basic Python operations and functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "from datetime import datetime, timedelta\n", "\n", "# Create sample dataset\n", "np.random.seed(42)\n", "n_records = 150\n", "\n", "sales_data = {\n", " 'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),\n", " 'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),\n", " 'Sales': np.random.normal(1000, 200, n_records).astype(int),\n", " 'Quantity': np.random.randint(1, 8, n_records),\n", " 'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),\n", " 'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n_records),\n", " 'Customer_Type': np.random.choice(['New', 'Returning', 'VIP'], n_records, p=[0.3, 0.6, 0.1])\n", "}\n", "\n", "df_sales = pd.DataFrame(sales_data)\n", "df_sales['Sales'] = np.abs(df_sales['Sales']) # Ensure positive values\n", "\n", "print(\"Original dataset:\")\n", "print(f\"Shape: {df_sales.shape}\")\n", "print(\"\\nFirst few rows:\")\n", "print(df_sales.head())\n", "print(\"\\nData types:\")\n", "print(df_sales.dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Basic Column Addition\n", "\n", "Simple methods to add new columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Method 1: Direct assignment\n", "df_modified = df_sales.copy()\n", "\n", "# Add simple calculated columns\n", "df_modified['Revenue'] = df_modified['Sales'] * df_modified['Quantity']\n", "df_modified['Commission_10%'] = df_modified['Sales'] * 0.10\n", "df_modified['Sales_per_Unit'] = df_modified['Sales'] / df_modified['Quantity']\n", "\n", "print(\"New calculated columns:\")\n", "print(df_modified[['Sales', 'Quantity', 'Revenue', 'Commission_10%', 'Sales_per_Unit']].head())\n", "\n", "# Add constant value column\n", "df_modified['Year'] = 2024\n", "df_modified['Currency'] = 'USD'\n", "df_modified['Department'] = 'Sales'\n", "\n", "print(\"\\nConstant value columns added:\")\n", "print(df_modified[['Year', 'Currency', 'Department']].head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Method 2: Using assign() method (more functional approach)\n", "df_assigned = df_sales.assign(\n", " Revenue=lambda x: x['Sales'] * x['Quantity'],\n", " Commission_Rate=0.08,\n", " Commission_Amount=lambda x: x['Sales'] * 0.08,\n", " Sales_Squared=lambda x: x['Sales'] ** 2,\n", " Is_High_Volume=lambda x: x['Quantity'] > 5\n", ")\n", "\n", "print(\"Using assign() method:\")\n", "print(df_assigned[['Sales', 'Quantity', 'Revenue', 'Commission_Amount', 'Is_High_Volume']].head())\n", "\n", "print(f\"\\nOriginal shape: {df_sales.shape}\")\n", "print(f\"Modified shape: {df_assigned.shape}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Method 3: Using insert() for specific positioning\n", "df_insert = df_sales.copy()\n", "\n", "# Insert column at specific position (after 'Sales')\n", "sales_index = df_insert.columns.get_loc('Sales')\n", "df_insert.insert(sales_index + 1, 'Sales_Tax', df_insert['Sales'] * 0.08)\n", "df_insert.insert(sales_index + 2, 'Total_with_Tax', df_insert['Sales'] + df_insert['Sales_Tax'])\n", "\n", "print(\"Using insert() for positioned columns:\")\n", "print(df_insert[['Product', 'Sales', 'Sales_Tax', 'Total_with_Tax', 'Quantity']].head())\n", "print(f\"\\nColumn order: {list(df_insert.columns)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Conditional Column Creation\n", "\n", "Create columns based on conditions and business logic." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Method 1: Using np.where() for simple conditions\n", "df_conditional = df_sales.copy()\n", "\n", "# Simple binary conditions\n", "df_conditional['High_Sales'] = np.where(df_conditional['Sales'] > 1000, 'Yes', 'No')\n", "df_conditional['Weekend'] = np.where(df_conditional['Date'].dt.dayofweek >= 5, 'Weekend', 'Weekday')\n", "df_conditional['Bulk_Order'] = np.where(df_conditional['Quantity'] >= 5, 'Bulk', 'Regular')\n", "\n", "print(\"Simple conditional columns:\")\n", "print(df_conditional[['Sales', 'High_Sales', 'Date', 'Weekend', 'Quantity', 'Bulk_Order']].head())\n", "\n", "# Nested conditions\n", "df_conditional['Sales_Category'] = np.where(df_conditional['Sales'] > 1200, 'High',\n", " np.where(df_conditional['Sales'] > 800, 'Medium', 'Low'))\n", "\n", "print(\"\\nNested conditions:\")\n", "print(df_conditional[['Sales', 'Sales_Category']].head(10))\n", "print(\"\\nCategory distribution:\")\n", "print(df_conditional['Sales_Category'].value_counts())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Method 2: Using pd.cut() for binning numerical data\n", "df_conditional['Sales_Tier'] = pd.cut(df_conditional['Sales'], \n", " bins=[0, 500, 800, 1200, float('inf')],\n", " labels=['Entry', 'Standard', 'Premium', 'Luxury'])\n", "\n", "print(\"Using pd.cut() for binning:\")\n", "print(df_conditional[['Sales', 'Sales_Tier']].head(10))\n", "print(\"\\nTier distribution:\")\n", "print(df_conditional['Sales_Tier'].value_counts())\n", "\n", "# Using pd.qcut() for quantile-based binning\n", "df_conditional['Sales_Quintile'] = pd.qcut(df_conditional['Sales'], \n", " q=5, \n", " labels=['Bottom 20%', 'Low 20%', 'Mid 20%', 'High 20%', 'Top 20%'])\n", "\n", "print(\"\\nUsing pd.qcut() for quantile binning:\")\n", "print(df_conditional['Sales_Quintile'].value_counts())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Method 3: Using pandas.select() for multiple conditions\n", "# Define conditions and choices\n", "conditions = [\n", " (df_conditional['Sales'] >= 1200) & (df_conditional['Quantity'] >= 5),\n", " (df_conditional['Sales'] >= 1000) & (df_conditional['Customer_Type'] == 'VIP'),\n", " (df_conditional['Sales'] >= 800) & (df_conditional['Region'] == 'North'),\n", " df_conditional['Customer_Type'] == 'New'\n", "]\n", "\n", "choices = ['Premium Deal', 'VIP Sale', 'North Preferred', 'New Customer']\n", "default = 'Standard'\n", "\n", "df_conditional['Deal_Type'] = np.select(conditions, choices, default=default)\n", "\n", "print(\"Using np.select() for complex conditions:\")\n", "print(df_conditional[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Deal_Type']].head(10))\n", "print(\"\\nDeal type distribution:\")\n", "print(df_conditional['Deal_Type'].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Using Apply and Lambda Functions\n", "\n", "Create complex calculated columns using custom functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Simple lambda functions\n", "df_apply = df_sales.copy()\n", "\n", "# Single column transformations\n", "df_apply['Sales_Log'] = df_apply['Sales'].apply(lambda x: np.log(x))\n", "df_apply['Product_Length'] = df_apply['Product'].apply(lambda x: len(x))\n", "df_apply['Days_Since_Start'] = df_apply['Date'].apply(lambda x: (x - df_apply['Date'].min()).days)\n", "\n", "print(\"Simple lambda transformations:\")\n", "print(df_apply[['Sales', 'Sales_Log', 'Product', 'Product_Length', 'Days_Since_Start']].head())\n", "\n", "# Multiple column operations using lambda\n", "df_apply['Efficiency_Score'] = df_apply.apply(\n", " lambda row: (row['Sales'] * row['Quantity']) / (row['Days_Since_Start'] + 1), \n", " axis=1\n", ")\n", "\n", "print(\"\\nMultiple column lambda:\")\n", "print(df_apply[['Sales', 'Quantity', 'Days_Since_Start', 'Efficiency_Score']].head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Custom functions for complex business logic\n", "def calculate_commission(row):\n", " \"\"\"Calculate commission based on complex business rules\"\"\"\n", " base_rate = 0.05\n", " \n", " # VIP customers get higher commission\n", " if row['Customer_Type'] == 'VIP':\n", " base_rate += 0.02\n", " \n", " # High quantity orders get bonus\n", " if row['Quantity'] >= 5:\n", " base_rate += 0.01\n", " \n", " # Regional multipliers\n", " region_multipliers = {'North': 1.2, 'South': 1.0, 'East': 1.1, 'West': 0.9}\n", " multiplier = region_multipliers.get(row['Region'], 1.0)\n", " \n", " return row['Sales'] * base_rate * multiplier\n", "\n", "def performance_rating(row):\n", " \"\"\"Calculate performance rating based on multiple factors\"\"\"\n", " score = 0\n", " \n", " # Sales performance (40% weight)\n", " if row['Sales'] > 1200:\n", " score += 40\n", " elif row['Sales'] > 800:\n", " score += 30\n", " else:\n", " score += 20\n", " \n", " # Quantity performance (30% weight)\n", " if row['Quantity'] >= 6:\n", " score += 30\n", " elif row['Quantity'] >= 4:\n", " score += 20\n", " else:\n", " score += 10\n", " \n", " # Customer type bonus (30% weight)\n", " customer_bonus = {'VIP': 30, 'Returning': 20, 'New': 15}\n", " score += customer_bonus.get(row['Customer_Type'], 0)\n", " \n", " # Convert to letter grade\n", " if score >= 85:\n", " return 'A'\n", " elif score >= 70:\n", " return 'B'\n", " elif score >= 55:\n", " return 'C'\n", " else:\n", " return 'D'\n", "\n", "# Apply custom functions\n", "df_apply['Commission'] = df_apply.apply(calculate_commission, axis=1)\n", "df_apply['Performance_Rating'] = df_apply.apply(performance_rating, axis=1)\n", "\n", "print(\"Custom function results:\")\n", "print(df_apply[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Commission', 'Performance_Rating']].head())\n", "\n", "print(\"\\nPerformance rating distribution:\")\n", "print(df_apply['Performance_Rating'].value_counts().sort_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Date and Time Derived Columns\n", "\n", "Extract useful information from datetime columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract date components\n", "df_dates = df_sales.copy()\n", "\n", "# Basic date components\n", "df_dates['Year'] = df_dates['Date'].dt.year\n", "df_dates['Month'] = df_dates['Date'].dt.month\n", "df_dates['Day'] = df_dates['Date'].dt.day\n", "df_dates['DayOfWeek'] = df_dates['Date'].dt.dayofweek # 0=Monday, 6=Sunday\n", "df_dates['DayName'] = df_dates['Date'].dt.day_name()\n", "df_dates['MonthName'] = df_dates['Date'].dt.month_name()\n", "\n", "print(\"Basic date components:\")\n", "print(df_dates[['Date', 'Year', 'Month', 'Day', 'DayOfWeek', 'DayName', 'MonthName']].head())\n", "\n", "# Business-relevant date features\n", "df_dates['Quarter'] = df_dates['Date'].dt.quarter\n", "df_dates['Week'] = df_dates['Date'].dt.isocalendar().week\n", "df_dates['DayOfYear'] = df_dates['Date'].dt.dayofyear\n", "df_dates['IsWeekend'] = df_dates['Date'].dt.dayofweek >= 5\n", "df_dates['IsMonthStart'] = df_dates['Date'].dt.is_month_start\n", "df_dates['IsMonthEnd'] = df_dates['Date'].dt.is_month_end\n", "\n", "print(\"\\nBusiness date features:\")\n", "print(df_dates[['Date', 'Quarter', 'Week', 'IsWeekend', 'IsMonthStart', 'IsMonthEnd']].head(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Time-based calculations\n", "start_date = df_dates['Date'].min()\n", "df_dates['Days_Since_Start'] = (df_dates['Date'] - start_date).dt.days\n", "df_dates['Weeks_Since_Start'] = df_dates['Days_Since_Start'] // 7\n", "\n", "# Create season column\n", "def get_season(month):\n", " if month in [12, 1, 2]:\n", " return 'Winter'\n", " elif month in [3, 4, 5]:\n", " return 'Spring'\n", " elif month in [6, 7, 8]:\n", " return 'Summer'\n", " else:\n", " return 'Fall'\n", "\n", "df_dates['Season'] = df_dates['Month'].apply(get_season)\n", "\n", "# Business day calculations\n", "df_dates['IsBusinessDay'] = df_dates['Date'].dt.dayofweek < 5\n", "df_dates['BusinessDaysSinceStart'] = df_dates.apply(\n", " lambda row: np.busday_count(start_date.date(), row['Date'].date()), axis=1\n", ")\n", "\n", "print(\"Time-based calculations:\")\n", "print(df_dates[['Date', 'Days_Since_Start', 'Weeks_Since_Start', 'Season', \n", " 'IsBusinessDay', 'BusinessDaysSinceStart']].head())\n", "\n", "print(\"\\nSeason distribution:\")\n", "print(df_dates['Season'].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Text and String Manipulations\n", "\n", "Create columns based on string operations and text processing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# String manipulations\n", "df_text = df_sales.copy()\n", "\n", "# Basic string operations\n", "df_text['Product_Upper'] = df_text['Product'].str.upper()\n", "df_text['Product_Lower'] = df_text['Product'].str.lower()\n", "df_text['Product_Length'] = df_text['Product'].str.len()\n", "df_text['Product_First_Char'] = df_text['Product'].str[0]\n", "df_text['Product_Last_Three'] = df_text['Product'].str[-3:]\n", "\n", "print(\"Basic string operations:\")\n", "print(df_text[['Product', 'Product_Upper', 'Product_Lower', 'Product_Length', \n", " 'Product_First_Char', 'Product_Last_Three']].head())\n", "\n", "# Text categorization\n", "df_text['Product_Category'] = df_text['Product'].apply(lambda x: \n", " 'Computer' if x in ['Laptop', 'Monitor'] else\n", " 'Mobile' if x in ['Phone', 'Tablet'] else\n", " 'Other'\n", ")\n", "\n", "# Check for patterns\n", "df_text['Has_Letter_A'] = df_text['Product'].str.contains('a', case=False)\n", "df_text['Starts_With_L'] = df_text['Product'].str.startswith('L')\n", "df_text['Ends_With_E'] = df_text['Product'].str.endswith('e')\n", "\n", "print(\"\\nText patterns and categorization:\")\n", "print(df_text[['Product', 'Product_Category', 'Has_Letter_A', 'Starts_With_L', 'Ends_With_E']].head(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create formatted text columns\n", "df_text['Sales_Formatted'] = df_text['Sales'].apply(lambda x: f\"${x:,.2f}\")\n", "df_text['Transaction_ID'] = df_text.apply(\n", " lambda row: f\"{row['Region'][:1]}{row['Product'][:3].upper()}{row.name:04d}\", axis=1\n", ")\n", "\n", "# Create summary descriptions\n", "df_text['Transaction_Summary'] = df_text.apply(\n", " lambda row: f\"{row['Salesperson']} sold {row['Quantity']} {row['Product']}(s) \"\n", " f\"for {row['Sales_Formatted']} in {row['Region']} region\", \n", " axis=1\n", ")\n", "\n", "print(\"Formatted text columns:\")\n", "print(df_text[['Sales_Formatted', 'Transaction_ID']].head())\n", "print(\"\\nTransaction summaries:\")\n", "for i, summary in enumerate(df_text['Transaction_Summary'].head(3)):\n", " print(f\"{i+1}. {summary}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Working with Categorical Data\n", "\n", "Optimize memory usage and enable category-specific operations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert to categorical data types\n", "df_categorical = df_sales.copy()\n", "\n", "# Check memory usage before\n", "print(\"Memory usage before categorical conversion:\")\n", "print(df_categorical.memory_usage(deep=True))\n", "\n", "# Convert string columns to categorical\n", "categorical_columns = ['Product', 'Region', 'Salesperson', 'Customer_Type']\n", "for col in categorical_columns:\n", " df_categorical[col] = df_categorical[col].astype('category')\n", "\n", "print(\"\\nMemory usage after categorical conversion:\")\n", "print(df_categorical.memory_usage(deep=True))\n", "\n", "print(\"\\nData types after conversion:\")\n", "print(df_categorical.dtypes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Working with ordered categories\n", "# Create ordered categorical for sales performance\n", "performance_categories = ['Poor', 'Fair', 'Good', 'Excellent']\n", "df_categorical['Performance_Level'] = pd.cut(\n", " df_categorical['Sales'],\n", " bins=[0, 700, 900, 1200, float('inf')],\n", " labels=performance_categories,\n", " ordered=True\n", ")\n", "\n", "print(\"Ordered categorical data:\")\n", "print(df_categorical['Performance_Level'].head(10))\n", "print(\"\\nCategory info:\")\n", "print(df_categorical['Performance_Level'].cat.categories)\n", "print(f\"Is ordered: {df_categorical['Performance_Level'].cat.ordered}\")\n", "\n", "# Categorical operations\n", "print(\"\\nPerformance level distribution:\")\n", "print(df_categorical['Performance_Level'].value_counts().sort_index())\n", "\n", "# Add new category\n", "df_categorical['Performance_Level'] = df_categorical['Performance_Level'].cat.add_categories(['Outstanding'])\n", "print(f\"\\nCategories after adding 'Outstanding': {df_categorical['Performance_Level'].cat.categories}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Mathematical and Statistical Transformations\n", "\n", "Create columns using mathematical functions and statistical transformations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Mathematical transformations\n", "df_math = df_sales.copy()\n", "\n", "# Common mathematical transformations\n", "df_math['Sales_Log'] = np.log(df_math['Sales'])\n", "df_math['Sales_Sqrt'] = np.sqrt(df_math['Sales'])\n", "df_math['Sales_Squared'] = df_math['Sales'] ** 2\n", "df_math['Sales_Reciprocal'] = 1 / df_math['Sales']\n", "\n", "print(\"Mathematical transformations:\")\n", "print(df_math[['Sales', 'Sales_Log', 'Sales_Sqrt', 'Sales_Squared', 'Sales_Reciprocal']].head())\n", "\n", "# Statistical standardization\n", "df_math['Sales_Z_Score'] = (df_math['Sales'] - df_math['Sales'].mean()) / df_math['Sales'].std()\n", "df_math['Sales_Min_Max_Scaled'] = (df_math['Sales'] - df_math['Sales'].min()) / (df_math['Sales'].max() - df_math['Sales'].min())\n", "\n", "# Rolling statistics\n", "df_math = df_math.sort_values('Date')\n", "df_math['Sales_Rolling_7_Mean'] = df_math['Sales'].rolling(window=7, min_periods=1).mean()\n", "df_math['Sales_Rolling_7_Std'] = df_math['Sales'].rolling(window=7, min_periods=1).std()\n", "df_math['Sales_Cumulative_Sum'] = df_math['Sales'].cumsum()\n", "\n", "print(\"\\nStatistical transformations:\")\n", "print(df_math[['Sales', 'Sales_Z_Score', 'Sales_Min_Max_Scaled', \n", " 'Sales_Rolling_7_Mean', 'Sales_Cumulative_Sum']].head(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Rank and percentile columns\n", "df_math['Sales_Rank'] = df_math['Sales'].rank(ascending=False)\n", "df_math['Sales_Percentile'] = df_math['Sales'].rank(pct=True) * 100\n", "df_math['Sales_Rank_by_Region'] = df_math.groupby('Region')['Sales'].rank(ascending=False)\n", "\n", "# Binning and discretization\n", "df_math['Sales_Decile'] = pd.qcut(df_math['Sales'], q=10, labels=range(1, 11))\n", "df_math['Sales_Tertile'] = pd.qcut(df_math['Sales'], q=3, labels=['Low', 'Medium', 'High'])\n", "\n", "print(\"Ranking and binning:\")\n", "print(df_math[['Sales', 'Sales_Rank', 'Sales_Percentile', 'Sales_Rank_by_Region', \n", " 'Sales_Decile', 'Sales_Tertile']].head(10))\n", "\n", "print(\"\\nDecile distribution:\")\n", "print(df_math['Sales_Decile'].value_counts().sort_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice Exercises\n", "\n", "Apply your column creation and modification skills:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Exercise 1: Customer Segmentation\n", "# Create a comprehensive customer segmentation system:\n", "# - Combine purchase behavior, frequency, and value\n", "# - Create RFM-like scores (Recency, Frequency, Monetary)\n", "# - Assign customer segments (e.g., Champion, Loyal, At Risk, etc.)\n", "\n", "def create_customer_segmentation(df):\n", " \"\"\"Create customer segmentation based on purchase patterns\"\"\"\n", " # Your implementation here\n", " pass\n", "\n", "# segmented_df = create_customer_segmentation(df_sales)\n", "# print(segmented_df[['Customer_Type', 'Sales', 'Frequency_Score', 'Monetary_Score', 'Segment']].head())" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Exercise 2: Performance Metrics Dashboard\n", "# Create a comprehensive set of KPI columns:\n", "# - Sales efficiency metrics\n", "# - Trend indicators (growth rates, momentum)\n", "# - Comparative metrics (vs. average, vs. target)\n", "# - Alert flags for unusual patterns\n", "\n", "# Your code here:\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Exercise 3: Feature Engineering for ML\n", "# Create features that could be useful for machine learning:\n", "# - Interaction features (product of two variables)\n", "# - Polynomial features\n", "# - Time-based features (seasonality, trends)\n", "# - Lag features (previous period values)\n", "\n", "# Your code here:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "1. **Column Assignment**: Use direct assignment (`df['col'] = value`) for simple cases\n", "2. **Assign Method**: Use `.assign()` for functional programming style and method chaining\n", "3. **Conditional Logic**: Combine `np.where()`, `pd.cut()`, `pd.qcut()`, and `np.select()` for complex conditions\n", "4. **Apply Functions**: Use `.apply()` with lambda or custom functions for complex transformations\n", "5. **Date Features**: Extract meaningful components from datetime columns\n", "6. **String Operations**: Leverage `.str` accessor for text manipulations\n", "7. **Categorical Data**: Convert to categories for memory efficiency and special operations\n", "8. **Mathematical Transformations**: Apply statistical and mathematical functions for data preprocessing\n", "\n", "## Performance Tips\n", "\n", "1. **Vectorized Operations**: Prefer pandas/numpy operations over loops\n", "2. **Categorical Types**: Use categorical data for repeated string values\n", "3. **Memory Management**: Monitor memory usage when creating many new columns\n", "4. **Method Chaining**: Use `.assign()` for readable method chains\n", "5. **Avoid apply() When Possible**: Use vectorized operations instead of `.apply()` for better performance\n", "\n", "## Common Patterns\n", "\n", "```python\n", "# Simple calculation\n", "df['new_col'] = df['col1'] * df['col2']\n", "\n", "# Conditional column\n", "df['category'] = np.where(df['value'] > threshold, 'High', 'Low')\n", "\n", "# Apply custom function\n", "df['result'] = df.apply(custom_function, axis=1)\n", "\n", "# Date features\n", "df['month'] = df['date'].dt.month\n", "\n", "# String operations\n", "df['upper'] = df['text'].str.upper()\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 4 }