# Session 1 - DataFrames - Lesson 5: Adding and Modifying Columns

## Learning Objectives
- Learn different methods to add new columns to DataFrames
- Master conditional column creation using various techniques
- Understand how to modify existing columns
- Practice with calculated fields and derived columns
- Explore data type conversions and transformations

## Prerequisites
- Completed Lessons 1-4
- Understanding of basic Python operations and functions

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Create sample dataset
np.random.seed(42)
n_records = 150

sales_data = {
 'Date': pd.date_range('2024-01-01', periods=n_records, freq='D'),
 'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_records),
 'Sales': np.random.normal(1000, 200, n_records).astype(int),
 'Quantity': np.random.randint(1, 8, n_records),
 'Region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
 'Salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], n_records),
 'Customer_Type': np.random.choice(['New', 'Returning', 'VIP'], n_records, p=[0.3, 0.6, 0.1])
}

df_sales = pd.DataFrame(sales_data)
df_sales['Sales'] = np.abs(df_sales['Sales']) # Ensure positive values

print("Original dataset:")
print(f"Shape: {df_sales.shape}")
print("\nFirst few rows:")
print(df_sales.head())
print("\nData types:")
print(df_sales.dtypes)

## 1. Basic Column Addition

Simple methods to add new columns.

In [None]:
# Method 1: Direct assignment
df_modified = df_sales.copy()

# Add simple calculated columns
df_modified['Revenue'] = df_modified['Sales'] * df_modified['Quantity']
df_modified['Commission_10%'] = df_modified['Sales'] * 0.10
df_modified['Sales_per_Unit'] = df_modified['Sales'] / df_modified['Quantity']

print("New calculated columns:")
print(df_modified[['Sales', 'Quantity', 'Revenue', 'Commission_10%', 'Sales_per_Unit']].head())

# Add constant value column
df_modified['Year'] = 2024
df_modified['Currency'] = 'USD'
df_modified['Department'] = 'Sales'

print("\nConstant value columns added:")
print(df_modified[['Year', 'Currency', 'Department']].head())

In [None]:
# Method 2: Using assign() method (more functional approach)
df_assigned = df_sales.assign(
 Revenue=lambda x: x['Sales'] * x['Quantity'],
 Commission_Rate=0.08,
 Commission_Amount=lambda x: x['Sales'] * 0.08,
 Sales_Squared=lambda x: x['Sales'] ** 2,
 Is_High_Volume=lambda x: x['Quantity'] > 5
)

print("Using assign() method:")
print(df_assigned[['Sales', 'Quantity', 'Revenue', 'Commission_Amount', 'Is_High_Volume']].head())

print(f"\nOriginal shape: {df_sales.shape}")
print(f"Modified shape: {df_assigned.shape}")

In [None]:
# Method 3: Using insert() for specific positioning
df_insert = df_sales.copy()

# Insert column at specific position (after 'Sales')
sales_index = df_insert.columns.get_loc('Sales')
df_insert.insert(sales_index + 1, 'Sales_Tax', df_insert['Sales'] * 0.08)
df_insert.insert(sales_index + 2, 'Total_with_Tax', df_insert['Sales'] + df_insert['Sales_Tax'])

print("Using insert() for positioned columns:")
print(df_insert[['Product', 'Sales', 'Sales_Tax', 'Total_with_Tax', 'Quantity']].head())
print(f"\nColumn order: {list(df_insert.columns)}")

## 2. Conditional Column Creation

Create columns based on conditions and business logic.

In [None]:
# Method 1: Using np.where() for simple conditions
df_conditional = df_sales.copy()

# Simple binary conditions
df_conditional['High_Sales'] = np.where(df_conditional['Sales'] > 1000, 'Yes', 'No')
df_conditional['Weekend'] = np.where(df_conditional['Date'].dt.dayofweek >= 5, 'Weekend', 'Weekday')
df_conditional['Bulk_Order'] = np.where(df_conditional['Quantity'] >= 5, 'Bulk', 'Regular')

print("Simple conditional columns:")
print(df_conditional[['Sales', 'High_Sales', 'Date', 'Weekend', 'Quantity', 'Bulk_Order']].head())

# Nested conditions
df_conditional['Sales_Category'] = np.where(df_conditional['Sales'] > 1200, 'High',
 np.where(df_conditional['Sales'] > 800, 'Medium', 'Low'))

print("\nNested conditions:")
print(df_conditional[['Sales', 'Sales_Category']].head(10))
print("\nCategory distribution:")
print(df_conditional['Sales_Category'].value_counts())

In [None]:
# Method 2: Using pd.cut() for binning numerical data
df_conditional['Sales_Tier'] = pd.cut(df_conditional['Sales'], 
 bins=[0, 500, 800, 1200, float('inf')],
 labels=['Entry', 'Standard', 'Premium', 'Luxury'])

print("Using pd.cut() for binning:")
print(df_conditional[['Sales', 'Sales_Tier']].head(10))
print("\nTier distribution:")
print(df_conditional['Sales_Tier'].value_counts())

# Using pd.qcut() for quantile-based binning
df_conditional['Sales_Quintile'] = pd.qcut(df_conditional['Sales'], 
 q=5, 
 labels=['Bottom 20%', 'Low 20%', 'Mid 20%', 'High 20%', 'Top 20%'])

print("\nUsing pd.qcut() for quantile binning:")
print(df_conditional['Sales_Quintile'].value_counts())

In [None]:
# Method 3: Using pandas.select() for multiple conditions
# Define conditions and choices
conditions = [
 (df_conditional['Sales'] >= 1200) & (df_conditional['Quantity'] >= 5),
 (df_conditional['Sales'] >= 1000) & (df_conditional['Customer_Type'] == 'VIP'),
 (df_conditional['Sales'] >= 800) & (df_conditional['Region'] == 'North'),
 df_conditional['Customer_Type'] == 'New'
]

choices = ['Premium Deal', 'VIP Sale', 'North Preferred', 'New Customer']
default = 'Standard'

df_conditional['Deal_Type'] = np.select(conditions, choices, default=default)

print("Using np.select() for complex conditions:")
print(df_conditional[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Deal_Type']].head(10))
print("\nDeal type distribution:")
print(df_conditional['Deal_Type'].value_counts())

## 3. Using Apply and Lambda Functions

Create complex calculated columns using custom functions.

In [None]:
# Simple lambda functions
df_apply = df_sales.copy()

# Single column transformations
df_apply['Sales_Log'] = df_apply['Sales'].apply(lambda x: np.log(x))
df_apply['Product_Length'] = df_apply['Product'].apply(lambda x: len(x))
df_apply['Days_Since_Start'] = df_apply['Date'].apply(lambda x: (x - df_apply['Date'].min()).days)

print("Simple lambda transformations:")
print(df_apply[['Sales', 'Sales_Log', 'Product', 'Product_Length', 'Days_Since_Start']].head())

# Multiple column operations using lambda
df_apply['Efficiency_Score'] = df_apply.apply(
 lambda row: (row['Sales'] * row['Quantity']) / (row['Days_Since_Start'] + 1), 
 axis=1
)

print("\nMultiple column lambda:")
print(df_apply[['Sales', 'Quantity', 'Days_Since_Start', 'Efficiency_Score']].head())

In [None]:
# Custom functions for complex business logic
def calculate_commission(row):
 """Calculate commission based on complex business rules"""
 base_rate = 0.05
 
 # VIP customers get higher commission
 if row['Customer_Type'] == 'VIP':
 base_rate += 0.02
 
 # High quantity orders get bonus
 if row['Quantity'] >= 5:
 base_rate += 0.01
 
 # Regional multipliers
 region_multipliers = {'North': 1.2, 'South': 1.0, 'East': 1.1, 'West': 0.9}
 multiplier = region_multipliers.get(row['Region'], 1.0)
 
 return row['Sales'] * base_rate * multiplier

def performance_rating(row):
 """Calculate performance rating based on multiple factors"""
 score = 0
 
 # Sales performance (40% weight)
 if row['Sales'] > 1200:
 score += 40
 elif row['Sales'] > 800:
 score += 30
 else:
 score += 20
 
 # Quantity performance (30% weight)
 if row['Quantity'] >= 6:
 score += 30
 elif row['Quantity'] >= 4:
 score += 20
 else:
 score += 10
 
 # Customer type bonus (30% weight)
 customer_bonus = {'VIP': 30, 'Returning': 20, 'New': 15}
 score += customer_bonus.get(row['Customer_Type'], 0)
 
 # Convert to letter grade
 if score >= 85:
 return 'A'
 elif score >= 70:
 return 'B'
 elif score >= 55:
 return 'C'
 else:
 return 'D'

# Apply custom functions
df_apply['Commission'] = df_apply.apply(calculate_commission, axis=1)
df_apply['Performance_Rating'] = df_apply.apply(performance_rating, axis=1)

print("Custom function results:")
print(df_apply[['Sales', 'Quantity', 'Customer_Type', 'Region', 'Commission', 'Performance_Rating']].head())

print("\nPerformance rating distribution:")
print(df_apply['Performance_Rating'].value_counts().sort_index())

## 4. Date and Time Derived Columns

Extract useful information from datetime columns.

In [None]:
# Extract date components
df_dates = df_sales.copy()

# Basic date components
df_dates['Year'] = df_dates['Date'].dt.year
df_dates['Month'] = df_dates['Date'].dt.month
df_dates['Day'] = df_dates['Date'].dt.day
df_dates['DayOfWeek'] = df_dates['Date'].dt.dayofweek # 0=Monday, 6=Sunday
df_dates['DayName'] = df_dates['Date'].dt.day_name()
df_dates['MonthName'] = df_dates['Date'].dt.month_name()

print("Basic date components:")
print(df_dates[['Date', 'Year', 'Month', 'Day', 'DayOfWeek', 'DayName', 'MonthName']].head())

# Business-relevant date features
df_dates['Quarter'] = df_dates['Date'].dt.quarter
df_dates['Week'] = df_dates['Date'].dt.isocalendar().week
df_dates['DayOfYear'] = df_dates['Date'].dt.dayofyear
df_dates['IsWeekend'] = df_dates['Date'].dt.dayofweek >= 5
df_dates['IsMonthStart'] = df_dates['Date'].dt.is_month_start
df_dates['IsMonthEnd'] = df_dates['Date'].dt.is_month_end

print("\nBusiness date features:")
print(df_dates[['Date', 'Quarter', 'Week', 'IsWeekend', 'IsMonthStart', 'IsMonthEnd']].head(10))

In [None]:
# Time-based calculations
start_date = df_dates['Date'].min()
df_dates['Days_Since_Start'] = (df_dates['Date'] - start_date).dt.days
df_dates['Weeks_Since_Start'] = df_dates['Days_Since_Start'] // 7

# Create season column
def get_season(month):
 if month in [12, 1, 2]:
 return 'Winter'
 elif month in [3, 4, 5]:
 return 'Spring'
 elif month in [6, 7, 8]:
 return 'Summer'
 else:
 return 'Fall'

df_dates['Season'] = df_dates['Month'].apply(get_season)

# Business day calculations
df_dates['IsBusinessDay'] = df_dates['Date'].dt.dayofweek < 5
df_dates['BusinessDaysSinceStart'] = df_dates.apply(
 lambda row: np.busday_count(start_date.date(), row['Date'].date()), axis=1
)

print("Time-based calculations:")
print(df_dates[['Date', 'Days_Since_Start', 'Weeks_Since_Start', 'Season', 
 'IsBusinessDay', 'BusinessDaysSinceStart']].head())

print("\nSeason distribution:")
print(df_dates['Season'].value_counts())

## 5. Text and String Manipulations

Create columns based on string operations and text processing.

In [None]:
# String manipulations
df_text = df_sales.copy()

# Basic string operations
df_text['Product_Upper'] = df_text['Product'].str.upper()
df_text['Product_Lower'] = df_text['Product'].str.lower()
df_text['Product_Length'] = df_text['Product'].str.len()
df_text['Product_First_Char'] = df_text['Product'].str[0]
df_text['Product_Last_Three'] = df_text['Product'].str[-3:]

print("Basic string operations:")
print(df_text[['Product', 'Product_Upper', 'Product_Lower', 'Product_Length', 
 'Product_First_Char', 'Product_Last_Three']].head())

# Text categorization
df_text['Product_Category'] = df_text['Product'].apply(lambda x: 
 'Computer' if x in ['Laptop', 'Monitor'] else
 'Mobile' if x in ['Phone', 'Tablet'] else
 'Other'
)

# Check for patterns
df_text['Has_Letter_A'] = df_text['Product'].str.contains('a', case=False)
df_text['Starts_With_L'] = df_text['Product'].str.startswith('L')
df_text['Ends_With_E'] = df_text['Product'].str.endswith('e')

print("\nText patterns and categorization:")
print(df_text[['Product', 'Product_Category', 'Has_Letter_A', 'Starts_With_L', 'Ends_With_E']].head(10))

In [None]:
# Create formatted text columns
df_text['Sales_Formatted'] = df_text['Sales'].apply(lambda x: f"${x:,.2f}")
df_text['Transaction_ID'] = df_text.apply(
 lambda row: f"{row['Region'][:1]}{row['Product'][:3].upper()}{row.name:04d}", axis=1
)

# Create summary descriptions
df_text['Transaction_Summary'] = df_text.apply(
 lambda row: f"{row['Salesperson']} sold {row['Quantity']} {row['Product']}(s) "
 f"for {row['Sales_Formatted']} in {row['Region']} region", 
 axis=1
)

print("Formatted text columns:")
print(df_text[['Sales_Formatted', 'Transaction_ID']].head())
print("\nTransaction summaries:")
for i, summary in enumerate(df_text['Transaction_Summary'].head(3)):
 print(f"{i+1}. {summary}")

## 6. Working with Categorical Data

Optimize memory usage and enable category-specific operations.

In [None]:
# Convert to categorical data types
df_categorical = df_sales.copy()

# Check memory usage before
print("Memory usage before categorical conversion:")
print(df_categorical.memory_usage(deep=True))

# Convert string columns to categorical
categorical_columns = ['Product', 'Region', 'Salesperson', 'Customer_Type']
for col in categorical_columns:
 df_categorical[col] = df_categorical[col].astype('category')

print("\nMemory usage after categorical conversion:")
print(df_categorical.memory_usage(deep=True))

print("\nData types after conversion:")
print(df_categorical.dtypes)

In [None]:
# Working with ordered categories
# Create ordered categorical for sales performance
performance_categories = ['Poor', 'Fair', 'Good', 'Excellent']
df_categorical['Performance_Level'] = pd.cut(
 df_categorical['Sales'],
 bins=[0, 700, 900, 1200, float('inf')],
 labels=performance_categories,
 ordered=True
)

print("Ordered categorical data:")
print(df_categorical['Performance_Level'].head(10))
print("\nCategory info:")
print(df_categorical['Performance_Level'].cat.categories)
print(f"Is ordered: {df_categorical['Performance_Level'].cat.ordered}")

# Categorical operations
print("\nPerformance level distribution:")
print(df_categorical['Performance_Level'].value_counts().sort_index())

# Add new category
df_categorical['Performance_Level'] = df_categorical['Performance_Level'].cat.add_categories(['Outstanding'])
print(f"\nCategories after adding 'Outstanding': {df_categorical['Performance_Level'].cat.categories}")

## 7. Mathematical and Statistical Transformations

Create columns using mathematical functions and statistical transformations.

In [None]:
# Mathematical transformations
df_math = df_sales.copy()

# Common mathematical transformations
df_math['Sales_Log'] = np.log(df_math['Sales'])
df_math['Sales_Sqrt'] = np.sqrt(df_math['Sales'])
df_math['Sales_Squared'] = df_math['Sales'] ** 2
df_math['Sales_Reciprocal'] = 1 / df_math['Sales']

print("Mathematical transformations:")
print(df_math[['Sales', 'Sales_Log', 'Sales_Sqrt', 'Sales_Squared', 'Sales_Reciprocal']].head())

# Statistical standardization
df_math['Sales_Z_Score'] = (df_math['Sales'] - df_math['Sales'].mean()) / df_math['Sales'].std()
df_math['Sales_Min_Max_Scaled'] = (df_math['Sales'] - df_math['Sales'].min()) / (df_math['Sales'].max() - df_math['Sales'].min())

# Rolling statistics
df_math = df_math.sort_values('Date')
df_math['Sales_Rolling_7_Mean'] = df_math['Sales'].rolling(window=7, min_periods=1).mean()
df_math['Sales_Rolling_7_Std'] = df_math['Sales'].rolling(window=7, min_periods=1).std()
df_math['Sales_Cumulative_Sum'] = df_math['Sales'].cumsum()

print("\nStatistical transformations:")
print(df_math[['Sales', 'Sales_Z_Score', 'Sales_Min_Max_Scaled', 
 'Sales_Rolling_7_Mean', 'Sales_Cumulative_Sum']].head(10))

In [None]:
# Rank and percentile columns
df_math['Sales_Rank'] = df_math['Sales'].rank(ascending=False)
df_math['Sales_Percentile'] = df_math['Sales'].rank(pct=True) * 100
df_math['Sales_Rank_by_Region'] = df_math.groupby('Region')['Sales'].rank(ascending=False)

# Binning and discretization
df_math['Sales_Decile'] = pd.qcut(df_math['Sales'], q=10, labels=range(1, 11))
df_math['Sales_Tertile'] = pd.qcut(df_math['Sales'], q=3, labels=['Low', 'Medium', 'High'])

print("Ranking and binning:")
print(df_math[['Sales', 'Sales_Rank', 'Sales_Percentile', 'Sales_Rank_by_Region', 
 'Sales_Decile', 'Sales_Tertile']].head(10))

print("\nDecile distribution:")
print(df_math['Sales_Decile'].value_counts().sort_index())

## Practice Exercises

Apply your column creation and modification skills:

In [18]:
# Exercise 1: Customer Segmentation
# Create a comprehensive customer segmentation system:
# - Combine purchase behavior, frequency, and value
# - Create RFM-like scores (Recency, Frequency, Monetary)
# - Assign customer segments (e.g., Champion, Loyal, At Risk, etc.)

def create_customer_segmentation(df):
 """Create customer segmentation based on purchase patterns"""
 # Your implementation here
 pass

# segmented_df = create_customer_segmentation(df_sales)
# print(segmented_df[['Customer_Type', 'Sales', 'Frequency_Score', 'Monetary_Score', 'Segment']].head())

In [19]:
# Exercise 2: Performance Metrics Dashboard
# Create a comprehensive set of KPI columns:
# - Sales efficiency metrics
# - Trend indicators (growth rates, momentum)
# - Comparative metrics (vs. average, vs. target)
# - Alert flags for unusual patterns

# Your code here:


In [20]:
# Exercise 3: Feature Engineering for ML
# Create features that could be useful for machine learning:
# - Interaction features (product of two variables)
# - Polynomial features
# - Time-based features (seasonality, trends)
# - Lag features (previous period values)

# Your code here:


## Key Takeaways

1. **Column Assignment**: Use direct assignment (`df['col'] = value`) for simple cases
2. **Assign Method**: Use `.assign()` for functional programming style and method chaining
3. **Conditional Logic**: Combine `np.where()`, `pd.cut()`, `pd.qcut()`, and `np.select()` for complex conditions
4. **Apply Functions**: Use `.apply()` with lambda or custom functions for complex transformations
5. **Date Features**: Extract meaningful components from datetime columns
6. **String Operations**: Leverage `.str` accessor for text manipulations
7. **Categorical Data**: Convert to categories for memory efficiency and special operations
8. **Mathematical Transformations**: Apply statistical and mathematical functions for data preprocessing

## Performance Tips

1. **Vectorized Operations**: Prefer pandas/numpy operations over loops
2. **Categorical Types**: Use categorical data for repeated string values
3. **Memory Management**: Monitor memory usage when creating many new columns
4. **Method Chaining**: Use `.assign()` for readable method chains
5. **Avoid apply() When Possible**: Use vectorized operations instead of `.apply()` for better performance

## Common Patterns

```python
# Simple calculation
df['new_col'] = df['col1'] * df['col2']

# Conditional column
df['category'] = np.where(df['value'] > threshold, 'High', 'Low')

# Apply custom function
df['result'] = df.apply(custom_function, axis=1)

# Date features
df['month'] = df['date'].dt.month

# String operations
df['upper'] = df['text'].str.upper()
```