# Session 1 - DataFrames - Lesson 2: Basic Operations

## Learning Objectives
- Learn essential methods to explore DataFrame structure
- Understand how to get basic information about your data
- Master data inspection techniques
- Practice with summary statistics

## Prerequisites
- Completed Lesson 1: Creating DataFrames
- Basic understanding of pandas DataFrames

In [22]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

## Creating Sample Dataset

Let's create a comprehensive sales dataset to practice basic operations.

In [43]:
# Create a comprehensive sales dataset
np.random.seed(42)

sales_data = {
 'Date': pd.date_range('2024-01-01', periods=20, freq='D'),
 'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,
 'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,
 1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],
 'Region': ['North', 'South', 'East', 'West', 'North'] * 4,
 'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,
 'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4
}

df_sales = pd.DataFrame(sales_data)
print("Sales Dataset Created!")
print(f"Dataset shape: {df_sales.shape}")

Sales Dataset Created!
Dataset shape: (20, 6)


## 1. Viewing Data

These methods help you quickly inspect your data.

In [None]:
# View first few rows
print("First 5 rows (default):")
print(df_sales.head())

print("\nFirst 3 rows:")
print(df_sales.head(3))

In [None]:
# View last few rows
print("Last 5 rows (default):")
print(df_sales.tail())

print("\nLast 3 rows:")
print(df_sales.tail(3))

In [44]:
# Sample random rows
print("Random sample of 5 rows:")
print(df_sales.sample(5))

print("\nRandom sample with different random state:")
print(df_sales.sample(3, random_state=10))

Random sample of 5 rows:
 Date Product Sales Region Salesperson Commission_Rate
0 2024-01-01 Laptop 1200 North John 0.10
17 2024-01-18 Tablet 620 East Mike 0.08
15 2024-01-16 Laptop 1220 North John 0.10
1 2024-01-02 Phone 800 South Sarah 0.12
8 2024-01-09 Laptop 1250 West Lisa 0.11

Random sample with different random state:
 Date Product Sales Region Salesperson Commission_Rate
7 2024-01-08 Tablet 650 East Mike 0.08
10 2024-01-11 Laptop 1150 North John 0.10
5 2024-01-06 Laptop 1300 North John 0.10


## 2. DataFrame Information

Get detailed information about your DataFrame structure.

In [None]:
# Comprehensive information about the DataFrame
print("DataFrame Info:")
df_sales.info()

print("\nMemory usage:")
df_sales.info(memory_usage='deep')

In [None]:
# Basic properties
print(f"Shape (rows, columns): {df_sales.shape}")
print(f"Number of rows: {len(df_sales)}")
print(f"Number of columns: {len(df_sales.columns)}")
print(f"Total elements: {df_sales.size}")
print(f"Dimensions: {df_sales.ndim}")

In [None]:
# Column and index information
print("Column names:")
print(df_sales.columns.tolist())

print("\nData types:")
print(df_sales.dtypes)

print("\nIndex information:")
print(f"Index: {df_sales.index}")
print(f"Index type: {type(df_sales.index)}")

## 3. Summary Statistics

Understand your data through statistical summaries.

In [None]:
# Summary statistics for numerical columns
print("Summary statistics:")
print(df_sales.describe())

print("\nRounded to 2 decimal places:")
print(df_sales.describe().round(2))

In [None]:
# Summary statistics for all columns (including non-numeric)
print("Summary for all columns:")
print(df_sales.describe(include='all'))

In [None]:
# Individual statistics
print("Individual Statistical Measures:")
print(f"Mean sales: {df_sales['Sales'].mean():.2f}")
print(f"Median sales: {df_sales['Sales'].median():.2f}")
print(f"Standard deviation: {df_sales['Sales'].std():.2f}")
print(f"Minimum sales: {df_sales['Sales'].min()}")
print(f"Maximum sales: {df_sales['Sales'].max()}")
print(f"Sales range: {df_sales['Sales'].max() - df_sales['Sales'].min()}")

In [None]:
# Quantiles and percentiles
print("Quantiles for Sales:")
print(f"25th percentile (Q1): {df_sales['Sales'].quantile(0.25)}")
print(f"50th percentile (Q2/Median): {df_sales['Sales'].quantile(0.50)}")
print(f"75th percentile (Q3): {df_sales['Sales'].quantile(0.75)}")
print(f"90th percentile: {df_sales['Sales'].quantile(0.90)}")

print("\nCustom quantiles:")
quantiles = df_sales['Sales'].quantile([0.1, 0.3, 0.7, 0.9])
print(quantiles)

## 4. Counting and Unique Values

Understand the distribution of categorical data.

In [None]:
# Count unique values in each column
print("Number of unique values per column:")
print(df_sales.nunique())

print("\nUnique values in 'Product' column:")
print(df_sales['Product'].unique())

print("\nValue counts for 'Product':")
print(df_sales['Product'].value_counts())

In [45]:
# Value counts with percentages
print("Product distribution (counts and percentages):")
product_counts = df_sales['Product'].value_counts()
product_percentages = df_sales['Product'].value_counts(normalize=True) * 100

distribution = pd.DataFrame({
 'Count': product_counts,
 'Percentage': product_percentages.round(1)
})
print(distribution)

Product distribution (counts and percentages):
 Count Percentage
Product 
Laptop 8 40.0
Phone 8 40.0
Tablet 4 20.0


In [None]:
# Cross-tabulation
print("Cross-tabulation of Product vs Region:")
crosstab = pd.crosstab(df_sales['Product'], df_sales['Region'])
print(crosstab)

print("\nWith percentages:")
crosstab_pct = pd.crosstab(df_sales['Product'], df_sales['Region'], normalize='all') * 100
print(crosstab_pct.round(1))

## 5. Data Quality Checks

Essential checks for data quality and integrity.

In [None]:
# Check for missing values
print("Missing values per column:")
print(df_sales.isnull().sum())

print("\nPercentage of missing values:")
missing_percentages = (df_sales.isnull().sum() / len(df_sales)) * 100
print(missing_percentages.round(2))

print("\nAny missing values in dataset?", df_sales.isnull().any().any())

In [None]:
# Check for duplicates
print(f"Number of duplicate rows: {df_sales.duplicated().sum()}")
print(f"Any duplicate rows? {df_sales.duplicated().any()}")

# Check for duplicates based on specific columns
print(f"\nDuplicate combinations of Date and Salesperson: {df_sales.duplicated(['Date', 'Salesperson']).sum()}")

## 6. Quick Data Exploration

Rapid exploration techniques to understand your data.

In [None]:
# Quick exploration function
def quick_explore(df, column_name):
 """Quick exploration of a specific column"""
 print(f"=== Quick Exploration: {column_name} ===")
 col = df[column_name]
 
 print(f"Data type: {col.dtype}")
 print(f"Non-null values: {col.count()}/{len(col)}")
 print(f"Unique values: {col.nunique()}")
 
 if col.dtype in ['int64', 'float64']:
 print(f"Min: {col.min()}, Max: {col.max()}")
 print(f"Mean: {col.mean():.2f}, Median: {col.median():.2f}")
 else:
 print(f"Most common: {col.mode().iloc[0] if not col.mode().empty else 'N/A'}")
 print(f"Sample values: {col.unique()[:5].tolist()}")
 print()

# Explore different columns
for col in ['Sales', 'Product', 'Region']:
 quick_explore(df_sales, col)

## Practice Exercises

Test your understanding with these exercises:

In [40]:
# Exercise 1: Create a larger dataset and explore it
# Create a dataset with 100 rows and at least 5 columns
# Include different data types (numeric, categorical, datetime)

# Your code here:


In [41]:
# Exercise 2: Write a function that provides a complete data profile
# Include: shape, data types, missing values, unique values, and basic stats

def data_profile(df):
 """Provide a comprehensive data profile"""
 # Your code here:
 pass

# Test your function
# data_profile(df_sales)

In [42]:
# Exercise 3: Find interesting insights from the sales data
# Questions to answer:
# 1. Which product has the highest average sales?
# 2. Which region has the most consistent sales (lowest standard deviation)?
# 3. What's the total commission earned by each salesperson?

# Your code here:


## Key Takeaways

1. **`.head()` and `.tail()`** are essential for quick data inspection
2. **`.info()`** provides comprehensive DataFrame structure information
3. **`.describe()`** gives statistical summaries for numerical columns
4. **`.nunique()` and `.value_counts()`** help understand categorical data
5. **Always check for missing values** and duplicates in your data
6. **Statistical measures** (mean, median, std) provide insights into data distribution
7. **Cross-tabulation** helps understand relationships between categorical variables

## Common Gotchas

- `.describe()` only includes numeric columns by default (use `include='all'` for all columns)
- Missing values can affect statistical calculations
- Large datasets might need memory-efficient exploration techniques
- Always verify data types are correct for your analysis