523 lines
15 KiB
Text
Executable file
523 lines
15 KiB
Text
Executable file
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Session 1 - DataFrames - Lesson 2: Basic Operations\n",
|
|
"\n",
|
|
"## Learning Objectives\n",
|
|
"- Learn essential methods to explore DataFrame structure\n",
|
|
"- Understand how to get basic information about your data\n",
|
|
"- Master data inspection techniques\n",
|
|
"- Practice with summary statistics\n",
|
|
"\n",
|
|
"## Prerequisites\n",
|
|
"- Completed Lesson 1: Creating DataFrames\n",
|
|
"- Basic understanding of pandas DataFrames"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import required libraries\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"from datetime import datetime, timedelta\n",
|
|
"\n",
|
|
"# Set display options for better output\n",
|
|
"pd.set_option('display.max_columns', None)\n",
|
|
"pd.set_option('display.width', None)\n",
|
|
"pd.set_option('display.max_colwidth', 50)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating Sample Dataset\n",
|
|
"\n",
|
|
"Let's create a comprehensive sales dataset to practice basic operations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 43,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Sales Dataset Created!\n",
|
|
"Dataset shape: (20, 6)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Create a comprehensive sales dataset\n",
|
|
"np.random.seed(42)\n",
|
|
"\n",
|
|
"sales_data = {\n",
|
|
" 'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
|
|
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
|
|
" 'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
|
|
" 1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
|
|
" 'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
|
|
" 'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
|
|
" 'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
|
|
"}\n",
|
|
"\n",
|
|
"df_sales = pd.DataFrame(sales_data)\n",
|
|
"print(\"Sales Dataset Created!\")\n",
|
|
"print(f\"Dataset shape: {df_sales.shape}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Viewing Data\n",
|
|
"\n",
|
|
"These methods help you quickly inspect your data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# View first few rows\n",
|
|
"print(\"First 5 rows (default):\")\n",
|
|
"print(df_sales.head())\n",
|
|
"\n",
|
|
"print(\"\\nFirst 3 rows:\")\n",
|
|
"print(df_sales.head(3))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# View last few rows\n",
|
|
"print(\"Last 5 rows (default):\")\n",
|
|
"print(df_sales.tail())\n",
|
|
"\n",
|
|
"print(\"\\nLast 3 rows:\")\n",
|
|
"print(df_sales.tail(3))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 44,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Random sample of 5 rows:\n",
|
|
" Date Product Sales Region Salesperson Commission_Rate\n",
|
|
"0 2024-01-01 Laptop 1200 North John 0.10\n",
|
|
"17 2024-01-18 Tablet 620 East Mike 0.08\n",
|
|
"15 2024-01-16 Laptop 1220 North John 0.10\n",
|
|
"1 2024-01-02 Phone 800 South Sarah 0.12\n",
|
|
"8 2024-01-09 Laptop 1250 West Lisa 0.11\n",
|
|
"\n",
|
|
"Random sample with different random state:\n",
|
|
" Date Product Sales Region Salesperson Commission_Rate\n",
|
|
"7 2024-01-08 Tablet 650 East Mike 0.08\n",
|
|
"10 2024-01-11 Laptop 1150 North John 0.10\n",
|
|
"5 2024-01-06 Laptop 1300 North John 0.10\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Sample random rows\n",
|
|
"print(\"Random sample of 5 rows:\")\n",
|
|
"print(df_sales.sample(5))\n",
|
|
"\n",
|
|
"print(\"\\nRandom sample with different random state:\")\n",
|
|
"print(df_sales.sample(3, random_state=10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. DataFrame Information\n",
|
|
"\n",
|
|
"Get detailed information about your DataFrame structure."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Comprehensive information about the DataFrame\n",
|
|
"print(\"DataFrame Info:\")\n",
|
|
"df_sales.info()\n",
|
|
"\n",
|
|
"print(\"\\nMemory usage:\")\n",
|
|
"df_sales.info(memory_usage='deep')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Basic properties\n",
|
|
"print(f\"Shape (rows, columns): {df_sales.shape}\")\n",
|
|
"print(f\"Number of rows: {len(df_sales)}\")\n",
|
|
"print(f\"Number of columns: {len(df_sales.columns)}\")\n",
|
|
"print(f\"Total elements: {df_sales.size}\")\n",
|
|
"print(f\"Dimensions: {df_sales.ndim}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Column and index information\n",
|
|
"print(\"Column names:\")\n",
|
|
"print(df_sales.columns.tolist())\n",
|
|
"\n",
|
|
"print(\"\\nData types:\")\n",
|
|
"print(df_sales.dtypes)\n",
|
|
"\n",
|
|
"print(\"\\nIndex information:\")\n",
|
|
"print(f\"Index: {df_sales.index}\")\n",
|
|
"print(f\"Index type: {type(df_sales.index)}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Summary Statistics\n",
|
|
"\n",
|
|
"Understand your data through statistical summaries."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Summary statistics for numerical columns\n",
|
|
"print(\"Summary statistics:\")\n",
|
|
"print(df_sales.describe())\n",
|
|
"\n",
|
|
"print(\"\\nRounded to 2 decimal places:\")\n",
|
|
"print(df_sales.describe().round(2))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Summary statistics for all columns (including non-numeric)\n",
|
|
"print(\"Summary for all columns:\")\n",
|
|
"print(df_sales.describe(include='all'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Individual statistics\n",
|
|
"print(\"Individual Statistical Measures:\")\n",
|
|
"print(f\"Mean sales: {df_sales['Sales'].mean():.2f}\")\n",
|
|
"print(f\"Median sales: {df_sales['Sales'].median():.2f}\")\n",
|
|
"print(f\"Standard deviation: {df_sales['Sales'].std():.2f}\")\n",
|
|
"print(f\"Minimum sales: {df_sales['Sales'].min()}\")\n",
|
|
"print(f\"Maximum sales: {df_sales['Sales'].max()}\")\n",
|
|
"print(f\"Sales range: {df_sales['Sales'].max() - df_sales['Sales'].min()}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Quantiles and percentiles\n",
|
|
"print(\"Quantiles for Sales:\")\n",
|
|
"print(f\"25th percentile (Q1): {df_sales['Sales'].quantile(0.25)}\")\n",
|
|
"print(f\"50th percentile (Q2/Median): {df_sales['Sales'].quantile(0.50)}\")\n",
|
|
"print(f\"75th percentile (Q3): {df_sales['Sales'].quantile(0.75)}\")\n",
|
|
"print(f\"90th percentile: {df_sales['Sales'].quantile(0.90)}\")\n",
|
|
"\n",
|
|
"print(\"\\nCustom quantiles:\")\n",
|
|
"quantiles = df_sales['Sales'].quantile([0.1, 0.3, 0.7, 0.9])\n",
|
|
"print(quantiles)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Counting and Unique Values\n",
|
|
"\n",
|
|
"Understand the distribution of categorical data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Count unique values in each column\n",
|
|
"print(\"Number of unique values per column:\")\n",
|
|
"print(df_sales.nunique())\n",
|
|
"\n",
|
|
"print(\"\\nUnique values in 'Product' column:\")\n",
|
|
"print(df_sales['Product'].unique())\n",
|
|
"\n",
|
|
"print(\"\\nValue counts for 'Product':\")\n",
|
|
"print(df_sales['Product'].value_counts())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 45,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Product distribution (counts and percentages):\n",
|
|
" Count Percentage\n",
|
|
"Product \n",
|
|
"Laptop 8 40.0\n",
|
|
"Phone 8 40.0\n",
|
|
"Tablet 4 20.0\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Value counts with percentages\n",
|
|
"print(\"Product distribution (counts and percentages):\")\n",
|
|
"product_counts = df_sales['Product'].value_counts()\n",
|
|
"product_percentages = df_sales['Product'].value_counts(normalize=True) * 100\n",
|
|
"\n",
|
|
"distribution = pd.DataFrame({\n",
|
|
" 'Count': product_counts,\n",
|
|
" 'Percentage': product_percentages.round(1)\n",
|
|
"})\n",
|
|
"print(distribution)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Cross-tabulation\n",
|
|
"print(\"Cross-tabulation of Product vs Region:\")\n",
|
|
"crosstab = pd.crosstab(df_sales['Product'], df_sales['Region'])\n",
|
|
"print(crosstab)\n",
|
|
"\n",
|
|
"print(\"\\nWith percentages:\")\n",
|
|
"crosstab_pct = pd.crosstab(df_sales['Product'], df_sales['Region'], normalize='all') * 100\n",
|
|
"print(crosstab_pct.round(1))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Data Quality Checks\n",
|
|
"\n",
|
|
"Essential checks for data quality and integrity."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Check for missing values\n",
|
|
"print(\"Missing values per column:\")\n",
|
|
"print(df_sales.isnull().sum())\n",
|
|
"\n",
|
|
"print(\"\\nPercentage of missing values:\")\n",
|
|
"missing_percentages = (df_sales.isnull().sum() / len(df_sales)) * 100\n",
|
|
"print(missing_percentages.round(2))\n",
|
|
"\n",
|
|
"print(\"\\nAny missing values in dataset?\", df_sales.isnull().any().any())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Check for duplicates\n",
|
|
"print(f\"Number of duplicate rows: {df_sales.duplicated().sum()}\")\n",
|
|
"print(f\"Any duplicate rows? {df_sales.duplicated().any()}\")\n",
|
|
"\n",
|
|
"# Check for duplicates based on specific columns\n",
|
|
"print(f\"\\nDuplicate combinations of Date and Salesperson: {df_sales.duplicated(['Date', 'Salesperson']).sum()}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 6. Quick Data Exploration\n",
|
|
"\n",
|
|
"Rapid exploration techniques to understand your data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Quick exploration function\n",
|
|
"def quick_explore(df, column_name):\n",
|
|
" \"\"\"Quick exploration of a specific column\"\"\"\n",
|
|
" print(f\"=== Quick Exploration: {column_name} ===\")\n",
|
|
" col = df[column_name]\n",
|
|
" \n",
|
|
" print(f\"Data type: {col.dtype}\")\n",
|
|
" print(f\"Non-null values: {col.count()}/{len(col)}\")\n",
|
|
" print(f\"Unique values: {col.nunique()}\")\n",
|
|
" \n",
|
|
" if col.dtype in ['int64', 'float64']:\n",
|
|
" print(f\"Min: {col.min()}, Max: {col.max()}\")\n",
|
|
" print(f\"Mean: {col.mean():.2f}, Median: {col.median():.2f}\")\n",
|
|
" else:\n",
|
|
" print(f\"Most common: {col.mode().iloc[0] if not col.mode().empty else 'N/A'}\")\n",
|
|
" print(f\"Sample values: {col.unique()[:5].tolist()}\")\n",
|
|
" print()\n",
|
|
"\n",
|
|
"# Explore different columns\n",
|
|
"for col in ['Sales', 'Product', 'Region']:\n",
|
|
" quick_explore(df_sales, col)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Practice Exercises\n",
|
|
"\n",
|
|
"Test your understanding with these exercises:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 40,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 1: Create a larger dataset and explore it\n",
|
|
"# Create a dataset with 100 rows and at least 5 columns\n",
|
|
"# Include different data types (numeric, categorical, datetime)\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 41,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 2: Write a function that provides a complete data profile\n",
|
|
"# Include: shape, data types, missing values, unique values, and basic stats\n",
|
|
"\n",
|
|
"def data_profile(df):\n",
|
|
" \"\"\"Provide a comprehensive data profile\"\"\"\n",
|
|
" # Your code here:\n",
|
|
" pass\n",
|
|
"\n",
|
|
"# Test your function\n",
|
|
"# data_profile(df_sales)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Exercise 3: Find interesting insights from the sales data\n",
|
|
"# Questions to answer:\n",
|
|
"# 1. Which product has the highest average sales?\n",
|
|
"# 2. Which region has the most consistent sales (lowest standard deviation)?\n",
|
|
"# 3. What's the total commission earned by each salesperson?\n",
|
|
"\n",
|
|
"# Your code here:\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Key Takeaways\n",
|
|
"\n",
|
|
"1. **`.head()` and `.tail()`** are essential for quick data inspection\n",
|
|
"2. **`.info()`** provides comprehensive DataFrame structure information\n",
|
|
"3. **`.describe()`** gives statistical summaries for numerical columns\n",
|
|
"4. **`.nunique()` and `.value_counts()`** help understand categorical data\n",
|
|
"5. **Always check for missing values** and duplicates in your data\n",
|
|
"6. **Statistical measures** (mean, median, std) provide insights into data distribution\n",
|
|
"7. **Cross-tabulation** helps understand relationships between categorical variables\n",
|
|
"\n",
|
|
"## Common Gotchas\n",
|
|
"\n",
|
|
"- `.describe()` only includes numeric columns by default (use `include='all'` for all columns)\n",
|
|
"- Missing values can affect statistical calculations\n",
|
|
"- Large datasets might need memory-efficient exploration techniques\n",
|
|
"- Always verify data types are correct for your analysis"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|