1
Fork 0
crypto_bot_training/Session_01/PandasDataFrame-exmples/02_basic_operations.ipynb
2025-06-13 07:25:59 +02:00

523 lines
15 KiB
Text
Executable file

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 2: Basic Operations\n",
"\n",
"## Learning Objectives\n",
"- Learn essential methods to explore DataFrame structure\n",
"- Understand how to get basic information about your data\n",
"- Master data inspection techniques\n",
"- Practice with summary statistics\n",
"\n",
"## Prerequisites\n",
"- Completed Lesson 1: Creating DataFrames\n",
"- Basic understanding of pandas DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"\n",
"# Set display options for better output\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.width', None)\n",
"pd.set_option('display.max_colwidth', 50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Sample Dataset\n",
"\n",
"Let's create a comprehensive sales dataset to practice basic operations."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sales Dataset Created!\n",
"Dataset shape: (20, 6)\n"
]
}
],
"source": [
"# Create a comprehensive sales dataset\n",
"np.random.seed(42)\n",
"\n",
"sales_data = {\n",
" 'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
" 'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
" 1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
" 'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
" 'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
" 'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
"}\n",
"\n",
"df_sales = pd.DataFrame(sales_data)\n",
"print(\"Sales Dataset Created!\")\n",
"print(f\"Dataset shape: {df_sales.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Viewing Data\n",
"\n",
"These methods help you quickly inspect your data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# View first few rows\n",
"print(\"First 5 rows (default):\")\n",
"print(df_sales.head())\n",
"\n",
"print(\"\\nFirst 3 rows:\")\n",
"print(df_sales.head(3))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# View last few rows\n",
"print(\"Last 5 rows (default):\")\n",
"print(df_sales.tail())\n",
"\n",
"print(\"\\nLast 3 rows:\")\n",
"print(df_sales.tail(3))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Random sample of 5 rows:\n",
" Date Product Sales Region Salesperson Commission_Rate\n",
"0 2024-01-01 Laptop 1200 North John 0.10\n",
"17 2024-01-18 Tablet 620 East Mike 0.08\n",
"15 2024-01-16 Laptop 1220 North John 0.10\n",
"1 2024-01-02 Phone 800 South Sarah 0.12\n",
"8 2024-01-09 Laptop 1250 West Lisa 0.11\n",
"\n",
"Random sample with different random state:\n",
" Date Product Sales Region Salesperson Commission_Rate\n",
"7 2024-01-08 Tablet 650 East Mike 0.08\n",
"10 2024-01-11 Laptop 1150 North John 0.10\n",
"5 2024-01-06 Laptop 1300 North John 0.10\n"
]
}
],
"source": [
"# Sample random rows\n",
"print(\"Random sample of 5 rows:\")\n",
"print(df_sales.sample(5))\n",
"\n",
"print(\"\\nRandom sample with different random state:\")\n",
"print(df_sales.sample(3, random_state=10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. DataFrame Information\n",
"\n",
"Get detailed information about your DataFrame structure."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Comprehensive information about the DataFrame\n",
"print(\"DataFrame Info:\")\n",
"df_sales.info()\n",
"\n",
"print(\"\\nMemory usage:\")\n",
"df_sales.info(memory_usage='deep')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Basic properties\n",
"print(f\"Shape (rows, columns): {df_sales.shape}\")\n",
"print(f\"Number of rows: {len(df_sales)}\")\n",
"print(f\"Number of columns: {len(df_sales.columns)}\")\n",
"print(f\"Total elements: {df_sales.size}\")\n",
"print(f\"Dimensions: {df_sales.ndim}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Column and index information\n",
"print(\"Column names:\")\n",
"print(df_sales.columns.tolist())\n",
"\n",
"print(\"\\nData types:\")\n",
"print(df_sales.dtypes)\n",
"\n",
"print(\"\\nIndex information:\")\n",
"print(f\"Index: {df_sales.index}\")\n",
"print(f\"Index type: {type(df_sales.index)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Summary Statistics\n",
"\n",
"Understand your data through statistical summaries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Summary statistics for numerical columns\n",
"print(\"Summary statistics:\")\n",
"print(df_sales.describe())\n",
"\n",
"print(\"\\nRounded to 2 decimal places:\")\n",
"print(df_sales.describe().round(2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Summary statistics for all columns (including non-numeric)\n",
"print(\"Summary for all columns:\")\n",
"print(df_sales.describe(include='all'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Individual statistics\n",
"print(\"Individual Statistical Measures:\")\n",
"print(f\"Mean sales: {df_sales['Sales'].mean():.2f}\")\n",
"print(f\"Median sales: {df_sales['Sales'].median():.2f}\")\n",
"print(f\"Standard deviation: {df_sales['Sales'].std():.2f}\")\n",
"print(f\"Minimum sales: {df_sales['Sales'].min()}\")\n",
"print(f\"Maximum sales: {df_sales['Sales'].max()}\")\n",
"print(f\"Sales range: {df_sales['Sales'].max() - df_sales['Sales'].min()}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Quantiles and percentiles\n",
"print(\"Quantiles for Sales:\")\n",
"print(f\"25th percentile (Q1): {df_sales['Sales'].quantile(0.25)}\")\n",
"print(f\"50th percentile (Q2/Median): {df_sales['Sales'].quantile(0.50)}\")\n",
"print(f\"75th percentile (Q3): {df_sales['Sales'].quantile(0.75)}\")\n",
"print(f\"90th percentile: {df_sales['Sales'].quantile(0.90)}\")\n",
"\n",
"print(\"\\nCustom quantiles:\")\n",
"quantiles = df_sales['Sales'].quantile([0.1, 0.3, 0.7, 0.9])\n",
"print(quantiles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Counting and Unique Values\n",
"\n",
"Understand the distribution of categorical data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Count unique values in each column\n",
"print(\"Number of unique values per column:\")\n",
"print(df_sales.nunique())\n",
"\n",
"print(\"\\nUnique values in 'Product' column:\")\n",
"print(df_sales['Product'].unique())\n",
"\n",
"print(\"\\nValue counts for 'Product':\")\n",
"print(df_sales['Product'].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Product distribution (counts and percentages):\n",
" Count Percentage\n",
"Product \n",
"Laptop 8 40.0\n",
"Phone 8 40.0\n",
"Tablet 4 20.0\n"
]
}
],
"source": [
"# Value counts with percentages\n",
"print(\"Product distribution (counts and percentages):\")\n",
"product_counts = df_sales['Product'].value_counts()\n",
"product_percentages = df_sales['Product'].value_counts(normalize=True) * 100\n",
"\n",
"distribution = pd.DataFrame({\n",
" 'Count': product_counts,\n",
" 'Percentage': product_percentages.round(1)\n",
"})\n",
"print(distribution)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Cross-tabulation\n",
"print(\"Cross-tabulation of Product vs Region:\")\n",
"crosstab = pd.crosstab(df_sales['Product'], df_sales['Region'])\n",
"print(crosstab)\n",
"\n",
"print(\"\\nWith percentages:\")\n",
"crosstab_pct = pd.crosstab(df_sales['Product'], df_sales['Region'], normalize='all') * 100\n",
"print(crosstab_pct.round(1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Data Quality Checks\n",
"\n",
"Essential checks for data quality and integrity."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check for missing values\n",
"print(\"Missing values per column:\")\n",
"print(df_sales.isnull().sum())\n",
"\n",
"print(\"\\nPercentage of missing values:\")\n",
"missing_percentages = (df_sales.isnull().sum() / len(df_sales)) * 100\n",
"print(missing_percentages.round(2))\n",
"\n",
"print(\"\\nAny missing values in dataset?\", df_sales.isnull().any().any())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check for duplicates\n",
"print(f\"Number of duplicate rows: {df_sales.duplicated().sum()}\")\n",
"print(f\"Any duplicate rows? {df_sales.duplicated().any()}\")\n",
"\n",
"# Check for duplicates based on specific columns\n",
"print(f\"\\nDuplicate combinations of Date and Salesperson: {df_sales.duplicated(['Date', 'Salesperson']).sum()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Quick Data Exploration\n",
"\n",
"Rapid exploration techniques to understand your data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Quick exploration function\n",
"def quick_explore(df, column_name):\n",
" \"\"\"Quick exploration of a specific column\"\"\"\n",
" print(f\"=== Quick Exploration: {column_name} ===\")\n",
" col = df[column_name]\n",
" \n",
" print(f\"Data type: {col.dtype}\")\n",
" print(f\"Non-null values: {col.count()}/{len(col)}\")\n",
" print(f\"Unique values: {col.nunique()}\")\n",
" \n",
" if col.dtype in ['int64', 'float64']:\n",
" print(f\"Min: {col.min()}, Max: {col.max()}\")\n",
" print(f\"Mean: {col.mean():.2f}, Median: {col.median():.2f}\")\n",
" else:\n",
" print(f\"Most common: {col.mode().iloc[0] if not col.mode().empty else 'N/A'}\")\n",
" print(f\"Sample values: {col.unique()[:5].tolist()}\")\n",
" print()\n",
"\n",
"# Explore different columns\n",
"for col in ['Sales', 'Product', 'Region']:\n",
" quick_explore(df_sales, col)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Test your understanding with these exercises:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Create a larger dataset and explore it\n",
"# Create a dataset with 100 rows and at least 5 columns\n",
"# Include different data types (numeric, categorical, datetime)\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Write a function that provides a complete data profile\n",
"# Include: shape, data types, missing values, unique values, and basic stats\n",
"\n",
"def data_profile(df):\n",
" \"\"\"Provide a comprehensive data profile\"\"\"\n",
" # Your code here:\n",
" pass\n",
"\n",
"# Test your function\n",
"# data_profile(df_sales)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Find interesting insights from the sales data\n",
"# Questions to answer:\n",
"# 1. Which product has the highest average sales?\n",
"# 2. Which region has the most consistent sales (lowest standard deviation)?\n",
"# 3. What's the total commission earned by each salesperson?\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **`.head()` and `.tail()`** are essential for quick data inspection\n",
"2. **`.info()`** provides comprehensive DataFrame structure information\n",
"3. **`.describe()`** gives statistical summaries for numerical columns\n",
"4. **`.nunique()` and `.value_counts()`** help understand categorical data\n",
"5. **Always check for missing values** and duplicates in your data\n",
"6. **Statistical measures** (mean, median, std) provide insights into data distribution\n",
"7. **Cross-tabulation** helps understand relationships between categorical variables\n",
"\n",
"## Common Gotchas\n",
"\n",
"- `.describe()` only includes numeric columns by default (use `include='all'` for all columns)\n",
"- Missing values can affect statistical calculations\n",
"- Large datasets might need memory-efficient exploration techniques\n",
"- Always verify data types are correct for your analysis"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}