{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Session 1 - DataFrames - Lesson 2: Basic Operations\n", "\n", "## Learning Objectives\n", "- Learn essential methods to explore DataFrame structure\n", "- Understand how to get basic information about your data\n", "- Master data inspection techniques\n", "- Practice with summary statistics\n", "\n", "## Prerequisites\n", "- Completed Lesson 1: Creating DataFrames\n", "- Basic understanding of pandas DataFrames" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "from datetime import datetime, timedelta\n", "\n", "# Set display options for better output\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.width', None)\n", "pd.set_option('display.max_colwidth', 50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Sample Dataset\n", "\n", "Let's create a comprehensive sales dataset to practice basic operations." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sales Dataset Created!\n", "Dataset shape: (20, 6)\n" ] } ], "source": [ "# Create a comprehensive sales dataset\n", "np.random.seed(42)\n", "\n", "sales_data = {\n", " 'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n", " 'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n", " 'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n", " 1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n", " 'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n", " 'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n", " 'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n", "}\n", "\n", "df_sales = pd.DataFrame(sales_data)\n", "print(\"Sales Dataset Created!\")\n", "print(f\"Dataset shape: {df_sales.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Viewing Data\n", "\n", "These methods help you quickly inspect your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# View first few rows\n", "print(\"First 5 rows (default):\")\n", "print(df_sales.head())\n", "\n", "print(\"\\nFirst 3 rows:\")\n", "print(df_sales.head(3))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# View last few rows\n", "print(\"Last 5 rows (default):\")\n", "print(df_sales.tail())\n", "\n", "print(\"\\nLast 3 rows:\")\n", "print(df_sales.tail(3))" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Random sample of 5 rows:\n", " Date Product Sales Region Salesperson Commission_Rate\n", "0 2024-01-01 Laptop 1200 North John 0.10\n", "17 2024-01-18 Tablet 620 East Mike 0.08\n", "15 2024-01-16 Laptop 1220 North John 0.10\n", "1 2024-01-02 Phone 800 South Sarah 0.12\n", "8 2024-01-09 Laptop 1250 West Lisa 0.11\n", "\n", "Random sample with different random state:\n", " Date Product Sales Region Salesperson Commission_Rate\n", "7 2024-01-08 Tablet 650 East Mike 0.08\n", "10 2024-01-11 Laptop 1150 North John 0.10\n", "5 2024-01-06 Laptop 1300 North John 0.10\n" ] } ], "source": [ "# Sample random rows\n", "print(\"Random sample of 5 rows:\")\n", "print(df_sales.sample(5))\n", "\n", "print(\"\\nRandom sample with different random state:\")\n", "print(df_sales.sample(3, random_state=10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. DataFrame Information\n", "\n", "Get detailed information about your DataFrame structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Comprehensive information about the DataFrame\n", "print(\"DataFrame Info:\")\n", "df_sales.info()\n", "\n", "print(\"\\nMemory usage:\")\n", "df_sales.info(memory_usage='deep')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Basic properties\n", "print(f\"Shape (rows, columns): {df_sales.shape}\")\n", "print(f\"Number of rows: {len(df_sales)}\")\n", "print(f\"Number of columns: {len(df_sales.columns)}\")\n", "print(f\"Total elements: {df_sales.size}\")\n", "print(f\"Dimensions: {df_sales.ndim}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Column and index information\n", "print(\"Column names:\")\n", "print(df_sales.columns.tolist())\n", "\n", "print(\"\\nData types:\")\n", "print(df_sales.dtypes)\n", "\n", "print(\"\\nIndex information:\")\n", "print(f\"Index: {df_sales.index}\")\n", "print(f\"Index type: {type(df_sales.index)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Summary Statistics\n", "\n", "Understand your data through statistical summaries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Summary statistics for numerical columns\n", "print(\"Summary statistics:\")\n", "print(df_sales.describe())\n", "\n", "print(\"\\nRounded to 2 decimal places:\")\n", "print(df_sales.describe().round(2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Summary statistics for all columns (including non-numeric)\n", "print(\"Summary for all columns:\")\n", "print(df_sales.describe(include='all'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Individual statistics\n", "print(\"Individual Statistical Measures:\")\n", "print(f\"Mean sales: {df_sales['Sales'].mean():.2f}\")\n", "print(f\"Median sales: {df_sales['Sales'].median():.2f}\")\n", "print(f\"Standard deviation: {df_sales['Sales'].std():.2f}\")\n", "print(f\"Minimum sales: {df_sales['Sales'].min()}\")\n", "print(f\"Maximum sales: {df_sales['Sales'].max()}\")\n", "print(f\"Sales range: {df_sales['Sales'].max() - df_sales['Sales'].min()}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Quantiles and percentiles\n", "print(\"Quantiles for Sales:\")\n", "print(f\"25th percentile (Q1): {df_sales['Sales'].quantile(0.25)}\")\n", "print(f\"50th percentile (Q2/Median): {df_sales['Sales'].quantile(0.50)}\")\n", "print(f\"75th percentile (Q3): {df_sales['Sales'].quantile(0.75)}\")\n", "print(f\"90th percentile: {df_sales['Sales'].quantile(0.90)}\")\n", "\n", "print(\"\\nCustom quantiles:\")\n", "quantiles = df_sales['Sales'].quantile([0.1, 0.3, 0.7, 0.9])\n", "print(quantiles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Counting and Unique Values\n", "\n", "Understand the distribution of categorical data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Count unique values in each column\n", "print(\"Number of unique values per column:\")\n", "print(df_sales.nunique())\n", "\n", "print(\"\\nUnique values in 'Product' column:\")\n", "print(df_sales['Product'].unique())\n", "\n", "print(\"\\nValue counts for 'Product':\")\n", "print(df_sales['Product'].value_counts())" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Product distribution (counts and percentages):\n", " Count Percentage\n", "Product \n", "Laptop 8 40.0\n", "Phone 8 40.0\n", "Tablet 4 20.0\n" ] } ], "source": [ "# Value counts with percentages\n", "print(\"Product distribution (counts and percentages):\")\n", "product_counts = df_sales['Product'].value_counts()\n", "product_percentages = df_sales['Product'].value_counts(normalize=True) * 100\n", "\n", "distribution = pd.DataFrame({\n", " 'Count': product_counts,\n", " 'Percentage': product_percentages.round(1)\n", "})\n", "print(distribution)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cross-tabulation\n", "print(\"Cross-tabulation of Product vs Region:\")\n", "crosstab = pd.crosstab(df_sales['Product'], df_sales['Region'])\n", "print(crosstab)\n", "\n", "print(\"\\nWith percentages:\")\n", "crosstab_pct = pd.crosstab(df_sales['Product'], df_sales['Region'], normalize='all') * 100\n", "print(crosstab_pct.round(1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Data Quality Checks\n", "\n", "Essential checks for data quality and integrity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check for missing values\n", "print(\"Missing values per column:\")\n", "print(df_sales.isnull().sum())\n", "\n", "print(\"\\nPercentage of missing values:\")\n", "missing_percentages = (df_sales.isnull().sum() / len(df_sales)) * 100\n", "print(missing_percentages.round(2))\n", "\n", "print(\"\\nAny missing values in dataset?\", df_sales.isnull().any().any())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check for duplicates\n", "print(f\"Number of duplicate rows: {df_sales.duplicated().sum()}\")\n", "print(f\"Any duplicate rows? {df_sales.duplicated().any()}\")\n", "\n", "# Check for duplicates based on specific columns\n", "print(f\"\\nDuplicate combinations of Date and Salesperson: {df_sales.duplicated(['Date', 'Salesperson']).sum()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Quick Data Exploration\n", "\n", "Rapid exploration techniques to understand your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Quick exploration function\n", "def quick_explore(df, column_name):\n", " \"\"\"Quick exploration of a specific column\"\"\"\n", " print(f\"=== Quick Exploration: {column_name} ===\")\n", " col = df[column_name]\n", " \n", " print(f\"Data type: {col.dtype}\")\n", " print(f\"Non-null values: {col.count()}/{len(col)}\")\n", " print(f\"Unique values: {col.nunique()}\")\n", " \n", " if col.dtype in ['int64', 'float64']:\n", " print(f\"Min: {col.min()}, Max: {col.max()}\")\n", " print(f\"Mean: {col.mean():.2f}, Median: {col.median():.2f}\")\n", " else:\n", " print(f\"Most common: {col.mode().iloc[0] if not col.mode().empty else 'N/A'}\")\n", " print(f\"Sample values: {col.unique()[:5].tolist()}\")\n", " print()\n", "\n", "# Explore different columns\n", "for col in ['Sales', 'Product', 'Region']:\n", " quick_explore(df_sales, col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice Exercises\n", "\n", "Test your understanding with these exercises:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# Exercise 1: Create a larger dataset and explore it\n", "# Create a dataset with 100 rows and at least 5 columns\n", "# Include different data types (numeric, categorical, datetime)\n", "\n", "# Your code here:\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# Exercise 2: Write a function that provides a complete data profile\n", "# Include: shape, data types, missing values, unique values, and basic stats\n", "\n", "def data_profile(df):\n", " \"\"\"Provide a comprehensive data profile\"\"\"\n", " # Your code here:\n", " pass\n", "\n", "# Test your function\n", "# data_profile(df_sales)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# Exercise 3: Find interesting insights from the sales data\n", "# Questions to answer:\n", "# 1. Which product has the highest average sales?\n", "# 2. Which region has the most consistent sales (lowest standard deviation)?\n", "# 3. What's the total commission earned by each salesperson?\n", "\n", "# Your code here:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "1. **`.head()` and `.tail()`** are essential for quick data inspection\n", "2. **`.info()`** provides comprehensive DataFrame structure information\n", "3. **`.describe()`** gives statistical summaries for numerical columns\n", "4. **`.nunique()` and `.value_counts()`** help understand categorical data\n", "5. **Always check for missing values** and duplicates in your data\n", "6. **Statistical measures** (mean, median, std) provide insights into data distribution\n", "7. **Cross-tabulation** helps understand relationships between categorical variables\n", "\n", "## Common Gotchas\n", "\n", "- `.describe()` only includes numeric columns by default (use `include='all'` for all columns)\n", "- Missing values can affect statistical calculations\n", "- Large datasets might need memory-efficient exploration techniques\n", "- Always verify data types are correct for your analysis" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 4 }