crypto_bot_training/Session_01/PandasDataFrame-exmples/02_basic_operations.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Session 1 - DataFrames - Lesson 2: Basic Operations\n",
    "\n",
    "## Learning Objectives\n",
    "- Learn essential methods to explore DataFrame structure\n",
    "- Understand how to get basic information about your data\n",
    "- Master data inspection techniques\n",
    "- Practice with summary statistics\n",
    "\n",
    "## Prerequisites\n",
    "- Completed Lesson 1: Creating DataFrames\n",
    "- Basic understanding of pandas DataFrames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "# Set display options for better output\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.width', None)\n",
    "pd.set_option('display.max_colwidth', 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Sample Dataset\n",
    "\n",
    "Let's create a comprehensive sales dataset to practice basic operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sales Dataset Created!\n",
      "Dataset shape: (20, 6)\n"
     ]
    }
   ],
   "source": [
    "# Create a comprehensive sales dataset\n",
    "np.random.seed(42)\n",
    "\n",
    "sales_data = {\n",
    "    'Date': pd.date_range('2024-01-01', periods=20, freq='D'),\n",
    "    'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Phone'] * 4,\n",
    "    'Sales': [1200, 800, 600, 1100, 850, 1300, 750, 650, 1250, 900,\n",
    "              1150, 820, 700, 1180, 880, 1220, 780, 620, 1300, 850],\n",
    "    'Region': ['North', 'South', 'East', 'West', 'North'] * 4,\n",
    "    'Salesperson': ['John', 'Sarah', 'Mike', 'Lisa', 'Tom'] * 4,\n",
    "    'Commission_Rate': [0.10, 0.12, 0.08, 0.11, 0.09] * 4\n",
    "}\n",
    "\n",
    "df_sales = pd.DataFrame(sales_data)\n",
    "print(\"Sales Dataset Created!\")\n",
    "print(f\"Dataset shape: {df_sales.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Viewing Data\n",
    "\n",
    "These methods help you quickly inspect your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View first few rows\n",
    "print(\"First 5 rows (default):\")\n",
    "print(df_sales.head())\n",
    "\n",
    "print(\"\\nFirst 3 rows:\")\n",
    "print(df_sales.head(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View last few rows\n",
    "print(\"Last 5 rows (default):\")\n",
    "print(df_sales.tail())\n",
    "\n",
    "print(\"\\nLast 3 rows:\")\n",
    "print(df_sales.tail(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Random sample of 5 rows:\n",
      "         Date Product  Sales Region Salesperson  Commission_Rate\n",
      "0  2024-01-01  Laptop   1200  North        John             0.10\n",
      "17 2024-01-18  Tablet    620   East        Mike             0.08\n",
      "15 2024-01-16  Laptop   1220  North        John             0.10\n",
      "1  2024-01-02   Phone    800  South       Sarah             0.12\n",
      "8  2024-01-09  Laptop   1250   West        Lisa             0.11\n",
      "\n",
      "Random sample with different random state:\n",
      "         Date Product  Sales Region Salesperson  Commission_Rate\n",
      "7  2024-01-08  Tablet    650   East        Mike             0.08\n",
      "10 2024-01-11  Laptop   1150  North        John             0.10\n",
      "5  2024-01-06  Laptop   1300  North        John             0.10\n"
     ]
    }
   ],
   "source": [
    "# Sample random rows\n",
    "print(\"Random sample of 5 rows:\")\n",
    "print(df_sales.sample(5))\n",
    "\n",
    "print(\"\\nRandom sample with different random state:\")\n",
    "print(df_sales.sample(3, random_state=10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. DataFrame Information\n",
    "\n",
    "Get detailed information about your DataFrame structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Comprehensive information about the DataFrame\n",
    "print(\"DataFrame Info:\")\n",
    "df_sales.info()\n",
    "\n",
    "print(\"\\nMemory usage:\")\n",
    "df_sales.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic properties\n",
    "print(f\"Shape (rows, columns): {df_sales.shape}\")\n",
    "print(f\"Number of rows: {len(df_sales)}\")\n",
    "print(f\"Number of columns: {len(df_sales.columns)}\")\n",
    "print(f\"Total elements: {df_sales.size}\")\n",
    "print(f\"Dimensions: {df_sales.ndim}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column and index information\n",
    "print(\"Column names:\")\n",
    "print(df_sales.columns.tolist())\n",
    "\n",
    "print(\"\\nData types:\")\n",
    "print(df_sales.dtypes)\n",
    "\n",
    "print(\"\\nIndex information:\")\n",
    "print(f\"Index: {df_sales.index}\")\n",
    "print(f\"Index type: {type(df_sales.index)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Summary Statistics\n",
    "\n",
    "Understand your data through statistical summaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary statistics for numerical columns\n",
    "print(\"Summary statistics:\")\n",
    "print(df_sales.describe())\n",
    "\n",
    "print(\"\\nRounded to 2 decimal places:\")\n",
    "print(df_sales.describe().round(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary statistics for all columns (including non-numeric)\n",
    "print(\"Summary for all columns:\")\n",
    "print(df_sales.describe(include='all'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Individual statistics\n",
    "print(\"Individual Statistical Measures:\")\n",
    "print(f\"Mean sales: {df_sales['Sales'].mean():.2f}\")\n",
    "print(f\"Median sales: {df_sales['Sales'].median():.2f}\")\n",
    "print(f\"Standard deviation: {df_sales['Sales'].std():.2f}\")\n",
    "print(f\"Minimum sales: {df_sales['Sales'].min()}\")\n",
    "print(f\"Maximum sales: {df_sales['Sales'].max()}\")\n",
    "print(f\"Sales range: {df_sales['Sales'].max() - df_sales['Sales'].min()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quantiles and percentiles\n",
    "print(\"Quantiles for Sales:\")\n",
    "print(f\"25th percentile (Q1): {df_sales['Sales'].quantile(0.25)}\")\n",
    "print(f\"50th percentile (Q2/Median): {df_sales['Sales'].quantile(0.50)}\")\n",
    "print(f\"75th percentile (Q3): {df_sales['Sales'].quantile(0.75)}\")\n",
    "print(f\"90th percentile: {df_sales['Sales'].quantile(0.90)}\")\n",
    "\n",
    "print(\"\\nCustom quantiles:\")\n",
    "quantiles = df_sales['Sales'].quantile([0.1, 0.3, 0.7, 0.9])\n",
    "print(quantiles)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Counting and Unique Values\n",
    "\n",
    "Understand the distribution of categorical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Count unique values in each column\n",
    "print(\"Number of unique values per column:\")\n",
    "print(df_sales.nunique())\n",
    "\n",
    "print(\"\\nUnique values in 'Product' column:\")\n",
    "print(df_sales['Product'].unique())\n",
    "\n",
    "print(\"\\nValue counts for 'Product':\")\n",
    "print(df_sales['Product'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Product distribution (counts and percentages):\n",
      "         Count  Percentage\n",
      "Product                   \n",
      "Laptop       8        40.0\n",
      "Phone        8        40.0\n",
      "Tablet       4        20.0\n"
     ]
    }
   ],
   "source": [
    "# Value counts with percentages\n",
    "print(\"Product distribution (counts and percentages):\")\n",
    "product_counts = df_sales['Product'].value_counts()\n",
    "product_percentages = df_sales['Product'].value_counts(normalize=True) * 100\n",
    "\n",
    "distribution = pd.DataFrame({\n",
    "    'Count': product_counts,\n",
    "    'Percentage': product_percentages.round(1)\n",
    "})\n",
    "print(distribution)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cross-tabulation\n",
    "print(\"Cross-tabulation of Product vs Region:\")\n",
    "crosstab = pd.crosstab(df_sales['Product'], df_sales['Region'])\n",
    "print(crosstab)\n",
    "\n",
    "print(\"\\nWith percentages:\")\n",
    "crosstab_pct = pd.crosstab(df_sales['Product'], df_sales['Region'], normalize='all') * 100\n",
    "print(crosstab_pct.round(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Data Quality Checks\n",
    "\n",
    "Essential checks for data quality and integrity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for missing values\n",
    "print(\"Missing values per column:\")\n",
    "print(df_sales.isnull().sum())\n",
    "\n",
    "print(\"\\nPercentage of missing values:\")\n",
    "missing_percentages = (df_sales.isnull().sum() / len(df_sales)) * 100\n",
    "print(missing_percentages.round(2))\n",
    "\n",
    "print(\"\\nAny missing values in dataset?\", df_sales.isnull().any().any())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for duplicates\n",
    "print(f\"Number of duplicate rows: {df_sales.duplicated().sum()}\")\n",
    "print(f\"Any duplicate rows? {df_sales.duplicated().any()}\")\n",
    "\n",
    "# Check for duplicates based on specific columns\n",
    "print(f\"\\nDuplicate combinations of Date and Salesperson: {df_sales.duplicated(['Date', 'Salesperson']).sum()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Quick Data Exploration\n",
    "\n",
    "Rapid exploration techniques to understand your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quick exploration function\n",
    "def quick_explore(df, column_name):\n",
    "    \"\"\"Quick exploration of a specific column\"\"\"\n",
    "    print(f\"=== Quick Exploration: {column_name} ===\")\n",
    "    col = df[column_name]\n",
    "    \n",
    "    print(f\"Data type: {col.dtype}\")\n",
    "    print(f\"Non-null values: {col.count()}/{len(col)}\")\n",
    "    print(f\"Unique values: {col.nunique()}\")\n",
    "    \n",
    "    if col.dtype in ['int64', 'float64']:\n",
    "        print(f\"Min: {col.min()}, Max: {col.max()}\")\n",
    "        print(f\"Mean: {col.mean():.2f}, Median: {col.median():.2f}\")\n",
    "    else:\n",
    "        print(f\"Most common: {col.mode().iloc[0] if not col.mode().empty else 'N/A'}\")\n",
    "        print(f\"Sample values: {col.unique()[:5].tolist()}\")\n",
    "    print()\n",
    "\n",
    "# Explore different columns\n",
    "for col in ['Sales', 'Product', 'Region']:\n",
    "    quick_explore(df_sales, col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Practice Exercises\n",
    "\n",
    "Test your understanding with these exercises:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1: Create a larger dataset and explore it\n",
    "# Create a dataset with 100 rows and at least 5 columns\n",
    "# Include different data types (numeric, categorical, datetime)\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2: Write a function that provides a complete data profile\n",
    "# Include: shape, data types, missing values, unique values, and basic stats\n",
    "\n",
    "def data_profile(df):\n",
    "    \"\"\"Provide a comprehensive data profile\"\"\"\n",
    "    # Your code here:\n",
    "    pass\n",
    "\n",
    "# Test your function\n",
    "# data_profile(df_sales)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3: Find interesting insights from the sales data\n",
    "# Questions to answer:\n",
    "# 1. Which product has the highest average sales?\n",
    "# 2. Which region has the most consistent sales (lowest standard deviation)?\n",
    "# 3. What's the total commission earned by each salesperson?\n",
    "\n",
    "# Your code here:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "1. **`.head()` and `.tail()`** are essential for quick data inspection\n",
    "2. **`.info()`** provides comprehensive DataFrame structure information\n",
    "3. **`.describe()`** gives statistical summaries for numerical columns\n",
    "4. **`.nunique()` and `.value_counts()`** help understand categorical data\n",
    "5. **Always check for missing values** and duplicates in your data\n",
    "6. **Statistical measures** (mean, median, std) provide insights into data distribution\n",
    "7. **Cross-tabulation** helps understand relationships between categorical variables\n",
    "\n",
    "## Common Gotchas\n",
    "\n",
    "- `.describe()` only includes numeric columns by default (use `include='all'` for all columns)\n",
    "- Missing values can affect statistical calculations\n",
    "- Large datasets might need memory-efficient exploration techniques\n",
    "- Always verify data types are correct for your analysis"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}