{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Session 1 - DataFrames - Lesson 1: Creating DataFrames\n", "\n", "## Learning Objectives\n", "- Understand different methods to create pandas DataFrames\n", "- Learn to create DataFrames from dictionaries, lists, and NumPy arrays\n", "- Practice with various data types and structures\n", "\n", "## Prerequisites\n", "- Basic Python knowledge\n", "- Understanding of lists and dictionaries" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pandas version: 2.2.3\n", "NumPy version: 2.2.6\n" ] } ], "source": [ "# Import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "from datetime import datetime, timedelta\n", "\n", "print(f\"Pandas version: {pd.__version__}\")\n", "print(f\"NumPy version: {np.__version__}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 1: Creating DataFrame from Dictionary\n", "\n", "This is the most common and intuitive way to create a DataFrame." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Student DataFrame:\n", " Name Age Grade Score\n", "0 Alice 23 A 95\n", "1 Bob 25 B 87\n", "2 Charlie 22 A 92\n", "3 Diana 24 C 78\n", "4 Eve 23 B 89\n", "\n", "Shape: (5, 4)\n", "Data types:\n", "Name object\n", "Age int64\n", "Grade object\n", "Score int64\n", "dtype: object\n" ] } ], "source": [ "# Creating DataFrame from dictionary\n", "student_data = {\n", " 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],\n", " 'Age': [23, 25, 22, 24, 23],\n", " 'Grade': ['A', 'B', 'A', 'C', 'B'],\n", " 'Score': [95, 87, 92, 78, 89]\n", "}\n", "\n", "df_students = pd.DataFrame(student_data)\n", "print(\"Student DataFrame:\")\n", "print(df_students)\n", "print(f\"\\nShape: {df_students.shape}\")\n", "print(f\"Data types:\\n{df_students.dtypes}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 2: Creating DataFrame from Lists\n", "\n", "You can create DataFrames from separate lists by combining them in a dictionary." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cities DataFrame:\n", " City Population_Million Country\n", "0 New York 8.4 USA\n", "1 London 8.9 UK\n", "2 Tokyo 13.9 Japan\n", "3 Paris 2.1 France\n", "4 Sydney 5.3 Australia\n", "\n", "Index: [0, 1, 2, 3, 4]\n", "Columns: ['City', 'Population_Million', 'Country']\n" ] } ], "source": [ "# Creating DataFrame from separate lists\n", "cities = ['New York', 'London', 'Tokyo', 'Paris', 'Sydney']\n", "populations = [8.4, 8.9, 13.9, 2.1, 5.3]\n", "countries = ['USA', 'UK', 'Japan', 'France', 'Australia']\n", "\n", "df_cities = pd.DataFrame({\n", " 'City': cities,\n", " 'Population_Million': populations,\n", " 'Country': countries\n", "})\n", "\n", "print(\"Cities DataFrame:\")\n", "print(df_cities)\n", "print(f\"\\nIndex: {df_cities.index.tolist()}\")\n", "print(f\"Columns: {df_cities.columns.tolist()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 3: Creating DataFrame from NumPy Array\n", "\n", "This method is useful when working with numerical data or when you need random data for testing." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Random DataFrame:\n", " Column_A Column_B Column_C\n", "Row1 52 93 15\n", "Row2 72 61 21\n", "Row3 83 87 75\n", "Row4 75 88 24\n", "Row5 3 22 53\n", "\n", "Summary statistics:\n", " Column_A Column_B Column_C\n", "count 5.000000 5.000000 5.000000\n", "mean 57.000000 70.200000 37.600000\n", "std 32.272279 29.693434 25.530374\n", "min 3.000000 22.000000 15.000000\n", "25% 52.000000 61.000000 21.000000\n", "50% 72.000000 87.000000 24.000000\n", "75% 75.000000 88.000000 53.000000\n", "max 83.000000 93.000000 75.000000\n" ] } ], "source": [ "# Creating DataFrame from NumPy array\n", "np.random.seed(42) # For reproducible results\n", "random_data = np.random.randint(1, 100, size=(5, 3))\n", "\n", "df_random = pd.DataFrame(random_data, \n", " columns=['Column_A', 'Column_B', 'Column_C'],\n", " index=['Row1', 'Row2', 'Row3', 'Row4', 'Row5'])\n", "\n", "print(\"Random DataFrame:\")\n", "print(df_random)\n", "print(f\"\\nSummary statistics:\")\n", "print(df_random.describe())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 4: Creating DataFrame with Custom Index\n", "\n", "You can specify custom row labels (index) when creating DataFrames." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Products DataFrame with Custom Index:\n", " Product Price Stock\n", "PROD001 Laptop 1200 15\n", "PROD002 Phone 800 50\n", "PROD003 Tablet 600 30\n", "PROD004 Monitor 300 20\n", "\n", "Accessing by index label 'PROD002':\n", "Product Phone\n", "Price 800\n", "Stock 50\n", "Name: PROD002, dtype: object\n" ] } ], "source": [ "# Creating DataFrame with custom index\n", "product_data = {\n", " 'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],\n", " 'Price': [1200, 800, 600, 300],\n", " 'Stock': [15, 50, 30, 20]\n", "}\n", "\n", "# Custom index using product codes\n", "custom_index = ['PROD001', 'PROD002', 'PROD003', 'PROD004']\n", "df_products = pd.DataFrame(product_data, index=custom_index)\n", "\n", "print(\"Products DataFrame with Custom Index:\")\n", "print(df_products)\n", "print(f\"\\nAccessing by index label 'PROD002':\")\n", "print(df_products.loc['PROD002'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method 5: Creating Empty DataFrame and Adding Data\n", "\n", "Sometimes you need to start with an empty DataFrame and add data incrementally." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Empty DataFrame:\n", "Empty DataFrame\n", "Columns: [Date, Temperature, Humidity, Pressure]\n", "Index: []\n", "Shape: (0, 4)\n", "\n", "DataFrame after adding data:\n", " Date Temperature Humidity Pressure\n", "0 2024-01-01 22.5 65 1013.2\n", "1 2024-01-02 24.1 68 1015.1\n", "2 2024-01-03 21.8 72 1012.8\n" ] } ], "source": [ "# Creating empty DataFrame with specified columns\n", "columns = ['Date', 'Temperature', 'Humidity', 'Pressure']\n", "df_weather = pd.DataFrame(columns=columns)\n", "\n", "print(\"Empty DataFrame:\")\n", "print(df_weather)\n", "print(f\"Shape: {df_weather.shape}\")\n", "\n", "# Adding data row by row (not recommended for large datasets)\n", "weather_data = [\n", " ['2024-01-01', 22.5, 65, 1013.2],\n", " ['2024-01-02', 24.1, 68, 1015.1],\n", " ['2024-01-03', 21.8, 72, 1012.8]\n", "]\n", "\n", "for row in weather_data:\n", " df_weather.loc[len(df_weather)] = row\n", "\n", "print(\"\\nDataFrame after adding data:\")\n", "print(df_weather)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice Exercises\n", "\n", "Try these exercises to reinforce your learning:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Exercise 1: Create a DataFrame from dictionary with employee information\n", "# Include: Employee ID, Name, Department, Salary, Years of Experience\n", "\n", "# Your code here:\n", "employee_data = {\n", " # Add your data here\n", "}\n", "\n", "# df_employees = pd.DataFrame(employee_data)\n", "# print(df_employees)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# Exercise 2: Create a DataFrame using NumPy with 6 rows and 4 columns\n", "# Use column names: 'A', 'B', 'C', 'D'\n", "# Use row indices: 'R1', 'R2', 'R3', 'R4', 'R5', 'R6'\n", "\n", "# Your code here:\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Exercise 3: Create a DataFrame with mixed data types\n", "# Include at least one string, integer, float, and boolean column\n", "\n", "# Your code here:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "1. **Dictionary method** is most intuitive for creating DataFrames\n", "2. **NumPy arrays** are useful for numerical data and testing\n", "3. **Custom indices** provide meaningful row labels\n", "4. **Empty DataFrames** can be useful but avoid adding rows one by one for large datasets\n", "5. Always check the **shape** and **data types** of your DataFrame after creation\n" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 4 }