1
Fork 0
crypto_bot_training/Session_01/PandasDataFrame-exmples/01_creating_dataframes.ipynb
2025-06-13 07:25:59 +02:00

391 lines
11 KiB
Text
Executable file

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Session 1 - DataFrames - Lesson 1: Creating DataFrames\n",
"\n",
"## Learning Objectives\n",
"- Understand different methods to create pandas DataFrames\n",
"- Learn to create DataFrames from dictionaries, lists, and NumPy arrays\n",
"- Practice with various data types and structures\n",
"\n",
"## Prerequisites\n",
"- Basic Python knowledge\n",
"- Understanding of lists and dictionaries"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pandas version: 2.2.3\n",
"NumPy version: 2.2.6\n"
]
}
],
"source": [
"# Import required libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime, timedelta\n",
"\n",
"print(f\"Pandas version: {pd.__version__}\")\n",
"print(f\"NumPy version: {np.__version__}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 1: Creating DataFrame from Dictionary\n",
"\n",
"This is the most common and intuitive way to create a DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Student DataFrame:\n",
" Name Age Grade Score\n",
"0 Alice 23 A 95\n",
"1 Bob 25 B 87\n",
"2 Charlie 22 A 92\n",
"3 Diana 24 C 78\n",
"4 Eve 23 B 89\n",
"\n",
"Shape: (5, 4)\n",
"Data types:\n",
"Name object\n",
"Age int64\n",
"Grade object\n",
"Score int64\n",
"dtype: object\n"
]
}
],
"source": [
"# Creating DataFrame from dictionary\n",
"student_data = {\n",
" 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],\n",
" 'Age': [23, 25, 22, 24, 23],\n",
" 'Grade': ['A', 'B', 'A', 'C', 'B'],\n",
" 'Score': [95, 87, 92, 78, 89]\n",
"}\n",
"\n",
"df_students = pd.DataFrame(student_data)\n",
"print(\"Student DataFrame:\")\n",
"print(df_students)\n",
"print(f\"\\nShape: {df_students.shape}\")\n",
"print(f\"Data types:\\n{df_students.dtypes}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 2: Creating DataFrame from Lists\n",
"\n",
"You can create DataFrames from separate lists by combining them in a dictionary."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cities DataFrame:\n",
" City Population_Million Country\n",
"0 New York 8.4 USA\n",
"1 London 8.9 UK\n",
"2 Tokyo 13.9 Japan\n",
"3 Paris 2.1 France\n",
"4 Sydney 5.3 Australia\n",
"\n",
"Index: [0, 1, 2, 3, 4]\n",
"Columns: ['City', 'Population_Million', 'Country']\n"
]
}
],
"source": [
"# Creating DataFrame from separate lists\n",
"cities = ['New York', 'London', 'Tokyo', 'Paris', 'Sydney']\n",
"populations = [8.4, 8.9, 13.9, 2.1, 5.3]\n",
"countries = ['USA', 'UK', 'Japan', 'France', 'Australia']\n",
"\n",
"df_cities = pd.DataFrame({\n",
" 'City': cities,\n",
" 'Population_Million': populations,\n",
" 'Country': countries\n",
"})\n",
"\n",
"print(\"Cities DataFrame:\")\n",
"print(df_cities)\n",
"print(f\"\\nIndex: {df_cities.index.tolist()}\")\n",
"print(f\"Columns: {df_cities.columns.tolist()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 3: Creating DataFrame from NumPy Array\n",
"\n",
"This method is useful when working with numerical data or when you need random data for testing."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Random DataFrame:\n",
" Column_A Column_B Column_C\n",
"Row1 52 93 15\n",
"Row2 72 61 21\n",
"Row3 83 87 75\n",
"Row4 75 88 24\n",
"Row5 3 22 53\n",
"\n",
"Summary statistics:\n",
" Column_A Column_B Column_C\n",
"count 5.000000 5.000000 5.000000\n",
"mean 57.000000 70.200000 37.600000\n",
"std 32.272279 29.693434 25.530374\n",
"min 3.000000 22.000000 15.000000\n",
"25% 52.000000 61.000000 21.000000\n",
"50% 72.000000 87.000000 24.000000\n",
"75% 75.000000 88.000000 53.000000\n",
"max 83.000000 93.000000 75.000000\n"
]
}
],
"source": [
"# Creating DataFrame from NumPy array\n",
"np.random.seed(42) # For reproducible results\n",
"random_data = np.random.randint(1, 100, size=(5, 3))\n",
"\n",
"df_random = pd.DataFrame(random_data, \n",
" columns=['Column_A', 'Column_B', 'Column_C'],\n",
" index=['Row1', 'Row2', 'Row3', 'Row4', 'Row5'])\n",
"\n",
"print(\"Random DataFrame:\")\n",
"print(df_random)\n",
"print(f\"\\nSummary statistics:\")\n",
"print(df_random.describe())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 4: Creating DataFrame with Custom Index\n",
"\n",
"You can specify custom row labels (index) when creating DataFrames."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Products DataFrame with Custom Index:\n",
" Product Price Stock\n",
"PROD001 Laptop 1200 15\n",
"PROD002 Phone 800 50\n",
"PROD003 Tablet 600 30\n",
"PROD004 Monitor 300 20\n",
"\n",
"Accessing by index label 'PROD002':\n",
"Product Phone\n",
"Price 800\n",
"Stock 50\n",
"Name: PROD002, dtype: object\n"
]
}
],
"source": [
"# Creating DataFrame with custom index\n",
"product_data = {\n",
" 'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],\n",
" 'Price': [1200, 800, 600, 300],\n",
" 'Stock': [15, 50, 30, 20]\n",
"}\n",
"\n",
"# Custom index using product codes\n",
"custom_index = ['PROD001', 'PROD002', 'PROD003', 'PROD004']\n",
"df_products = pd.DataFrame(product_data, index=custom_index)\n",
"\n",
"print(\"Products DataFrame with Custom Index:\")\n",
"print(df_products)\n",
"print(f\"\\nAccessing by index label 'PROD002':\")\n",
"print(df_products.loc['PROD002'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method 5: Creating Empty DataFrame and Adding Data\n",
"\n",
"Sometimes you need to start with an empty DataFrame and add data incrementally."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Empty DataFrame:\n",
"Empty DataFrame\n",
"Columns: [Date, Temperature, Humidity, Pressure]\n",
"Index: []\n",
"Shape: (0, 4)\n",
"\n",
"DataFrame after adding data:\n",
" Date Temperature Humidity Pressure\n",
"0 2024-01-01 22.5 65 1013.2\n",
"1 2024-01-02 24.1 68 1015.1\n",
"2 2024-01-03 21.8 72 1012.8\n"
]
}
],
"source": [
"# Creating empty DataFrame with specified columns\n",
"columns = ['Date', 'Temperature', 'Humidity', 'Pressure']\n",
"df_weather = pd.DataFrame(columns=columns)\n",
"\n",
"print(\"Empty DataFrame:\")\n",
"print(df_weather)\n",
"print(f\"Shape: {df_weather.shape}\")\n",
"\n",
"# Adding data row by row (not recommended for large datasets)\n",
"weather_data = [\n",
" ['2024-01-01', 22.5, 65, 1013.2],\n",
" ['2024-01-02', 24.1, 68, 1015.1],\n",
" ['2024-01-03', 21.8, 72, 1012.8]\n",
"]\n",
"\n",
"for row in weather_data:\n",
" df_weather.loc[len(df_weather)] = row\n",
"\n",
"print(\"\\nDataFrame after adding data:\")\n",
"print(df_weather)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice Exercises\n",
"\n",
"Try these exercises to reinforce your learning:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 1: Create a DataFrame from dictionary with employee information\n",
"# Include: Employee ID, Name, Department, Salary, Years of Experience\n",
"\n",
"# Your code here:\n",
"employee_data = {\n",
" # Add your data here\n",
"}\n",
"\n",
"# df_employees = pd.DataFrame(employee_data)\n",
"# print(df_employees)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 2: Create a DataFrame using NumPy with 6 rows and 4 columns\n",
"# Use column names: 'A', 'B', 'C', 'D'\n",
"# Use row indices: 'R1', 'R2', 'R3', 'R4', 'R5', 'R6'\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# Exercise 3: Create a DataFrame with mixed data types\n",
"# Include at least one string, integer, float, and boolean column\n",
"\n",
"# Your code here:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Takeaways\n",
"\n",
"1. **Dictionary method** is most intuitive for creating DataFrames\n",
"2. **NumPy arrays** are useful for numerical data and testing\n",
"3. **Custom indices** provide meaningful row labels\n",
"4. **Empty DataFrames** can be useful but avoid adding rows one by one for large datasets\n",
"5. Always check the **shape** and **data types** of your DataFrame after creation\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}