Getting Started

Series and DataFrames

Introduction

The pandas library is the backbone of data manipulation in Python. It provides two main structures: Series and DataFrames, which are essential for handling and analyzing data. In this post, we’ll introduce these structures and explore some of the most common operations for data exploration and manipulation.

1. Introduction to Series

A Series is a one-dimensional labeled array, capable of holding any data type (integers, strings, floats, etc.).

Creating a Series

You can create a Series from a list or array:

python
Copy code
import pandas as pd

# Creating a Series from a list
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)

Setting Custom Index

You can assign custom labels to the data:

python
Copy code
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
print(series)

Accessing Elements

You can access elements in a Series using labels or integer positions:

python
Copy code
# Access by label
print(series['a'])

# Access by position
print(series[0])

2. Introduction to DataFrames

A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

Creating a DataFrame

You can create a DataFrame from dictionaries, lists, or NumPy arrays:

python
Copy code
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

Creating a DataFrame from a List of Lists

python
Copy code
data = [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

3. Common Operations on Series and DataFrames

Accessing Data

By Column: You can access columns of a DataFrame as if they were Series:
```
python
Copy code
print(df['Name'])
```

By Row: Use .loc[] for label-based indexing or .iloc[] for position-based indexing:

python
Copy code
print(df.loc[0])  # First row by label
print(df.iloc[0])  # First row by position

Filtering Data

You can filter data based on conditions:

python
Copy code
# Filter by Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

Sorting Data

Sort the data by a specific column:

python
Copy code
# Sorting by Age
sorted_df = df.sort_values(by='Age')
print(sorted_df)

4. Grouping Data

Grouping data is useful for aggregation. The groupby() method is a powerful tool for this.

Grouping by a Column

python
Copy code
grouped = df.groupby('City')
print(grouped['Age'].mean())  # Find average age per city

Multiple Aggregations

You can apply multiple aggregation functions:

python
Copy code
grouped = df.groupby('City').agg({'Age': ['mean', 'max'], 'Name': 'count'})
print(grouped)

5. Handling Missing Data

Missing data is common in real-world datasets. pandas provides several methods to handle it.

Detecting Missing Data

python
Copy code
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', None],
    'Age': [25, None, 35],
})
print(df.isnull())  # Detect missing values

Filling Missing Data

You can fill missing values with a specific value or method:

python
Copy code
df['Age'] = df['Age'].fillna(df['Age'].mean())  # Fill with the mean of the column
print(df)

Dropping Missing Data

Alternatively, you can drop rows with missing values:

python
Copy code
df = df.dropna()  # Drop rows with missing values
print(df)

6. Practical Example: Exploring and Manipulating Data

Let’s work with a sample dataset to see how pandas can be used in practice.

Problem: A dataset contains information about employees, including their name, age, department, and salary. We will explore and manipulate this data.

python
Copy code
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 45],
    'Department': ['HR', 'IT', 'IT', 'Sales', 'HR'],
    'Salary': [50000, 60000, 70000, 80000, 90000]
}
df = pd.DataFrame(data)

# Filtering data
it_department = df[df['Department'] == 'IT']
print(it_department)

# Sorting by Salary
sorted_by_salary = df.sort_values(by='Salary', ascending=False)
print(sorted_by_salary)

# Grouping by Department and calculating average salary
avg_salary_by_dept = df.groupby('Department')['Salary'].mean()
print(avg_salary_by_dept)

Conclusion

In this post, we’ve learned the fundamentals of using pandas to manipulate and analyze data. You now know how to create Series and DataFrames, filter, sort, group data, and handle missing values. These are essential skills for working with data in Python.