Getting Started
Series and DataFrames
Introduction
The pandas library is the backbone of data manipulation in Python. It provides two main structures: Series and DataFrames, which are essential for handling and analyzing data. In this post, we’ll introduce these structures and explore some of the most common operations for data exploration and manipulation.
1. Introduction to Series
A Series is a one-dimensional labeled array, capable of holding any data type (integers, strings, floats, etc.).
Creating a Series
You can create a Series from a list or array:
python
Copy code
import pandas as pd
# Creating a Series from a list
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)Setting Custom Index
You can assign custom labels to the data:
python
Copy code
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
print(series)Accessing Elements
You can access elements in a Series using labels or integer positions:
python
Copy code
# Access by label
print(series['a'])
# Access by position
print(series[0])2. Introduction to DataFrames
A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Creating a DataFrame
You can create a DataFrame from dictionaries, lists, or NumPy arrays:
python
Copy code
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)Creating a DataFrame from a List of Lists
python
Copy code
data = [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)3. Common Operations on Series and DataFrames
Accessing Data
By Column: You can access columns of a DataFrame as if they were Series:
python Copy code print(df['Name'])By Row: Use
.loc[]for label-based indexing or.iloc[]for position-based indexing:python Copy code print(df.loc[0]) # First row by label print(df.iloc[0]) # First row by position
Filtering Data
You can filter data based on conditions:
python
Copy code
# Filter by Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)Sorting Data
Sort the data by a specific column:
python
Copy code
# Sorting by Age
sorted_df = df.sort_values(by='Age')
print(sorted_df)4. Grouping Data
Grouping data is useful for aggregation. The groupby() method is a powerful tool for this.
Grouping by a Column
python
Copy code
grouped = df.groupby('City')
print(grouped['Age'].mean()) # Find average age per cityMultiple Aggregations
You can apply multiple aggregation functions:
python
Copy code
grouped = df.groupby('City').agg({'Age': ['mean', 'max'], 'Name': 'count'})
print(grouped)5. Handling Missing Data
Missing data is common in real-world datasets. pandas provides several methods to handle it.
Detecting Missing Data
python
Copy code
df = pd.DataFrame({
'Name': ['Alice', 'Bob', None],
'Age': [25, None, 35],
})
print(df.isnull()) # Detect missing valuesFilling Missing Data
You can fill missing values with a specific value or method:
python
Copy code
df['Age'] = df['Age'].fillna(df['Age'].mean()) # Fill with the mean of the column
print(df)Dropping Missing Data
Alternatively, you can drop rows with missing values:
python
Copy code
df = df.dropna() # Drop rows with missing values
print(df)6. Practical Example: Exploring and Manipulating Data
Let’s work with a sample dataset to see how pandas can be used in practice.
Problem: A dataset contains information about employees, including their name, age, department, and salary. We will explore and manipulate this data.
python
Copy code
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'Department': ['HR', 'IT', 'IT', 'Sales', 'HR'],
'Salary': [50000, 60000, 70000, 80000, 90000]
}
df = pd.DataFrame(data)
# Filtering data
it_department = df[df['Department'] == 'IT']
print(it_department)
# Sorting by Salary
sorted_by_salary = df.sort_values(by='Salary', ascending=False)
print(sorted_by_salary)
# Grouping by Department and calculating average salary
avg_salary_by_dept = df.groupby('Department')['Salary'].mean()
print(avg_salary_by_dept)Conclusion
In this post, we’ve learned the fundamentals of using pandas to manipulate and analyze data. You now know how to create Series and DataFrames, filter, sort, group data, and handle missing values. These are essential skills for working with data in Python.