
Python Data Analysis Basics: Mastering Pandas Core Functions
- Development, Data Science
- 20 Jun, 2024
Excel in the Python data ecosystem, Pandas
Why was Python able to become the overwhelming number one language in the fields of data science and machine learning? This is thanks to an excellent ecosystem of libraries specialized in handling data. The core library at the center of the ecosystem is Pandas.
Pandas helps you easily and powerfully manipulate and analyze tabular data, similar to spreadsheets in Excel or tables in a relational database (SQL), in the Python programming environment. It is no exaggeration to say that more than 80% of the work done by data analysts, including big data preprocessing, purification, filtering, and grouping, is done through Pandas.
In this post, we will take a quick look at Pandas' two core data structures and essential data manipulation methods.
Getting started with Pandas
To use Pandas, you must first install and import the library. By convention, it is imported using the alias pd.
# Install (Terminal)
# pip install pandas
# import
import pandas as pd
import numpy as np # Usually, numpy, a numerical calculation library, is also used.
1. The heart of Pandas, two core data structures
Pandas provides two special containers (data structures) to store data: Series and DataFrame.
Series
Series is a data structure in the form of a one-dimensional array. It is easy to think of it as ‘one column’ of a table in Excel. It is similar to Python's basic list, but the difference is that it has a label called Index that allows access to each data.
# Convert list to Pandas Series
data = ['Apple', 'Banana', 'Cherry']
s = pd.Series(data, index=['a', 'b', 'c'])
print(s)
# Output result:
# a Apple
#b Banana
#c Cherry
# dtype: object
DataFrame
DataFrame is a data structure in the form of a two-dimensional table. It is composed of rows and columns and can be edited, and multiple Series can be viewed as gathering together to form one DataFrame. This is the structure that will be dealt with most in actual data analysis.
# Create DataFrame using dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']
}
df = pd.DataFrame(data)
print(df)
# Output result:
# Name Age City
# 0 Alice 25 New York
#1 Bob 30 London
#2 Charlie 35 Paris
2. Load external data
In a real environment, rather than entering data directly into the code, work is done by loading data saved in CSV or Excel file format. Pandas provides powerful features for reading and writing files in a variety of formats.
# Read CSV file
df_csv = pd.read_csv('sales_data.csv')
# Read Excel file
df_excel = pd.read_excel('report.xlsx')
# Preview the first 5 rows (essential to understand what the data looks like!)
print(df_csv.head())
# Summary of overall information of the data frame (missing values, data type, etc.)
df_csv.info()
3. Core fundamentals of data selection and filtering
The core of Pandas is the method of indexing only the information you want from a massive data frame.
Select a column
# Select only the 'Name' column (results are returned in Series format)
names = df['Name']
# When selecting multiple columns at the same time, group them into a list.
subset = df[['Name', 'City']]
Filtering rows that meet conditions (Boolean Indexing)
It is similar to the filter function in Excel. Used to extract only rows that satisfy specific conditions.
# Filter only people over 30 years old
over_30 = df[df['Age'] >= 30]
# When combining multiple conditions (AND: &, OR: |)
# Caution: You must use parentheses () for each conditional expression.
condition_df = df[(df['Age'] >= 30) & (df['City'] == 'London')]
Select data from specific location (loc, iloc)
loc[]: Select based on the ‘label (name)’ of the index.iloc[]: Select based on the ‘position (integer number)’ of the index.
# Get all data in row with index 0 (based on name)
row_0 = df.loc[0]
# Get data from row 0, column 1 (based on numbers)
val = df.iloc[0, 1]
4. Handling missing data
In reality, data is often empty (NaN, Not a Number) or messy and full of invalid values. It is essential to handle missing values in the data cleaning (preprocessing) stage before starting analysis.
# Check if there are any missing values
print(df.isnull().sum())
# Delete rows containing at least one missing value
df_dropped = df.dropna()
# Fill missing values with another value (e.g. 0 or the mean value of a column)
mean_age = df['Age'].mean()
df_filled = df.fillna(mean_age)
To the next step
So far, we've only looked at the very basic form and functionality of Pandas. In addition, Pandas has extensive and powerful built-in functions such as data merging (Merge, Concat), grouping and aggregation (GroupBy), and time series data processing.
Rather than trying to memorize all the functions from scratch, download a CSV file dataset of interest from Kaggle or a public data portal, load it directly, and play around with it. There is no better way to learn than dealing with the data and dealing with error messages.







