Type something to search...
Python Data Analysis Basics: Mastering Pandas Core Functions

Python Data Analysis Basics: Mastering Pandas Core Functions

Excel in the Python data ecosystem, Pandas

Why was Python able to become the overwhelming number one language in the fields of data science and machine learning? This is thanks to an excellent ecosystem of libraries specialized in handling data. The core library at the center of the ecosystem is Pandas.

Pandas helps you easily and powerfully manipulate and analyze tabular data, similar to spreadsheets in Excel or tables in a relational database (SQL), in the Python programming environment. It is no exaggeration to say that more than 80% of the work done by data analysts, including big data preprocessing, purification, filtering, and grouping, is done through Pandas.

In this post, we will take a quick look at Pandas' two core data structures and essential data manipulation methods.

Getting started with Pandas

To use Pandas, you must first install and import the library. By convention, it is imported using the alias pd.

# Install (Terminal)
# pip install pandas

# import
import pandas as pd
import numpy as np # Usually, numpy, a numerical calculation library, is also used.

1. The heart of Pandas, two core data structures

Pandas provides two special containers (data structures) to store data: Series and DataFrame.

Series

Series is a data structure in the form of a one-dimensional array. It is easy to think of it as ‘one column’ of a table in Excel. It is similar to Python's basic list, but the difference is that it has a label called Index that allows access to each data.

# Convert list to Pandas Series
data = ['Apple', 'Banana', 'Cherry']
s = pd.Series(data, index=['a', 'b', 'c'])
print(s)

# Output result:
# a Apple
#b Banana
#c Cherry
# dtype: object

DataFrame

DataFrame is a data structure in the form of a two-dimensional table. It is composed of rows and columns and can be edited, and multiple Series can be viewed as gathering together to form one DataFrame. This is the structure that will be dealt with most in actual data analysis.

# Create DataFrame using dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']
}
df = pd.DataFrame(data)
print(df)

# Output result:
# Name Age City
# 0 Alice 25 New York
#1 Bob 30 London
#2 Charlie 35 Paris

2. Load external data

In a real environment, rather than entering data directly into the code, work is done by loading data saved in CSV or Excel file format. Pandas provides powerful features for reading and writing files in a variety of formats.

# Read CSV file
df_csv = pd.read_csv('sales_data.csv')

# Read Excel file
df_excel = pd.read_excel('report.xlsx')

# Preview the first 5 rows (essential to understand what the data looks like!)
print(df_csv.head())

# Summary of overall information of the data frame (missing values, data type, etc.)
df_csv.info()

3. Core fundamentals of data selection and filtering

The core of Pandas is the method of indexing only the information you want from a massive data frame.

Select a column

# Select only the 'Name' column (results are returned in Series format)
names = df['Name']

# When selecting multiple columns at the same time, group them into a list.
subset = df[['Name', 'City']]

Filtering rows that meet conditions (Boolean Indexing)

It is similar to the filter function in Excel. Used to extract only rows that satisfy specific conditions.

# Filter only people over 30 years old
over_30 = df[df['Age'] >= 30]

# When combining multiple conditions (AND: &, OR: |)
# Caution: You must use parentheses () for each conditional expression.
condition_df = df[(df['Age'] >= 30) & (df['City'] == 'London')]

Select data from specific location (loc, iloc)

  • loc[]: Select based on the ‘label (name)’ of the index.
  • iloc[]: Select based on the ‘position (integer number)’ of the index.
# Get all data in row with index 0 (based on name)
row_0 = df.loc[0]

# Get data from row 0, column 1 (based on numbers)
val = df.iloc[0, 1]

4. Handling missing data

In reality, data is often empty (NaN, Not a Number) or messy and full of invalid values. It is essential to handle missing values ​​in the data cleaning (preprocessing) stage before starting analysis.

# Check if there are any missing values
print(df.isnull().sum())

# Delete rows containing at least one missing value
df_dropped = df.dropna()

# Fill missing values ​​with another value (e.g. 0 or the mean value of a column)
mean_age = df['Age'].mean()
df_filled = df.fillna(mean_age)

To the next step

So far, we've only looked at the very basic form and functionality of Pandas. In addition, Pandas has extensive and powerful built-in functions such as data merging (Merge, Concat), grouping and aggregation (GroupBy), and time series data processing.

Rather than trying to memorize all the functions from scratch, download a CSV file dataset of interest from Kaggle or a public data portal, load it directly, and play around with it. There is no better way to learn than dealing with the data and dealing with error messages.

Related Post

The Complete Guide to Docker: Introduction to and Use of Container Technology for Beginners

The Complete Guide to Docker: Introduction to and Use of Container Technology for Beginners

What is Docker? One of the technologies that has brought about the most innovative changes in the software development and distribution environment in recent years is Docker. Docker is a software

Mastering Kubernetes: Container Orchestration Beyond Docker

Mastering Kubernetes: Container Orchestration Beyond Docker

What is Kubernetes? While Docker revolutionized the creation and management of single containers, Kubernetes (k8s for short) is a 'Container Orchestration' tool that automates the process of depl

The Complete Guide to Git Branching Strategy: From Git Flow to GitHub Flow

The Complete Guide to Git Branching Strategy: From Git Flow to GitHub Flow

A necessity for collaboration, Git branch strategy In software development projects, when multiple developers write code simultaneously, conflicts and confusion inevitably arise. “Who modified th

React vs Vue.js: Guide to Choosing a Front-End Framework in 2024

React vs Vue.js: Guide to Choosing a Front-End Framework in 2024

Front-end war, what is your choice? If you are at all interested in web development, you have probably heard the names 'React' and 'Vue.js' at least once. As the jQuery era comes to an end and th

Website Performance Optimization Strategies: How Loading Speed ​​​​Affects Your Business

Website Performance Optimization Strategies: How Loading Speed ​​​​Affects Your Business

Butterfly effect with 1 second loading speed The patience of not only Koreans, a “fast, quick” people, but also internet users around the world, is getting shorter and shorter. An Amazon study fo

Complete CI/CD pipeline automation starting with GitHub Actions

Complete CI/CD pipeline automation starting with GitHub Actions

Escape the nightmare of manual deployment “Okay, now the coding is done! Let’s connect to the server, get git pull, reinstall dependencies, build, kill the existing process, and launch a new pr

TypeScript 101: Putting ‘seatbelts’ on JavaScript

TypeScript 101: Putting ‘seatbelts’ on JavaScript

Betrayal of JavaScript JavaScript is the most widely used language in the world, and is a very flexible and easy to write language. However, as the project size grows and becomes more complex, 'f

Practical guide to developer-prompted engineering in the era of generative AI

Practical guide to developer-prompted engineering in the era of generative AI

Introduction: Why do developers need prompt engineering? In an era where generative AI writes code and fixes bugs, the role of developers is rapidly evolving from simply ‘typing’ code to ‘designi

Front-end ecosystem trends in 2024: What should we learn and prepare for?

Front-end ecosystem trends in 2024: What should we learn and prepare for?

Introduction: The ever-changing front-end ecosystem Among the web development fields, the front-end ecosystem is one where the speed of change is dazzlingly fast. New frameworks and tools are con

Data-based decision making and big data analysis trends in 2024

Data-based decision making and big data analysis trends in 2024

Introduction: “I think...” vs “Looking at the data...” What is the most dangerous thing to say in a business meeting? It is an argument that begins with "I think..." and relies solely on a person