Data Analysis with Pandas
The de-facto library for data manipulation and analysis in Python
Introduction to Pandas
Pandas is a powerful and flexible Python library designed for data manipulation and analysis. It provides data structures and functions needed to work with structured data seamlessly. For financial bankers, Pandas offers a way to perform complex data analysis tasks efficiently, making it an invaluable tool for exploring and interpreting financial data.
Pandas introduces two primary data structures that are central to data analysis:
- Series:
- Description: A one-dimensional array-like object that can hold various data types (e.g., integers, strings, floats). It is similar to a column in a spreadsheet.
- Usage Example: A Series can represent stock prices over time.
- DataFrame:
- Description: A two-dimensional tabular data structure that can hold multiple data types (e.g., Series, lists, dictionaries). It is similar to a spreadsheet or SQL table.
- Usage Example: A DataFrame can represent a financial dataset with multiple columns (e.g., stock prices, trading volume, market capitalization).
Core Functions and Methods
Pandas provides a range of functions and methods to facilitate data analysis:
-
Data Loading:
-
Read CSV: Import data from CSV files, which are commonly used in financial reporting.
-
Read Excel: Load data from Excel spreadsheets.
-
-
Data Exploration:
-
View Data: Quickly view the first few rows of a DataFrame to understand its structure.
-
Summary Statistics: Get a quick summary of numerical data.
-
-
Data Cleaning:
-
Handling Missing Values: Identify and handle missing data, which is crucial for accurate analysis.
-
Data Transformation: Modify data, such as converting data types or normalizing values.
-
-
Data Analysis:
-
Filtering Data: Extract subsets of data based on conditions.
-
Grouping and Aggregation: Summarize data by grouping and calculating statistics.
-
-
Data Visualization:
-
Basic Plotting: Create simple plots directly from a DataFrame to visualize trends and patterns.
-
Advantages of Using Pandas for Data Analysis
-
Efficiency: Pandas can handle large datasets efficiently, which is essential for financial data analysis involving extensive historical records or real-time data streams.
-
Flexibility: It supports various data formats and allows for complex manipulations, making it suitable for diverse financial data needs.
-
Integration: Pandas integrates well with other Python libraries such as NumPy for numerical operations, Matplotlib for plotting, and Scikit-learn for machine learning, creating a robust data analysis ecosystem.
-
Data Handling: It provides powerful tools for cleaning, transforming, and analyzing data, helping to ensure that your analysis is accurate and meaningful.
Using Pandas
to measure the growth of IDX
The following is code you’ve seen (and hopefully executed) in the quickstart section. It reads a CSV file containing the historical market capitalization of the Indonesian Stock Exchange (IDX) and plots the market capitalization over time.
What you see here is us creating a DataFrame
object from a CSV file hosted remotely on GitHub.
We then plot the market capitalization data using the plot()
method from Pandas,
and display the plot using plt.show()
from Matplotlib. With pandas
, we use the
square bracket notation bbri_nasabah[]
to access columns in the DataFrame.
Indexing by Columns
df['market_cap']
ordf.market_cap
selects the ‘market_cap’ column from the DataFrame.- In cases where the column name has spaces or special characters, you can use
df['column name']
to access the column, but outside of those cases, you can usedf.column_name
as well.
- In cases where the column name has spaces or special characters, you can use
- To select multiple columns, you can pass a list of column names:
df[['currency', 'market_cap']]
.
When you select a single column from a DataFrame, you get a Series
object in return.
This is in constrast to selecting multiple columns, which returns another DataFrame with
the selected columns.
Indexing by columns is useful when you want to perform operations on specific columns or when you need to extract a subset of the data for analysis. It is also useful when you want to reorder columns in the DataFrame since you can pass a list of column names in the order you want them to be displayed.
It is important to note the lack of any assignment operator in the above code snippet, which means that the DataFrame is not modified in place. If you want to save the reordered DataFrame, you need to assign it to a new variable or overwrite the existing DataFrame.
Indexing by Rows
To select data (“slicing”) by rows, you can use the loc
and iloc
accessors. To select
a single row, you can use the loc
accessor with the row index label. Just like indexing
a single column, this returns a Series object.
When it’s inconvenient to use the row index label, or when it is yet unknown, you can
use the iloc
accessor to select rows by their integer index.
To select multiple rows, you can pass a list of row index labels or integers to the loc
or iloc
accessors, respectively.
Dropping Rows and Columns
To drop rows or columns from a DataFrame, you can use the drop()
method. The drop()
method returns a new DataFrame with the specified rows or columns removed.
pandas
DataFrames Attribute
What we’ve seen above are accessors to the DataFrame object. These are shortcuts that help you slice out specific parts of the DataFrames.
In this section, we’ll shift our attention to the DataFrame object itself, starting
with the shape
attribute. The shape
attribute returns a tuple representing the
dimensions of the DataFrame, with the first element being the number of rows and
the second element being the number of columns.
Another useful attribute is columns
, which returns the column labels of the DataFrame.
The dtypes
attribute returns the data types of each column in the DataFrame.
The .values
attribute returns the underlying NumPy array of the DataFrame.
This could be useful when you need to perform operations that are easier with NumPy
arrays, such as matrix multiplication or reshaping.
The .T
attribute returns the transpose of the DataFrame, which swaps the rows and columns,
making it more readable when you have a large number of columns relative to rows.
But remember, these attributes are read-only and do not modify the DataFrame in place, much like the indexing methods we’ve seen earlier.
The last attribute we’ll cover is .index
, which provides the row labels of the
DataFrame. This index is like a list of row labels. In spreadsheets, this would be
the row numbers on the leftmost column. When we read in a CSV file, the index is
automatically generated as a RangeIndex
object, unless we specify a column to be the index
through index_col
parameter in pd.read_csv()
.
In our case, it does make sense to set the index to the ‘year’ column, as it is unique and immediately clear what that row represents.
With that, our slicing and indexing operations will be more intuitive, as we can now use the year as the row label to access data.
When we combine the above with the .drop()
method, remember to now use the new index (row name)
in the drop()
method.
We can, of course, chain these .drop()
operations together, but remember that the DataFrame is not modified in place
unless you reassign it to the same variable.
And to reset the index back to the default RangeIndex
, you can use the reset_index()
method.
Creating new columns
Being creative in data analysis can help unlock insights that are not immediately obvious. In the field of machine learning, this is sometimes referred to as feature engineering.
For example, supposed we would like to train a machine learning model to predict the movement of the commodities market based on the market capitalization of the IDX. We could create a new column that represents the percentage change in market capitalization from the previous year.
Then, we can create a new column that returns either ‘positive’ or ‘negative’ based on the percentage change in market capitalization.
You probably won’t be using lambda
functions a great deal when starting out
with Python and Pandas, but it’s good to know that they exist and can be used
to write concise code.
-
.apply(lambda x: ...)
: Applies a function to each element of the column- Lambda Function:
lambda x: 'positive' if x > 0 else 'negative'
- Lambda Function:
-
df['change_direction']
: Creates a new column in the DataFrame namedchange_direction
and assigns the results of the.apply()
method to this column.
Checkpoint
Up to this part of the exercise, you should be able to:
- Use
pandas
to read in data from CSV files and Excel spreadsheets. - Access columns in a DataFrame using the square bracket notation.
- Use the
loc
andiloc
accessors to select rows in a DataFrame. - Drop rows and columns from a DataFrame using the
drop()
method. - Access DataFrame attributes like
shape
,columns
,dtypes
,values
,T
, andindex
. - Set and reset the index of a DataFrame.
- Create new columns in a DataFrame using existing columns.
Putting everything together, here’s a full example of what’s been covered so far:
Boolean Indexing
Boolean indexing is a powerful feature in Pandas that allows you to filter data based on conditions. You can use boolean arrays to filter rows based on specific criteria.
You can also combine multiple conditions using logical operators like &
(and) and |
(or).
When you want to filter for the year when the market capitalization change has
declined the most, you can use the idxmin()
method to get the index of the minimum value.
Similarly, the idxmax()
method can be used to get the index of the maximum value.
1997 being the year where IDX has shrunk the most in terms of market capitalization shouldn’t come as a surprise to anyone who still has vivid memories of the Asian Financial Crisis.
You can verify the number above by using conditional indexing to filter for the year with the minimum market capitalization change.
Another common use case of a boolean-indexed DataFrame is to count the number of rows that satisfy a condition.
Helpful methods for DataFrames
Pandas provides a wide range of methods to help you explore and manipulate your data. Here are some commonly used methods:
head()
: Returns the firstn
rows of the DataFrame.tail()
: Returns the lastn
rows of the DataFrame.describe()
: Generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution.info()
: Prints a concise summary of a DataFrame, including the index dtype and column dtypes, non-null values, and memory usage.unique()
: Returns unique values in a column.value_counts()
: Returns a Series containing counts of unique values.sort_values()
: Sorts the DataFrame by the values along either axis.dropna()
: Removes missing values from the DataFrame, either by dropping rows (axis=0
) or columns (axis=1
).
I’ll demonstrate the use of value_counts()
and sort_values()
on the DataFrame that
we’ve been working with.
And sorting the DataFrame by the market capitalization change percentage in descending order.
Group By and Aggregation
We’ve seen how to filter and sort data in a DataFrame. To summarize data, we can use the groupby()
method to group data based on one or more columns and then apply an aggregation function to each group.
A common aggregation function is mean()
, which calculates the average value for each group.
If you have matplotlib
installed, you can visualize the results using a bar plot.
Summary
You have just learned a tremendous amount of information about Pandas, a powerful library for data manipulation and analysis in Python. You now know how to:
- Load data from CSV files and Excel spreadsheets.
- Access columns and rows in a DataFrame using indexing and slicing.
- Drop rows and columns from a DataFrame.
- Use DataFrame attributes to get information about the DataFrame.
- Create new columns in a DataFrame.
- Filter data using boolean indexing.
- Use helpful methods for exploring and manipulating DataFrames, such as
value_counts()
andsort_values()
. - Group data and perform aggregation using
groupby()
.
Exercises
To solidify these learnings, you might be required to answer a few questions with a completely new dataset. The recommended time for the graded quiz is 45 minutes, but feel free to take as much time as you need.
To attempt these questions, make sure you’ve run the code snippets provided in this chapter.
Your dataset should have all the necessary columns (4, after dropping currency
) and rows (29) for the exercises.
Google Classroom Graded Assignment
Head over to Google Classroom or the assigned learning management system to complete the exercises and submit your work.
Lab: Data Manipulation and Analysis w/ Pandas
Using Pandas, perform a series of data manipulation and analysis tasks on Indonesia banking sector data provided by Sectors.
Author
This chapter is written by Samuel Chan, an analytics consultant at Supertype with over 11 years of experience of enterprise AI consulting across Singapore, China (DianDian, 600634:SH), Japan (TWP Dai Nippon, TYO:7912; gumi Inc, TWO:3903; SEGA, TYO:6460) and Indonesia (Emtek, Adaro Group of Companies, Central Bank of Indonesia, Bursa Efek Indonesia, BCA). He has long-term consulting experience with leading financial institutions in the region, and is the co-founder of Algoritma Data Science Education Center, Supertype, Sectors, and formerly HyperGrowth, a marketing automation and chatbot platform startup that he sold in 2016.
Samuel is an avid open source contributor and guest lecturer at several universities across Indonesia and Singapore. He is currently ranked #1 in Indonesia (and top 2% worldwide) on Stack Overflow for R and Python topics (with 111 badges and contributions exceeding 2 million reach).
Contributors
- Gerald Bryan, senior analytics consultant at Supertype