Lab: Data Manipulation and Analysis
Using Pandas, perform a series of data manipulation and analysis tasks on Indonesia banking sector data provided by Sectors.
To complete this exercise, you will need to use the Pandas library to perform a series of data manipulation and analysis tasks on valuation of Indonesia banking institutions. This exercise is worth 15 points.
Pre-Lab Preparation & Prerequisites
Environment Setup and API Keys
You will need to have a proper Python environment setup, with .env
file
containing the necessary API keys.
Data Analysis w/ Pandas
Review the Pandas tutorial to familiarize yourself with the library and its capabilities.
Submission Template (Google Colab)
Use the provided Google Colab template to complete the exercises and submit them for grading (a total of 15 points).
Google Classroom
Head over to Google Classroom to submit your assignments and receive feedback.
Financial Ratios for Value Investing
When it comes to investing, the specific style of “value investing” popularized by Benjamin Graham and Warren Buffet is based on the idea of buying stocks that are undervalued relative to their intrinsic value.
One way to determine if a stock is undervalued is to look at a company’s financial ratios. These ratios can help investors understand how a company is performing financially and whether its stock is a good investment.
In particular, We’ll be looking at the Price-to-Earnings ratio, the Price-to-Book ratio, and the Price-to-Sales ratio.
- Where the Price per share is the current stock price,
- and Earnings per share is the company’s net income divided by the number of shares outstanding.
In a way, this measures how much investors are willing to pay for each dollar of earnings a company generates.
- Where the Price per share is the current stock price,
- and the Book value per share is the company’s total assets minus its total liabilities, divided by the number of shares outstanding.
The P/B ratio reflects the value that investors are willing to pay for each dollar of book value.
- Where the Price per share is the current stock price,
- and the Sales per share is the company’s total revenue divided by the number of shares outstanding.
The P/S ratio is particularly useful for companies that have negative earnings or are in the growth phase, as it measures how much investors are willing to pay for each dollar of sales.
Margin of Safety
In value investing, a lower ratio is generally considered better, as it indicates that the stock is undervalued relative to its financial performance. When a stock is trading at a higher P/E or P/B ratio relative to its peers, value investors may typically consider it overvalued — this is where the concept of “margin of safety” comes into play. Magin of safety is simply the difference between the intrinsic value of a stock and its market price, and a stock with a higher P/E value is considered to have a lower margin of safety, unless its growth rate is significantly higher than its peers.
In the following exercise, you will be working with data directly from a financial data API to extract the historical financial ratios of Indonesia’s leading banking institutions.
The data would look like this:
By comparing the historical P/E ratios of different banks to their peers and their own historical averages, an analyst can get a sense of whether a bank is trading at a premium or discount to its intrinsic value, and whether it is a good value investment relative to its historical norm.
With the right data and analysis tools, a data analyst can construct a comprehensive picture of the banking sector’s valuation, and estimate a reasonable range for a stock’s intrinsic value. If the current P/E ratio is at the lower end of its historical range and the company’s growth prospects are as strong as ever, it may indicate a good value investment opportunity with a high margin of safety.
Banking Institutions’ historical valuation
For your convenience, a copy of the dataset has been provided in the datasets
folder. Using
what you’ve learned in the Pandas tutorial, you should know how to read this data into a DataFrame:
Once you have the data loaded, perform a series of exercises that are worth 15 points in total.
1. Basic Data Exploration
- Objective: Learn to explore the basic structure of a DataFrame
- Task: Write Python code to import the dataset and display the first 5 rows of the DataFrame. Then, report its shape (number of rows and columns).
- Hint: Use the
head()
method and theshape
attribute.
2. Filtering Data
- Objective: Learn to filter data based on specific conditions
- Task: Write Python code to filter and display rows where the
pe
is greater than 15 and whensymbol
is equal toBRIS.JK
. Name this new DataFramebris_high
. - Hint: Use boolean indexing
3. Calculate Average Values
- Objective: Learn to compute statistical summaries on a DataFrame
- Task: Write Python code to calculate the median
pe
,pb
, andps
values for the year 2022. - Hint: Use the
median()
method after filtering the DataFrame for the year 2022.
4. Sort Data, Reorder and Drop Columns
- Objective: Learn to sort data and reorder columns
- Task: Write Python code to sort the DataFrame by
ps
in descending order, then reorder the columns so thatsymbol
is the first column,year
,ps
,pe
, andpb
. Drop thecompany_name
column. - Hint: Use the
sort_values()
method, and optionallydrop()
method to remove thecompany_name
column. Recall that insort_values
, you can specify theascending
parameter toFalse
to sort in descending order. You may also choose thecolumns
parameter in thereindex()
method to reorder the columns.
5. Grouping and Aggregating Data
- Objective: Learn to group data and perform aggregate operations
- Task: Write Python code to group the bank valuation data by
symbol
and calculate the averagepe
,pb
, andps
values for each group. Display the results. Now, do the same but grouped byyear
and calculate the averagepe
,pb
, andps
values for each year. - Hint: Use the
groupby()
method followed by themean(numeric_only=True)
method.
In past versions of pandas
, the mean()
method would include all numeric columns by default.
There is now a FutureWarning
that suggests numeric_only
defaulting to False
in the future.
We can either specify numeric_only=True
to avoid the warning, or use the select_dtypes()
method to filter
out only the numeric columns before calling the mean()
method.
Dive Deeper
The data you’ve been provided with is not hand curated, but rather sourced from Sectors Financial Data API. If you’re feeling adventurous, you can proceed along with this section that explains how the data is being fetched from the API and subsequently processed into a csv file.
These topics may feel a bit advanced, and you are not required to complete them for the lab. They are included here for those who have a personal interest in learning more about how the “data pipeline” works in the real world.
Nested Data Structures into DataFrames
The data is fetched from the API using the requests
library, as you have seen in your Quick start guide.
To actually use fetch_data
we need to specify a url. This can be constructed from
reading the correponding API documentation. For example, to fetch the historical P/E ratios of BBRI, one could do:
This looks like a great deal more code than what you’ve seen so far, but it’s essentially the same concepts you’ve learned
with the addition of pd.json_normalize
to convert the JSON response into a DataFrame.
Here’s the output:
We’ll find that the historical_valuation
key contains the data we’re interested in, and we can extract it using pd.json_normalize
once we’ve used .explode()
to expand out the list of dictionaries into separate rows.
Seeing that it is now an object, we can then use pd.json_normalize
to expand the dictionary into separate columns:
That is the essence of how we fetch data from an API endpoint that sometimes, as in this case, returns nested data structures.
json_normalize
: Converts JSON data into a DataFrameexplode
: Expands out a list of dictionaries into separate rows
We perform the above iteratively for each bank in the banks
list, and then concatenate the DataFrames together to form the final dataset.