SQL and Databases
The theoretical and practical aspects of database management and querying.
Recommended Pre-Reading
A historical perspective on enterprise data management
A great way to learn about the evolution of enterprise data management and how it has shaped the relational database management systems we use today.
Creating SQLite Databases from CSV Files
A 30-min, step-by-step guide to creating SQLite databases from CSV files.
Introduction to SQL Queries w/ Pandas
A Jupyter Notebook that highlights the 15 most common SQL queries and how to perform them in Pandas.
Introduction
Much of the data we work with in the financial industry is stored in databases and accessed using SQL.
To really unlock the power of data, your ability and general familiarity with SQL will be a huge advantage in your career, and is the reason why it’s often a requirement for many data-related roles.
This module will cover the theoretical and practical aspects of database management and the querying language SQL. If you do not have a database connection set up (as is expected for most learners and beginners), the best alternative is to use SQLite, which is a lightweight database engine that can be used without a server.
Follow the instructions in the Recommended Pre-Reading section above to set up a SQLite database from a CSV file. If you’re following this module as part of a course, you will be required to complete that setup to be able to follow along with the lessons.
Grading Rubric
- Convert a CSV to a SQLite Database, 4 points
- SQL Queries in Practice, 12 points
- Renaming columns with
ALTER TABLE
, 1 point SELECT
,LIMIT
andOFFSET
in queries, 1 point- Returning unique values with
DISTINCT
, 1 point - Filtering data with
WHERE
, 1 point - Pattern matching with
LIKE
, 1 point - Joining tables with
LEFT JOIN
, 1 point - Aggregation and grouping with
GROUP BY
, 1 point - Using
CAST AS
to change data types, 1 point - Using
HAVING TO
to filter grouped data, 1 point - Using subquery to nest a
SELECT
under anotherSELECT
, 1 point - Using
BETWEEN
to filter data within a range, 1 point - Using
CREATE VIEW
to pack a query into a named object, 1 point
- Renaming columns with
For the most part, you will also need a database client to interact with your database. These ranged from
full fledged workbenches, to simple command line tools. I am using the SQLite
VSCode
extension by alexcvzz, which suffice for the purpose of this module.
Connecting to a Database
Very often, to make a connection to a database, you will need to provide a connection string and
use a library to help with the connection. Among the most popular choices for Python users would be
sqlalchemy
(additional lessons to SQLAlchemy)
which acts as an Object Relational Mapper (ORM), but there are also libraries specific to the database:
- PostgreSQL:
psycopg2
- MySQL:
mysql-connector-python
ormysqlclient
In addition to an ORM or an adapter library, you would also need to prepare the connection details for the database. This would include the database name, the user, the password, and the host.
Depending on the tools you use, I’ve provided a list of examples below to help you make the first connection to a database and perform a simple query.
Database queries with Pandas
SQL stands for Structured Query Language, and is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS).
Notebook Exercise
Looking for some structured practice with using SQL directly in Pandas?
Try your hands on the exercises in 4_SQL_Query before moving into the lab exercises below.
Lab
All of the following exercise is to be performed on the sqlite
database you created from the CSV files:
metadata
table: fromidx_subsector_metadata_080824.csv
reports
table: fromidx_sector_reports_080824.csv
- the database itself should be called
sectors.db
Your submission should be a folder with all the SQL files you created for each exercise.
Renaming columns with ALTER TABLE
and RENAME COLUMN
Let’s start off simple. Create a file and name it 01_rename.sql
. Use ALTER TABLE <table_name> RENAME COLUMN <existing_name> TO <new_name>
to make alterations to the column in metadata
named sub_sector_id
. Rename it to sub_sector_uid
.
If you use the VSCode extension, you can also run the query directly to verify that your answer is correct.
- CTRL + SHIFT + P to open the command palette
- Type
SQLite: New Query
- Type in your query
- CTRL + SHIFT + P +
SQLite: Run Query
or use the shortcut (CTRL + SHIFT + Q)
You may need to refresh your database view to see this change take effect. Alternatively, you can
also use PRAGMA table_info(metadata);
to see the column names.
SELECT
, LIMIT
and OFFSET
in queries
Now create a second file and call it 02_select.sql
.
Use SELECT
to query the reports
table and return the first 10 rows with the following columns:
sub_sector
total_companies
filtered_median_pe
Add an OFFSET
clause to your query to get the first 10 rows after the first 5 rows (skipping the first 5 rows).
Return unique values with DISTINCT
and as
aliasing
Create a third file and name it 03_distinct.sql
.
Use SELECT
to query the metadata
table and return the first 10 unique sectors in the table.
Add a DISTINCT
clause to your query to return only unique values in the
sector
column. This will remove any duplicate values from the result set.
Name the column Indonesian Sectors
using the AS
keyword.
Use a LIMIT
clause to return only the first 10 rows.
Filtering data with WHERE
Create a fourth file and name it 04_where.sql
.
From the reports
table, select only rows where the sub-sector has
at least 15 companies and the average market cap of that sector (avg_market_cap
) is lesser than 500_000_000_000 (“500 billion”).
Return the following columns:
sub_sector
total_companies
avg_market_cap
You can join multiple conditions using the AND
keyword.
Pattern matching with LIKE
.
Create a fifth file and name it 05_like.sql
.
In SQL, you can use wildcards for constructing patterns.
-
%
wildcard matches any sequence of zero or more characters. -
_
wildcard matches any single character. -
b%
matchesbank
,bytes
,bit
-
%ion
matchesinstitution
,renovation
-
%it% matches any string containing
it` -
b_nk
matchesbank
,bunk
-
l__n
matchesloan
,lean
“A” LIKE “a” is true (non case sensitive). To set it to
case sensitive, run PRAGMA case_sensitive_like = true
;
Select all rows from the metadata
table where the sub_sector
column contains the word Goods
or Services
(both case insensitive),
regardless of the position of these words in the string.
Return all columns.
Joining tables with LEFT JOIN
Create a sixth file and name it 06_left_join.sql
.
Use a LEFT JOIN
to join the metadata
and reports
tables on the sub_sector
column.
Return the following columns:
sector
sub_sector
total_companies
avg_market_cap
Aggregation and grouping with GROUP BY
Create a seventh file and name it 07_group_by.sql
.
This file should contain two queries.
For the first query, use COUNT
to return the number of rows in each group
of sector
in the metadata
table. This gives us an idea of how many sub sectors
are in each sector.
Name the columns sector
and total_sub_sectors
.
For the second query, use AVG
to return the average number of companies of each sub sector by
grouping on the sector
column.
Name the columns sector
, sub_sector
, total_companies
, and avg_sub_sector_companies
.
This query should be similar to the one in your LEFT JOIN
exercise
since you will need to join the two tables to get the total_companies
column.
Using CAST AS
and ORDER BY
Since aggregation is commonly applied on numeric columns, this may present a little bit of problem when our columns are in character strings. You may alter the database table and change the column types, or you may use CAST AS
to cast the column values from character to either integer
or decimal
.
Create an eighth file and name it 08_cast_as.sql
.
Use CAST AS
to change the total_companies
and avg_market_cap
columns in the reports
table to an integer.
Finally, use ORDER BY
to sort the result set by the total_companies
column in descending order. Return
only the first ten rows and the following columns:
sub_sector
total_companies
as companies_numavg_market_cap
as market_cap_average
For practice’s sake, you should also try this exercise without changing the column types and see if you can get the same result when ordering by a character column.
If you get the following results, you did not cast the columns correctly:
Using HAVING TO
to filter grouped data
HAVING
has to be used with GROUP BY
. They act similarly to how WHERE
works on rows, except HAVING
filters on the group levels.
Create a ninth file and name it 09_having.sql
.
Perform the second exercise in the seventh file again, but this
time return only sectors where the avg_sub_sector_companies
is greater than 30.
Use a subquery to nest a SELECT
under another
Subqueries allow you to put together more complex queries while maintaining some readability. Fundamentally, a subquery is just a nested level of SELECT
statement.
Create a tenth file and name it 10_subquery.sql
.
First create a subquery that returns all rows from metadata
where the sector is either Financials
or Technology
(case-sensitive!).
Then use this subquery to filter the reports
table, so the returned set
only contains rows where the sub_sector
is in the (earlier) subquery result.
Return the following columns in your query:
sub_sector
min_company_pe
max_company_pe
Using BETWEEN
to filter data within a range
Create a eleventh file and name it 11_between.sql
.
Your query should be similar to the one in the tenth file above, with a small adjustment.
- In your subquery (the inner query), return all rows from
metadata
where thesector
is eitherFinancials
orInfrastructures
and when thesub_sector_uid
is between 3 and 20. - In your outer query, return only rows where the
avg_yoy_q_revenue_growth
is between 0 and 0.3.
Note that BETWEEN
is inclusive on both ends, so BETWEEN 0 AND 17
includes 0 and 17.
Return the following columns:
sub_sector
avg_yoy_q_revenue_growth
min_company_pe
max_company_pe
Notice that:
- If you have performed the exercises earlier, your column names may have be altered. We’re working with the
sub_sector_uid
column after the renaming exercise. BETWEEN
, as you’d expect, expect two values separated byAND
and that these are numbers or dates. You might have to cast the column to a number if it’s not already.
Using CREATE VIEW
to pack a query into a named object
For the final exercise, create a file and name it 12_create_view.sql
.
A view is the result set of a stored query. It is a virtual table you define in the database that can be queried like a table, except that it does not store data.
Once you have defined a view, you can query it like a table — it can include columns from any number of tables, involving complex joins, filters and aggregations.
We create a view with the CREATE VIEW
statement.
In your file, create a view named watchlist
from reports
where the following conditions are met:
avg_market_cap
is greater than the average ofavg_market_cap
. This means we’re only watching sectors with companies that are generally larger than the average.avg_yoy_q_revenue_growth
is greater than 0.1. This narrow our watchlist to sectors with companies that are growing at a rate of at least 10% year-on-year.median_yield_ttm
greater than 0.03. This further narrows our watchlist to sectors with companies that are, on average, paying out at least 3% worth of dividiend yield.
Return all columns. You may need to perform type casting to ensure that the columns are of the correct type.
Now, use PRAGMA
to print out the columns of the watchlist
view.
Depending on your database editor or application, you may need to refresh the explorer to see your views being created.
You can now attempt a query directly on this view to see if it works as expected.
Benefits and Advantages of Views
Views present the following benefits:
- Simplify complex queries: Views can encapsulate complex queries and present a simplified, more understandable interface. Users can query the view without needing to know the details of the underlying tables and joins.
- Data security: Views can restrict access to sensitive data. By creating views that only include certain columns or rows, you can control what data users see and access.
- Consistency: Views ensure consistent results for complex queries. If you have a commonly used query, encapsulating it in a view ensures that any changes to the query logic are applied uniformly across all users and applications accessing that view.
- Abstraction: Views abstract the underlying database schema. If the schema changes, you can often adjust the view without modifying the application code that uses the view.
- Maintainaibility: If you need to update how data is presented or aggregated, you can modify the view definition without changing the application code that uses it.
- Materialized views: Some databases support materialized views, which store the results of the view query in a table-like structure. This can improve query performance by pre-computing the results and storing them for faster access. In cases where materialized view isn’t supported, you can create a table from the view. It is important to distinguish this from regular views, which are virtual and are generally just queries that are run when the view is queried.