Abstract
In this lab session we look at setting up a SQL server, creating and populating a database, and making joins between different tables.
The check Session for this Practical is 7th November 2024.
- This practical should prepare you for the course assessment. Ensure that you have a solid understanding of the material, with particular emphasis on the AWS database setup. You should be able to use the same database that you set up here for the final assessment.
- In that assessment, you will work with datasets that require initial setup processes. Start your work on dataset access and database setup early to avoid being blocked from work on subsequent stages later.
- Some tasks will require you to develop skills for searching for multiple solutions and experimenting with different approaches, which lecture content may not cover. This environment closely resembles real-world data science and software engineering challenges, where there might not be a unique correct solution.
Brief History of the Cloud
In the early days of the internet, companies could make use of open source software, such as the Apache web server and the MySQL database for providing on line stores, or other capabilities. But as part of that, they normally had to provide their own server farms. For example, the earliest server for running PageRank from 1996 was hosted on custom made hardware built in a case of Mega Blocks (see Figure \(\ref{the-first-google-computer-at-stanford}\)).
By September 2000, Google operated 5000 PCs for searching and web crawling, using the LINUX operating system.1
Cloud computing gives you access to computing at a similar scale, but on an as-needed basis. AWS launched S3 cloud storage in March 2006 and the Elastic Compute Cloud followed in August 2006. These services were inspired by the challenges they had scaling their own web service (known as Obidos) for running the world’s largest e-commerce site. Recent market share estimates indicate that AWS still retains around a third of the cloud infrastructure market.2
UK Housing Datasets
The UK Price Paid data for housing in dates back to 1995 and contains millions of transactions. This database is available at the gov.uk site. The total data is over 4 gigabytes in size and it is available in a single file or in multiple files splitted by years and semester. For example, the first part of the data for 2018 is stored at http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018-part1.csv. By applying the divide and conquer principle, we will download the splitted data because these files is less than 100MB each which makes them easier to manage.
Let’s download first the two files that contain the price paid data for the transactions that took place in the year 1995:
import requests
# Base URL where the dataset is stored
= "http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com"
base_url # Downloading part 1 from 1995
= "/pp-1995-part1.csv"
file_name_part_1 = base_url + file_name_part_1
url = requests.get(url)
response if response.status_code == 200:
with open("." + file_name_part_1, "wb") as file:
file.write(response.content)
# Downloading part 2 from 1995
= "/pp-1995-part2.csv"
file_name_part_2 = base_url + file_name_part_2
url = requests.get(url)
response if response.status_code == 200:
with open("." + file_name_part_2, "wb") as file:
file.write(response.content)
The data is downloaded as CSV files in the files explorer of this notebook. You can see that the two pieces of code that download the data are quite similar. It makes sense to use a for loop to automate the way we access the dataset for the different years. The following code will download the data from 1996 to 2010.
# Base URL where the dataset is stored
= "http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com"
base_url # File name with placeholders
= "/pp-<year>-part<part>.csv"
file_name for year in range(1996,2011):
print ("Downloading data for year: " + str(year))
for part in range(1,3):
= base_url + file_name.replace("<year>", str(year)).replace("<part>", str(part))
url = requests.get(url)
response if response.status_code == 200:
with open("." + file_name.replace("<year>", str(year)).replace("<part>", str(part)), "wb") as file:
file.write(response.content)
If we think of reusability, it would be good to create a function that can be called from anywhere in your code.
import requests
Now we can call the function to download the data between two given years. For example, let’s download the data from 2011 to 2020 by calling the defined function.
2011, 2020) download_price_paid_data(
Exercise 1
Add this function to your fynesse library and download the data from 2021 to 2024 using your library.
Cloud Hosted Database
The size of the data makes it unwieldy to manipulate directly in
python frameworks such as pandas
. As a result we will host
and access the data in a relational database.
Using the following ideas: 1. A cloud hosted database (such as MariaDB hosted on the AWS RDS service). 2. SQL code wrapped in appropriately structured python functions. 3. Joining databases tables. You will construct a database containing tables that contain all house prices, latitudes and longitudes from the UK house price data base since 1995.
Important Notes
You will manipulate large datasets along this practical and the final assessment. If your database is blocked or is not responsive after any operation, have a look at the databases dashboard to see if it is using too much CPU. Reboot the database if that is the case. If you encounter problems with the online notebook (e.g., interrupted connections with the AWS server), you can use a local IDE to work in your machine.
SQL Database Server
A typical machine learning installation might have you running a
database from a cloud service (such as AWS, Azure or Google Cloud
Platform). That cloud service would host the database for you, and you
would pay according to the number of queries made. Popular SQL server
software includes [MariaDB
] https://mariadb.org/) which is
open source, or Microsoft’s
SQL Server.
Many start-up companies were formed on the back of a
MySQL
server hosted on top of AWS. Although since MySQL was
sold to Sun, and then passed on to Oracle, the open source community has
turned its attention to MariaDB
, here’s the AWS
instructions on how to set up MariaDB
.
If you were designing your own ride hailing app, or any other major commercial software you would want to investigate whether you would need to set up a central SQL server in one of these frameworks.
Creating a MariaDB Server on AWS
In this section, we’ll review the setup required to create a MariaDB
server on AWS. The earliest AWS services of S3 and EC2 gave storage and
compute. Together these could be combined to host a database service.
Today cloud providers also provide machines that are already set up to
provide a database service. We will make use of AWS’s Relational
Database Service to provide our MariaDB
server.
- Log in to your AWS account and go to the AWS RDS console here.
- Make sure the region is set to Europe (London) which is denoted as eu-west-2.
- Scroll down to “Create Database”. Do not create an Aurora database instance.
Standard Create
should be selected. In the box below, which is titledEngine Options
you should selectMariaDB
. You can leave theVersion
as it’s set.
- In the box below that, make sure you select
Free tier
.
- Name your database server. For your setup we suggest you use
database-ads-<CRSid>
for the name.<CRSid>
corresponds to your CRSid. So, every student has an independent database. - Set a master password for accessing the database server as admin.
- Leave the
DB instance class
as it is. - Leave the
DB instance size
at the default setting. Leave the storage type and allocated storage at the default settings ofGeneral Purpose
(SSD) and20
. - Disable autoscaling in the
Storage Autoscaling
option. - In the connectivity leave the VPC selection as
Default VPC
and enablePublicly accessible
so that you’ll have an IP address for your database. - In
VPC security group
selectCreate new
to create a new security group for the instance. - Write
ADSMariaDB
as the group name for the VPC security group. - Leave the rest as it is and select
Create database
at the bottom to launch the database server.
Your database server will take a few minutes to launch.
While it’s launching you can check the access rules for the database server here.
- Select the
Default
security group. - The source of the active inbound rule must be set to
0.0.0.0/0
. It means you can connect from any source using IPv4.
A wrong inbound rule can cause you fail connecting to the database from this notebook.
Note: by setting the inbound rule to 0.0.0.0/0
we have opened up access to any IP address. If this were
production code you wouldn’t do this, you would specify a range of
addresses or the specific address of the compute server that needed to
access the system. Because we’re using Google colab or another notebook
client to access, and we can’t control the IP address of that access,
for simplicity we’ve set it up so that any IP address can access the
database, but that is not good practice for production
systems.
Connecting to your Database Server
Before you start, you’re going to need the username and password you
set-up above for accessing the database server. You will need to make
use of it when your client connects to the server. It’s good practice to
never expose passwords in your code directly. So to protect your
passowrd, we’re going to create a credentials.yaml
file
locally that will store your username and password so that the client
can access the server without ever showing your password in the
notebook. This file will also score the URL and port of your database.
You can get these details from your database connectivity and security
details.
import yaml
from ipywidgets import interact_manual, Text, Password
If you click Run Interact
then the credentials you’ve
selected will be saved in the yaml
file.
Then, we can read the server credentials using:
with open("credentials.yaml") as file:
= yaml.safe_load(file)
credentials = credentials["username"]
username = credentials["password"]
password = credentials["url"]
url = credentials["port"] port
SQL Commands
We have all the required data to interact with our database server. There are mainly two ways how we can do that. The first one is using magic SQL commands. For this option, we need to install pymysql and load the sql extension:
%pip install pymysql
%load_ext sql
We can now test our first database server connection using magic SQL. The first line establishes the connection and the second one list the databases. For now, you should see the databases that the engine has installed by default.
%sql mariadb+pymysql://$username:$password@$url?local_infile=1
%sql SHOW databases
This connection also enables the uploading of local files as part of
the connection (i.e., local_infile=1
). We will use this
property later.
Database Schema
As a first step after establishing a connection, we should create our own database in the server. We will use this relational database to store, structure, and access the different datasets we will manipulate during this course. We will use SQL code for this and once again magic commands:
%%sql
= "NO_AUTO_VALUE_ON_ZERO";
SET SQL_MODE = "+00:00";
SET time_zone
; CREATE DATABASE IF NOT EXISTS `ads_2024` DEFAULT CHARACTER SET utf8 COLLATE utf8_bin
%sql SHOW databases
After the database is created in our server, we must tell it which database we will use. You will need to run this command after you create a new connection:
%%sql
; USE `ads_2024`
A database is composed of tables where data records are stored in
rows. The attributes of each record are the columns. We must define an
schema
to create a table in a database. The
schema
tells the database and the server what to expect in
the columns of the table (i.e., names and data types of the
columns).
The schema
is defined by the data we want to store. If
we want to store the UK Price Paid data, we should have a look at its
description first. Also, we should a look at the data. Let’s do that
for the second semester of 2021.
import pandas as pd
= "./pp-2021-part2.csv"
file_name = pd.read_csv(file_name)
data =True) data.info(verbose
Based on the columns of the dataframe, we should define now the
equivalent schema
of the table in the SQL database. We will
use once more SQL magic commands to create a table with the equivalent
schema
:
# WARNING: If you run this command after you have uploaded data to the table (in the steps below), you will delete the uploaded data as this command first drops the table if exists (DROP TABLE IF EXISTS `pp_data`;).
%%sql
--
-- Table structure for table `pp_data`
--
;
USE `ads_2024`;
DROP TABLE IF EXISTS `pp_data`
CREATE TABLE IF NOT EXISTS `pp_data` (
`transaction_unique_identifier` tinytext COLLATE utf8_bin NOT NULL,int(10) unsigned NOT NULL,
`price`
`date_of_transfer` date NOT NULL,8) COLLATE utf8_bin NOT NULL,
`postcode` varchar(1) COLLATE utf8_bin NOT NULL,
`property_type` varchar(1) COLLATE utf8_bin NOT NULL,
`new_build_flag` varchar(1) COLLATE utf8_bin NOT NULL,
`tenure_type` varchar(
`primary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`secondary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`street` tinytext COLLATE utf8_bin NOT NULL,
`locality` tinytext COLLATE utf8_bin NOT NULL,
`town_city` tinytext COLLATE utf8_bin NOT NULL,
`district` tinytext COLLATE utf8_bin NOT NULL,
`county` tinytext COLLATE utf8_bin NOT NULL,2) COLLATE utf8_bin NOT NULL,
`ppd_category_type` varchar(2) COLLATE utf8_bin NOT NULL,
`record_status` varchar(20) unsigned NOT NULL
`db_id` bigint(=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ; ) DEFAULT CHARSET
The schema
defines an id field in the table (i.e.,
db_id
), which must be unique and will play the role of primary
key, which is a crucial concept in relational databases. The
following code sets up the primary key for our table and makes it auto
increment when a new row (i.e., record) is insterted into the table.
%%sql
--
-- Primary key for table `pp_data`
--
ALTER TABLE `pp_data`;
ADD PRIMARY KEY (`db_id`)
ALTER TABLE `pp_data`20) unsigned NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=1; MODIFY db_id bigint(
We’ve created our first table in the database with its respective
primary key. Now we need to populate it. There are different methods to
do that. Some
of them more efficient than others. In our case, given the size of
our data set, we will take advantage of the csv files we downloaded in
the first part of this lab. The command
LOAD DATA LOCAL INFILE
allows uploading data to the table
from a CSV file. We must specify the name of the local file, the name of
the table, and the format of the CSV file we want to use (i.e.,
separators, enclosers, termination line characters, etc.)
The following command uploads the data of the transactions that took place in 1995.
%sql USE `ads_2024`;
%sql LOAD DATA LOCAL INFILE "./pp-1995-part1.csv" INTO TABLE `pp_data` FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED by '"' LINES STARTING BY '' TERMINATED BY '\n';
%sql LOAD DATA LOCAL INFILE "./pp-1995-part2.csv" INTO TABLE `pp_data` FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED by '"' LINES STARTING BY '' TERMINATED BY '\n';
If we want to upload the data for all the years, we will need a command for each CSV in our dataset. Alternatively, we can use Python code together with SQL magic commands as follows:
# WARNING: This code will take a long time to finish (i.e., more than 30 minutes) given our dataset's size. The print informs the uploading progress by year.
for year in range(1996,2025):
print ("Uploading data for year: " + str(year))
for part in range(1,3):
= "./pp-" + str(year) + "-part" + str(part) + ".csv"
file_name %sql LOAD DATA LOCAL INFILE "$file_name" INTO TABLE `pp_data` FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED by '"' LINES STARTING BY '' TERMINATED BY '\n';
Now that we uploaded the data, we can retrieve it from the table. We
can select the first 5 elements in the pp_data
using this
command:
%%sql
;
USE `ads_2024`* FROM `pp_data` LIMIT 5; SELECT
We can also count the number of rows in our table. It can take more than 5 minutes to finish. There are almost 30 million of records in the dataset.
%sql select count(*) from `pp_data`;
Postal Codes Dataset
The UK Price Paid dataset is now available in our database. This dataset provides useful information about the housing transactions in the UK. More complete analysis can be enabled if we join it with additional datasets. That is the goal of the rest of this practical. The Open Postcode Geo provides additional information about the houses. It is a dataset of British postcodes with easting, northing, latitude, and longitude and with additional fields for geospace applications, including postcode area, postcode district, postcode sector, incode, and outcode.
Your task now is to make this dataset available and accessible in our database. The data you need can be found at this url: https://www.getthedata.com/downloads/open_postcode_geo.csv.zip. It will need to be unzipped before use.
You may find the following schema useful for the postcode data (developed by Christian and Neil)
%%sql
--
-- Table structure for table `postcode_data`
--
;
DROP TABLE IF EXISTS `postcode_data`
CREATE TABLE IF NOT EXISTS `postcode_data` (8) COLLATE utf8_bin NOT NULL,
`postcode` varchar('live','terminated') NOT NULL,
`status` enum('small', 'large') NOT NULL,
`usertype` enum(int unsigned,
`easting` int unsigned,
`northing` int NOT NULL,
`positional_quality_indicator` 'England', 'Wales', 'Scotland', 'Northern Ireland', 'Channel Islands', 'Isle of Man') NOT NULL,
`country` enum(11,8) NOT NULL,
`latitude` decimal(10,8) NOT NULL,
`longitude` decimal(
`postcode_no_space` tinytext COLLATE utf8_bin NOT NULL,7) COLLATE utf8_bin NOT NULL,
`postcode_fixed_width_seven` varchar(8) COLLATE utf8_bin NOT NULL,
`postcode_fixed_width_eight` varchar(2) COLLATE utf8_bin NOT NULL,
`postcode_area` varchar(4) COLLATE utf8_bin NOT NULL,
`postcode_district` varchar(6) COLLATE utf8_bin NOT NULL,
`postcode_sector` varchar(4) COLLATE utf8_bin NOT NULL,
`outcode` varchar(3) COLLATE utf8_bin NOT NULL,
`incode` varchar(20) unsigned NOT NULL
`db_id` bigint(=utf8 COLLATE=utf8_bin; ) DEFAULT CHARSET
And again you’ll want to set up a primary key for the new table.
%%sql
ALTER TABLE `postcode_data`;
ADD PRIMARY KEY (`db_id`)
;
ALTER TABLE `postcode_data`20) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=1; MODIFY `db_id` bigint(
Exercise 2
Upload the postcode dataset to your database using the LOAD DATA LOCAL INFILE command.
Joining Tables
When we have two tables with data tha describe the same object, then
it makes sense to join the tables together to enrich our knowledge about
the object. For example, while the pp_data
tell us about
the transactions a house have been involved, the
postcode_data
tell us details about the location of the
house. By joining both, we could answer more interesting questions like
what are the coordinates of the most expensive house in 2024?
The join of the tables must be done by matching the columns the
tables share. In this case, the pp_data
and the
postcode_data
tables share the column
postcode
. This operation can take long time because the
number of records in each database is huge.
You will find the issue of operations taking too long when handling
large data sets in different scenarios. An appropriate strategy to
overcome such issues is to index the tables. Indexing is a way to
organise the data so the queries are more efficient time-wise. Now the
task is to select the right columns to create the index. This decision
depends on the columns we are using in the SQL operations. In the join
case, we are selecting and matching records in each table using the
postcode
column. It makes sense then to create an index for
these columns. The following code indexes the table pp_data
by postcode
.
# WARNING: Giving the size of the table, this operation takes around 8 minutes.
# If your database is not responsive, check the status of your database on the AWS dashboard. You can restart the database from the dashboard.
%%sql
;
USE `ads_2024`; CREATE INDEX idx_pp_postcode ON pp_data(postcode)
Let’s try our join the tables for year 2024 now:
%%sql
;
USE `ads_2024`* from pp_data as pp inner join postcode_data as po on pp.postcode = po.postcode where pp.date_of_transfer BETWEEN '2024-01-01' AND '2024-12-31' limit 5; select
Exercise 3
The index made a difference in the time the join operation took to
finish. Write the code to index the table postcode_data
by
postcode
.
And, let’s try the join again:
%%sql
;
USE `ads_2024`* from pp_data as pp inner join postcode_data as po on pp.postcode = po.postcode where pp.date_of_transfer BETWEEN '2024-01-01' AND '2024-12-31' limit 5; select
Exercise 4
Do you see any difference after adding the new index? Why?
Database Python Client
Now let’s focus on the second way of interacting with the database server. We can use Python code to create a client that communicates with our database server. For this, we need to install the following libraries.
%pip install ipython-sql
%pip install PyMySQL
Let’s create a method in Python to establish a database connection wherever we like. It should look like the following code:
import pymysql
Please add the code above to your fynesse library. We can now call this function to get a connection:
#Write your code to establish a connection using your fynesee library
Now let’s define a Python method that uploads to a table the data
product of the join operation between the tables pp_data
and postcode_data
. For this, we first need to create the
table that will store this data.
%%sql
;
USE `ads_2024`--
-- Table structure for table `prices_coordinates_data`
--
;
DROP TABLE IF EXISTS `prices_coordinates_data`
CREATE TABLE IF NOT EXISTS `prices_coordinates_data` (int(10) unsigned NOT NULL,
`price`
`date_of_transfer` date NOT NULL,8) COLLATE utf8_bin NOT NULL,
`postcode` varchar(1) COLLATE utf8_bin NOT NULL,
`property_type` varchar(1) COLLATE utf8_bin NOT NULL,
`new_build_flag` varchar(1) COLLATE utf8_bin NOT NULL,
`tenure_type` varchar(
`locality` tinytext COLLATE utf8_bin NOT NULL,
`town_city` tinytext COLLATE utf8_bin NOT NULL,
`district` tinytext COLLATE utf8_bin NOT NULL,
`county` tinytext COLLATE utf8_bin NOT NULL,'England', 'Wales', 'Scotland', 'Northern Ireland', 'Channel Islands', 'Isle of Man') NOT NULL,
`country` enum(11,8) NOT NULL,
`latitude` decimal(10,8) NOT NULL,
`longitude` decimal(20) unsigned NOT NULL
`db_id` bigint(=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ; ) DEFAULT CHARSET
We should define the primary key for this table too.
%%sql
ALTER TABLE `prices_coordinates_data`;
ADD PRIMARY KEY (`db_id`)
ALTER TABLE `prices_coordinates_data`20) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=1; MODIFY `db_id` bigint(
Indexing the table pp_data
by date will be useful for
populating the prices_coordinates_data
. This index can take
around 8 minutes to finish.
%%sql
;
USE `ads_2024`; CREATE INDEX idx_pp_date_transfer ON pp_data(date_of_transfer)
Now that the table exists in our database, let’s create a method for uploading the join data. This method will upload the data for a given year and will use the logic we have used before but in Python code.
import csv
import time
Now, lets upload the joined data for 2024. This upload is going to take long time given the size of our datasets:
2024) housing_upload_join_data(conn,
Exercise 5
Add the housing_upload_join_data
function to your
fynesse library and write the code to upload the joined data for
2023.
To finalise this lab, let’s have a look at the structure of your database running the following code:
= %sql SHOW TABLES;
tables
for row in tables:
= row[0]
table_name print(f"\nTable: {table_name}")
= %sql SHOW TABLE STATUS LIKE '{table_name}';
table_status = table_status[0][4] if table_status else 'Unable to fetch row count'
approx_row_count print("\nApprox Row Count:", approx_row_count//100000/10, "M")
= %sql SELECT * FROM `{table_name}` LIMIT 5;
first_5_rows print(first_5_rows)
= %sql SHOW INDEX FROM `{table_name}`;
indices if indices:
print("\nIndices:")
for index in indices:
print(f" - {index[2]} ({index[10]}): Column {index[4]}")
else:
print("\nNo indices set on this table.")
Exercise 6
Write the output of the above code:
Summary
In this practical, we have explored how to persist a couple of
datasets in a relational database to facilitate future access. We
configured a Cloud-hosted database server and had a look at two ways to
interact with it. Then, we explored how to join tables using SQL and the
benefits of indexing. The tables you created in this practical will be
used along the course and we expect you use them for your final
assignment. In the following practical, you will assess
this data using different methods from visualisations to statistical
analysis.
Thanks!
For more information on these subjects and more you might want to check the following resources.
- book: The Atomic Human
- twitter: @lawrennd
- podcast: The Talking Machines
- newspaper: Guardian Profile Page
- blog: http://inverseprobability.com