Building AWS Redshift Data Warehouse with Matillion and Tableau

1

Building a Data Warehouse in 2 Hours using Amazon Redshift Lessons learned from our hands-on workshop

Our primary goal was to showcase the power and ease of building a data warehouse using AWS Redshift. In order to load source data from AWS S3 efficiently, we used an AWS Marketplace Partner (Matillion ETL for Redshift) as our data load tool. To complete a typical Enterprise business scenario, we used another AWS Marketplace Partner (Tableau) to be able to generate some data visualization in the form of a dashboard. Another goal was to build a reference use case using AWS best practices, such as using an IAM user with least privilege permissions and also to use an AWS VPC for the solution components. The image above shows our scenario. The Workshop Team at AWS re:Invent

At Left is a picture of the great team I worked with at AWS re:Invent. They included contractor Kim Schmidt, AWS team members and vendors from the Amazon AWS Marketplace Partners Matillion and Tableau. Kim has done a series of YouTube screencasts related to this blog post at this location:

2

The Business Problem and Dashboard Goal As with all successful data warehouse projects, we started with the source data and related that to the business questions of interest. Our source data revolved around flights and weather, so we expected our solutions to enable us to answer questions and to display the results - such the following:

• “Which airline carriers had the most delays per year?” • “Which airports had the greatest percentage of flight delays based on weather conditions (such as

rain?)” • “Which airplane types had the most weather-related delays?”

For our scenario, we used two public sample data sets – the first was “US airport flight information from 1995 -> 2008”. This flight data set included every flight to or from a US airport (and whether it left on time or not). The second data set is public weather data, taken from NOAA, including the daily weather readings for each US Airport. Our solution dashboard (using Tableau) is shown below.

3

How to Use this Blog Post There are 3 different approaches you could take when reading this blog post, depending on your time, level of expertise and depth of knowledge gain you want. They are listed below:

• Approach 1 – Read the post for information and (optionally) watch the short, included screencasts • Approach 2 – Use the pre-built artifacts (scripts, jobs, etc…), open, explore and then run them on an

AWS Environment that you set up. • Approach 3 – Build everything from scratch (including using your own data if inclined). NOTE: You will

have to modify your setup steps based on the size and complexity of your source data. To that end we’ll detail the steps you’ll need to take and we’ll add reference scripts and artifacts as we go. For Approach 1, simply read this entire post. For Approach 2, read the “The Workshop Exercises” section so that you can understand what steps to take on your own environment. To set up your AWS environment, you can either click to set up via the AWS console or you can use our AWS cli script. For Approach 3, read the same as for Approach 2 but also read everything under steps for Approach 2 with the header “Build your own Data Warehouse on AWS.” We will provide step-by-step instructions to build the Matillion ETL jobs and also the Tableau workbook. Shown below is our reference architecture. re:Invent AWS Data Warehouse Workshop Architecture

4

The Workshop Exercises Exercise 0 – Environment At re:Invent we pre-provisioned one workshop environment for each student team to use; each environment included these services and data:

• One Amazon Redshift cluster using a 1 x dc1.large instance launched in an AWS VPC • The public flights data and weather data in the following bucket s3://mtln-flight-data • One AMI EC2 instance (launched from the AWS Marketplace) running Matillion ETL for Redshift. • The Matillion instance used an AWS Elastic IP Address • The Matillion instance ran in the same AWS VPC as the Redshift instance • Matillion jobs solution file to load and process the data ”‘FinalSolution.json” on each desktop • One data analytics tool (installed on each desktop), including JDBC driver for Redshift - Tableau • Unique IAM User login for each team. IAM user assigned appropriate permissions.

Exercise 1 – Review Environment After seeing a demo of the AWS Console (Redshift) environment and the AWS Marketplace (Matillion and Tableau) environments, the first exercise was to have the student teams’ login in and explore their pre-provisioned environments. They reviewed the following aspects of their setup and took notes on new features they saw in each of the following:

1. IAM Users (Best Practice) 2. VPC (Best Practice) 3. Redshift single-node 4. Matillion via EC2 -> Matillion Browser Interface

Exercise 2 - Open, Review and Run Matillion Load (Orchestration) Jobs After seeing a demo of the data scenario and source files, the student teams next saw a demo of the Matillion load jobs. Instructor imported the Matillion load jobs and showed how to examine the load flow. In this exercise, students then imported two Matillion load (orchestration) jobs (“Import Flight Data” & “Import Weather Data”) that were on their desktops and then reviewed then ran those jobs. They then examined the output in Matillion during the job processing and also in the Redshift console (“Query Execution” and “Load” tabs). The data load takes approximately 5 minutes; therefore during the load the instructor demonstrated additional Matillion capabilities while students wait for their data to finish loading. Exercise 3 - Open, Review and Run Matillion Transformation Job After seeing a demo of the data transformation job, in this exercise the student teams imported and ran their own transformation job. The Instructor imported the Matillion transformation job and reviewed the job steps in detail while the student jobs completed. Students examined the output in Matillion during job processing and also in the Redshift console ((“Query Execution” and “Load” tabs). Exercise 4 – Connect to Tableau and Visualize the Results After seeing a demo of how to connect a desktop installation of Tableau to their Redshift cluster, in this exercise the student teams then connected to Tableau (on their desktop). Subsequently the instructor demonstrated how to implement joins in Tableau from the source Redshift data. Student teams then performed the joins. In the final step the instructor will demonstrate a Tableau visualization using the data. Student teams will then create one or more visualizations based on their data.

5

Building your own Data Warehouse on AWS To start, you will want to download the workshop script and sample files from GitHub - https://github.com/lynnlangit/AWS-Redshift-Matillion-Workshop First let’s get started with a copy of Matillion ETL for Redshift. The AMI is available for a 14-day free trial from the Marketplace. Please follow our getting started instructions however for the impatient the key points are: -

● Attach an IAM Role to the instance that has access to Redshift (AmazonRedshiftReadOnlyAccess), S3 (AmazonS3FullAccess) and optionally SNS. If you omit this it is possible to enter credentials later

● Run in a an AWS VPC. When in a VPC, ensure the instance has internet access for connecting to S3 ● Once started, connect to the instance with a web browser on http://<server_name_or_ip>/

Connecting to AWS Redshift from within Matillion Once the instance is started and the software is launched you should see the following screen. Fill in the details for you cluster.

If the test succeeds then Matillion ETL can talk to Redshift and we are ready to start building ETL jobs.

Orientation

6

This orientation diagram shows the key elements of the tool on a main Transformation canvas. We will drill into the detail further as we go. Also see this video that gives a good product overview.

Loading our Weather Data The paradigm we follow in Matillion ETL for Redshift is to first load our untransformed source data into Amazon Redshift, using an orchestration job (ready for transformation by a transformation job). So, that’s what we’ll do now. When you first start a project a sample orchestration job and transformation job are created to help new users orient themselves in the tool. You can keep these for reference if you wish, or remove them. Our first job is to create a new orchestration job that will load our weather data from the S3 bucket.

1. From the Project menu select Add orchestration job.

7

2. Name the job Import Weather Data and click OK 3. Now we have a blank canvas. In the components panel expand Orchestration -> Flow and drag the

Start component onto the canvas. This component simply indicates where to start the orchestration job. This component will have green borders (i.e. validated as ok)

4. Next we create a table, to hold our list of weather stations. Select Orchestration -> DDL -> Create/Replace Table component and drag onto the canvas. This will require some input in the properties area. The important properties to set here are the New Table Name and the metadata

5. The Table Metadata should be set up as follows this matches the column format of the text input data.

8

6. The Distribution Style is set to “All”, meaning this data will be copied to all nodes on the Redshift

cluster. We choose All because this is a small table and this is most efficient, performant way to store it on Redshift

7. We define the “USAF” column as the Sort key as this is the column’s natural key and we will use this for joining later

8. Finally, in order to validate our component needs an input connection. To do this select the Start component on the canvas and click on the small gray circle to its right. This will then allow you to draw a connector to the create table component indicating that this happens next.

9. Now our component is valid we can run it ad-hoc and create the table. To do this right click on the

component and select Run Component. The table will be created in Redshift. 10. Next we will load some data into our table form S3. Drag on a new component Orchestration ->

Load/Unload -> S3 Load. 11. The S3 Load component has a long list of properties but most can be left default. The important ones

are shows in this table

Property Value Notes

Target Table Name station list If your create component worked correctly you should be able to simply select your table from the list.

Load Columns Choose All

S3 URL Location s3://mtln-flight-data/weather This is the public bucket where the

9

data is kept.

S3 Object Prefix ish-history.csv This is the name of the file (or object) in the bucket. In this case it’s a single file, but you can use a prefix and process multiple files.

Date File Type CSV It’s a comma-separated file.

CSV Quoter “ Elements are quoted.

Region eu-west-1 The region of the s3 bucket where the data is loaded from.

12. The rest can be left as default. 13. Once again right click and Run Component in order to load the data into the table. Note the task panel

will show the number of rows transferred.

14. Now we repeat the steps above (from step 4) to create a table for the main weather data called raw_weather. This holds a lot of data so we will distribute “EVEN” that will spread the data evenly across the cluster. The weather data columns look like this.

Column Name Data Type Size Decimal Places

STN Numeric 6 0

WBAN Numeric 5 0

YEAR Numeric 4 0

MODA Numeric 4 0

TEMP Numeric 6 1

PRCP Numeric 5 2

VISIB Numeric 6 1

WDSP Numeric 6 1

15. The weather data is a delimited file, so for this type we use the following settings


Target Table Name raw_weather If your create component worked ok you should be able to simply select your table from the list.

Load Columns Choose All

10

S3 URL Location s3://mtln-flight-data/weather This is the public bucket where the data is kept.

S3 Object Prefix weather_simple This is the name of the file (or object) in the bucket. In this case it’s a prefix hence redshift will automatically load all matching files. Multiple files in great for large quantities of data because it means the data can be loaded in parallel by the cluster.

Date File Type Delimited It's a delimited file.

Delimiter , It’s a comma separated file.

Compression Method Gzip The data is compressed with GZIP.

Region eu-west-1 The region of the s3 bucket where the data is loaded from.

16. Now finally we can complete the orchestration job

If we run this job now we should have both tables populated.

Importing the Flight data

11

Next it’s time to import the Flight Data, however to avoid too much repetition we will import a job to do this by importing a pre built job. This demonstrates how jobs can be reused using the Export/Import functionality in Matillion ETL for Redshift.

1. Download the Flight Data.json from here. 2. Select Project -> Import Jobs 3. Click Browse... and choose the file you downloaded, then select the “Import Flight Data.json” file and

click OK

4. Now open the Orchestration Job by double clicking the Orchestration -> Import Flight Data to open the

job. 5. Review the job and when you are happy with what it is doing, re-validate and then run it.

12

6. The job will take a few minutes to load all the data

Note: The Flight Data job also pre-creates the output table “Flights Analysis” that we will use for analysis of the result of our transformation.

Creating Transformation job We have done the E and the L (Extract, from S3, and Load, into Redshift), so we are ready to start the fun bit; the T (Transform). In the next section we will build a transformation job that will join our Flights and Weather Data and output it into a new table designed for easy analysis. We will join the two data sets using the airport code, which exists in both. After that we’ll add some simple calculations. All the sort of things you’ll do, at scale, in a real life business scenario. So let’s get started. Our first challenge is that our flights data doesn’t contain much friendly airplane information, other than the tail number (tailnum). People doing data analysis will want that there, so let’s start with a simple join.

1. First we need to import a partially completed Transformation Job using Project -> Import Jobs and import the file called Transform Weather and Flight Data.json

2. Once imported double click to open the Job and let us begin by adding our flights data flow to the existing data flows.

3. Right click on the job and choose Revalidate Job. to build all the views

13

Note: Before we start adding new components take a moment to look at what the existing components are doing. You will notice we have a transformation for the weather data with joins, calculators filters and aggregates. You can click through these and see how they are configured as we go.

4. Remove the Note that says “Build flight data flow here” by right clicking and doing Delete Note 5. In Components add Data -> Read -> Table Input to the job. 6. Select the Component and set the table name to raw_flights and for simplicity sake select all column

names. 7. Repeat the above steps for the table raw_plane_info . Now we have something to join. 8. Add a Data -> Join -> Join and wire up like so.

14

9. Now let us configure the join as follows


Name Join Plane Info

Main Table raw_flights This is the main flow that we will be joining to. Note that the join can support multiple flows not just two.

Main Table Alias flights This is used later when we specify the join condition.

Joins Join Table 1: raw_plane_info Join Alias 1: planes Join Type 1: Inner

This describes that we will inner join to the plane info table. Note: you can do multiple joins and join flows to themselves if you require.

Join Expression 1: f_inner_a

"flights"."tailnum" = "planes"."tailnum"

This describes our join. This time it’s simply joining the iata code with the origin airport but it could be much more complex if needed. We will look at the calculation editor in more detail later.

Output Columns All available. These are the columns of data the flow out of the component. Sometimes it's

15

possible to get an error here. A useful trick is to delete the all columns and allow the component to re-validate (by clicking OK). This will automatically re-add all valid output columns.

10. If the component is now valid, now is a good time to stop and explore what is going on

Matillion ETL for Redshift has created a view for each of the three components created so far and Amazon Redshift has ensured that each view is valid. i.e. the SQL syntax is correct and the view is physically allowed to exist. Lets look at what Matillion ETL for Redshift can tell us about our data flow so far. The Sample tab will allow us to look at the output of our flow so far and also indicate the number of rows involved at this step.

The Metadata tab shows us the data types and columns involved in the output.

The SQL tab shows the SQL of the generated view, this allows you to see exactly what the tool is doing at each step.

16

The Plan will tell you information about how Redshift will tackle the query

And finally Help is context sensitive help for the the component

11. Next we add a Data -> Transform -> Filter component to remove all privately owned planes from the dataset. It's OK to filter the data later down the flow as the query optimiser will usually improve this when it performs the actual query. Set it up like this


Name Filter Private Planes

Filter Conditions Input Column: type Qualifier: Not Comparator: Equal to Value: Individual

Note again how we are relying on the output of the previous calculate.

Combine conditions AND

12. Next in the flow we add a Data -> Transform ->Calculator component. The calculator is a powerful

component that allows us to do in flow calculations across a row of data. 13. The main element of the Calculator component is the Calculations editor

17

14. For this calculator we need four expressions.

Name Expression Notes

delay_date TO_DATE ("year" || '-' || "month" || '-' || "dayofmonth", 'YYYY-MM-DD')

Since the source data has no actual date column we construct one from the year, month and dayofmonth fields.

Is Departure Delayed

CASE "depdelay" > 0 WHEN true THEN 'Yes' ELSE 'No' END

Sets a simple flag for the departure delay. This sort of field makes life easier for analysts

Is Long Delay

CASE "depdelay" > "airtime" * 0.2 WHEN true THEN 'Yes' ELSE 'No' END

Another flag this time identifying flights that are “Long Delayed” i.e. the delay was > 20 % of the overall flight.

Is Flight Diverted

CASE WHEN "diverted" = 1 THEN 'Yes' ELSE 'No' END

Here we convert a 1 or 0 flag to a more user friendly Yes or No

15. Our flow now looks like this.

18

16. Now we can add our output table to for Analysis. Add a Data -> Write -> Table Output. Set this up as

below


Name Analysis Flights

Target Table Name Analysis Flights This table was created when we ran the “import Weather Data” Orchestration job. The columns are already set up to correctly compress your data.

Fix Data Type Mismatches No Not needed here as the types are correct but sometimes it can be useful to allow matillion to attempt to map data types

Column Mapping see below This maps the column names in your flow to the physical columns in the table.

Truncate Truncate Means every time we add data this table will be truncated.

The column mappings are setup like this:-

19

17. Now our Flights analysis flow is complete. the whole job looks like this. To run everything from end to end right click and select Run Job.

20

16. You can watch the execution in your task list

17. This will leave us with 5 analysis tables populated and ready to work with an analysis tool such as

Tableau. So we are done. Our data is neatly prepared and ready for analysis. Our Jobs can now be Versioned and Scheduled. So the data can be updated regularly id required. Note: Don’t forget the collaboration features in the tool. Send the URL of the job you happen to be working on to a colleague, and then work on the job together, collaborative, in real time - just like in Google Docs.

About Matillion Front end - Matillion ETL for Redshift is an entirely browser based tool, launched as an AMI from the AWS Marketplace. As such it runs inside your existing AWS account and can be up and running in a few minutes. Matillion has been designed specifically for Redshift. Back End - Matillion ETL for Redshift uses ELT Architecture, pushing down data transformations to Amazon Redshift. The tool takes advantage of Redshift’s ability to layer many views, whilst still optimizing the execution plan accordingly. Each transformation component generates a corresponding view in Redshift and Matillion ETL for Redshift keeps these views in sync. This approach has some significant real world advantages.

● ‘ELT’ is several orders of magnitude faster than ‘ETL’. This is because the data remains in the database, which understands its structure and how to transform it most efficiently. This as opposed to ‘ETL’, where the data has to be expensively extracted from the database before being transformed in-memory then reloaded

21

● Amazon Redshift robustly validates the views as they are created, so the user can be confident that if the view is successfully created, that part of the job will work. This avoids time wasted debugging an ETL job after it has been created

● Matillion ETL for Redshift allows you to ‘sample’ the data at any point in your flow. This can be extremely useful when debugging and understanding complex data flows

Conclusion As we can see with this relatively simple data set, the key to good analysis is good data preparation... and the fastest and quickest way to do that is using an ELT-based tool such as Matillion, running over a MPP columnar database like Amazon Redshift. Of course this is a very simple job and real world applications are much more complex. That is when the advantage of a tool like this really comes alive. The graphical job development, collaborative nature of Matillion, versioning support, built in scheduling, et al, all serve to make your ETL jobs far more enjoyable to create and far more valuable once created.

Tableau for reInvent Workshop

0a. Connect to data • Connect to the redshift data source and select public schema, use correct port

0b. Join Data – use LEFT Joins! o Drag Analysis Flights table onto the connection window o Drag Analysis Carriers onto the connection window (it will automatically join on carrier code). Turn into a left join. o Drag Analysis Weather onto the connection window, turn into a left join as well. o Drag Analysis Airports onto the connection window, left join

1. Examine how many flights by each carrier o New sheet (name it flights by carrier) o Double click on number of records (bottom of measures) o RIGHT click, HOLD, and drag date onto columns from Analysis Flights, you will see a number of different ways to

display dates, select Week(Date) towards the bottom (the green one…)

• Drag carrier name from analysis carriers onto color o On the Marks card, click the drop down that says “Automatic” and select “Area” o OPTIONAL: Click the drop down on the color legend and select “Cyclic” and click assign pallet (

2. Where are the weather delays? o NEW SHEET (name it avg weather delays) o Double click on City

22

o Drag weather delay onto color § Change to average by clicking the drop down on the green “pill” and selecting Measure-‐>average o Drag number of records onto size o Click the drop down on the color legend and change to red…

3. What is the Total delay by carrier? o NEW SHEET (name it total delay by carrier) o Drag Carrier onto Rows o Drag Arrival Delay onto columns o Change arrival delay to average

o Sort descending (one on the right…) • Click Color, Change to Grey

4. Create a dashboard o Click new dashboard o Double click Flights by Carrier (or sheet 1) o Drag and drop Avg Weather Delays below that o Drag and drop Total Delay by Carrier to the left of both o Change Fit to Entire View for Total delay by carrier

o o Click the drop down on total delay by carrier and select “Use as filter” o Click on one of the carriers to see the number of flights they have had, where their flights are going to and which

airports are the most delayed by weather

Building AWS Redshift Data Warehouse with Matillion and Tableau

Technology