Top Banner
+ HospETL Healthcare Analytics Platform Angela Razzell Insight Data Engineering Fellowship New York
17

HospETL - Delivering a Healthcare Analytics Platform

Jan 08, 2017

Download

Data & Analytics

Angela Razzell
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HospETL - Delivering a Healthcare Analytics Platform

+

HospETL Healthcare Analytics Platform

Angela Razzell Insight Data Engineering Fellowship New York

Page 2: HospETL - Delivering a Healthcare Analytics Platform

My motivation n  Delivery of an analytics platform in Amazon Redshift for

randomly generated healthcare data.

n  Deep delve into Amazon Redshift as a distributed data warehouse system.

n  Redshift is being widely employed in business, efficient analytics is important to supply operational insight.

Page 3: HospETL - Delivering a Healthcare Analytics Platform

Technologies

AMAZON  REDSHIFT  

Page 4: HospETL - Delivering a Healthcare Analytics Platform

Technologies

AMAZON  REDSHIFT  

Total  Capacity:  640  GB  

4  x  dc1.large  nodes  

4  x  $0.25  /  hour  

Page 5: HospETL - Delivering a Healthcare Analytics Platform

Schema

ref_doctor ref_hospital

ref_eddisposal

ref_edcomplaint

ref_diagnosis

patient 10 million rows

X 21 columns

elective_bookings 53+ million rows

X 13 columns

(ED) encounter 40 million rows

X 10 columns

admissions & appts 28 million rows

X 12 columns

patient_diagnosis 40 million rows

X 21 columns

Page 6: HospETL - Delivering a Healthcare Analytics Platform

Columnar Compression Types How it works Examples

Raw N/A – no compression, use for large domain Identifiers

Bytedict Creates dict. of unique values, optimal for limited unique values Dept. code

LZO Creates a dictionary of repeating character sequences, use for very long character strings

Comments

Runlength Store repeat value counts, use for consecutive repeating values Doctor code

Text255 & text32k

Creates dictionary of unique words for repeating text Address

Delta Record difference between values that follow each other, optimal for consecutive integer values

Gender Code

Mostly Store values in smaller standard storage size, optimal when the data type for a column is larger than most values

BIGINT columns

Page 7: HospETL - Delivering a Healthcare Analytics Platform

Columnar Compression Types How it works Examples

Raw N/A – no compression, use for large domain Identifiers

Bytedict Creates dict. of unique values, optimal for limited unique values Dept. code

LZO Creates a dictionary of repeating character sequences, use for very long character strings

Comments

Runlength Store repeat value counts, use for consecutive repeating values Doctor code

Text255 & text32k

Creates dictionary of unique words for repeating text Address

Delta Record difference between values that follow each other, optimal for consecutive integer values

Gender Code

Mostly Store values in smaller standard storage size, optimal when the data type for a column is larger than most values

BIGINT columns

Page 8: HospETL - Delivering a Healthcare Analytics Platform

Columnar Compression: Runlength Doctor Code

Original size (bytes)

Compressed Value

Compressed size (bytes)

C1 2 {2,C1} 3

C1 2 0

C22 3 {4,C22} 4

C22 3 0

C22 3 0

C22 3 0

C101 4 {1,C101} 5

Total: 20 12

Page 9: HospETL - Delivering a Healthcare Analytics Platform

No columnar compression or keys

50%

25%

Page 10: HospETL - Delivering a Healthcare Analytics Platform

Add columnar compression

20%

10%

Page 11: HospETL - Delivering a Healthcare Analytics Platform

Add columnar compression and keys

15%

Page 12: HospETL - Delivering a Healthcare Analytics Platform

Challenges

n Creating schema from scratch.

n Generating and loading large datasets.

n Learning Redshift and how to optimize it.

Page 13: HospETL - Delivering a Healthcare Analytics Platform

About me n  Worked in Data Migration for IT system

implementation project and Business Intelligence at an NHS Trust.

n M.Eng in Engineering Mathematics from University of Bristol. n Interests include hiking and swimming.

Page 14: HospETL - Delivering a Healthcare Analytics Platform

Demo n  www.hospETL.website

Page 15: HospETL - Delivering a Healthcare Analytics Platform

Encryption n  AWS Key Management Services (KMS)

n  Automatically integrates with Redshift n  $1 a month

n  Hardware Security Module (HSM) n  Need to use client and server certificates to configure a trusted connection

to Amazon Redshift n  Monthly fee plus $5000 initial cost

Page 16: HospETL - Delivering a Healthcare Analytics Platform

Redshift cluster n Set up a Redshift cluster with 4 dc1.large nodes. = four

nodes with two slices each

Node size vCPU ECU

RAM (GiB)

Slices per Node

Storage per Node

Node Range

Total Capacity

dc1.large 2 7 15 2 160 GB SSD

1-32 5.12 TB

Page 17: HospETL - Delivering a Healthcare Analytics Platform

Columnar Compression Types How it works Use case Examples

Raw N/A – no compression Large domain Identifiers

Bytedict Creates a dict. of unique values Limited unique vals Dept. code

LZO Creates a dictionary of repeating character sequences

V. Long char strings Comments

Runlength Store repeated value counts, use for consecutive repeating values

Consecutive repeating vals

Dr code

Text255 & text32k

Creates dictionary of unique words for repeating text

Repeating words within string

Address

Delta Record difference between values that follow each other

Consecutive integer vals

Gender Code

Mostly Store values in smaller standard storage size

Column data type is larger than most vals

BIGINT columns