Top Banner
Presenter: Date: Note: Company: eMail: Marc Bouma June 5, 2014 UMC Utrecht [email protected]
31

Data Vault ReConnect Speed Presenting PM Part Three

May 25, 2015

Download

Data & Analytics

Hans Hultgren

Third set of 5x5 Speed Presenting Updates:

1) Research Data Platform - Marc Bouma
2) Same-As Struggles - Sander Robijns
3) Groups of Links - Kasper de Graaf
4) Ensemble Model & MPP - Juan-José van der Linden
5) Data Vault on SAP HANA - Remco Broekmans
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Vault ReConnect Speed Presenting PM Part Three

Presenter:

Date: Note:

Company:

eMail:

Marc Bouma June 5, 2014 UMC Utrecht [email protected]

Page 2: Data Vault ReConnect Speed Presenting PM Part Three

Our Dot on the Horizon

- Central point for delivering healthcare processes data for medical research

- Integrate various sources - Historize, trace and pseudonymize all data used

Page 3: Data Vault ReConnect Speed Presenting PM Part Three

Our Journey

- Learning and adapting to Data Vault not everybody is a modeler (Shu Ha Ri)

- Script, code, build, try, test, throw away and start again - Testing overrated? - Architecture improvements Performance issues SAS/Microsoft Performance issues loading scripts Automate DV load

- From Chaos to SCRUM

Page 4: Data Vault ReConnect Speed Presenting PM Part Three

Our Obstacles

- Registration for healthcare process vs. usability for research

- Questionnaires: sources or generic models? - Performance: Do we really need all complete texts? Do we really need 20 years of lab results?

- The usual: conflicting interests, politics etc.

Page 5: Data Vault ReConnect Speed Presenting PM Part Three

Our preliminary results

- 2013: selection of 5 major Studies as starting showcases proved difficult

- 2014: had to choose 5 new showcases from 25 applicants

- Started as Research Data Platform, now growth towards Enterprise Data Platform (including Education and BI)

- Architecture now stable

Page 6: Data Vault ReConnect Speed Presenting PM Part Three

Lessons learned

• Automate when possible • Invest in a team of skilled pioneers • Models rule everything • Adapt agility, teach agility

Page 7: Data Vault ReConnect Speed Presenting PM Part Three

Presenter:

Date: Note:

Company:

eMail: Twitter:

Sander Robijns June 5, 2014 Estrenuo BVBA [email protected] @srobijns

Page 8: Data Vault ReConnect Speed Presenting PM Part Three

The Issue

No enterprise-wide business keys

Page 9: Data Vault ReConnect Speed Presenting PM Part Three

The Current Approach

Using recursive links on hubs to identify the same-as relationship

Page 10: Data Vault ReConnect Speed Presenting PM Part Three

The Struggle

Getting the facts reported under a single business key

Page 11: Data Vault ReConnect Speed Presenting PM Part Three

The Future Approach

Master Data Management will take away some of the struggles

Page 12: Data Vault ReConnect Speed Presenting PM Part Three

The Lesson Learned

Get the enterprise-wide business keys in place first using data governance

Page 13: Data Vault ReConnect Speed Presenting PM Part Three

Presenter:

Date: Note:

Company:

eMail: Twitter:

Kasper de Graaf June 5 2014

Occurro [email protected] kdgraaf

Page 14: Data Vault ReConnect Speed Presenting PM Part Three

Groups of Links: context at hospital

Imagine the following: • An operation (surgery) is executed by a

group of people (first surgeon, second surgeon, assistant, anesthiologist, etc.)

• An operation is planned a couple of weeks in advance

• Whenever the planning changes in the source the complete group is sent to the EDW

Page 15: Data Vault ReConnect Speed Presenting PM Part Three

Group of Links: the Data

{Time} operation_no employee_no role

T=1 19354 John OP1

19354 Jane OP2

19354 Chris ANA

T=2 19354 John OP1

19354 Mary ANA

T=3 19354 Jane OP1

19354 Chris ANA

Please note: the actual operation with operation_no 19354 is executed by Jane (OP1) and Chris (ANA)

Page 16: Data Vault ReConnect Speed Presenting PM Part Three

Groups of Links: the Problem

Standard Data Vault loading routines cannot handle this situation: operation_no employee_no role load_dts

19354 John OP1 T=1

19354 Jane OP2 T=1

19354 Chris ANA T=1

19354 Mary ANA T=2

19354 Jane OP1 T=3

Page 17: Data Vault ReConnect Speed Presenting PM Part Three

Groups of Links: the Problem

Using end-dating of a link (preferable a validity satellite) cannot handle this problem either: operation_no employee_no role load_dts Active?

19354 John OP1 T=1 No (T=3)

19354 Jane OP2 T=1 Yes

19354 Chris ANA T=1 No (T=3)

19354 Mary ANA T=2 Yes

19354 Jane OP1 T=3 Yes

BK of link used: operation_no + role

Page 18: Data Vault ReConnect Speed Presenting PM Part Three

Groups of Links: our solution

1. Add a validity satellite to the link (for end-dating) 2. Tell the meta data of the automatin tool this is a

group validity satellite with BK=operation_no 3. Whenever an existing operation_no is present in

the staging layer set all current links to Active=No

4. Process as usual

• Remark: because the same row can come back (i.e. John/OP1) it will be set to Active=No and Active=Yes at the same time there can be no unique index on BK of Validity satellite and some cleaning up is required after loading

Page 19: Data Vault ReConnect Speed Presenting PM Part Three

Groups of Links: special thanks to …

St. Antonius Hospital (for having the problem) Edwin Weber (for coding the solution) Get your copy of the solution: http://sourceforge.net/projects/pdidatavaultfw/

Page 20: Data Vault ReConnect Speed Presenting PM Part Three

Presenter:

Date: Note:

Company:

eMail: Twitter:

Juan-José van der Linden June 5, 2014 DV, MPP QOSQO [email protected] @delostilos

Page 21: Data Vault ReConnect Speed Presenting PM Part Three

SMP => MPP => AMPP

SMP Symmetric Processing MPP Massively Parallel Processing AMPP Asymmetric MPP ( SMP + MPP)

Page 22: Data Vault ReConnect Speed Presenting PM Part Three

Primary key => distribution key

hub -< satellite join - data redistribution - join local in parallel

BK SID

Ensemble 1

Dimensional 2

SID LDTS INFO

1 2001-01-01 My first DV

1 2014-06-05 DV Masters

2 1997-08-02 DM manifesto

Node 1 Node 2

Page 23: Data Vault ReConnect Speed Presenting PM Part Three

Hub SID => distribution key

hub -< satellite join - join local in parallel

BK SID

Ensemble 1

Dimensional 2

SID LDTS INFO

1 2001-01-01 First DV

1 2014-06-05 DV Masters

2 1997-08-02 DM manifesto

Node 1 Node 2

Page 24: Data Vault ReConnect Speed Presenting PM Part Three

Link SID => distribution key

Default L_SID, 1:N & N:M - data redistribution - join local in parallel

H_MID H_SID L_SID

1 A 1

1 B 2

L_SID LDTS LDTS_END CURRENT

1 2001-01-01 2006-01-01 N

1 2014-06-05 9999-12-31 Y

2 2006-01-01 2014-06-05 N

H_MID H_SID L_SID

1 A 1

1 B 2

L_SID H_MID H_SID LDTS LDTS_END

1 1 A 2001-01-01 2006-01-01

1 1 B 2014-06-05 9999-12-31

2 1 A 2006-01-01 2014-06-05

1:N => H_MID on link satellite - join local in parallel H_MID is the ensemble identifier !

Node 1 Node 2

Page 25: Data Vault ReConnect Speed Presenting PM Part Three

Use the ensemble identifier if possible!

H_SID H_SID LDTS INFO

L_SID? H_SID H_MID H_SID ? L_SID ? LDTS INFO

Distributing data efficiently to ensure good performance in a MPP database. - If uneven distribution, one node may become a

bottleneck for the whole execution Try to minimize data movement between nodes - Data redistribution may occur when joining tables

Ensemble

Page 26: Data Vault ReConnect Speed Presenting PM Part Three

Presenter:

Date: Note:

Company:

eMail: Twitter:

Remco Broekmans June 5, 2014 Example for ReConnect Coarem [email protected] RemcoBroekmans

Page 27: Data Vault ReConnect Speed Presenting PM Part Three

SAP #Hana is a column store #database which brings #efficiency in storage and access - #in-memory.

Page 28: Data Vault ReConnect Speed Presenting PM Part Three

SAP #Hana seems to benefit on their technical #architecture in using 1 broad Satellite per #Hub - #benefit no need for #PIT, less tables

Page 29: Data Vault ReConnect Speed Presenting PM Part Three

Splitting #Sat’s in #rate-of-change as efficient in storage as column store #multiple Sat’s to prefer if data coming from multiple sources (#write efficiency)

Page 30: Data Vault ReConnect Speed Presenting PM Part Three

#referential join will only perform the join if data from the joined tables is used create 1 #PIT per #Hub (not as #SQL view)

Page 31: Data Vault ReConnect Speed Presenting PM Part Three

#Lesson: DV is #efficient way of storing data #Lesson: #SQL views can’t be read by Hana Studio #Lesson: #Hana is still evolving