Data Vault ReConnect Speed Presenting PM Part Three

Presenter:

Date: Note:

Company:

eMail:

Marc Bouma June 5, 2014 UMC Utrecht [email protected]

Our Dot on the Horizon

- Central point for delivering healthcare processes data for medical research

- Integrate various sources - Historize, trace and pseudonymize all data used

Our Journey

- Learning and adapting to Data Vault not everybody is a modeler (Shu Ha Ri)

- Script, code, build, try, test, throw away and start again - Testing overrated? - Architecture improvements Performance issues SAS/Microsoft Performance issues loading scripts Automate DV load

- From Chaos to SCRUM

Our Obstacles

- Registration for healthcare process vs. usability for research

- Questionnaires: sources or generic models? - Performance: Do we really need all complete texts? Do we really need 20 years of lab results?

- The usual: conflicting interests, politics etc.

Our preliminary results

- 2013: selection of 5 major Studies as starting showcases proved difficult

- 2014: had to choose 5 new showcases from 25 applicants

- Started as Research Data Platform, now growth towards Enterprise Data Platform (including Education and BI)

- Architecture now stable

Lessons learned

• Automate when possible • Invest in a team of skilled pioneers • Models rule everything • Adapt agility, teach agility

Presenter:

Date: Note:

Company:

eMail: Twitter:

Sander Robijns June 5, 2014 Estrenuo BVBA [email protected] @srobijns

The Issue

No enterprise-wide business keys

The Current Approach

Using recursive links on hubs to identify the same-as relationship

The Struggle

Getting the facts reported under a single business key

The Future Approach

Master Data Management will take away some of the struggles

The Lesson Learned

Get the enterprise-wide business keys in place first using data governance

Presenter:

Date: Note:

Company:

eMail: Twitter:

Kasper de Graaf June 5 2014

Occurro [email protected] kdgraaf

Groups of Links: context at hospital

Imagine the following: • An operation (surgery) is executed by a

group of people (first surgeon, second surgeon, assistant, anesthiologist, etc.)

• An operation is planned a couple of weeks in advance

• Whenever the planning changes in the source the complete group is sent to the EDW

Group of Links: the Data

{Time} operation_no employee_no role

T=1 19354 John OP1

19354 Jane OP2

19354 Chris ANA

T=2 19354 John OP1

19354 Mary ANA

T=3 19354 Jane OP1

19354 Chris ANA

Please note: the actual operation with operation_no 19354 is executed by Jane (OP1) and Chris (ANA)

Groups of Links: the Problem

Standard Data Vault loading routines cannot handle this situation: operation_no employee_no role load_dts

19354 John OP1 T=1

19354 Jane OP2 T=1

19354 Chris ANA T=1

19354 Mary ANA T=2

19354 Jane OP1 T=3

Groups of Links: the Problem

Using end-dating of a link (preferable a validity satellite) cannot handle this problem either: operation_no employee_no role load_dts Active?

19354 John OP1 T=1 No (T=3)

19354 Jane OP2 T=1 Yes

19354 Chris ANA T=1 No (T=3)

19354 Mary ANA T=2 Yes

19354 Jane OP1 T=3 Yes

BK of link used: operation_no + role

Groups of Links: our solution

1. Add a validity satellite to the link (for end-dating) 2. Tell the meta data of the automatin tool this is a

group validity satellite with BK=operation_no 3. Whenever an existing operation_no is present in

the staging layer set all current links to Active=No

4. Process as usual

• Remark: because the same row can come back (i.e. John/OP1) it will be set to Active=No and Active=Yes at the same time there can be no unique index on BK of Validity satellite and some cleaning up is required after loading

Groups of Links: special thanks to …

St. Antonius Hospital (for having the problem) Edwin Weber (for coding the solution) Get your copy of the solution: http://sourceforge.net/projects/pdidatavaultfw/

Presenter:

Date: Note:

Company:

eMail: Twitter:

Juan-José van der Linden June 5, 2014 DV, MPP QOSQO [email protected] @delostilos

SMP => MPP => AMPP

SMP Symmetric Processing MPP Massively Parallel Processing AMPP Asymmetric MPP ( SMP + MPP)

Primary key => distribution key

hub -< satellite join - data redistribution - join local in parallel

BK SID

Ensemble 1

Dimensional 2

SID LDTS INFO

1 2001-01-01 My first DV

1 2014-06-05 DV Masters

2 1997-08-02 DM manifesto

Node 1 Node 2

Hub SID => distribution key

hub -< satellite join - join local in parallel

BK SID

Ensemble 1

Dimensional 2

SID LDTS INFO

1 2001-01-01 First DV

1 2014-06-05 DV Masters

2 1997-08-02 DM manifesto

Node 1 Node 2

Link SID => distribution key

Default L_SID, 1:N & N:M - data redistribution - join local in parallel

H_MID H_SID L_SID

1 A 1

1 B 2

L_SID LDTS LDTS_END CURRENT

1 2001-01-01 2006-01-01 N

1 2014-06-05 9999-12-31 Y

2 2006-01-01 2014-06-05 N

H_MID H_SID L_SID

1 A 1

1 B 2

L_SID H_MID H_SID LDTS LDTS_END

1 1 A 2001-01-01 2006-01-01

1 1 B 2014-06-05 9999-12-31

2 1 A 2006-01-01 2014-06-05

1:N => H_MID on link satellite - join local in parallel H_MID is the ensemble identifier !

Node 1 Node 2

Use the ensemble identifier if possible!

H_SID H_SID LDTS INFO

L_SID? H_SID H_MID H_SID ? L_SID ? LDTS INFO

Distributing data efficiently to ensure good performance in a MPP database. - If uneven distribution, one node may become a

bottleneck for the whole execution Try to minimize data movement between nodes - Data redistribution may occur when joining tables

Ensemble

Presenter:

Date: Note:

Company:

eMail: Twitter:

Remco Broekmans June 5, 2014 Example for ReConnect Coarem [email protected] RemcoBroekmans

SAP #Hana is a column store #database which brings #efficiency in storage and access - #in-memory.

SAP #Hana seems to benefit on their technical #architecture in using 1 broad Satellite per #Hub - #benefit no need for #PIT, less tables

Splitting #Sat’s in #rate-of-change as efficient in storage as column store #multiple Sat’s to prefer if data coming from multiple sources (#write efficiency)

#referential join will only perform the join if data from the joined tables is used create 1 #PIT per #Hub (not as #SQL view)

#Lesson: DV is #efficient way of storing data #Lesson: #SQL views can’t be read by Hana Studio #Lesson: #Hana is still evolving

Data Vault ReConnect Speed Presenting PM Part Three

Data & Analytics

link sid

sid ldts ldts

hub sid

sid ldts info

jane op1 t

role t

chris ana t

parallel h