Top Banner
When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema Kathryn Watson
55

When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Oct 19, 2014

Download

Technology

What is Reverse Star Schema?

Why and when would I use a Reverse Star Schema?

How would I implement a Reverse Star Schema?

What about data integrity?

In this slideshare, we'll walk through a real life implementation within the healthcare industry.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

When Facts and Dimensions Alone Aren't the Answer:

Logically Reversing the Star Schema

Kathryn Watson

Page 2: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Solution Architect at Perficient

Economist & Financial Analyst by education,

BI Professional by passion

Email: [email protected]

LinkedIn: www.linkedin.com/pub/kathryn-watson/a/431/354/

Kathryn Watson

Page 3: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What is a Reverse Star Schema?

Why and when would I use a Reverse Star Schema?

How would I implement a Reverse Star Schema?

What about data integrity?

While addressing these questions, we’ll walk through a

real life implementation within the healthcare industry.

The Questions We’ll Answer Today:

Page 4: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

The Star Schema has long been the foundation of

effective BI applications● Optimized for the speedy return of business facts and their

corresponding descriptors

● Supports the manner in which businesses ask typical business

questions

Let’s Start With the Basics

Page 5: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Show me A, filtered by B and aggregated by C & D

Let’s Start With the Basics (cntd.)

A

B C

D

Page 6: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Show me Z, filtered and aggregated by W, X & Y

Let’s Start With the Basics (cntd.)

Z

W X

Y

Page 7: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Data warehouses commonly include multiple facts

These multiple facts often share dimensions, called

conformed dimensions

Let’s Start With the Basics (cntd.)

Page 8: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Conformed dimensions facilitate answering multi-

faceted business questions.

Show me A & Z, aggregated and/or filtered by N

Let’s Start With the Basics (cntd.)

A

B C

D

Z

W X

Y

N

Page 9: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

As BI implementations mature, the questions your

users are asking often do as well.

Show me Z, filtered by B & C and aggregated by Y

And Continue Along the BI Maturity Model

A

B C

D

Z

W X

Y

N

Page 10: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

In this scenario:● B and C are not dimensions of Z

● B and C are not subsets of Y

● B and C are not directly related to the Fact (Z) they are required to

affect

And Continue Along the BI Maturity Model (cntd.)

A

B C

D

Z

W X

Y

N

Page 11: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

We’ve now encountered a very practical use case for

the Reverse Star Schema.

And Continue Along the BI Maturity Model (cntd.)

Page 12: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

A typical Reverse Star Schema flips the concept of

what information is being returned to the user. Rather

than factual information being the star of the show, the

dimensional information is the main event.

In all Reverse Star Schemas, the relationships between

facts and dimensions are flipped; rather than a one-to-

many dimension to fact relationship, the Reverse Star

Schema introduces a one-to-many or many-to-many

fact to dimension relationship.

So, What is a Reverse Star Schema?

Page 13: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Sometimes, the return of the intersection of

dimensional data is truly all the user needs

More often, the Reverse Star Schema is the avenue

to establish cause-and-effect relationships amongst

indirectly related data concepts

When and Why Would I Use This?

Page 14: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Let’s return to the scenario we showcased previously:

show me Z, filtered by B & C and aggregated by Y

When and Why Would I Use This? (cntd.)

A

B C

D

Z

W X

Y

N

Page 15: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

B & C’s relationship to Z is established through fact A

and conformed dimension N

When and Why Would I Use This? (cntd.)

A

B C

D

Z

W X

Y

N

Page 16: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

The values of A & N aren’t of interest to the user

A & N are simply avenues through which the needed

attributes are accessed

When and Why Would I Use This? (cntd.)

A

B C

D

Z

W X

Y

N

Page 17: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Let’s apply this concept to the real world:

When and Why Would I Use This? (cntd.)

One of the nation’s top cancer care centers was seeking an on-

demand means to identify patient cohorts.

Physicians should be able to refine cohorts by basic tumor

information and various disconnected factors of cancer care:

Tumors – initial presentation date and type

Histologies – cancer type and structure over time

Stagings – cancer severity

Statuses – active, in remission, etc.

Treatments – chemotherapy, radiation, other

A filter applied to any of these concepts should affect the results

of all concepts, even though the concepts are independent of

each other.

Page 18: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Let’s apply this concept to the real world (cntd.):During the analysis phase, it was confirmed that everything was

directly related to the patient’s instance of cancer (Tumor).

All other concepts, though factual concepts in their own right, for

this purpose were informational.

It’s all about the Tumor. Everything else is informational.

When and Why Would I Use This? (cntd.)

Page 19: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Much like our use case scenario, a multi-fact Star Schema

emerged.

● 5 Facts with Independent Dimensions● Tumors/Documents (F_CLNCL_DOC)

● Histologies (F_TMR_HISTOL)

● Stagings (F_TMR_STG)

● Statuses (F_TMR_STAT)

● Treatments (F_TMR_TRTMT)

● 1 Conformed Dimension shared by all facts● Tumors (D_PT_TMR)

Let’s apply this concept to the real world (cntd.):

When and Why Would I Use This? (cntd.)

Page 20: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Let’s apply this concept to the real world (cntd.):

When and Why Would I Use This? (cntd.)

Page 21: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Implemented in OBIEE, the Dimensional Model = OBIEE Physical

Layer, except for:

● EIW Dimensions● Date dimension aliased for each of its purposes (17 aliases)

● Provider dimension aliased for each of its purposes (3 aliases)

● 4 Additional Facts● Users stated the requirement to analyze treatments as separate entities for some use-

cases, and as consolidated activities in other use-cases

● This is achieved by placing database views on top of F_TMR_TRTMT, with one view

for each treatment type

Let’s apply this concept to the real world (cntd.):

When and Why Would I Use This? (cntd.)

Page 22: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Let’s apply this concept to the real world (cntd.):

When and Why Would I Use This? (cntd.)

Page 23: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Let’s apply this concept to the real world (cntd.):

When and Why Would I Use This? (cntd.)

But the classic Star Schema did not support our main

requirement:

A filter applied to any of these concepts should affect the results of all

concepts, even though the concepts are independent of each other.

It’s all about the Tumor. Everything else is informational.

In the classic Star Schema design, each fact and its dimensions

are independent of the other facts.

Filters applied to the dimensions of one fact do not affect the

results returned in another fact.

Page 24: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Ok, How Would I implement This?

To fill our requirement that filters applied to the dimension of one

fact should affect the results returned in another fact:

● “Sub-Facts” (Statuses, Stagings, Histologies, Treatments) should be

treated as dimensions of the central fact (Tumors/Documents) through

the conformed dimension (Tumors).

● This would create a reverse snowflake schema● One fact: F_CLNCL_DOC (Tumors/Documents)

● Central Dimension: D_PT_TMR (Tumors)

● All other tables are dimensions that flake off of D_PT_TMR and their “Sub-Facts”.

(One to many join between D_PT_TMR and other facts reversed, becoming many to

one)

Continuing with our real world example:

Page 25: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Ok, How Would I implement This? (cntd.)

So, should our Dimensional Model/RPD Physical Layer be re-

designed to support this?

No.

Continuing with our real world example:

Page 26: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Ok, How Would I implement This? (cntd.)

The Dimensional Model/RPD Physical Layer should remain As Is

(a classic Star Schema)

● The “Sub-Facts” truly are facts/transactions. They can be analyzed

independently

● Current requirements dictate that the “Sub-Facts” are simply avenues

to descriptive attributes

● However, future requirements may dictate the need to analyze these

“Sub-Facts” individually; we don’t want to lose this capability through a

redesign

Continuing with our real world example:

Page 27: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Ok, How Would I implement This? (cntd.)

The Reverse Star Schema is built logically in the RPD BMM layer:

● Individual Dimension Tables are brought over from the Physical Layer

● The Logical Table Source (LTS) for each Dimension Table is edited to

include its “Sub-Fact” table as well as D_PT_TMR

Continuing with our real world example:

Page 28: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Ok, How Would I implement This? (cntd.)

The Reverse Star Schema is built logically in the RPD BMM layer

(cntd.):

● The dimension table is then logically joined directly to the central fact

(FACT_ALS_F_CLNCL_DOC)

Continuing with our real world example:

Page 29: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Ok, How Would I implement This? (cntd.)

The Reverse Star Schema is built logically in the RPD BMM layer

(cntd.):

● The joins in the Physical Layer remain the same.

Continuing with our real world example:

Dimension DimensionFact Fact

Page 30: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Ok, How Would I implement This? (cntd.)

This process is completed until all dimension tables join directly

to the central fact (FACT_ALS_F_CLNCL_DOC) in the BMM

“Sub-Facts” were not brought in as logical tables in the BMM for

this implementation, as there were no requirements for them.

● These “Sub-Facts” can be brought in as independent facts later if

needed, creating a multi-facted Star Schema

● If a “Sub-Fact” is brought in at a later time, it need simply be joined to

its relevant Dimensions; OBIEE will recognize that only one table in the

LTS need be used for the Join

Continuing with our real world example:

Page 31: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Ok, How Would I implement This? (cntd.)

Continuing with our real world example:

Page 32: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:Aggregation is fine so long as your aggregations are all count

distinct.

But what about when your aggregation is something other than a

count distinct?

Ay, there’s the rub! Luckily, OBIEE has a quick and easy

solution: level based measures and/or measures based upon

dimensions

Page 33: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:We had to set our non-count distinct aggregations to be Based on

Dimensions in order to achieve correct aggregation of our

measures

● For the conformed dimension (DIM_ALS_D_PT_TMR or Tumors), take

the Distinct Average of the measure

● For all Other Dimensions, simply Average the Measure

Page 34: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:Why does this work?

● OBIEE will first average distinct the measure at the Tumor level.

Since the Tumor level is the lowest granularity, this will equate to the

original measure

● OBIEE will then average the population of these distinct averages by

whatever other dimensions are brought into the request

Page 35: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:Why is this necessary?

● In our data set, our “Sub-Facts” can have multiple instances per tumor.

● We have transitioned our “Sub-Facts” to act as avenues between their

own dimensions and the central fact (FACT_F_CLNCL_DOC) for the

formation of the Star Schema in the BMM

● As a result, we can (and do) experience “ballooning”

Page 36: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:Why is this necessary (cntd.)?

Two Tumors, each with one row of data:

vs.

Same two Tumors, with Staging Date added to the query. The 1st Tumor now has two rows

of data, one for each of its Staging Dates. The 2nd Tumor still has only one row of data, as it

has only one Staging Date.

Page 37: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:Why is this necessary? (cntd.)

● With the inflated data, we can’t simply average: tumors that have

ballooned out would be weighted more heavily in our average

calculation than those that have not

349.67

If the Aggregation Rule in our RPD was simply set to “Average,”

OBIEE would calculate the average of the above as 349.67

[ (493 + 493 + 63) / 3 ]

Page 38: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:Why is this necessary? (cntd.)

● To remove the “weight” given to a measure when its tumor appears in

the results more than once, we have to tell OBIEE to base the

aggregation on Dimensions

Page 39: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:Why is this necessary? (cntd.)

● OBIEE would first calculate the Average at the Tumor Level AVG(DISTINCT "SCD-Tumor Management"."FACT_ALS_F_CLNCL_DOC"."Follow Up Time")

Page 40: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

What about data integrity?

Continuing with our real world example:Why is this necessary? (cntd.)

● OBIEE would then average this averageAVG("SCD-Tumor Management"."FACT_ALS_F_CLNCL_DOC"."Follow Up Time")

● Our average calculates correctly, at 278[ (493 + 63) / 2 ]

Page 41: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Questions?

Page 42: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes

Non Star Schema Structures:● Non-Star Schema structures can work and do not necessarily

cause incorrect results

● However, they are not ideal and can greatly complicate

implementation, expansion, and maintenance

Page 43: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Non Star Schema Structures (cntd.):● To demonstrate, let’s start with a simple snowflake structure

Page 44: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Non Star Schema Structures (cntd.):● And then create a Dimensional Hierarchy on

DIM_PROVIDERS

Page 45: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Non Star Schema Structures (cntd.):● OBIEE names your hierarchy DIM_PATIENTSDim, but you

created it on DIM_PROVIDERS. Why?

● OBIEE considers the dimension that is joined directly to the fact to be

the “true”, as it’s the lowest level dimension

● DIM_PROVIDERS is connected to DIM_PATIENTS at a higher level,

and OBIEE has consolidating both tables into one dimensional

hierarchy, with DIM_PROVIDERS the parent level of DIM_PATIENTS

Page 46: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Non Star Schema Structures (cntd.):● What would have happened if DIM_PATIENTS had 2 flakes?

Dimension

Dimension Dimension

Dimension

Fact

Page 47: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Non Star Schema Structures (cntd.):● You’d end up with 1 dimensional hierarchy with 2 separate

trees

Page 48: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Non Star Schema Structures (cntd.):● So why does this complicate things?

● Level-based measures are based upon Dimensional Hierarchies

● In a Snowflake Schema, constructing many types of level-based

measures becomes impossible, or at the least very complex.

Page 49: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Non Star Schema Structures (cntd.):● Let’s calculate the average Follow-Up Days for a Provider by

Age Bucket

● Because Providers and Age Buckets are in the same

Dimensional Hierarchy, we can’t do this calculation.

● OBIEE limits you to only one level per Dimensional Hierarchy

Page 50: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

RPD Consistency Check Warnings:● Because you have mapped multiple tables in the LTS’s of

your Dimensions, OBIEE expects you to use those tables in

the Column Mappings

● If you do not make use of a mapped table, the Consistency

Checker will give you a warning:

Page 51: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

RPD Consistency Check Warnings (cntd.):● To resolve this warning, create a new Logical Column in your

table and map it to a column in the unused table

Page 52: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Fabricated Data: Missing Data● In our data set we were faced with the fact that a Tumor may not

be present in all of the major subject areas (Statuses, Stagings,

Histologies, Treatments)

● In the instances where a Tumor was not present in a subject

area, that Tumor would fall out of the data population if a

dimension from that subject area was brought in

Users did not want this fallout to occur and they emphasized

that it was just as important to see where a Tumor did not

have information in a given subject area as it was to see what

information a tumor did have in a given subject area

Page 53: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Fabricated Data: Missing Data (cntd.)● RPD Resolution Possibility

● This fallout could have been prevented through Outer Joins in the OBIEE

RPD

● The Logical Mappings of all dimension attributes could be changed to

include an IFNULL clause [It was required that a NULL/missing object state

its status (Missing)]

● With hundreds of attributes this is neither sustainable nor realistic

Page 54: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Fabricated Data: Missing Data (cntd.)● Data Creation in Database Tables Resolution Possibility

● Each and every Dimension Table in our Dimensional Model has an entry for

a key of -2 and attribute codes, names, descriptions, etc. of ‘!Record Does

Not Exist in Source’

● At the end of the ETL process, the central conformed dimension

(D_PT_TMR) was LEFT OUTER JOINed to each of the dimensional

database “Sub-Facts”

● F_TMR_STAT

● F_TMR_STG

● F_TMR_HISTOL

● F_TMR_TRTMT

Page 55: When Facts and Dimensions Alone Aren't the Answer: Logically Reversing the Star Schema

Additional Implementation Notes (cntd.)

Fabricated Data: Missing Data (cntd.)● Data Creation in Database Tables Resolution Possibility (cntd.)

● Where a match was not found in a Fact table, a row was inserted in that

Fact table for the missing Tumor ID. All Dimensional Keys were set to -2,

!Record Does Not Exist in Source

● Users then gained the ability to pull tumors with missing information.