A Framework for Building Resilient Data Warehouses A Framework for Building Resilient Data Warehouses A Framework for Building Resilient Data Warehouses using a Mandala Topology Architecture using a Mandala Topology Architecture IA using a Mandala Topology Architecture IA Michel JACQUES Michel JACQUES Information Assembly SPRL, Brussels, Belgium Introduction 1.Users Perspectives & Interfaces 2.The Four Major Streams of Data Introduction 1.Users Perspectives & Interfaces 2.The Four Major Streams of Data Data Application: DW Platform is Usage-Driven Data Application: DW Platform is Usage-Driven Constructing ETL work for a DW project is a complex affair, one that Each user type has a different Source data is in an unrefined state (to Data Application: DW Platform is Usage-Driven Data Application: DW Platform is Usage-Driven Constructing ETL work for a DW project is a complex affair, one that requires planning. This framework uses a DW architecture based on a perspective and role towards the DW. Existing users include: data stewards, various degrees) that must have its data elements differentiated into a DW core Integration Integration requires planning. This framework uses a DW architecture based on a Mandala Topology . At its core is an agile, step-by-step approach to Existing users include: data stewards, analytical end-users, quality control elements differentiated into a DW core data model (IN) in order to able to META Integration META Integration Mandala Topology . At its core is an agile, step-by-step approach to identify ETL work units: analytical end-users, quality control masters, DW administrators & integration data model (IN) in order to able to recombine them effectively afterwards for META META identify ETL work units: 1.Identifying the external users masters, DW administrators & integration managers. Each user has his own set of information requirements and accesses recombine them effectively afterwards for analytics (OUT). During this transition, STAR STAR 1.Identifying the external users 2.Positioning main data flows, information requirements and accesses the DW via a specific interface (STAG, two other types of data elements are produced: metadata (UP) and erroneous STAR EDM STAG MART Data STAR EDM STAG MART Data 2.Positioning main data flows, 3.Decomposing a flow internal layers the DW via a specific interface (STAG, META, MART, or SINK). The STAR can produced: metadata (UP) and erroneous data (DOWN) streams. No data should Analytics Data Stewards Analytics Data Stewards 3.Decomposing a flow internal layers 4.Incorporating them within the DW application platform (Mandala) META, MART, or SINK). The STAR can only be accessed indirectly via the other data (DOWN) streams. No data should be lost in this closed system. Thus the SINK Analytics SINK Analytics 4.Incorporating them within the DW application platform (Mandala) 5.Extending conceptual EDM with a functional level of detail interfaces to ensure data integrity & security and prevents dependencies position & direction of a data stream determines its purpose, the means, and 5.Extending conceptual EDM with a functional level of detail 6.Classifying data model entities and their dependencies security and prevents dependencies caused by different user demands. The determines its purpose, the means, and its destination. This naturally leads to Quality Quality 6.Classifying data model entities and their dependencies 7.Combining the Functional ER model with Topic Areas & Tiers caused by different user demands. The users are part of an iterative feedback its destination. This naturally leads to asymmetries between streams, which Management: Quality Control Quality Control 7.Combining the Functional ER model with Topic Areas & Tiers 8.Enumerate the ETL work units in matrix format users are part of an iterative feedback loop improving future data content and asymmetries between streams, which must be accounted for in the ETL design.. Management: •Intravenous Dextrose infusion with rates up to 250mls/hr of 20% Dextrose 8.Enumerate the ETL work units in matrix format The method focuses on the topological relationship between all the DW quality. Hence a multi-perspectives/faceted data warehouse architecture acting as a crossway between “Just about every [DW] process has side effects; but they can be deliberate and sustaining instead of unintentional and pernicious…and we can also be inspired by it to design some •Intravenous Dextrose infusion with rates up to 250mls/hr of 20% Dextrose The method focuses on the topological relationship between all the DW artifacts, in order to comprehensive improve planning & design of DW . Hence a multi-perspectives/faceted data warehouse architecture acting as a crossway between these end-users is comprehensive and non-discriminatory at an organizational level. instead of unintentional and pernicious…and we can also be inspired by it to design some positive side effects to our own enterprises instead of focusing exclusively on a single end.” (p.80) •Dietary intervention with frequent meals and corn starch artifacts, in order to comprehensive improve planning & design of DW . these end-users is comprehensive and non-discriminatory at an organizational level. positive side effects to our own enterprises instead of focusing exclusively on a single end.” (p.80) Ref: W. McDonough and M. Braungart, Craddle to Craddle, North Point Press, NY, 2002 •Dietary intervention with frequent meals and corn starch •Diazoxide – intolerant leading to hyponatraemia, oedema and nausea 5.Functional ER Data Model 3.The Five ETL Layers & Chirality 4.DW Mandala Topology Architecture •Octreotide/glucagon intravenously – in order to replace counter-regulatory hormones 5.Functional ER Data Model 3.The Five ETL Layers & Chirality 4.DW Mandala Topology Architecture Data Application: Platform for 5 Architectural Layers Data Application: Platform for 5 Architectural Layers Enterprise Data Model: Conceptual Model Enterprise Data Model: Conceptual Model The Buddhist Mandala metaphor The conceptual model is The purpose of an ETL is to increase the •Octreotide/glucagon intravenously – in order to replace counter-regulatory hormones Data Application: Platform for 5 Architectural Layers Data Application: Platform for 5 Architectural Layers Enterprise Data Model: Conceptual Model Enterprise Data Model: Conceptual Model The Buddhist Mandala metaphor helps visualize an architecture The conceptual model is business-oriented, while the The purpose of an ETL is to increase the integration & analytical capability of •Subcutaneous Octreotide - hypoglycaemia worsened 1 META META where topological relationships between DW components business-oriented, while the logical model is focused on integration & analytical capability of sourced data. An ETL data path consists of 5 •Subcutaneous Octreotide - hypoglycaemia worsened 1 Authoring Authoring between DW components including: data modeling, ETL logical model is focused on content and application. The layers, each conducting a different set of data transformations. The first two “IN” stages •Prednisolone – developed fluid retention including: data modeling, ETL data flows, and surrounding functional data model, proposed herein, places itself in the gap transformations. The first two “IN” stages integrate the data to enable multiple •Prednisolone – developed fluid retention STAR STAR Trn Maps Trn Maps data flows, and surrounding actors & applications; functionally herein, places itself in the gap between the two. It extends the integrate the data to enable multiple interpretations, while the last two “OUT” •Hepatic Arterial Embolisation (HAE) performed twice with initial improvement (post-procedure insulin 29 pmol/l), but relapsed after 4 weeks. STAR EDM STAG MART IN OUT STAR EDM STAG MART IN OUT interact. The architecture is similar to a road crossway that between the two. It extends the number of entity types from the interpretations, while the last two “OUT” stages specialise the data such that it •Hepatic Arterial Embolisation (HAE) performed twice with initial improvement (post-procedure insulin 29 pmol/l), but relapsed after 4 weeks. EDM Extraction:::Staging:::::Integration::::Publishing:User Access EDM Extraction:::Staging:::::Integration::::Publishing:User Access Profiles Attendance Profiles Attendance similar to a road crossway that gives essential context and number of entity types from the initial fact and dimension with becomes fit-for-purpose. This mirror-effect is referred to as ETL chirality. Extraction:::Staging:::::Integration::::Publishing:User Access Extraction:::Staging:::::Integration::::Publishing:User Access Profiles Roster Profiles Roster gives essential context and movement to data and new types for holding hierarchies, dim. ids, details & associations. referred to as ETL chirality. The following are important elements when selecting the most appropriate data modelling SINK SINK Trainings Trainings operations. The context is composed of 5 distinct locations dim. ids, details & associations. This improves history-keeping The following are important elements when selecting the most appropriate data modelling technique to use: a) the degree of convergence built into the data during ETL; b) the number of SINK SINK Trainings Trainings composed of 5 distinct locations (STAG, META, SINK, MART, & This improves history-keeping and makes the core data model unique pathways in the dimensional model; c) increased data flow resilience by decreased data reliance; and d) ability for decoupling of model components. Erroneous Data Repo Erroneous Data Repo (STAG, META, SINK, MART, & STAR) that provide a clear more resilient while decoupling the associated data flows. reliance; and d) ability for decoupling of model components. Data disturbances will occur either from external sources in an unexpected, subtle or extreme Data Repo Data Repo STAR) that provide a clear logical structure determining There is a need to extend our “vocabulary of forms” when modelling. This involves adding the associated data flows. Data disturbances will occur either from external sources in an unexpected, subtle or extreme manner, which requires a faster data recovery by minimising data reload to only what is relevant. data, flows, modeling patterns, access and security, user groups, and integration methods. Moreover, locations enable data persistence facilitating data recovery and transformations. There is a need to extend our “vocabulary of forms” when modelling. This involves adding functional features so as to harmonise “form with function” and thus achieve a greater manner, which requires a faster data recovery by minimising data reload to only what is relevant. Resilience also involves cyclic transfer of information across data applications reinforcing each Moreover, locations enable data persistence facilitating data recovery and transformations. functional features so as to harmonise “form with function” and thus achieve a greater decoupling of DW artefacts, whilst maintaining data cohesion. system data quality and monitoring usage. decoupling of DW artefacts, whilst maintaining data cohesion. 8.Work Matrix from Method & Conclusion 7. Agile DW Meets Functional ER Model 6.Volatility of Functional Entities 8.Work Matrix from Method & Conclusion 7. Agile DW Meets Functional ER Model 6.Volatility of Functional Entities ETL Decomposition Process: Classes as Tiers Revisited ETL Decomposition Process: Classes as Tiers Revisited ETL Decomposition Process: Work Units and Modules ETL Decomposition Process: Work Units and Modules ETL Decomposition Process: Topic Areas & Tiers for EDM ETL Decomposition Process: Topic Areas & Tiers for EDM Development progress follows an The ETL decomposition process This functional data model ETL Decomposition Process: Classes as Tiers Revisited DEPENDENCY LAYERS as TIERS ETL Decomposition Process: Classes as Tiers Revisited DEPENDENCY LAYERS as TIERS DEPENDENCY LAYERS as TIERS ETL Decomposition Process: Work Units and Modules ETL Decomposition Process: Work Units and Modules ETL Decomposition Process: Topic Areas & Tiers for EDM EVALUATION_QUESTIONS3 TYPES_OF_NEEDS3 OTHER_CATALOG3 EVALUATION_ANSWERS3 ETL Decomposition Process: Topic Areas & Tiers for EDM EVALUATION_QUESTIONS3 TYPES_OF_NEEDS3 OTHER_CATALOG3 EVALUATION_ANSWERS3 Development progress follows an iterative and mostly top-down The ETL decomposition process requires an entity having both a This functional data model applies additional data entity DEPENDENCY LAYERS as TIERS TIERS DATA CLASSES DEPENDENCY LAYERS as TIERS TIERS DATA CLASSES DEPENDENCY LAYERS as TIERS TIERS DATA CLASSES 1,n 1,n 0,n 1,n TRAINING_NEEDS_ACTUAL EVALUATION_QUESTIONS3 PRIORITY SOLUTION <Undefined> <Undefined> OTHER_CATALOG3 EVALUATION_ANSWERS3 1,n 1,n 0,n 1,n TRAINING_NEEDS_ACTUAL EVALUATION_QUESTIONS3 PRIORITY SOLUTION <Undefined> <Undefined> OTHER_CATALOG3 EVALUATION_ANSWERS3 iterative and mostly top-down approach: one topic area, one class and a theme. For each entity the process allocates: a) a applies additional data entity classes giving it increased ETL flexibility. The classes follow a T00 Metadata (static) T00 Metadata (static) T00 Metadata (static) EVA-4 TNA-5 1,n TRAINING_NEEDS_ACTUAL TNA_NO_NEEDS_VAALUE TNA_SUGGESTIONS_DESC TNA_OBJECTIVE_DESC ... <Undefine <Undefine <Undefine TRAININGS_EVALUATIONS EVA-4 TNA-5 1,n TRAINING_NEEDS_ACTUAL TNA_NO_NEEDS_VAALUE TNA_SUGGESTIONS_DESC TNA_OBJECTIVE_DESC ... <Undefine <Undefine <Undefine TRAININGS_EVALUATIONS module, one tier at a time, although units within same tier can be entity the process allocates: a) a tier corresponding to a class; b) flexibility. The classes follow a step-wise approach whereby they T01 Dimensions T01 Dimensions T01 Dimensions TMA-2 EVA-4 ORGANI WORKFLOW TMA-2 EVA-4 ORGANI WORKFLOW units within same tier can be developed in parallel. Once there is tier corresponding to a class; b) a topic area corresponding to a step-wise approach whereby they become increasingly volatile T01 base, details & struct) Facts T01 base, details & struct) Facts T01 base, details & struct) Facts TRA-3 T1 SUB_TIME SUP_TIME 0,n 1,n SUB SUP sup sub TRAINING_MAP_ACTUAL TMA_DOSSIER_ID TMA_STATUS_DATE <U <U TIMES3 TIMES_H WORKFLOW_STATUS5 WFS_SID WFS_CODE <Undefined> <Undefined> ORGANI ROOMS3 TRA-3 T1 SUB_TIME SUP_TIME 0,n 1,n SUB SUP sup sub TRAINING_MAP_ACTUAL TMA_DOSSIER_ID TMA_STATUS_DATE <U <U TIMES3 TIMES_H WORKFLOW_STATUS5 WFS_SID WFS_CODE <Undefined> <Undefined> ORGANI ROOMS3 developed in parallel. Once there is a Reference Model (built prototype) a topic area corresponding to a theme; and c) a priority become increasingly volatile (data susceptibility to change). Facts (events & profiles) T02 Facts (events & profiles) T02 Facts (events & profiles) T02 T0 T1 T2 1,n 0,n 0,n 0,n TMA_STATUS_DATE TMA_VERSION_NO TMA_VERSION_START_DATE TMA_VERSION_END_DATE ... <U <U <U <U TIME_SID TIME_DATE <Undefin <Undefin TRAINING_MAP_EXERCISE3 TME_SID TME_NAME <Undefined> <Undefined> ORGANISATIONS3 ORG_SID ORG_CODE ... <Undefin <Undefin COURSE_DETAILS3 T0 T1 T2 1,n 0,n 0,n 0,n TMA_STATUS_DATE TMA_VERSION_NO TMA_VERSION_START_DATE TMA_VERSION_END_DATE ... <U <U <U <U TIME_SID TIME_DATE <Undefin <Undefin TRAINING_MAP_EXERCISE3 TME_SID TME_NAME <Undefined> <Undefined> ORGANISATIONS3 ORG_SID ORG_CODE ... <Undefin <Undefin COURSE_DETAILS3 a Reference Model (built prototype) for a Tier work unit, estimates for all corresponding to prevailing business needs. An entity defines Since data in lower classes is used to derive new data in upper T03 Grids (associations) T03 Grids (associations) T03 Grids (associations) T4 0,n 0,n 0,n Owner 1,n 0,n Details 0,n 1,n 1,n ... STATUTORIES3 COURSES3 COURSE_SID <Undefine TRAININGS_ATTENDANCE_PAG T4 0,n 0,n 0,n Owner 1,n 0,n Details 0,n 1,n 1,n ... STATUTORIES3 COURSES3 COURSE_SID <Undefine TRAININGS_ATTENDANCE_PAG units can be based on the reference model and adjusted according to business needs. An entity defines the work unit in which it is used to derive new data in upper classes, the data volatility in T04 Densification & Conformations T04 Densification & Conformations T04 Densification & Conformations Details 0,n Participant 1,n 1,n 1,n 1,n 1,n PARTICIPANTS3 PTCD_SID PTC_LAST_NAME ... <U <U PARTICIPANTS_PROFILES_ACTUAL PPA_ONE_VALUE PPA_VALID_START_DATE PPA_VALID_END_DATE PPA_ACTUAL_FLAG <Undefine <Undefine <Undefine <Undefine ADMIN_POSITIONS3 COURSE_SID COURSE_CODE <Undefine <Undefine PARTICIPANT_STATUS3 SESSION_DURATION SESSION_EQUIVALENT_VALUE SESSION_EVENT_DATE SESSION_VIRTUAL_FLAG ... < < < < Details 0,n Participant 1,n 1,n 1,n 1,n 1,n PARTICIPANTS3 PTCD_SID PTC_LAST_NAME ... <U <U PARTICIPANTS_PROFILES_ACTUAL PPA_ONE_VALUE PPA_VALID_START_DATE PPA_VALID_END_DATE PPA_ACTUAL_FLAG <Undefine <Undefine <Undefine <Undefine ADMIN_POSITIONS3 COURSE_SID COURSE_CODE <Undefine <Undefine PARTICIPANT_STATUS3 SESSION_DURATION SESSION_EQUIVALENT_VALUE SESSION_EVENT_DATE SESSION_VIRTUAL_FLAG ... < < < < model and adjusted according to variable difficulty factors. Once the work unit in which it is contained. A work unit is the classes, the data volatility in lower class will be less than that Data Marts T05 T04 (ballasting & mappings) Data Marts T05 T04 (ballasting & mappings) Data Marts T05 T04 (ballasting & mappings) T2 T1 T3 0,n 0,n Details 0,n 0,n 1,n 1,n PARTICIPANTS_DETAIL PPA_ACTUAL_FLAG <Undefine JOBS3 CATEGORIES_GRADES TRAININGS_ROSTER_PTG FINAL_EXAM_MARK INSCRIPTION_DATE ... <Undefined> <Undefined> T2 T1 T3 0,n 0,n Details 0,n 0,n 1,n 1,n PARTICIPANTS_DETAIL PPA_ACTUAL_FLAG <Undefine JOBS3 CATEGORIES_GRADES TRAININGS_ROSTER_PTG FINAL_EXAM_MARK INSCRIPTION_DATE ... <Undefined> <Undefined> variable difficulty factors. Once completed & tested, work units contained. A work unit is the smallest functional artefact lower class will be less than that in upper classes. Volatility Data Marts (lov, hier & msr) T06 T07 Data Marts (lov, hier & msr) T06 T07 Data Marts (lov, hier & msr) T06 T07 T1 0,n 0,n 1,n 1,n 1,n RESOURCE_CENTERS3 TRAININGS_ACTUAL COST NO_SESSION_VALUE <Undefined> <Undefined> CATEGORIES_GRADES CGR_SID CGR_CODE ... <Undefine <Undefine DOMAINS3 T1 0,n 0,n 1,n 1,n 1,n RESOURCE_CENTERS3 TRAININGS_ACTUAL COST NO_SESSION_VALUE <Undefined> <Undefined> CATEGORIES_GRADES CGR_SID CGR_CODE ... <Undefine <Undefine DOMAINS3 completed & tested, work units modules are then promoted to next determining granularity of resource allocation. Regrouping determines which data flows are performed in parallel or in T08 T09 Segments & KPI (compositions & formula) T08 T09 Segments & KPI (compositions & formula) T08 T09 Segments & KPI (compositions & formula) PPA-1 T1 T1 T2 0,n 1,n 1,n 1,n 1,n 0,n DURATION_CALCUL_VALUE DURATION_MANUAL_VALUE NO_MAX_PARTICIPANTS ... <Undefined> <Undefined> <Undefined> ... CITIES3 COURSE_TYPES3 PPA-1 T1 T1 T2 0,n 1,n 1,n 1,n 1,n 0,n DURATION_CALCUL_VALUE DURATION_MANUAL_VALUE NO_MAX_PARTICIPANTS ... <Undefined> <Undefined> <Undefined> ... CITIES3 COURSE_TYPES3 dev. environment. resource allocation. Regrouping work units is as follows: Work performed in parallel or in sequence. This principle T99 Topic Area Promotion T99 Topic Area Promotion T99 Topic Area Promotion TOPIC PPA-1 T1 T1 TIERS TOPIC TIERS COURSE_STATUS3 TRAINERS3 TOPIC PPA-1 T1 T1 TIERS TOPIC TIERS COURSE_STATUS3 TRAINERS3 work units is as follows: Work unit >> Module >> Topic Area >> sequence. This principle drastically reduces ETL execution Conclusion: The advantages of implementing such a topological architecture include: greater Promotion Promotion Promotion Enterprise Data Model (EDM). drastically reduces ETL execution and development time. scalability of additional data themes, enhanced performance of data flows, increased resilience of decoupled artifacts, sturdier quality control, and lower operating and development costs. The Concept of Tiers and Topic Areas is borrowed from R. Hughes’ book: The Agile Data Warehousing, iUniverse Inc, Bloomington, 2008 decoupled artifacts, sturdier quality control, and lower operating and development costs. The framework provides a comprehensive, reproducible, and proven DW architecture solution. The 6 classes are mapped to 9 tiers that maintain volatility. Warehousing, iUniverse Inc, Bloomington, 2008 framework provides a comprehensive, reproducible, and proven DW architecture solution. Ecole Centrale Paris, Châtenay-Malabry Ecole Centrale Paris, Châtenay-Malabry