This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
ObjectivesObjectives Definition of termsDefinition of terms Reasons for information gap between Reasons for information gap between
information needs and availabilityinformation needs and availability Reasons for need of data warehousingReasons for need of data warehousing Describe three levels of data warehouse Describe three levels of data warehouse
architecturesarchitectures List four steps of data reconciliationList four steps of data reconciliation Describe two components of star schemaDescribe two components of star schema Estimate fact table sizeEstimate fact table size Design a data martDesign a data mart
DefinitionDefinition Data WarehouseData Warehouse: :
A subject-oriented, integrated, time-variant, non-A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of updatable collection of data used in support of management decision-making processesmanagement decision-making processes
Subject-oriented:Subject-oriented: e.g. customers, patients, e.g. customers, patients, students, productsstudents, products
Integrated: Integrated: Consistent naming conventions, Consistent naming conventions, formats, encoding structures; from multiple data formats, encoding structures; from multiple data sourcessources
Time-variant: Time-variant: Can study trends and changesCan study trends and changes Nonupdatable: Nonupdatable: Read-only, periodically refreshedRead-only, periodically refreshed
Data MartData Mart:: A data warehouse that is limited in scopeA data warehouse that is limited in scope
Need for Data WarehousingNeed for Data Warehousing Integrated, company-wide view of high-quality Integrated, company-wide view of high-quality
information (from disparate databases)information (from disparate databases) Separation of Separation of operationaloperational and and informationalinformational
systems and data (for improved performance)systems and data (for improved performance)
Data Warehouse Data Warehouse ArchitecturesArchitectures
Generic Two-Level ArchitectureGeneric Two-Level Architecture Independent Data MartIndependent Data Mart Dependent Data Mart and Dependent Data Mart and
Operational Data StoreOperational Data Store Logical Data Mart and @ctive Logical Data Mart and @ctive
Data ReconciliationData Reconciliation Typical operational data is:Typical operational data is:
Transient – not historicalTransient – not historical Not normalized (perhaps due to denormalization for Not normalized (perhaps due to denormalization for
performance)performance) Restricted in scope – not comprehensiveRestricted in scope – not comprehensive Sometimes poor quality – inconsistencies and errorsSometimes poor quality – inconsistencies and errors
After ETL, data should be:After ETL, data should be: Detailed – not summarized yetDetailed – not summarized yet Historical – periodicHistorical – periodic Normalized – 3Normalized – 3rdrd normal form or higher normal form or higher Comprehensive – enterprise-wide perspectiveComprehensive – enterprise-wide perspective Timely – data should be current enough to assist decision-Timely – data should be current enough to assist decision-
makingmaking Quality controlled – accurate with full integrityQuality controlled – accurate with full integrity
Ease of use for decision support applicationsEase of use for decision support applications Fast response to predefined user queriesFast response to predefined user queries Customized data for particular target audiencesCustomized data for particular target audiences Ad-hoc query supportAd-hoc query support Data mining capabilitiesData mining capabilities
Issues Regarding Star Issues Regarding Star SchemaSchema
Dimension table keys must be Dimension table keys must be surrogatesurrogate (non- (non-intelligent and non-business related), because:intelligent and non-business related), because: Keys may change over timeKeys may change over time Length/format consistencyLength/format consistency
Granularity of Fact Table – what level of detail do Granularity of Fact Table – what level of detail do you want? you want? Transactional grain – finest levelTransactional grain – finest level Aggregated grain – more summarizedAggregated grain – more summarized Finer grains Finer grains better better market basket analysismarket basket analysis capability capability Finer grain Finer grain more dimension tables, more rows in fact table more dimension tables, more rows in fact table
Duration of the database – how much history Duration of the database – how much history should be kept?should be kept? Natural duration – 13 months or 5 quartersNatural duration – 13 months or 5 quarters Financial institutions may need longer durationFinancial institutions may need longer duration Older data is more difficult to source and cleanseOlder data is more difficult to source and cleanse
The User InterfaceThe User InterfaceMetadata (data catalog)Metadata (data catalog)
Identify subjects of the data martIdentify subjects of the data mart Identify dimensions and factsIdentify dimensions and facts Indicate how data is derived from enterprise Indicate how data is derived from enterprise
data warehouses, including derivation rulesdata warehouses, including derivation rules Indicate how data is derived from operational Indicate how data is derived from operational
data store, including derivation rulesdata store, including derivation rules Identify available reports and predefined queriesIdentify available reports and predefined queries Identify data analysis techniques (e.g. drill-down)Identify data analysis techniques (e.g. drill-down) Identify responsible peopleIdentify responsible people
The use of a set of graphical tools that provides The use of a set of graphical tools that provides users with multidimensional views of their data users with multidimensional views of their data and allows them to analyze the data using and allows them to analyze the data using simple windowing techniquessimple windowing techniques
Relational OLAP (ROLAP)Relational OLAP (ROLAP) Traditional relational representationTraditional relational representation
OLAP OperationsOLAP Operations Cube slicingCube slicing – come up with 2-D view of data – come up with 2-D view of data Drill-downDrill-down – going from summary to more detailed – going from summary to more detailed
Data Mining and Data Mining and VisualizationVisualization
Knowledge discovery using a blend of statistical, AI, and computer Knowledge discovery using a blend of statistical, AI, and computer graphics techniquesgraphics techniques
Goals:Goals: Explain observed events or conditionsExplain observed events or conditions Confirm hypothesesConfirm hypotheses Explore data for new or unexpected relationshipsExplore data for new or unexpected relationships
TechniquesTechniques Statistical regressionStatistical regression Decision tree inductionDecision tree induction Clustering and signal processingClustering and signal processing AffinityAffinity Sequence associationSequence association Case-based reasoningCase-based reasoning Rule discoveryRule discovery Neural netsNeural nets FractalsFractals
Data visualization – representing data in graphical/multimedia Data visualization – representing data in graphical/multimedia formats for analysisformats for analysis