Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.
Post on 03-Jan-2016
215 Views
Preview:
Transcript
Data Warehousing
Definition• Data Warehouse:
– A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources
– Time-variant: Contain a time dimenstion so that it may be used to study trends and changes
– Nonupdatable: Read-only, periodically refreshed
• Data Mart:– A data warehouse that is limited in scope
Need for Data Warehousing• Integrated, company-wide view of high-quality
information (from disparate databases)• Separation of operational and informational (decision
support) systems and data (for improved performance)
Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
All involve some form of extraction, transformation and loading (ETLETL)
Figure 11-2: Generic two-level data warehousing architecture
E
T
LOne, company-wide warehouse
Periodic extraction data is not completely current in warehouse
Figure 11-3 Independent data mart data warehousing architecture
Data marts:Data marts:Mini-warehouses, limited in scope
E
T
L
Separate ETL for each independent data mart
Data access complexity due to multiple data marts
The ETL Process
• Capture/Extract• Scrub or data cleansing• Transform:
– Convert data from the format of the source to the format of the data warehouse.
• Load and Index
ETL = Extract, transform, and load
Load/Index= place transformed data into the warehouse and create indexes
Refresh mode:Refresh mode: bulk rewriting of target data at periodic intervals
Update mode:Update mode: only changes in source data are written to data warehouse
Figure 11-10: Steps in data reconciliation
(cont.)
Index
• Bitmap index
• Join index
Figure 6-8Bitmap index index organization
Bitmap saves on space requirementsRows - possible values of the attribute
Columns - table rows
Bit indicates whether the attribute of a row has the values
Figure 6-9 Join Indexes–speeds up join operations
Star Schema for Data Warehouse
• Objectives– Ease of use for decision support applications– Fast response to predefined user queries– Customized data for particular target audiences
Also called “dimensional model”• Dimension:
– A dimension is a term used to describe any category used in analyzing data, such as time, geography, and product line.
Figure 11-13 Components of a star schemastar schemaFact tables contain factual or quantitative data
Dimension tables contain descriptions about the subjects of the business
1:N relationship between dimension tables and fact tables
Excellent for ad-hoc queries, but bad for online transaction processing
Dimension tables are denormalized to maximize performance
Figure 11-14 Star schema example
Fact table provides statistics for sales broken down by product, period and store dimensions
Figure 11-15 Star schema with sample data
On-Line Analytical Processing (OLAP) Tools
• The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
• Relational OLAP (ROLAP)– Traditional relational representation
• Multidimensional OLAP (MOLAP)– Cube structure
• OLAP Operations– Cube slicing–come up with 2-D view of data– Drill-down–going from summary to more detailed views
Figure 11-23 Slicing a data cube
Figure 11-24 Example of drill-down
Summary report
Drill-down with color added
Starting with summary data, users can obtain details for particular cells
Data Mining and Visualization• Knowledge discovery using a blend of statistical, AI, and computer graphics
techniques• Goals:
– Explain observed events or conditions– Confirm hypotheses– Explore data for new or unexpected relationships
• Techniques– Statistical regression– Decision tree induction– Clustering and signal processing– Affinity– Sequence association– Case-based reasoning– Rule discovery– Neural nets– Fractals
• Data visualization–representing data in graphical/multimedia formats for analysis
Pivot Table
• Excel:– Drill Down, Roll Up
• Access CrossTab query
SQL GROUPING SETS
• GROUPING SETS– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS
– GROUP BY GROUPING SETS(CITY,RATING,(CITY,RATING),())
– ORDER BY CITY;
• Note: () indicates that an overall total is desired.
SQL CUBE
• Perform aggregations for all possible combinations of columns indicated.– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS
– GROUP BY CUBE(CITY,RATING)
– ORDER BY CITY, RATING;
SQL ROLLUP
• The ROLLUP extension causes cumulative subtotals to be calculated for the columns indicated. If multiple columns are indicated, subtotals are performed for each of the columns except the far-right column.– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS– GROUP BY ROLLUP(CITY,RATING)– ORDER BY CITY, RATING
top related