Data Warehousing and elements of Data Mining · 2016-08-25 · • Data analysis system characteristics: FASMI – OLAP Report 1995 ... DW and elements of DM OLTP vs. OLAP Maurizio
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining
Slide 6
DW and elements of DMMaurizio PighinWhat is Data Warehouse?
• Defined in many different ways, but not rigorously.– A decision support database that is maintained
separately from the organization’s operational database
– Support information processing by providing a solid platform of consolidated, historical data for analysis.
DW and elements of DMMaurizio PighinWhat is Data Warehouse?
• “A data warehouse is a subject-oriented, integrated, time-variant, and non volatile collection of data in support of management’s decision-making process.”- W. H. Inmon (1985)
• “A single, complete and consistent data warehouse, obtained by different sources, available to final users to be immediately utilized” – IBM System Journal (1990)
• Data warehousing:– The process of constructing and using data
warehouses
Slide 8
DW and elements of DMMaurizio PighinData Warehouse - Subject-Oriented
• Organized around major subjects, such as customer, product, sales.
• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
DW and elements of DMMaurizio PighinData Warehouse - Integrated
• Constructed by integrating multiple, heterogeneous data sources– relational databases, flat files, on-line transaction
records• Data cleaning and data integration techniques are
applied.– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
Slide 10
DW and elements of DMMaurizio PighinData Warehouse - Time Variant
• The time horizon for the data warehouse is significantly longer than that of operational systems.– Operational database: current value data.– Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly– But the key of operational data may or may not contain
DW and elements of DMMaurizio PighinWhy do we need all that?
• Operational databases are for On Line Transaction Processing (OLTP)– automate day-to-day operations (purchasing, banking
etc)– transactions access (and modify!) a few records at a
time– database design is application (process) oriented– metric: transactions/sec
Slide 14
DW and elements of DMMaurizio PighinWhy do we need all that?
• Data Warehouse is for On Line Analytical Processing (OLAP)– complex queries that access millions of records– need historical data for trend analysis – long scans would interfere with normal operations– synchronizing data-intensive queries among physically
separated databases would be a nightmare!– metric: query response time
• Traditional heterogeneous DB integration: – Build wrappers/mediators on top of heterogeneous
databases – Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set
• Complex information filtering, compete for resources
• Data warehouse: update-driven, high performance– Information from heterogeneous sources is integrated
in advance and stored in warehouses for direct query and analysis
– Data analysis and decision making• Distinct features (OLTP vs. OLAP):
– System orientation: process vs. business subject– Data contents: current, detailed vs. historical, consolidated– Database design: ER + application vs. Multidimensional + subject– View: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex queries
OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date
unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
Slide 20
DW and elements of DMMaurizio PighinWhy Separate Data Warehouse?
• High performance for both systems– DBMS - tuned for OLTP: access methods, indexing, concurrency
control, recovery– Warehouse - tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.• Different functions and different data:
– missing data: Decision Support requires historical data which operational DBs do not typically maintain
– data consolidation: Decision Support requires consolidation (aggregation, summarization) of data from heterogeneous sources
– data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining
Slide 22
DW and elements of DMMaurizio PighinMultidimensional model
• A data warehouse is based on a multidimensional data model which views data in the form of a data cube (hypercube)
• An hypercube is a multidimensional array which represents particular event
• We define “fact” a point of this multidimensional array obtained crossing exiting co-ordinates– Dimension: fact co-ordinate– Measure: numerical value characterizing the event
DW and elements of DMMaurizio PighinMultidimensional model - example
• A data cube, such as sales, allows numerical data (measures) to be modeled and viewed in multiple dimensions– Measures such as transaction value (dollars_sold),
quantity (item_quantity)– Dimension, such as item (item_name, brand, type), or
time (day, week, month, quarter, year), or customer (customer_name, city, region, state)
Slide 24
DW and elements of DMMaurizio PighinMeasures
• Every fact can contain more than one measure• A measure may be
– Saved on the Data Warehouse (effective)– Run-time evaluated from effective measures– Implicit (presence or absence of a fact)
• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining
Slide 32
DW and elements of DMMaurizio PighinOLAP Server Architectures
• Relational OLAP (ROLAP) – Use relational or extended-relational DBMS to store
and manage warehouse data and OLAP middle ware to support missing pieces
– Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
• Modeling data warehouses: dimensions & measures on ROLAP Systems– Star schema: A fact table in the middle connected to a
set of dimension tables – Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
– Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
DW and elements of DMMaurizio PighinGeneral Architecture
• Enterprise warehouse– collects all of the information about subjects spanning
the entire organization• Data Mart
– a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Data extraction:– get data from multiple, heterogeneous, and external sources
• Data cleaning:– detect errors in the data and rectify them when possible
• Data transformation:– convert data from legacy or host format to warehouse format
• Load:– sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions• Refresh:
– propagate the updates from the data sources to the warehouse
Slide 48
DW and elements of DMMaurizio Pighin
Data Warehousing and Data Mining
• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining
DW and elements of DMMaurizio PighinData Warehouse Design Process
• Typical data warehouse design process with bottom up process– Choose a business process to model, e.g., orders, invoices, etc.– Choose the grain (atomic level of data) of the business process– Choose the dimensions that will apply to each fact table record– Choose the measure that will populate each fact table record– Design the architecture of the DW– Design the ETL– Install and test
• Advantages– Results in short time– Not too expensive– Give to the management a clear perspective of the OLAP world
Slide 52
DW and elements of DMMaurizio Pighin
Data Warehousing and Data Mining
• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining
• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining
• What is a data warehouse? • A multi-dimensional data model• Data warehouse architecture• Data warehouse implementation• OLAP analysis• From data warehousing to data mining• Principles of data mining
Slide 82
DW and elements of DMMaurizio PighinWhat Is Data Mining?
• Data mining (knowledge discovery in databases): – Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large databases
• Alternative names:– Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
DW and elements of DMMaurizio PighinClassification: Definition
• Given a collection of records (training set)• Each record contains a set of attributes, one of the
attributes is the class.• Find a model for class attribute as a function of the
values of other attributes.• Goal: previously unseen records should be assigned
a class as accurately as possible.• Metodology: a test set is used to determine the
accuracy of the model. Usually, the given a collection of known data set is randomly divided into trainingand test sets, with training set used to build the model and test set used to validate it.
Slide 94
DW and elements of DMMaurizio PighinClassification Example
DW and elements of DMMaurizio PighinClassification: Application
• Direct Marketing– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.– Approach:
• Use the data for a similar product introduced before. • We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.• Use this information as input attributes to learn a classifier
model.
Slide 96
DW and elements of DMMaurizio PighinClustering Definition
• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to one
another.– Data points in separate clusters are less similar to one
another. • Similarity Measures
– Euclidean Distance if attributes are continuous.– Other Problem-specific Measures
DW and elements of DMMaurizio PighinDeviation Detection
• The search of “Outlier”• Outlier: exception, element out of range• The search is based on the same principles of clustering• Concentrates the efforts in finding elements “far” from the other • Search method
– Statistical• Can be used if a statistical distribution is evaluable
– Distance based• Search for elements with maximize the distance from the other
elements of the set – Deviation based
• Search for elements with maximize the deviance from the other elements of the set.