Top Banner
Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: [email protected] Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse
50

Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: [email protected]

Mar 07, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Naeem Ahmed

Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Email: [email protected]

Data Mining Concepts & Techniques Lecture No. 01

Databases, Data warehouse

Page 2: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Database •  A database is a large, integrated collection of data

•  A database management system (DBMS) is a software system designed to store, manage and facilitate access to the database

•  A schema is a description of a database

Page 3: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

What is Data warehouse? •  Basically a very large database…

–  Not all very large databases are data warehouses, but all data warehouses are pretty large databases

–  Nowadays a warehouse is considered to start at around 800 GB and goes up to several TB

–  It spans over several servers and needs an impressive amount of computing power

Page 4: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

What is Data warehouse? •  More specific, a collective data repository

–  Containing snapshots of the operational data (history) –  Obtained through data cleansing ETL (Extract-Transform- Load) –  Useful for analytics

Page 5: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

What is Data warehouse? •  Compared to other solutions it…

–  Is suitable for tactical/strategic focus

–  Implies a small number of transactions

–  Implies large transactions spanning over a long period of time

Page 6: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Some Definitions •  Ralph Kimball: “a copy of transaction data

specifically structured for query and analysis”

•  Bill Inmon (father of data warehousing, in 1993): A Data Warehouse is a:

•  subject oriented •  integrated •  non-volatile •  time-variant

collection of data in support of management’s decisions

Page 7: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Warehouse •  Subject oriented: Data is arranged by subject

area rather than by application. Data is organized so that all the data elements relating to the same real-world event or object are linked together

–  Typical subject areas in DWs are Customer, Product, Order, Claim, Account,…

Page 8: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Warehouse •  Subject oriented:

–  Example: customer as subject in a DW •  DW is organized in this case by the customer •  It may consist of 10, 100 or more physical tables, all related

Page 9: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Warehouse •  Integrated: Data is collected and consistently

stored from multiple, diverse sources of an organization's operational systems and this data is made consistent –  E.g. gender, measurement, conflicting keys,

consistency,…

Page 10: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Warehouse •  Non-volatile: Data in the data warehouse is never

over-written or deleted - once committed, the data is static, read-only, and retained for future reporting. Data is loaded, but not updated –  When subsequent changes occur, a new snapshot

record is written.

Page 11: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Warehouse •  Time-variant: The changes to the data in the data

warehouse are tracked and recorded so that reports can be produced showing changes over time. –  Different environments have different time horizons

associated •  While for operational systems a 60-to-90 day time horizon is

normal, data warehouse has a 5-to-10 year horizon

Page 12: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Warehouse vs. Operational Database

Data Warehouse •  Subject oriented

•  Integrated

•  Non-volatile

•  Time-variant

Operational Database •  Application oriented

•  Multiple diverse sources

•  Updateable

•  Real-time, current

Page 13: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OnLine Transaction Processing •  OLTP (OnLine Transaction Processing):

–  Also known under the name of operational data, it represents day-to-day operational business activities:

•  Purchasing, sales, production distribution, …

–  Typically for data entry and retrieval transaction processing

–  Reflects only the current state of the data

Page 14: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OnLine Analytical Processing •  OLAP (OnLine Analytical Processing):

–  Represents front-end analytics based on a DW repository

–  It provides information for activities like: •  Resource planning, capital budgeting, marketing initiatives,...

–  It is decision oriented

Page 15: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLTP vs. DW •  Properties

Operational DB DW Mostly updates Mostly reads Many small transactions Queries long, complex MB-TB of data GB-PB of data Raw data Summarized data Clerical users Decision makers Up-to-date data May be slightly outdated

Page 16: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLTP vs. DW OLTP Data Warehouse users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 17: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Applications of DW •  A DW is the base repository for front-end analytics

–  OLAP –  KDD –  Data visualization –  Reporting

KDD (Knowledge Discovery in Databases) a data mining process

Page 18: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Lifecycle of DW •  Classical SDLC vs. DW SDLC

•  DW SDLC is almost the opposite of classical SDLC

Page 19: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Lifecycle of DW •  Classical SDLC vs. DW SDLC

•  Because it is the opposite of SDLC, DW SDLC is also called CLDS

Page 20: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Architecture of DW •  Basic Architecture

Page 21: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Warehouse Architecture

Page 22: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Mart •  A data mart is a special purpose subset of

enterprise data for a particular function or application (It may contain detail or summary data or both).

•  Data Mart types: –  Independent—created directly from operational systems

to a separate physical data store –  Logical—exists as a subset of existing data warehouse. –  Dependent—created from data warehouse to a separate

physical data store

Page 23: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Phases

Page 24: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Data Modeling •  Conceptual Design

–  Transforms data requirements to conceptual model –  Conceptual model describes data entities, relationships, constraints, etc.

on high-level •  Does not contain any implementation details •  Independent of used software and hardware

•  Logical Design –  Maps the conceptual data model to the logical data model used by the DBMS

•  e.g. relational model, dimensional model,... •  Technology independent conceptual model is adapted to the used DBMS software

•  Physical Design –  Creates internal structures needed to efficiently store/manage data

•  Table spaces, indexes, access paths,... •  Depends on used hardware and DBMS software

Page 25: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

DW Modeling •  Data Modeling

–  Conceptual Modeling: •  Multidimensional Entity Relationship (ME/R) Model •  Multidimensional UML (mUML)

–  Logical Modeling: •  Cubes, Dimensions, Hierarchies

–  Physical Modeling: •  Star, Snowflake, Array storage

Page 26: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

DW Modeling •  Components

–  Facts: a fact is a focus of interest for decision-making, e.g., sales, shipments..

–  Measures: attributes that describe facts from different points of view, e.g. , each sale is measured by its revenue

–  Dimensions: discrete attributes which determine the granularity adopted to represent facts, e.g. , product, store, date

–  Hierarchies: are made up of dimension attributes •  Determine how facts may be aggregated and selected, e.g. ,

day – month – quarter - year

Page 27: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLAP •  A decision support system (DSS) that support ad-

hoc querying, i.e. enables managers and analysts to interactively manipulate data.

•  Analysis of information in a database for the purpose of making management decision

•  The idea is to allow the users to easy and quickly manipulate and visualize the data through multidimensional views (i.e. different perspectives)

•  OLAP (OnLine Analytical Processing) analyzes historical data (terabytes) using complex queries

Page 28: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLAP •  OLAP Council definition:

–  A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user

•  OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity.

Page 29: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLAP •  OLAP primarily involves aggregating large

amounts of diverse data •  OLAP functionality provides dynamic multi-

dimensional analysis, supporting analytical and navigational activities

•  OLAP functionality is provided by the OLAP Server •  OLAP Council defines OLAP Server as:

–  ‘A high capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures.’

Page 30: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLTP vs. OLAP OLTP OLAP

Operational processing Informational processing Transaction-oriented Analysis-oriented For operational staffs For managers, executive & analysts Daily operations Decision support Current, up-to-date data Historical data Primitive, highly detailed data Summarized, consolidated data Detailed, flat relational views Summarized, multi-dimensional views Short, simple transactions Complex aggregate queries Read/write Mostly read only Index on keys Many scans Many users Small number of users Large databases Very large databases

Page 31: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLTP vs. OLAP •  On-Line Transaction Processing

–  Transfer $100 balance from my saving account to my checking account

•  On-Line Analytical Processing

–  What is the average balance of accounts by customer groups, account types, areas, account managers, and their combinations?

Page 32: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

DW Queries •  DW queries are big queries

–  Imply a large portion of the data –  Read only queries

•  no Updates

–  Redundancy a necessity •  Materialized Views, special-purpose indexes, de-normalized

schemas

–  Data is refreshed periodically •  E.g., Daily or weekly

–  Their purpose is to analyze data •  OLAP (OnLine Analytical Processing)

Page 33: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLAP operations •  Typical OLAP operations

–  Roll-up –  Drill-down –  Slice and dice –  Pivot (rotate)

•  Other operations –  Aggregate functions –  Ranking and comparing –  Drill-across –  Drill-through

Page 34: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Roll-up •  Roll-up (drill-up)

–  Taking the current aggregation level of fact values and doing a further aggregation

–  Summarize data by •  Climbing up hierarchy (hierarchical roll-up) •  By dimensional reduction •  Or by a mix of these 2 techniques

–  Used for obtaining an increased generalization •  E.g., from Time.Week to Time.Year

Page 35: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Roll-up •  Hierarchical roll-ups

–  Performed on the fact table and some dimension tables by climbing up the attribute hierarchies

•  E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group

Page 36: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Roll-up •  Dimensional roll-ups

–  Are done solely on the fact table by dropping one or more dimensions

•  E.g., drop the Client dimension

Page 37: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Roll-up •  Climbing above the top in hierarchical roll-up

–  In an ultimate case, hierarchical roll-up above the top level of an attribute hierarchy (attribute “ALL”) can be viewed as converting to a dimensional roll-up

Page 38: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Drill-down •  Drill-down (roll-down)

–  Reverse of Roll-up –  Represents a de-aggregate operation

•  From higher level of summary to lower level of summary – detailed data

–  Introducing new dimensions –  Requires the existence of materialized finer grained

data •  One can’t drill if it doesn’t have the data

Page 39: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Roll-up drill-down example

Page 40: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Roll-up drill-down example

Page 41: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Slice •  Slice: a subset of the multi-dimensional array

corresponding to a single value for one or more dimensions and projecting on the rest of dimensions –  E.g., project on Geo (store) and Time from values

corresponding to Laptops in the product dimension πStoreId, TimeId, Ammount (σArticleId = LaptopId(Sales))

Page 42: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Slice •  Amounts to equality select condition •  WHERE clause in SQL

–  E.g., slice Laptops

Page 43: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Slice •  Slicing means taking out the slice of a cube, given

certain set of select dimension –  e.g., sales where city =‘Karachi’ and date = ‘20/1/2014’

day 2 s1 s2 s3p1 44 4p2 s1 s2 s3

p1 12 50p2 11 8

day 1

s1 s2 s3p1 12 50p2 11 8

TIME = day 1

Page 44: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Dice •  Dice: amounts to range select condition on one

dimension, or to equality select condition on more than one dimension –  E.g., Range SELECT πStoreId, TimeId, Amount (σArticleId

∈ {Laptop, CellP}(Sales))

Page 45: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Dice •  E.g., Equality SELECT on 2 dimensions Product

and Time πStoreId, Amount (σArticleId = Laptop ∧ MonthID = December(Sales))

Page 46: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Pivot

10

47

30 12

Juice

Cola

Milk Cream

3/1 3/2 3/3 3/4 Date

Reg

ion

Product

A pivot is a two dimensional lay-out of the summary data

The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures

Page 47: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Pivot •  Pivot (rotate): re-arranging data for viewing

purposes –  The simplest view of pivoting is that it selects two

dimensions to aggregate the measure •  The aggregated values are often displayed in a grid where each

point in the (x, y) coordinate system corresponds to an aggregated value of the measure

•  The x and y coordinate values are the values of the selected two dimensions

–  The result of pivoting is also called cross–tabulation

Page 48: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Pivot •  Consider pivoting the following data

Page 49: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

Pivot •  Pivoting on City and Day

Page 50: Data Mining Concepts & Techniques · 2015-01-08 · Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: naeemmahoto@gmail.com

OLAP query languages •  Getting from OLAP operations to the data

–  As in the relational model, through queries •  In OLTP one has SQL as the standard query language

–  However, OLAP operations are hard to express in SQL –  There is no standard query language for OLAP –  Choices are:

•  SQL-99 for ROLAP –  Grouping Set, Roll-up, Cube operators

•  MDX (Multidimensional expressions) for both MOLAP and ROLAP

–  Similar to SQL, used especially MOLAP solutions, in ROLAP it is mapped to SQL