Naeem Ahmed Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro Email: [email protected] Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse
Mar 07, 2020
Naeem Ahmed
Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Email: [email protected]
Data Mining Concepts & Techniques Lecture No. 01
Databases, Data warehouse
Database • A database is a large, integrated collection of data
• A database management system (DBMS) is a software system designed to store, manage and facilitate access to the database
• A schema is a description of a database
What is Data warehouse? • Basically a very large database…
– Not all very large databases are data warehouses, but all data warehouses are pretty large databases
– Nowadays a warehouse is considered to start at around 800 GB and goes up to several TB
– It spans over several servers and needs an impressive amount of computing power
What is Data warehouse? • More specific, a collective data repository
– Containing snapshots of the operational data (history) – Obtained through data cleansing ETL (Extract-Transform- Load) – Useful for analytics
What is Data warehouse? • Compared to other solutions it…
– Is suitable for tactical/strategic focus
– Implies a small number of transactions
– Implies large transactions spanning over a long period of time
Some Definitions • Ralph Kimball: “a copy of transaction data
specifically structured for query and analysis”
• Bill Inmon (father of data warehousing, in 1993): A Data Warehouse is a:
• subject oriented • integrated • non-volatile • time-variant
collection of data in support of management’s decisions
Data Warehouse • Subject oriented: Data is arranged by subject
area rather than by application. Data is organized so that all the data elements relating to the same real-world event or object are linked together
– Typical subject areas in DWs are Customer, Product, Order, Claim, Account,…
Data Warehouse • Subject oriented:
– Example: customer as subject in a DW • DW is organized in this case by the customer • It may consist of 10, 100 or more physical tables, all related
Data Warehouse • Integrated: Data is collected and consistently
stored from multiple, diverse sources of an organization's operational systems and this data is made consistent – E.g. gender, measurement, conflicting keys,
consistency,…
Data Warehouse • Non-volatile: Data in the data warehouse is never
over-written or deleted - once committed, the data is static, read-only, and retained for future reporting. Data is loaded, but not updated – When subsequent changes occur, a new snapshot
record is written.
Data Warehouse • Time-variant: The changes to the data in the data
warehouse are tracked and recorded so that reports can be produced showing changes over time. – Different environments have different time horizons
associated • While for operational systems a 60-to-90 day time horizon is
normal, data warehouse has a 5-to-10 year horizon
Data Warehouse vs. Operational Database
Data Warehouse • Subject oriented
• Integrated
• Non-volatile
• Time-variant
Operational Database • Application oriented
• Multiple diverse sources
• Updateable
• Real-time, current
OnLine Transaction Processing • OLTP (OnLine Transaction Processing):
– Also known under the name of operational data, it represents day-to-day operational business activities:
• Purchasing, sales, production distribution, …
– Typically for data entry and retrieval transaction processing
– Reflects only the current state of the data
OnLine Analytical Processing • OLAP (OnLine Analytical Processing):
– Represents front-end analytics based on a DW repository
– It provides information for activities like: • Resource planning, capital budgeting, marketing initiatives,...
– It is decision oriented
OLTP vs. DW • Properties
Operational DB DW Mostly updates Mostly reads Many small transactions Queries long, complex MB-TB of data GB-PB of data Raw data Summarized data Clerical users Decision makers Up-to-date data May be slightly outdated
OLTP vs. DW OLTP Data Warehouse users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Applications of DW • A DW is the base repository for front-end analytics
– OLAP – KDD – Data visualization – Reporting
KDD (Knowledge Discovery in Databases) a data mining process
Lifecycle of DW • Classical SDLC vs. DW SDLC
• DW SDLC is almost the opposite of classical SDLC
Lifecycle of DW • Classical SDLC vs. DW SDLC
• Because it is the opposite of SDLC, DW SDLC is also called CLDS
Architecture of DW • Basic Architecture
Data Warehouse Architecture
Data Mart • A data mart is a special purpose subset of
enterprise data for a particular function or application (It may contain detail or summary data or both).
• Data Mart types: – Independent—created directly from operational systems
to a separate physical data store – Logical—exists as a subset of existing data warehouse. – Dependent—created from data warehouse to a separate
physical data store
Phases
Data Modeling • Conceptual Design
– Transforms data requirements to conceptual model – Conceptual model describes data entities, relationships, constraints, etc.
on high-level • Does not contain any implementation details • Independent of used software and hardware
• Logical Design – Maps the conceptual data model to the logical data model used by the DBMS
• e.g. relational model, dimensional model,... • Technology independent conceptual model is adapted to the used DBMS software
• Physical Design – Creates internal structures needed to efficiently store/manage data
• Table spaces, indexes, access paths,... • Depends on used hardware and DBMS software
DW Modeling • Data Modeling
– Conceptual Modeling: • Multidimensional Entity Relationship (ME/R) Model • Multidimensional UML (mUML)
– Logical Modeling: • Cubes, Dimensions, Hierarchies
– Physical Modeling: • Star, Snowflake, Array storage
DW Modeling • Components
– Facts: a fact is a focus of interest for decision-making, e.g., sales, shipments..
– Measures: attributes that describe facts from different points of view, e.g. , each sale is measured by its revenue
– Dimensions: discrete attributes which determine the granularity adopted to represent facts, e.g. , product, store, date
– Hierarchies: are made up of dimension attributes • Determine how facts may be aggregated and selected, e.g. ,
day – month – quarter - year
OLAP • A decision support system (DSS) that support ad-
hoc querying, i.e. enables managers and analysts to interactively manipulate data.
• Analysis of information in a database for the purpose of making management decision
• The idea is to allow the users to easy and quickly manipulate and visualize the data through multidimensional views (i.e. different perspectives)
• OLAP (OnLine Analytical Processing) analyzes historical data (terabytes) using complex queries
OLAP • OLAP Council definition:
– A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user
• OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity.
OLAP • OLAP primarily involves aggregating large
amounts of diverse data • OLAP functionality provides dynamic multi-
dimensional analysis, supporting analytical and navigational activities
• OLAP functionality is provided by the OLAP Server • OLAP Council defines OLAP Server as:
– ‘A high capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures.’
OLTP vs. OLAP OLTP OLAP
Operational processing Informational processing Transaction-oriented Analysis-oriented For operational staffs For managers, executive & analysts Daily operations Decision support Current, up-to-date data Historical data Primitive, highly detailed data Summarized, consolidated data Detailed, flat relational views Summarized, multi-dimensional views Short, simple transactions Complex aggregate queries Read/write Mostly read only Index on keys Many scans Many users Small number of users Large databases Very large databases
OLTP vs. OLAP • On-Line Transaction Processing
– Transfer $100 balance from my saving account to my checking account
• On-Line Analytical Processing
– What is the average balance of accounts by customer groups, account types, areas, account managers, and their combinations?
DW Queries • DW queries are big queries
– Imply a large portion of the data – Read only queries
• no Updates
– Redundancy a necessity • Materialized Views, special-purpose indexes, de-normalized
schemas
– Data is refreshed periodically • E.g., Daily or weekly
– Their purpose is to analyze data • OLAP (OnLine Analytical Processing)
OLAP operations • Typical OLAP operations
– Roll-up – Drill-down – Slice and dice – Pivot (rotate)
• Other operations – Aggregate functions – Ranking and comparing – Drill-across – Drill-through
Roll-up • Roll-up (drill-up)
– Taking the current aggregation level of fact values and doing a further aggregation
– Summarize data by • Climbing up hierarchy (hierarchical roll-up) • By dimensional reduction • Or by a mix of these 2 techniques
– Used for obtaining an increased generalization • E.g., from Time.Week to Time.Year
Roll-up • Hierarchical roll-ups
– Performed on the fact table and some dimension tables by climbing up the attribute hierarchies
• E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group
Roll-up • Dimensional roll-ups
– Are done solely on the fact table by dropping one or more dimensions
• E.g., drop the Client dimension
Roll-up • Climbing above the top in hierarchical roll-up
– In an ultimate case, hierarchical roll-up above the top level of an attribute hierarchy (attribute “ALL”) can be viewed as converting to a dimensional roll-up
Drill-down • Drill-down (roll-down)
– Reverse of Roll-up – Represents a de-aggregate operation
• From higher level of summary to lower level of summary – detailed data
– Introducing new dimensions – Requires the existence of materialized finer grained
data • One can’t drill if it doesn’t have the data
Roll-up drill-down example
Roll-up drill-down example
Slice • Slice: a subset of the multi-dimensional array
corresponding to a single value for one or more dimensions and projecting on the rest of dimensions – E.g., project on Geo (store) and Time from values
corresponding to Laptops in the product dimension πStoreId, TimeId, Ammount (σArticleId = LaptopId(Sales))
Slice • Amounts to equality select condition • WHERE clause in SQL
– E.g., slice Laptops
Slice • Slicing means taking out the slice of a cube, given
certain set of select dimension – e.g., sales where city =‘Karachi’ and date = ‘20/1/2014’
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 12 50p2 11 8
TIME = day 1
Dice • Dice: amounts to range select condition on one
dimension, or to equality select condition on more than one dimension – E.g., Range SELECT πStoreId, TimeId, Amount (σArticleId
∈ {Laptop, CellP}(Sales))
Dice • E.g., Equality SELECT on 2 dimensions Product
and Time πStoreId, Amount (σArticleId = Laptop ∧ MonthID = December(Sales))
Pivot
10
47
30 12
Juice
Cola
Milk Cream
3/1 3/2 3/3 3/4 Date
Reg
ion
Product
A pivot is a two dimensional lay-out of the summary data
The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures
Pivot • Pivot (rotate): re-arranging data for viewing
purposes – The simplest view of pivoting is that it selects two
dimensions to aggregate the measure • The aggregated values are often displayed in a grid where each
point in the (x, y) coordinate system corresponds to an aggregated value of the measure
• The x and y coordinate values are the values of the selected two dimensions
– The result of pivoting is also called cross–tabulation
Pivot • Consider pivoting the following data
Pivot • Pivoting on City and Day
OLAP query languages • Getting from OLAP operations to the data
– As in the relational model, through queries • In OLTP one has SQL as the standard query language
– However, OLAP operations are hard to express in SQL – There is no standard query language for OLAP – Choices are:
• SQL-99 for ROLAP – Grouping Set, Roll-up, Cube operators
• MDX (Multidimensional expressions) for both MOLAP and ROLAP
– Similar to SQL, used especially MOLAP solutions, in ROLAP it is mapped to SQL