Data Warehousing and Data Mining Lecture 1 Introduction Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics CITS3401 CITS5504 Acknowledgement: The Lecture Slides are adapted from the original slides from Han’s textbook.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Warehousing
and Data MiningLecture 1 Introduction
Wei Liu School of Computer
Science and Software
Engineering
Faculty of Engineering,
Computing and
Mathematics
CITS3401
CITS5504
Acknowledgement: The Lecture Slides are adapted from the original slides from Han’s textbook.
• Two projects : 20% each– An analysis of a business scenario through an OLAP tool.
• We will be using an excel plug-in JEDOX for Data Warehousing Project.
– http://www.jedox.com/en/services/downloads
– An analysis of a data mining and exploration problem using WEKA.
• Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java Code
• http://www.cs.waikato.ac.nz/ml/weka/
• Mid-semester Test: 10% – at the lecture venue after the study break
• Final Examination: 50%
• Project Specifications and Instructions will be available on the course website.
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
14
Why Data Mining
Summary:– Abundance of data and data archives are seldom visited.
– Far exceeded human ability for comprehension
– Intuitive decisions are prone to biases and errors, and is
extremely time-consuming and costly
– Data mining tools perform data analysis and uncover important
data patterns, contributing greatly to business strategies,
knowledge bases, and scientific and medical research.
Data Tombs
Nuggets of knowledge
15
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
– Data mining: a misnomer? (Knowledge Mining from data)
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems
What is Data Mining?
16
What is Data Mining?
• Tremendous amount of data (terabyte-petabyte)
• High-dimensionality and high complexity of data– Structured, un-structured, heterogeneous data
• Scalable
• Data mining involves integration of multiple disciplines: – Machine learning
– Pattern recognition
– Statistics
– Databases
– Business Intelligence
– Big data
• Efficient: Derived knowledge is new, interesting, informative and can be used for sophisticated application (decision making, process control, information management....)
17
Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
Algorithm
OtherDisciplines
Visualization
18
Steps of Knowledge Discovery
(KDD) Process
• This is a view from typical database systems and data warehousing communities
• Data mining plays an essential role in the knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
19
Data Warehousing and Mining
Framework
20
KDD Process: Several Key Steps
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
21
Multi-Dimensional View of Data
Mining
• Data to be mined
– Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized (methodologies)
– Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
22
Data Mining: On What Kinds of
Data?
• Structured and semi-structured data
– Relational database/ Object-relational data
– Data Warehouse,
– Transactional Database
• Unstructured data
– Data streams and sensor data
– Text data and web data
– Time-series data, temporal data, sequence data (incl. bio-
sequences)
– Graphs, social networks and information networks
– Spatial data, spatiotemporal data and multimedia data
23
Relational Database
• A relational database is a collection of tables, each of which is assigned a unique name.
• Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows).
• Each tuple in a relational table represents an object identified by unique key and described by a set of attribute values.
• A semantic data model, such as the entity relationship data model, is often constructed for relational databases.
• An ER data model represents the database as a set of entities and their relationships.
24
Relational Database
• Relational data can be accessed by database queries
written in a relational language such as SQL.
• A given query is transformed into a set of relational
operations such as join, selection and projection,
and is then optimized for efficient processing.
• Efficiency of retrieval, efficiency of update and
integrity are the key requirements of a good
relational database.
25
An Example - AllElectronics
• Four relational tables: customer, item, employee and
branch.
• Each relation consists of a set of attributes.
26
Example of Queries
• Show me a list of all items that were sold in the last quarter
• Show me the total sales of the last month, grouped by branch
• Which sales person has the highest amount of sales?
• How many sales transactions occurred in the month of September?
27
Purpose of relational databases
• The main purpose of a relational database is to store
data correctly and retrieve data on demand.
• This type of data processing is sometime called
Online Transaction Processing (OLTP).
• Relational databases are passive data repositories in
the sense that a query only shows you what is
stored in the database, but cannot tell you much
about the meaning or trend of the data.
28
Data Warehouse of AllElectronics
• A data warehouse is a repository of information collected
from multiple sources, stored under a unified schema,
and that usually resides at a single site.
• Need is to provide an analysis of the company’s sales per
item type per branch for the a specified period.
29
Data Warehouse
• The data warehouse
may store a summary
of the transactions per
item type for each
store or, summarized
to a higher level, for
each sales region.
30
Transactional Database
• A transactional database consists of a file where each record represents a transaction.
• Supports nested relation
• Transaction id: Items, Customer name, date…
• Sample Queries:
– Show me all the items purchased by ‘X’
– How many transactions include item number ‘Y’?
– market basket data analysis: Which items sold well together? (Frequent item set)
31
Knowledge View: What Knowledge to be
mined?
• Data summary in multidimensional space
– Data cube and OLAP (On-Line Analytical Processing)
• Pattern discovery
– Mining frequent patterns, association and correlation
– Applying pattern mining in many other tasks
• Classification and predictive modelling
– Model construction based on some training examples
– Prediction of new data based on constructed models
• Cluster analysis: How to group data to form new categories?
• Outlier analysis: Discovery of anomalies and rare events
• Trend and evolution analysis
32
Data Mining Function: (1)
Characterization and Discrimination
• Data can be associated with classes or concepts. ( e.g., classes of items: computer, printers concept of customers: bigSpender, budgetSpender… are the descriptions )
• Multidimensional concept description:
– Characterization: summarizing the class in general. (e.g. general specification of products whose sales increased by 10% and, ….profile of customers who spend more than $1000 a year. )
– Discrimination: comparison of target class with a contrast class.( compare the two groups of customers, such as who shop computer products regularly versus who rarely shop such products). Drilling down on dimensions such as occupation, age, etc.)
33
Data Mining Function: (2)
Association and Correlation Analysis
• Frequent patterns (or frequent item_sets)
– What items are frequently purchased together ?
• Association, correlation vs. causality
– A typical association rule
• Milk Bread [0.5%, 75%] (support, confidence)
– Are strongly associated items also strongly correlated?
• How to mine such patterns and/or set rules efficiently in
large datasets? ( single or multi-dimensional
association, minimum support threshold)
• How to use such patterns for classification, clustering,
and other applications?
34
Data Mining Function: (3)
Classification
• Classification and label prediction
– Construct models (functions) based on some training examples or rules….[example: kind of response (good, mild, no) in sales campaign: price, brand, category, place_made…]
– Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas mileage)
• Invisible data mining : web search, stock market analysis
42
Classification of Data Mining System
• According to the kinds of database mined:– relational, transactional, ….spatial, text, stream data….or World Wide Web
• According to the kinds of knowledge mined: – Based on mining functionalities, e.g. : characterization, discrimination,
association, ….can be multiple and/or integrated data mining…., can be distinguished based on granularity…, regular or irregular patterns(outliers) mining
• According to the techniques utilized: – degree of user interaction involved ( autonomous, interactive, query-driven),
method of analysis (machine learning, pattern recognition, statistics, neural network….), combining merits of individual aspects..
• According to the applications adapted: – Finance, Telecommunication, DNA, stock-market…all purpose data mining
system may not fit for domain specific minig.
43
Summary (till this)
• Data mining: Discovering interesting patterns and knowledge
from massive amount of data
• A natural evolution of science and information technology, in
great demand, with wide applications
• A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination,
association, classification, clustering, trend and outlier
analysis, etc.
• Data mining technologies and applications
44
Evaluation of Knowledge
• Are all mined knowledge interesting?
– One can mine tremendous amount of “patterns”
– Some may fit only certain dimension space
• time, location, …
– Some may not be representative, may be transient, …
• Evaluation of mined knowledge → directly mine only interesting knowledge?
– Descriptive vs. predictive
– Coverage
– Typicality vs. novelty
– Accuracy
– Timeliness
– …
45
Are All the “Discovered” Patterns
Interesting?
• Data mining may generate thousands of patterns: Not all of them