Top Banner
Data Mining (and machine learning) DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module Many of these slides are highly derivative of Nick Taylor’s slides used for this module in previous years
26
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DMML1_overview.ppt

Data Mining(and machine learning)

DM Lecture 1: Overview of DM, and overview of the DM part of the DM&ML module

Many of these slides are highly derivative of Nick Taylor’s slides used for this module in previous years

Page 2: DMML1_overview.ppt

Overview of My Lectures

All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

• 25/9 Overview of DM (and of these 8 lectures) • 02/10:     Data Cleaning - usually a necessary first step for large amounts of data • 09/10  Basic Statistics for Data Miners - essential knowledge, and very useful • 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry • NO THURSDAY LECTURE OCTOBER 23rd • 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data • NO THURSDAY LECTURE November 6th • 13/11: Similarity and Correlation Measures - making sure you do clustering

appropriately for the given data • 20/11: Regression - the simplest algorithm for predicting data/class values • 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future

Page 3: DMML1_overview.ppt

Data Mining - Definition & Goal

Definition• – Data Mining is the exploration and analysis of

large quantities of data in order to discover meaningful patterns and rules

Goal• – To permit some other goal to be achieved or

performance to be improved through a better understanding of the data

Page 4: DMML1_overview.ppt

Some examples of huge databases

Retail basket data: much commercial DM is done with this. In one store, 18,000 baskets per month

Tesco has >500 stores. Per year, 100,000,000 baskets ?

The Internet ~ >15,000,000,000 pages

Lots of datasets: UCI Machine Learning repository

How can we begin to understand and exploit such datasets? Especially the big ones?

Page 5: DMML1_overview.ppt

Like this …

Page 6: DMML1_overview.ppt

and this …

Page 7: DMML1_overview.ppt

and this …

Page 8: DMML1_overview.ppt

or this … (see http://www.cs.umd.edu/hcil/treemap-history/

Page 9: DMML1_overview.ppt

or this …

• see

http://websom.hut.fi/websom/milliondemo/html/root.html

Page 10: DMML1_overview.ppt

Data Mining - Basics

• Data Mining is the process of discovering patterns and inferring associations in raw data

• Data Mining is a collection of powerful techniques intended to analyse large amounts of data

• There is no single Data Mining approach• Data Mining can employ a range of techniques,

either individually or in combination with each other

Page 11: DMML1_overview.ppt

Data Mining – Why is it important?

• Data are being generated in enormous quantities• Data are being collected over long periods of time• Data are being kept for long periods of time• Computing power is formidable and cheap• A variety of Data Mining software is available

Page 12: DMML1_overview.ppt

Data Mining – History

• The approach has its roots over 40 years ago• In the early 1960s Data Mining was called

statistical analysis, and the pioneers were statistical software companies such as SPSS

• By the late 1980s these traditional techniques had been augmented by new methods such as machine induction, artificial neural networks, evolutionary computing, etc.

Page 13: DMML1_overview.ppt
Page 14: DMML1_overview.ppt

Data Mining – Two Major Types

• Directed (Farming) – Attempts to explain or categorise some particular target field such as income, medical disorder, genetic characteristic, etc.

• Undirected (Exploring) – Attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes

• Compare with Supervised and Unsupervised systems in machine learning

Page 15: DMML1_overview.ppt

Data Mining – Tasks

Classification - Example: high risk for cancer or notEstimation - Example: household incomePrediction - Example: credit card balance transfer average

amountAffinity Grouping - Example: people who buy X, often also

buy Y with a probability of ZClustering - similar to classification but no predefined

classesDescription and Profiling – Identifying characteristics

which explain behaviour - Example: “More men watch football on TV than women”

Page 16: DMML1_overview.ppt

Data Warehousing

• Note that Data Mining is very generic and can be used for detecting patterns in almost any data– Retail data– Genomes– Climate data– Etc.

• Data Warehousing, on the other hand, is almost exclusively used to describe the storage of data in the commercial sector

Page 17: DMML1_overview.ppt
Page 18: DMML1_overview.ppt

Data Warehousing - Definitions

“A subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision making process”

W. H. Inmon, "What is a Data Warehouse?" Prism Tech Topic, Vol. 1, No. 1, 1995 -- a very influential definition.

“A copy of transaction data, specifically structured for query and analysis”

Ralph Kimball, from his 2000 book, “The Data Warehouse Toolkit”

Page 19: DMML1_overview.ppt

Data Warehouse – why?

For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way

Data Warehousing allows an organisation to remember its data and what it has learned about its data

Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it

Page 20: DMML1_overview.ppt
Page 21: DMML1_overview.ppt

Data Warehouse - Contents

• A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting

• The data will normally have been transformed when it was copied into the Data Warehouse

• The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course

Page 22: DMML1_overview.ppt

Data Marts

• A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse

• A Data Mart will normally reflect the business rules of a specific business unit within an enterprise – identifying data relevant to that unit’s acitivities

Page 23: DMML1_overview.ppt

From Data Warhousing to Machine Learning, via Data Marts

Page 24: DMML1_overview.ppt

The Big Challenge for Data Mining

• The largest challenge that a Data Miner may face is the sheer volume of data in the Data Warehouse

• It is very important, then, that summary data also be available to get the analysis started

• The sheer volume of data may mask the important relationships in which the Data Miner is interested

• Being able to overcome the volume and interpret the data is essential to successful Data Mining

Page 25: DMML1_overview.ppt

What happens in practice …

Data Miners, both “farmers” and “explorers”, are expected to utilise Data Warehouses to give guidance and answer a limitless variety of questions

The value of a Data Warehouse and Data Mining lies in a new and changed appreciation of the meaning of the data

There are limitations though - A Data Warehouse cannot correct problems with its data, although it may help to more clearly identify them

Page 26: DMML1_overview.ppt

Which brings us to “data cleaning”, next week …