Top Banner
Data Mining MTAT.03.183 (4AP 6EAP) (4AP = 6EAP) Introduction Jaak Vilo Jaak Vilo 2009 Fall
37

(4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Apr 09, 2018

Download

Documents

lyphuc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Data Mining MTAT.03.183(4AP 6EAP)(4AP = 6EAP)Introduction

Jaak ViloJaak Vilo

2009 Fall

Page 2: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

LecturerLecturer

• 1986‐1991 U Tartu

• 1991‐1999 U Helsinki (sequence pattern discovery)1991 1999 U Helsinki (sequence pattern discovery)

• 1999‐2002 EMBL‐EBI, UK (bioinformatics)

• 2002‐ EGeen ‐> Quretec   (Biobank and Data Mgmnt)

• U Tartu professor (Bioinformatics) 2007U Tartu, professor (Bioinformatics) 2007– EXCS – Center of Excellence

– STACC – Software Technologies and ApplicationsCompetence Center (Tarkvara TAK)

– research projectsJaak Vilo and other authors UT: Data Mining 2009 2

Page 3: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

StudentsStudents

• >80 registered

• Estonian vs ForeignEstonian vs Foreign

• MSc 1st y / 2nd y ? 

• BSc , PhD ? 

• Non IT/CS ?Non IT/CS ? 

• Why this class? Expectations? (ESSCaSS’08,09…)

Jaak Vilo and other authors UT: Data Mining 2009 3

Page 4: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

CourseCourse

• http://courses.cs.ut.ee/2009/dm/

• List: [email protected][email protected]

• Lectures: 10:15, Liivi 2‐403 

• Seminars: 12:15, Liivi 2‐403 

• Prof Jaak Vilo vilo@ut eeProf. Jaak Vilo      [email protected]• http://www.quicktopic.com/43/H/eWqhydvFpUN

O h ? Sk ?• Other? Skype ? 

4UT: Data Mining 2009Jaak Vilo and other authors

Page 5: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

SeminarsSeminars

• Three types:1. Homework: presentations/discussionsp /

2. Guest lectures, visitors

3 Practical labs/training (no concrete plans yet)3. Practical labs/training (no concrete plans yet)

• Participation is obligatory (>75%)

Jaak Vilo and other authors UT: Data Mining 2009 5

Page 6: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Grading requirementsGrading requirements

• Participation! >75% of seminarsParticipation! >75% of seminars

• Homeworks (30%)   (min 50% of assignments)

• Projects/essays (30%) 

• Exam (40%)Exam (40%)

• Total: 100% + thresholds

• All deadlines are stringent.

Jaak Vilo and other authors UT: Data Mining 2009 6

Page 7: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

HomeworkHomework• Tasks/assignmentsTasks/assignments

– 5 tasks/week + possibly bonuses

– About in every 2 weeks (irregular)

• Report/mark all completed tasksp / p– written reports on tasks

ready to present fully to class– ready to present fully to class

– there will be some uploading system

– and/or paper sheets in class

• Deadline always before class start (Thu, 12:15)Deadline always before class start (Thu, 12:15)Jaak Vilo and other authors UT: Data Mining 2009 7

Page 8: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

4AP = 6EAP4AP = 6EAP

• 4 weeks (4x40h=160h) of intensive work– assuming basic knowledge of BSc materialg g

• 1/3 in class• 1/3 in class

• 1/3 reading, homeworks

• 1/3 projects, writing, … 

Jaak Vilo and other authors UT: Data Mining 2009 8

Page 9: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

What is Data Mining?What is Data Mining?

• Data ‐> Information, Knowledge, Insight– new, interesting, nontrivial, useful …, g, ,

• Data size ‐> Algorithmic challenge

• Predictive, useful ‐> theoretical andPredictive, useful  theoretical and economical challenge

• Why? By practical demand and need…y y p

Jaak Vilo and other authors UT: Data Mining 2009 9

Page 10: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

TextbooksTextbooks

b d h d d ( h• Han, Kamber: Data Mining: Concepts and Techniques, Second Edition (The Morgan Kaufmann Series in Data Management Systems) Google Booksweb

• Chakrabarti et al. Data Mining: know it all. Morgan Kaufmann 2008 @ELsevier @AMazon @Google

• Bramer: Principles of Data Mining (Springer 2007) @Amazon @Springer• Bramer: Principles of Data Mining (Springer, 2007) @Amazon @Springer@Google

• David J. Hand, Heikki Mannila and Padhraic Smyth: Principles of Data Mining (MIT Press, 2001) @MIT Press @Google

• Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning: Data Mining Inference and Prediction (SpringerStatistical Learning: Data Mining, Inference, and Prediction. (Springer 2009) @Tibshirani @Amazon

10UT: Data Mining 2009Jaak Vilo and other authors

Page 11: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

• Han, Kamber: Data Mining: Concepts and T h i S d Edi i (Th MTechniques, Second Edition (The MorganKaufmann Series in Data ManagementSystems)

• TOC: http://www cs uiuc edu/homes/hanj/bk2/toc pdf• TOC: http://www.cs.uiuc.edu/homes/hanj/bk2/toc.pdf

11UT: Data Mining 2009Jaak Vilo and other authors

Page 12: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Jaak Vilo and other authors UT: Data Mining 2009 12

Page 13: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

What’s it all about?What s it all about?

Data DB

Jaak Vilo and other authors UT: Data Mining 2009 13

Page 14: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

• Statistics

• Patterns in dataPatterns in data

• Learning

• Classification

• Knowledge / Information /Knowledge / Information / 

• Algorithms

• Prediction

•• …Jaak Vilo and other authors UT: Data Mining 2009 14

Page 15: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Sources of data (growth)Sources of data (growth)• devicesdevices

• net/web

• logs

• transactional db• transactional db

• consumer

• multimedia(!)

• science• science

• cheaper storage, compute power

• …Jaak Vilo and other authors UT: Data Mining 2009 15

Page 16: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Why Data Mining? y g

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web, , y , ,computerized society

Major sources of abundant dataj

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation, Science: Remote sensing, bioinformatics, scientific simulation, …

Society and everyone: news, digital cameras, YouTube

We are drowning in data but starving for knowledge! We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 16

analysis of massive data sets

Page 17: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Evolution of Sciences

Before 1600, empirical science 1600-1950s theoretical science 1600-1950s, theoretical science

Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.

1950s 1990s comp tational science 1950s-1990s, computational science Over the last 50 years, most disciplines have grown a third, computational branch

(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to

find closed-form solutions for complex mathematical models. 1990-now, data science

The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible p g y Scientific info. management, acquisition, organization, query, and visualization tasks

scale almost linearly with data volumes. Data mining is a major new challenge! Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,

Data Mining: Concepts and Techniques 17

Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002

Jiawei Han, Micheline Kamber, and Jian Pei

Page 18: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Evolution of Database Technologygy

1960s:ll i d b i S d k S Data collection, database creation, IMS and network DBMS

1970s: Relational data model relational DBMS implementation Relational data model, relational DBMS implementation

1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) ( ) Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s: Data mining, data warehousing, multimedia databases, and Web

databases

2000s 2000s Stream data management and mining Data mining and its applications

Data Mining: Concepts and Techniques 18

Web technology (XML, data integration) and global information systemsJiawei Han, Micheline Kamber, and Jian Pei

Page 19: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

examples from Machine Learningexamples from Machine Learning

• 1950’ies – checkers (Arthur Samuels 1959)

• 1960’ies – NN – perceptron and it’s limitations1960 ies  NN  perceptron and it s limitations

• 1970’ies – expert systems, decision trees(ID3)(ID3), …

• 1980’ies – Neural Networks, PAC learning, …, g,

• 1990’ies – Data mining, ILP, Ensembles

• 2000’ – SVM, Kernels, Graphical Models, …

Jaak Vilo and other authors UT: Data Mining 2009 19

Page 20: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Chapter 1. IntroductionChapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data?

Ti d O d i S ti l P tt T d d E l ti Time and Ordering: Sequential Pattern, Trend and Evolution Analysis

Structure and Network Analysis Structure and Network Analysis Evaluation of Knowledge Applications of Data Miningpp g Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society

Data Mining: Concepts and Techniques 20

SummaryJiawei Han, Micheline Kamber, and Jian Pei

Page 21: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

What Is Data Mining?

Data mining (knowledge discovery from data)Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from u o a d po a y u u ) pa o o dg ohuge amount of data

Data mining: a misnomer?

Alternative names Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

W t h t I thi “d t i i ”? Watch out: Is everything “data mining”? Simple search and query processing

(D d ti ) t t

Data Mining: Concepts and Techniques 21

(Deductive) expert systems

Jiawei Han, Micheline Kamber, and Jian Pei

Page 22: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Knowledge Discovery (KDD) Processg y ( )

This is a view from typical database systems and datadatabase systems and data warehousing communities

Data mining plays an essential l i th k l d di

Pattern Evaluation

role in the knowledge discovery process

T k l t D t

Data Mining

D t W h

Task-relevant Data

Selection

Data Cleaning

Data Warehouse Selection

Data Cleaning

Data Integration

Data Mining: Concepts and Techniques 22DatabasesJiawei Han, Micheline

Kamber, and Jian Pei

Page 23: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Example: A Web Mining FrameworkExample: A Web Mining Framework

Web mining usually involves Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data miningg Data mining Presentation of the mining results Presentation of the mining results Patterns and knowledge to be used or stored into

knowledge-base

Data Mining: Concepts and Techniques 23

knowledge base

Jiawei Han, Micheline Kamber, and Jian Pei

Page 24: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Data Mining in Business Intelligenceg g

Increasing potentialIncreasing potentialto supportbusiness decisions End User

DecisionM ki

BusinessAnalyst

Making

Data PresentationAnalyst

DataAnalyst

Visualization Techniques

Data MiningInformation Discovery yInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

DBA

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources

Data Mining: Concepts and Techniques 24

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Jiawei Han, Micheline Kamber, and Jian Pei

Page 25: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Collaborative filteringCollaborative filtering

– Amazon, Netflicks

• Collaborative filtering systems usually take two steps:– Look for users who share the same rating patterns with the active user (the user whom the prediction is for).

– Use the ratings from those like‐minded users found in step 1 to calculate a prediction for the active user

Jaak Vilo and other authors UT: Data Mining 2009 25

Page 26: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Jaak Vilo and other authors UT: Data Mining 2009 26

Page 27: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Netflix prizehttp://www.netflixprize.com/

• http://en.wikipedia.org/wiki/Netflix_Prize

18K i18K movies

480K customers ~ 100M ratings

???? Test on 2.8M witheld ratings

Jaak Vilo and other authors UT: Data Mining 2009 27

Page 28: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Social networkSocial network

• Graph of connections

• Social networkSocial networkmining

Jaak Vilo and other authors UT: Data Mining 2009 28

Page 29: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

WebWeb

• Interlinked web sites and pages

• Directed Graph of links

• Information Retrieval PageRankInformation Retrieval, PageRank

• Web mining

Jaak Vilo and other authors UT: Data Mining 2009 29

Page 30: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Web usage miningWeb usage mining

• Software and web usage logs

• Typical use patterns

• User groups their preferences behavior• User groups, their preferences, behavior

• Can you predict their goals and help to achievethem?– distributed online transactions, queries, … (Google, etc)

Jaak Vilo and other authors UT: Data Mining 2009 30

Page 31: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Biomedical data miningBiomedical data mining

• Analyse:– DNA, ,

– Genotype information

disease histories– disease histories

– find associated genes

– predict and classify diseases and outcomes

– discover “how biology works”gy

– …

Jaak Vilo and other authors UT: Data Mining 2009 31

Page 32: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Combinatorial Data Mining AlgorithmsCombinatorial Data Mining Algorithms(research seminar, Sven Laur, PhD)

Basics ideas and techniquesH t fi d f t t i d t b– How to find frequent sets in databases

– How to find frequent motifs in sequences

Algorithmic problems– Depth‐first vs breath first search– How to avoid combinatorial explosion

Interpretation of resultsp– Which patterns are important enough?

Page 33: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Combinatorial Data Mining Algorithms(research seminar, Sven Laur, PhD.)

Other important aspects– How to handle noisy data– Random sampling vs linear scan

Applications and extensionsApplications and extensions– Association rules in practice – Log analysis Episode rules and usability– Log analysis. Episode rules and usability– Graph mining and biochemistry

Page 34: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Combinatorial Data Mining Algorithms(research seminar, Sven Laur, PhD) 

Administrative details

C bi t i l D t Mi i Al ith• Combinatorial Data Mining Algorithms

• Gives 3 EAP (2 old AP)

• Takes place on Wednesdays in L122

• First seminar is on 16th of September

• Each participant has to give a presentation 

• Project work is combined with DM course

• http://courses.cs.ut.ee/2009/fast‐counting/

Page 35: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Research at U TartuResearch at U Tartu 

• BIIT – http://biit.cs.ut.ee/

• STACC – Software Technologies and A li i C CApplications Competence Center– companies and universities

– Skype, Regio, Delfi, Quretec, …

– Research problems, topics, scholarships 

Jaak Vilo and other authors UT: Data Mining 2009 35

Page 36: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

Research topicsResearch topics

• Publications => Projects, fundingPublications > Projects, funding

• Relevant to STACC, companies

• Can lead to job offers

Jaak Vilo and other authors UT: Data Mining 2009 36

Page 37: (4AP 6EAP) - Kursused - Arvutiteaduse instituut€¦ ·  · 2009-09-02Textbooks • Han, Kamber: Data Mining: Concepts and Tec hniques, Second Edition (h(The Morgan Kaufmann Series

UT CS departmentUT CS department

• Job offers:

• courses.cs.ut.ee ‐ web site development– UT CS department courses web development

– Other sysdamin and Department developmenty p ptasks

– ……

Jaak Vilo and other authors UT: Data Mining 2009 37