Data Mining I Summer semester 2019 Lecture 1: Introduction Lectures: Prof. Dr. Eirini Ntoutsi TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali Fakultät für Elektrotechnik und Informatik Institut für Verteilte Systeme AG Intelligente Systeme - Data Mining group
69
Embed
Data Mining Intoutsi/DM1.SoSe19/lectures/1... · 2019-04-10 · Data Mining –Data Science –Big Data –Machine Learning –Deep Learning Analytics … New fancy words for knowledge
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Mining I
Summer semester 2019
Lecture 1: Introduction
Lectures: Prof. Dr. Eirini Ntoutsi
TAs: Tai Le Quy, Vasileios Iosifidis, Maximilian Idahl, Shaheer Asghar, Wazed Ali
Fakultät für Elektrotechnik und InformatikInstitut für Verteilte Systeme
AG Intelligente Systeme - Data Mining group
About me
03/2016 - present Associate Professor Faculty of Electrical Engineering & Computer Science Leibniz University Hannover L3S Research Center (since May 2016)
02/2012 - 02/2016 Post-doctoral researcher & lecturer Institute for Informatics, LMU Munich, Germany
02/2010 - 01/2012 Alexander von Humboldt postdoc fellowInstitute for Informatics, LMU Munich, Germany
2009: Data Mining Expert National Hellenic Organization (OTE), Athens, Greece
04/2007 – 02/2009 Co-Founder and AI expert
NeeMo Startup, Greece
09/2003 – 09/2008 PhD in Data MiningUniversity of Piraeus, Athens, Greece
09/2001 – 09/2003 MSc, Computer Science/ Text MiningPolytechnic School, University of Patras, Greece
09/1996 – 09/2001 Diploma, Computer Engineering and Informatics/ AI GamesPolytechnic School, University of Patras, Greece
2Learning from streaming data
Current focus areas:• Data Stream Mining/ Adaptive Machine Learning• Responsible AI: Fairness-Aware Machine Learning
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
3Data Mining I @SS19: Introduction
Why to study Data Mining/Machine Learning – famous quotes*
■ “A breakthrough in machine learning would be worth ten Microsofts” (Bill Gates, Chairman, Microsoft)
■ “Machine learning is the next Internet” (Tony Tether, Director, DARPA)
■ “Machine learning is the hot new thing” (John Hennessy, President, Stanford)
■ “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo)
■ “Machine learning is going to result in a real revolution” (Greg Papadopoulos, Former CTO, Sun)
■ “Machine learning today is one of the hottest aspects of computer science” (Steve Ballmer, CEO, Microsoft)
4
*Source: Pedro Domingos http://courses.cs.washington.edu/courses/cse446/15sp/slides/intro.pdf
Data Mining I @SS19: Introduction
Disclaimer: I use the terms data mining and machine learning (sometimes also Artificial Intelligence (AI) interchangeably here and through the lecture.We will discuss the similarities/differences later. In both cases, we talk to learning from data.
Data Mining – Data Science – Big Data – Machine Learning – Deep Learning Analytics …
■ New fancy words for knowledge discovery from data
❑ Data mining, machine learning have been focusing on knowledge discovery from data for decades
❑ Well defined set of tasks and solutions
■ Big data and analytics are more business terms and ill-defined
■ The same holds today for AI
5
“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.“
Source: Dan Ariely, Duke University
Data Mining I @SS19: Introduction
Ever increasing interest … the ``rebranding’’ effect
6
Source: Google trends, query on 9.4.2019
Data Mining I @SS19: Introduction
Why to study Data Mining - Data Scientist: The sexiest job of 21st century
7
“If “sexy” means having rare qualities that are much in demand, data scientists are alreadythere. They are difficult and expensive to hire and, given the very competitive market for theirservices, difficult to retain. There simply aren’t a lot of people with their combination ofscientific background and computational and analytical skills.”
Source: Harvard Business Review. Data Scientist: The Sexiest Job of the 21st Century. October 2012 link
■ The Internet of Things (IoT) is the network of physical objects or "things" embedded with electronics, software, sensors, and network connectivity, which enables these objects to collect and exchange data.
During 2008, the number of things connected to the internet surpassed the number of people on earth… By 2020 there will be 50 billion … vs 7.3 billion people (2015).
These things are everything, smartphones, tablets, refrigerators …. cattle.
“Increasingly, scientific breakthroughs willbe powered by advanced computingcapabilities that help researchersmanipulate and explore massive datasets.”
-The Fourth Paradigm – Microsoft
Examples of e-science applications:• Earth and environment• Health and wellbeing
■ Are there people from Physics, Medicine, Engineering Sciences in the audience?
19Data Mining I @SS19: Introduction
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
20Data Mining I @SS19: Introduction
What is KDD
Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data.
[Fayyad, Piatetsky-Shapiro, and Smyth 1996]
Remarks:
● valid: the discovered patterns should also hold for new, previously unseen problem instances.
● novel: at least to the system and preferably to the user
● potentially useful: they should lead to some benefit to the user or task
● ultimately understandable: the end user should be able to interpret the patterns either immediately or after some post-processing
21Data Mining I @SS19: Introduction
Clarification: The term databases does not refer exclusively to relational databases storing structured data … it can be any data storage and also structured, semi-structured, non-structured data
The KDD process and the Data Mining step
22
Patterns
Knowledge
[Fayyad, Piatetsky-Shapiro & Smyth, 1996]
Transformed data
Target data
Preprocessed data
Sele
ctio
n:
•Se
lect
a r
elev
ant
dat
aset
or
focu
s o
n a
su
bse
t o
f a
dat
aset
•Fi
le /
DB
/
Pre
pro
cess
ing/
Cle
anin
g:•
Inte
grat
ion
of
dat
a fr
om
d
iffe
ren
t d
ata
sou
rces
•N
ois
e re
mo
val
•M
issi
ng
valu
es
Tran
sfo
rmat
ion
:•
Sele
ct u
sefu
l fea
ture
s•
Feat
ure
tra
nsf
orm
atio
n/
dis
cret
izat
ion
•D
imen
sio
nal
ity
red
uct
ion
Dat
a M
inin
g:•
Sear
ch f
or
pat
tern
s o
f in
tere
st
Eval
uat
ion
:•
Eval
uat
e p
atte
rns
bas
ed o
n
inte
rest
ingn
ess
mea
sure
s•
Stat
isti
cal v
alid
atio
n o
f th
e M
od
els
•V
isu
aliz
atio
n•
Des
crip
tive
Sta
tist
ics
Data
Data Mining I @SS19: Introduction
A modern version: The Data Science process
23Data Mining I @SS19: Introduction
The interdisciplinary nature of KDD 1/2
24
KDD
Machine Learning
Databases
Statistics
Data visualization
Pattern recognition
Algorithms Other disciplines
Data Mining I @SS19: Introduction
The interdisciplinary nature of KDD 2/2
25
Statistics Machine Learning
Databases
KDD
Model based inferenceFocus on numerical
data
Theory + methodsFocus on small datasets
Scalability to large data setsNew data types (web data, micro-arrays, social data ...)
Integration with commercial databases[Chen, Han & Yu 1996]
[Berthold & Hand 1999] [Mitchell 1997]
Data Mining I @SS19: Introduction
How do machines learn?
■ ML “gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959)
■ We don’t codify the solution. We don’t even know it!
■ Data is the key & the learning algorithm
26Data Mining I @SS19: Introduction
Algorithms
Models
Models
(semi)Automatic
decision making
Data
How can we build computer programs that automatically improve with experience?
Tom Mitchell, Machine Learning book
More formally: How do machines learn?
■ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Tom Mitchell, Machine Learning 1997.
■ Example: A backgammon learning problem
❑ Task T: playing backgammon
❑ Performance measure P: % of games won against opponents
❑ Training experience E: playing practice games against itself
■ Example: Exam performance
❑ Task T: predict whether a student will pass the final DM exam or not
❑ Experience E: historical records of students that took the DM exam
❑ Performance measure P: % of correctly identified students
27Data Mining I @SS19: Introduction
(Machine) Learning from experience/feedback 1/2
■ “Experience comes in terms of data (the so called, instances or examples) from the specific problem/ application”
■ Datasets consists of instances (also known as examples or objects)
❑ e.g., in a university database: students, professors, courses, grades,…
❑ e.g., in a library database: books, users, loans, publishers, ….
❑ e.g., in a movie database: movies, actors, director,…
■ Instances are described through features (also known as attributes or variables)
❑ E.g. a course is described in terms of a title, description, lecturer, teaching frequency etc.
❑ An easy to visualize example: if our data are in a database table, the rows are the instances and the columns are the features.
28Data Mining I @SS19: Introduction
(Machine) Learning from experience/feedback 2/2
■ Except for the instance description, we might also have feedback on those instances from some “teacher”/”expert“
❑ E.g., whether a student passed the exam
■ The direct feedback is known as label, i.e., each instance is associated with a label labeleddataset
■ But we might have no feedback at all unlabeled dataset
■ There might be also indirect feedback
29Data Mining I @SS19: Introduction
Unlabeled datasetLabeled dataset
Lecture 2 is devoted on getting to know our data!!!
Short break (5’) – Modeling students data for the exam performance task
■ Recall our learning example
■ Example: Exam performance
❑ Task T: predict whether a student will pass the final DM exam or not
❑ Experience E: historical records of students that took the DM exam
❑ Performance measure P: % of correctly identified students
■ If students are the learning instances, what sort of features could I use to describe each of them?
■ What could be the feedback (direct, indirect) for the learning model (if any)?
30Data Mining I @SS19: Introduction
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
31Data Mining I @SS19: Introduction
Different learning tasks
Based on the feedback we have on the data, we can distinguish between:
■ Direct-feedback instances
❑ the correct response /label is provided for each instance by the “teacher”
❑ e.g., good or bad product
■ No-feedback instances
❑ no evaluation/label of the instance is provided, since there is no “teacher“
❑ e.g., no information on whether a product is good or bad, just the description of the product/instance
■ Indirect-feedback instances
❑ less feedback is given, since not the proper action, but only an evaluation of the chosen action is given by the teacher
32Data Mining I @SS19: Introduction
Supervised learning
Reinforcement learning
Unsupervised learning
Different learning tasks: Supervised learning
■ Supervised learning/ Predictive:
❑ A description of the instances and their class labels is available (training set)
❑ The goal is to learn a mapping from the instances to the class labels, i.e., given a future unseen instance to predict its class label
■ Typical examples covered in this lecture:
❑ Classification
❑ Outlier detection
❑ Regression
33Data Mining I @SS19: Introduction
Classification: an example
■ The goal is to learn a mapping from the “height, width space” to the class space (nails, screw,paper clips)
■ For the new objects, the result of the classification if one of the class labels {nails, screw,paper clips}
34Data Mining I @SS19: Introduction
Screw
Nails
Paper clips
Hei
ght
[cm
]
Width[cm]
instance width height class
1 2,6 4,5 Screw
2 3,7 7,3 Nails
3 4,1 6,5 Paper Clips
4 8,5 8,1 Screw
5 9,5 5,5 Nails
… … … …
New objectNew object
Classification applications 1/2
■ Application: Fraud Detection
❑ Goal: Predict fraudulent cases in credit card transactions.
❑ Approach:
■ Use credit card transactions and the information on its account-holder as attributes.
❑ When does a customer buy, what does he buy, how often he pays on time, etc
■ Label past transactions as fraud or fair transactions. This forms the class attribute.
■ Learn a model for the class of the transactions.
■ Use this model to detect fraud by observing credit card transactions on an account.
35Data Mining I @SS19: Introduction
Classification applications 2/2
■ Application: Churn prediction in telco
❑ Goal: Predict whether a customer is likely to be lost to a competitor
❑ Approach:
■ Use detailed record of transactions with each of the past and present customers, to find attributes.
❑ How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.
■ Label the customers as loyal or disloyal (class attribute).
■ Find a model for customer loyalty
■ Use this model to predict churn and organize possible retain strategies.
36Data Mining I @SS19: Introduction
Example: Google News
37Data Mining I @SS19: Introduction
A huge variety of classification algorithms
38Data Mining I @SS19: Introduction
Decision trees k nearest neighbours
Support vector machines
Neural networks Bayesian classifiers
Ensembles
Supervised learning: Regression
■ Similar to classification, but the feature-result to be learned is continuous rather than discrete.
■ Goal: Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.
39Data Mining I @SS19: Introduction
Given this data, a friend has a house 750 square feet - how much can they be expected to get?
Source: Andrew Ng ML course, Coursera
Application: Precision farming
■ Create a production curve depending on multiple parameters like soil characteristics, weather, used fertilizers.
■ Only the appropriate amount of fertilizers given the environmental settings (soil, weather) will result in maximum yield.
■ Controlling the effects of over-fertilization on the environment is also important
40
Water capacity
Soil parametersWeatherFertilizers
…
Fertilizers
productionproduction
curve
Data Mining I @SS19: Introduction
Different learning tasks: Unsupervised learning
■ Unsupervised learning/ Descriptive:
❑ Only a description of the instances is available
❑ No feedback/labels are available
❑ The goal is to discover groups of similar instances
■ Typical subtasks covered in this lecture:
❑ clustering
❑ association rules mining
❑ outlier detection
41Data Mining I @SS19: Introduction
Clustering: an example
■ Each point described in terms of its height and width
■ No information on the actual classes (nails, paper clips) is available to the clustering algorithm.
42
Cluster 1Cluster 2
Hei
ght
[cm
]
Width[cm]
Data Mining I @SS19: Introduction
instance width height
1 2,6 4,5
2 3,7 7,3
3 4,1 6,5
4 8,5 8,1
5 9,5 5,5
… … …
Clustering applications 1/2
Application: Market Segmentation
■ Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
■ Approach:
❑ Collect different attributes of customers based on their geographical and lifestyle related information.
■ E.g., age, income, education, family status, ….
❑ Find clusters of similar customers.
❑ Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
43Data Mining I @SS19: Introduction
Clustering applications 2/2
Application: Document clustering
■ Find groups of documents (topics) that are similar to each other based on the important terms appearing in them.
■ Approach:
❑ Identify important terms in each document.
❑ Form a similarity measure between documents.
❑ Cluster based on the similarity measure.
■ Gain:
❑ Help the end user to navigate in the collection of documents (based on the extracted clusters).
❑ Utilize the clusters to relate a new document or search term to clustered documents.
■ Check for example, Google News.
44Data Mining I @SS19: Introduction
Example: Google News
45Data Mining I @SS19: Introduction
A huge variety of clustering algorithms
46Data Mining I @SS19: Introduction
Partitioning methods (k-Means)
Grid-based methods (CLIQUE)
Model-based methods (DBSCAN)
Hierarchical methods
Constraint-based methods
Model-based methods(EM)
Unsupervised learning: Association rules mining
■ Task: Find all rules in the database, in the following form:
If x, y, z are contained in a set M, then t is also contained in M with a probability of at least X%.
In 5 out of 5 cases (100 %) it holds that:If b,c then d also exists.
Data Mining I @SS19: Introduction
• a= milk• b=cheese• c =wine
• d= pasta• e= yogurt• f = apples
Application: Market basket analysis
■ Result:
❑ Frequently purchased items together may be better to be positioned close to each other: E.g. since diapers are often purchased together with beers => Place beer in the way from diapers to the checkout
❑ Generate recommendations for customers with similar baskets:=> e.g. Customers that bought „Star Wars“, might be also interested in „The lord of the rings “.
48
Shopping basket
DataWarehouse
Possible generalizations:• Paprika-Chips Snacks • Enrichment of customer data
■ Groups of 2 (Please form the teams by yourselves)
■ Goal: how to run a data mining case study? From data preprocessing to transformation, learning algorithm, evaluation and presentation of the results. Both analysis and presentation part are important.
■ We will use Kaggle for result submission (but you have to submit the report separately)
■ We will have a poster session at the end where each team present its results
■ Bonus schema
❑ Pass both projects: you switch to the next best grade
■ e.g., from 1.7→1,3
❑ Each member ``inherits’’ the grade of the group
❑ Extra bonus for those that score best in Kaggle (system) & those with the best poster (voting)
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
66Data Mining I @SS19: Introduction
Things you should know from this lecture
■ KDD definition
■ KDD process
■ DM step
■ Supervised vs Unsupervised learning
■ Main DM tasks
❑ Clustering
❑ Classification
❑ Regression
❑ Association rules mining
❑ Outlier detection
67Data Mining I @SS19: Introduction
Outline
■ Why to study Data Mining?
■ Why we need Data Mining?
■ What is the KDD (Knowledge Discovery in Databases) process?
■ Main data mining tasks
■ Course logististics
■ Things you should know from this lecture
■ Homework/ Tutorial
68Data Mining I @SS19: Introduction
Homework/ Tutorial
■ Homework: Think of some real world applications that you find suitable for Data Mining.
❑ Why?
❑ What type of patterns would you look for?
❑ Would you approach it as a supervised or unsupervised learning task?
■ Readings:
❑ Tan P.-N., Steinbach M., Kumar V book, Chapter 1.
❑ U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press.
69Data Mining I @SS19: Introduction
Acknowledgement
■ The slides are based on
❑ KDD I lecture at LMU Munich (Johannes Aßfalg, Christian Böhm, Karsten Borgwardt, Martin Ester, EshrefJanuzaj, Karin Kailing, Peer Kröger, Eirini Ntoutsi, Jörg Sander, Matthias Schubert, Arthur Zimek, Andreas Züfle)
❑ Introduction to Data Mining book slides at http://www-users.cs.umn.edu/~kumar/dmbook/
❑ Pedro Domingos Machine Lecture course slides at the University of Washington
❑ Machine Learning book by T. Mitchel slides at http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html
❑ Thank you to all TAs contributing to their improvement, namely Vasileios Iosifidis, Damianos Melidis, Tai Le Quy, Han Tran