Top Banner
An Introduction to Data Mining Data Mining and Machine Learning in a nutshell 1 DATA MINING AND MACHINE LEARNING IN A NUTSHELL AN INTRODUCTION TO DATA MINING Mohammad-Ali Abbasi http://www.public.asu.edu/~mabbasi2/ SCHOOL OF COMPUTING, INFORMATICS, AND DECISION SYSTEMS ENGINEERING ARIZONA STATE UNIVERSITY http://dmml.asu.edu/
87

Data Mining: an Introduction

Aug 11, 2014

Download

Data & Analytics

An Introduction to Data Mining,
Clustering, Classification, Regression, Text Mining
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 1

DATA MINING AND MACHINE LEARNINGIN A NUTSHELL

AN INTRODUCTION TO DATA MINING

Mohammad-Ali Abbasihttp://www.public.asu.edu/~mabbasi2/

SCHOOL OF COMPUTING, INFORMATICS, AND DECISION SYSTEMS ENGINEERINGARIZONA STATE UNIVERSITY

http://dmml.asu.edu/

Page 2: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 2

INTRODUCTION

• Data production rate has been increased dramatically (Big Data) and we are able store much more data than before– E.g., purchase data, social media data, mobile phone

data• Businesses and customers need useful or

actionable knowledge and gain insight from raw data for various purposes– It’s not just searching data or databases

Data mining helps us to extract new information and uncover hidden patterns out of the stored and streaming data

Page 3: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 3

DATA MINING

• Extracting or “mining” knowledge from large amounts of data, or big data

• Data-driven discovery and modeling of hidden patterns in big data

• Extracting implicit, previously unknown, unexpected, and potentially useful information/knowledge from data

The process of discovering hidden patterns in large data sets

It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems

Page 4: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 4

DATA MINING STORIES

• “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection

• The NSA is using data mining to analyze telephone call data to track al’Qaeda activities

• Walmart uses data mining to control product distribution based on typical customer buying patterns at individual stores

Page 5: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 5

DATA MINING VS. DATABASES

• Data mining is the process of extracting hidden and actionable patterns from data

• Database systems store and manage data – Queries return part of stored data – Queries do not extract hidden patterns

• Examples of querying databases– Find all employees with income more than $250K– Find top spending customers in last month– Find all students from engineering college with

GPA more than average

Page 6: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 6

EXAMPLES OF DATA MINING APPLICATIONS

• Identifying fraudulent transactions of a credit card or spam emails– You are given a user’s purchase history and a new

transaction, identify whether the transaction is fraud or not;

– Determine whether a given email is spam or not• Extracting purchase patterns from existing records– beer dippers (80%)⇒

• Forecasting future sales and needs according to some given samples

• Extracting groups of like-minded people in a given network

Page 7: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 7

BASIC DATA MINING TASKS

• Classification– Assign data into predefined classes• Spam Detection, fraudulent credit card detection

• Regression– Predict a real value for a given data instance• Predict the price for a given house

• Clustering– Group similar items together into some clusters• Detect communities in a given social network

Page 8: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 8

DATA

Page 9: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 9

DATA INSTANCES

• A collection of properties and features related to an object or person– A patient’s medical record– A user’s profile– A gene’s information

• Instances are also called examples, records, data points, or observations

Data Instance:

Features or Attributes Class Label

Page 10: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 10

DATA TYPES

• Nominal (categorical)– No comparison is defined– E.g., {male, female}

• Ordinal– Comparable but the difference is not defined– E.g., {Low, medium, high}

• Interval– Deduction and addition is defined but not division– E.g., 3:08 PM, calendar dates

• Ratio– E.g., Height, weight, money quantities

Page 11: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 11

SAMPLE DATASEToutlook temperature humidity windy playsunny 85 85 FALSE nosunny 80 90 TRUE no

overcast 83 86 FALSE yesrainy 70 96 FALSE yesrainy 68 80 FALSE yesrainy 65 70 TRUE no

overcast 64 65 TRUE yessunny 72 95 FALSE nosunny 69 70 FALSE yesrainy 75 80 FALSE yessunny 75 70 TRUE yes

overcast 72 90 TRUE yesovercast 81 75 FALSE yes

rainy 71 91 TRUE no

Nominal OrdinalInterval

Page 12: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 12

DATA QUALITY

When making data ready for data mining algorithms, data quality need to be assured• Noise– Noise is the distortion of the data

• Outliers– Outliers are data points that are considerably different

from other data points in the dataset• Missing Values– Missing feature values in data instances

• Duplicate data

Page 13: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 13

DATA PREPROCESSING

• Aggregation– when multiple attributes need to be combined into a single

attribute or when the scale of the attributes change• Discretization

– From continues values to discrete values• Feature Selection

– Choose relevant features• Feature Extraction

– Creating a mapping of new features from original features • Sampling

– Random Sampling– Sampling with or without replacement– Stratified Sampling

Page 14: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 14

CLASSIFICATION

Page 15: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 15

CLASSIFICATION

Learning patterns from labeled data and classify new data with labels (categories)– For example, we want to classify an e-mail as

"legitimate" or "spam"

Classifier

Page 16: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 16

CLASSIFICATION: THE PROCESS

• In classification, we are given a set of labeled examples

• These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar

• The classification task is to build model that maps x to y

• Our task is to find a mapping f such that f(x) = y

Page 17: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 17

CLASSIFICATION: THE PROCESS

Page 18: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 18

CLASSIFICATION: AN EMAIL EXAMPLE

• A set of emails is given where users have manually identified spam versus non-spam

• Our task is to use a set of features such as words in the email (x) to identify spam/non-spam status of the email (y)

• In this case, classes are y = {spam, non-spam}

• What would it be dealt with in a social setting?

Page 19: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 19

CLASSIFICATION ALGORITHMS

• Decision tree learning

• Naive Bayes learning

• K-nearest neighbor classifier

Page 20: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 20

DECISION TREE

• A decision tree is learned from the dataset (training data with known classes) and later applied to predict the class attribute value of new data (test data with unknown classes) where only the feature values are known

Page 21: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 21

DECISION TREE INDUCTION

Page 22: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 22

ID3, A DECISION TREE ALGORITHM

Use information gain (entropy) to determine how well an attribute separates the training data according to the class attribute value

– p+ is the proportion of positive examples in D

– p- is the proportion of negative examples in D

In a dataset containing ten examples, 7 have a positive class attribute value and 3 have a negative class attribute value [7+, 3-]:

If the numbers of positive and negative examples in the set are equal, then the entropy is 1

Page 23: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 23

DECISION TREE: EXAMPLE 1outlook temperature humidity windy play

sunny 85 85 FALSE no

sunny 80 90 TRUE no

overcast 83 86 FALSE yes

rainy 70 96 FALSE yes

rainy 68 80 FALSE yes

rainy 65 70 TRUE no

overcast 64 65 TRUE yes

sunny 72 95 FALSE no

sunny 69 70 FALSE yes

rainy 75 80 FALSE yes

sunny 75 70 TRUE yes

overcast 72 90 TRUE yes

overcast 81 75 FALSE yes

rainy 71 91 TRUE no

Page 24: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 24

DECISION TREE: EXAMPLE 2

Learned Decision Tree 1 Learned Decision Tree 2

Class Labels

Page 25: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 25

NAIVE BAYES CLASSIFIER

For two random variables X and Y, Bayes theorem states that,

class variable the instance features

Then class attribute value for instance X

Assuming that variables are independent

Page 26: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 26

NBC: AN EXAMPLE

Page 27: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 27

NEAREST NEIGHBOR CLASSIFIER

• k-nearest neighbor employs the neighbors of a data point to perform classification

• The instance being classified is assigned the label that the majority of k neighbors’ labels

• When k = 1, the closest neighbor’s label is used as the predicted label for the instance being classified

• For determining the neighbors, distance is computed based on some distance metric, e.g., Euclidean distance

Page 28: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 28

K-NN: ALGORITHM

1. The dataset, number of neighbors (k), and the instance i is given

2. Compute the distance between i and all other data points in the dataset

3. Pick k closest neighbors4. The class label for the data point i is the one

that the majority holds (if there are more than one class, select one of them randomly)

Page 29: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 30

K-NEAREST NEIGHBOR: EXAMPLE

• Depending on the k, different labels can be predicted for the green circle• In our example k = 3 and k = 5 generate different labels for the instance• K= 10 we can choose either triangle or rectangle

k = 3

k = 5k = 10Class label = ?

Page 30: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 31

K-NEAREST NEIGHBOR: EXAMPLE

Data instance Outlook Temperature Humidity Similarity Label K Prediction2 1 1 1 3 N 1 N1 1 0 1 2 N 2 N4 0 1 1 2 Y 3 N3 0 0 1 1 Y 4 ?5 1 0 0 1 Y 5 Y6 0 0 0 0 N 6 ?7 0 0 0 0 Y 7 Y

Similarity between row 8 and other data instances;(Similarity = 1 if attributes have the same value, otherwise similarity = 0)

Page 31: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 32

EVALUATING CLASSIFICATION PERFORMANCE

• As the class labels are discrete, we can measure the accuracy by dividing number of correctly predicted labels (C) by the total number of instances (N)• Accuracy = C/N• Error rate = 1 - Accuracy

• More sophisticated approaches of evaluation will be discussed later

Page 32: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 33

REGRESSION

Page 33: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 34

REGRESSION

Regression analysis includes techniques of modeling and analyzing the relationship between a dependent variable and one or more independent variables• Regression analysis is widely used for

prediction and forecasting• It can be used to infer

relationships between the independent and dependent variables

Page 34: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 35

REGRESSION

In regression, we deal with real numbers as class values (Recall that in classification, class values or labels are categories)

y ≈ f(X)

Regressorsx1, x2, …, xm

Dependent variabley R

Our task is to find the relation between y and the vector (x1, x2, …, xm)

Page 35: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 36

LINEAR REGRESSION

In linear regression, we assume the relation between the class attribute y and feature set x to be linear

where w represents the vector of regression coefficients• The problem of regression can be solved by

estimating w and using the provided dataset and the labels y– The least squares is often used to solve the problem

Page 36: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 37

SOLVING LINEAR REGRESSION PROBLEMS

• The problem of regression can be solved by estimating w and using the dataset provided and the labels y– “Least squares” is a popular method to solve

regression problems

Page 37: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 38

LEAST SQUARES

Find W such that minimizing ǁY - XWǁ2 for regressors X and labels Y

Page 38: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 39

LEAST SQUARES

Page 39: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 40

REGRESSION COEFFICIENTS

• When there is only one independent variable:y = w0 + w1x

• Two independent variablesy = w0 + w1x1 + w2x2

Page 40: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 41

LINEAR REGRESSION: EXAMPLE

Years of experience Salary ($K)

3 30

8 57

9 64

13 72

3 36

6 43

11 59

21 90

1 20

16 83

Page 41: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 43

EVALUATING REGRESSION PERFORMANCE

• The labels cannot be predicted precisely• It is needed to set a margin to accept or reject

the predictions– For example, when the observed temperature is

71 any prediction in the range of 71±0.5 can be considered as correct prediction

Page 42: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 44

CLUSTERING

Page 43: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 45

CLUSTERING

• Clustering is a form of unsupervised learning– The clustering algorithms do not have examples

showing how the samples should be group together• The clustering algorithms look for patterns or

structures in the data that are of interest• Clustering algorithms group together similar

items

Grouping together items that are similar in some way – according to some criteria

Page 44: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 46

CLUSTERING: EXAMPLE

Page 45: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 47

MEASURING SIMILARITY IN CLUSTERING ALGORITHMS

• The goal is to group together similar items• Different similarity measures can be used to

find similar items• Usually similarity measures are critical to

clustering algorithms

The most popular (dis)similarity measure for continuous features are Euclidean Distance and Pearson Linear Correlation

Page 46: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 48

EUCLIDEAN DISTANCE – A DISSIMILAR MEASURE

• Here n is the number of dimensions in the data vector

Page 47: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 49

PEARSON LINEAR CORRELATION

• We’re shifting the expression profiles down (subtracting the means) and scaling by the standard deviations (i.e., making the data have mean = 0 and std = 1)

• Always between –1 and +1 (perfectly anti-correlated and perfectly correlated)

Page 48: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 50

SIMILARITY MEASURES: MORE DEFINITIONS

Page 49: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 51

CLUSTERING

• Distance-based algorithms

– K-Means

• Hierarchical algorithms

Page 50: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 52

K-MEANS

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean• Finding the global optimal of k partitions is

computationally expensive (NP-hard). However, there are efficient heuristic algorithms that are commonly employed and converge quickly to an optimum that might not be global.

Page 51: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 53

K-MEANS

• Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares:

where μi is the mean of points in Si

Page 52: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 54

K-MEANS: ALGORITHM

Given data points xi and an initial set of k centroids m1

(1),…,mk(1), the algorithm proceeds as

follows:• Assignment step: Assign each data point to the

cluster Si with the closest centroid each data point goes into exactly one cluster)

• Update step: Calculate the new means to be the centroid of the data points in the cluster

Page 53: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 56

K-MEANS: AN EXAMPLE

Data point X Y

1 1 12 2 13 1 24 2 25 4 46 4 57 5 48 5 5

Cluster 1 Cluster 2Step Data point Centroid Data point Centroid

1 1 (1.0, 1.0) 2 (2.0, 1.0)

Page 54: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 58

RUNNING K-MEANS ON IRIS DATASET

Page 55: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 60

HIERARCHICAL CLUSTERING

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. • Strategies for hierarchical clustering generally fall

into two types:– Agglomerative: This is a "bottom up" approach: each

observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

– Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Page 56: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 61

HIERARCHICAL ALGORITHMS

• Initially n data points are considered as either 1 or n clusters in hierarchical clustering

• These clusters are gradually split or merged (divisive or agglomerative hierarchical clustering algorithms), depending on the type of an algorithm

• Until the desired number of clusters are reached

Page 57: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 62

HIERARCHICAL AGGLOMERATIVE CLUSTERING

• Start with each data point as a cluster• Keep merging the most similar pairs of data

points/clusters until only one big cluster left• This is called a bottom-up or agglomerative

method

This produces a binary tree or dendrogram– The final cluster is the root and each data point is a

leaf– The height of the bars indicate how close the points

are

Page 58: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 63

HIERARCHICAL CLUSTERING: AN EXAMPLE

Page 59: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 64

MERGING THE DATA POINTS IN HIERARCHICAL CLUSTERING

• Average Linkage– Each cluster ci is associated with a mean vector i

which is the mean of all the data items in the cluster– The distance between two clusters ci and cj is then

just d(i , j )• Single Linkage– The minimum of all pairwise distances between

points in the two clusters• Complete Linkage– The maximum of all pairwise distances between

points in the two clusters

Page 60: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 65

LINKAGE IN HIERARCHICAL CLUSTERING: EXAMPLE

Single Linkage Average Linkage

Complete Linkage

Page 61: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 66

EVALUATING THE CLUSTERINGS

• Evaluation with ground truth• Evaluation without ground truth

When we are given objects of two different kinds, the perfect clustering would be that objects of the same type are clustered together.

Page 62: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 67

EVALUATION WITH GROUND TRUTH

When ground truth is available, the evaluator has prior knowledge of what a clustering should be– That is, we know the correct clustering

assignments.

• Measures– Precision and Recall, or F-Measure– Purity– Normalized Mutual Information (NMI)

Page 63: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 68

PRECISION AND RECALL

• True Positive (TP) : – when similar points are assigned to the

same clusters– This is considered a correct decision.

• True Negative (TN) : – when dissimilar points are assigned to

different clusters– This is considered a correct decision

• False Negative (FN) :

– when similar points are assigned to different clusters

– This is considered an incorrect decision• False Positive (FP) :

– when dissimilar points are assigned to the same clusters

– This is considered an incorrect decision

Page 64: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 69

PRECISION AND RECALL: EXAMPLE 1

Page 65: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 71

F-MEASURE

• To consolidate precision and recall into one measure, we can use the harmonic mean of precision of recall

Computed for the same example, we get F = 0.54

Page 66: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 72

PURITY• In purity, we assume the majority of a cluster

represents the cluster• Hence, we use the label of the majority against

the label of each member to evaluate the algorithm

• The purity is then defined as the fraction of instances that have labels equal to the cluster’s majority label

where Lj defines label j (ground truth) and Mi defines the majority label for cluster i

• Purity can be easily tampered; consider points being singleton clusters (of size 1) or very large clusters.

• In both cases, purity does not make much sense.

Page 67: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 73

MUTUAL INFORMATION

The mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables

• p(x,y) is the joint probability distribution function of X and Y, • p(x) and p(y) are the marginal probability distribution

functions of X and Y respectively

Page 68: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 74

NORMALIZED MUTUAL INFORMATION

Normalized Mutual Information has been derived from information theory where the Mutual Information (MI) between the clusterings found and the labels is normalized by the upper bound of (MI) which is a mean of the entropies (H) of labels and clusterings found

Page 69: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 75

NORMALIZED MUTUAL INFORMATION

• NMI values close to one indicate high similarity between clusterings found and labels

• Values close to zero indicate high dissimilarity between them

• where l and h are labels and found clusterings, • nh and nl are the number of data points in the clusters h and l, respectively, • nh,l is the number of points in clusters h and labeled l, • n is the size of the dataset

Page 70: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 76

NORMALIZED MUTUAL INFORMATION: EXAMPLE

Partition a: [1,1,1,1,1,1,1, 2,2,2,2,2,2,2] Partition b: [1,1,1,1,1,2,2, 1,2,2,2,2,2,2]

nh

h=1 6

h=2 8

nl

l=1 7

l=2 7

nh,l l=1 l=2

h=1 5 1

h=2 2 6n = 14

Page 71: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 77

EVALUATION WITHOUT GROUND TRUTH

• Use domain experts

• Use quality measures such as SSE

– SSE: the sum of the squared error for all clusters

• Use more than two clustering algorithms and

compare the results and pick the algorithm

with better quality measure

Page 72: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 78

TEXT MINING

Page 73: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 79

TEXT MINING

• In social media, most of the data that is available online is in text format

• In general, the way to perform data mining is to convert text data into tabular format and then perform data mining on this data

• The process of converting text data into tabular data is called vectorization

Page 74: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 80

TEXT MINING PROCESS

A set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation

Page 75: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 81

TEXT PREPROCESSING

Text preprocessing aims to make the input documents more consistent to facilitate text representation, which is necessary for most text analytics tasks• Methods:– Stop word removal

• Stop word removal eliminates words using a stop word list, in which the words are considered more general and meaningless – e.g. the, a, is, at, which

– Stemming• Stemming reduces inflected (or sometimes derived) words to

their stem, base or root form– For example, “watch”, “watching”, “watched” are represented as

“watch”

Page 76: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 82

TEXT REPRESENTATION

• The most common way to model documents is to transform them into sparse numeric vectors and then deal with them with linear algebraic operations

• This representation is called “Bag of Words”

• Methods:– Vector space model– tf-idf

Page 77: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 83

VECTOR SPACE MODEL

• In the vector space model, we start with a set of documents, D

• Each document is a set of words• The goal is to convert these textual documents

to vectors

• di : document i, wj,i : the weight for word j in document i

The weight can be set to 1 when the word exist in the document and 0 when it does not. Or we can set this weight to the number of times the word is observed in the document

Page 78: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 84

VECTOR SPACE MODEL: AN EXAMPLE

• Documents:– d1: data mining and social media mining– d2: social network analysis– d3: data mining

• Reference vector:– (social, media, mining, network, analysis, data)

• Vector representation: analysis data media mining network sociald1 0 1 1 1 0 1d2 1 0 0 0 1 1d3 0 1 0 1 0 0

Page 79: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 85

TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY)

tf-idf of term t, document d, and document corpus D is calculated as follows:

tf-idf(t, d, D) = tf (t, d) * idf (t, D)

The total number of documents in the corpus

The number of documents where the term t appears

Page 80: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 86

TF-IDF: AN EXAMPLE

Consider words “apple” and “the” that appear 10 and 20 times in document 1 (d1), which contains 100 words. Consider |D| = 20 and word “apple” only appearing in d1 and word “the” appearing in all 20 documents

Page 81: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 87

TF-IDF: AN EXAMPLE

• Documents:– d1: data mining and social media mining– d2: social network analysis– d3: data mining

• tf-idf representation:

analysis data media mining network socialdf(w) 1 2 1 2 1 2log(N/df(w)) 0.48 0.18 0.48 0.18 0.48 0.18d1, tf 0 1 1 2 0 1d2, tf 1 0 0 0 1 1d3, tf 0 1 0 1 0 0d1, tf-idf 0.00 0.18 0.48 0.35 0.00 0.18d2, tf-idf 0.48 0.00 0.00 0.00 0.48 0.18d3, tf-idf 0.00 0.18 0.00 0.18 0.00 0.00

Page 82: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 88

SENTIMENT ANALYSIS

• Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials

• It aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.

Page 83: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 89

POLARITY ANALYSIS

• The basic task in opinion mining is classifying the polarity of a given document or text– The polarity could be positive, negative, or neutral

• Methods:– Naïve Bayes– Pointwise Mutual Information (PMI)

Page 84: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 90

MEASURING POLARITY, NAÏVE BAYES

• Bayes’ rule:

• If we consider that the occurrence of features (words) in the document are independent

Page 85: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 91

MEASURING POLARITY, MAXIMUM ENTROPY

• Z(d) is the normalization factor• is feature-weight parameter and shows the

importance of each feature• Fi,c is defined as a feature/class function for

feature fi and class c

Page 86: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 92

MEASURING POLARITY, POINTWISE MUTUAL INFORMATION

• P(word) is the number of results returned by search engine in response to search for term word

• P(word1 word2) is the number of results for mutual search of word1 and word2 together

Page 87: Data Mining: an Introduction

An Introduction to Data MiningData Mining and Machine Learning in a nutshell 93

Mohammad-Ali Abbasi (Ali), Ali, is a Ph.D. student at Data Mining and Machine Learning Lab, Arizona State University. His research interests include Data Mining, Machine Learning, Social Computing, and Social Media Behavior Analysis.

http://www.public.asu.edu/~mabbasi2/