Top Banner
1 Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University ©2013 Han, Kamber, and Pei. All rights reserved.
54

1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

Dec 21, 2015

Download

Documents

Johnathan Evans
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

1

Data Mining: Concepts and

Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign

Simon Fraser University

©2013 Han, Kamber, and Pei. All rights reserved.

Page 2: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

2

Chapter 2: Getting to Know Your Data

Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Data Visualization

Measuring Data Similarity and Dissimilarity

Summary

Page 3: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

3

Types of Data Sets

Record Relational records Data matrix, e.g., numerical matrix,

crosstabs Document data: text documents: term-

frequency vector Transaction data

Graph and network World Wide Web Social or information networks Molecular Structures

Ordered Video data: sequence of images Temporal data: time-series Sequential Data: transaction

sequences Genetic sequence data

Spatial, image and multimedia: Spatial data: maps Image data: Video data:

Page 4: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

4

Data Objects

Data sets are made up of data objects. A data object represents an entity. Examples:

sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses

Also called samples , examples, instances, data points, objects, tuples.

Data objects are described by attributes. Database rows -> data objects; columns ->attributes.

Page 5: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

5

Attributes

Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. E.g., customer _ID, name, address

Types: Nominal Binary Ordinal Numeric

Interval-scaled Ratio-scaled

Page 6: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

6

Attribute Types

Nominal: categories, states, or “names of things”, order not meaningful Hair_color = {auburn, black, blond, brown, grey, red, white}

Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important

e.g., gender Asymmetric binary: outcomes not equally important.

e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV

positive) Ordinal

Values have a meaningful order (ranking) but magnitude between successive values is not known.

Size = {small, medium, large}, grades, army rankings

Page 7: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

7

Numeric Attribute Types

Quantity (integer or real-valued) Interval

Measured on a scale of equal-sized units Values have order

E.g., temperature in C˚or F˚, calendar dates No true zero-point

Ratio Inherent zero-point We can speak of values as being an order of magnitude

larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). e.g., temperature in Kelvin, length, counts,

monetary quantities

Page 8: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

8

Discrete vs. Continuous Attributes

Discrete Attribute Has only a finite or countably infinite set of values

E.g., zip codes, profession, or the set of words in a collection of documents

Continuous Attribute Has real numbers as attribute values

E.g., temperature, height, or weight Continuous attributes are typically represented as floating-

point variables

Page 9: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

9

Chapter 2: Getting to Know Your Data

Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Data Visualization

Measuring Data Similarity and Dissimilarity

Summary

Page 10: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

10

Basic Statistical Descriptions of Data

Measures of central tendency Mean, median, mode

Dispersion of data range, quartiles and interquartile range, five-number

summary and boxplots, variance and standard deviation

Page 11: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

11

Measuring the Central Tendency

Mean:

Note: n is sample size and N is population size.

Weighted arithmetic mean:

Trimmed mean: chopping extreme values

Median:

Middle value if odd number of values, or average of the

middle two values otherwise

Estimated by interpolation (for grouped data):

Mode

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:

n

iixn

x1

1

n

ii

n

iii

w

xwx

1

1

)(3 medianmeanmodemean

Median interval

Page 12: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

April 19, 2023 Data Mining: Concepts and Techniques 12

Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data

positively skewed negatively skewed

symmetric

Page 13: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

13

Measuring the Dispersion of Data

Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, median, Q3, max

Boxplot: A simple graphical device to display the overall shape of a

distribution, including the outliers. Ends of the box are the quartiles; median is

marked within the box; add whiskers to mark min and max, and plot outliers

individually

Outlier: Values less than Q1-1.5*IQR and greater than Q3+1.5*IQR are

outliers

Page 14: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

14

Boxplot Analysis

Draw a box plot for the following dataset.

10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4

Here,

Q2(median) = 14.6

Q1 = 14.4

Q3 = 14.9

IQR = Q3 – Q1 = 14.9-14.4 = 0.5

Outliers will be any points below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.

So, the outliers are at 10.2, 15.9, and 16.4.

The ends of the box are at 14.4 and 14.9.

The median 14.6 is marked within the box.

The whiskers extend to 14.1 and 15.1.

The outliers 10.2, 15.9, and 16.4 are plotted individually.

Page 15: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

15

Measuring the Dispersion of Data

Variance and standard deviation

Variance:

Standard deviation σ is the square root of variance σ2

Page 16: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

16

Example of Standard Deviation

Find out the Mean, the Variance, and the Standard Deviation of the following

dataset.

9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4

Here, the mean = 7

Using the formula for variance,

σ2 = 8.9

Therefore, the standard deviation is σ = 2.98

Page 17: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

17

Graphic Displays of Basic Statistical Descriptions

Boxplot: graphic display of five-number summary

Histogram: x-axis are values, y-axis represents frequencies

Quantile plot: each value xi is paired with fi indicating that

approximately fi * 100% of data are below the value xi

Quantile-quantile (q-q) plot: graphs the quantiles of one

univariant distribution against the corresponding quantiles of

another

Scatter plot: each pair of values is a pair of coordinates and

plotted as points in the plane

Page 18: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

18

Histogram Example

Page 19: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

19

Quantile Plot

Used to check whether your data is normal

To make a quantile plot:

If the data distribution is close to normal, the plotted points

will lie close to a slopped straight line

Page 20: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

20

Quantile Plot Example

Create a quantile plot for the following dataset.

3, 5, 1, 4, 10

First sort the data: 1, 3, 4, 5, 10

Calculate the sample quantiles: 0.1, 0.3, 0.5, 0.7, 0.9

Plot the graph

Page 21: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

21

Scatter Plot

Page 22: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

22

Positively and Negatively Correlated Data

Page 23: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

23

Uncorrelated Data

Page 24: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

24

Chapter 2: Getting to Know Your Data

Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Data Visualization

Measuring Data Similarity and Dissimilarity

Summary

Page 25: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

25

Data Visualization

Why data visualization? Gain insight into the data Search for patterns, trends, structure, irregularities, relationships among

data Categorization of visualization methods:

Pixel-oriented visualization techniques Geometric projection visualization techniques Icon-based visualization techniques Hierarchical visualization techniques Visualizing complex data and relations

Page 26: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

26

Pixel-Oriented Visualization Techniques

For a data set of m dimensions, create m windows on the screen, one for each dimension

The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows

The colors of the pixels reflect the corresponding values

(a) Income (b) Credit Limit

(c) transaction volume

(d) age

Page 27: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

27

Geometric Projection Visualization Techniques

Visualization of geometric transformations and projections of the data

Helps users find interesting projections of multidimensional data Methods

Scatterplot and scatterplot matrices

Page 28: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

28

Icon-Based Visualization Techniques

Visualization of the data values as features of icons Typical visualization methods

Chernoff Faces Stick Figures

Page 29: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

29

Chernoff Faces

A way to display multidimensional data of up to 18 dimensions as a cartoon human face

Components of the faces such as eyes, ears, mouth, and nose represent values of the dimensions by their shape, size, placement and orientation

Page 30: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

Data Mining: Concepts and Techniques 30

A census data figure showing age, income, gender, education, etc.

used

by

perm

issi

on o

f G

. G

rinst

ein,

Uni

vers

ity o

f M

assa

chus

ette

s at

Low

ell

Stick Figure

A 5-piece stick figure (1 body and 4 limbs)

Two attributes mapped to the two axes, remaining attributes mapped to angle or length of limbs

Page 31: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

31

Hierarchical Visualization Techniques

Visualization of the data using a hierarchical partitioning into subspaces instead of visualizing all dimensions at the same time

Methods Dimensional Stacking Worlds-within-Worlds Tree-Map Cone Trees InfoCube

Page 32: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

32

Used by permission of M. Ward, Worcester Polytechnic Institute

Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes

Dimensional Stacking

Page 33: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

Visualizing Complex Data and Relations

Visualizing non-numerical data: text and social networks Tag cloud: visualizing user-generated tags

The importance of tag is represented by font size/color

Newsmap: Google News Stories in 2005

Page 34: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

34

Chapter 2: Getting to Know Your Data

Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Data Visualization

Measuring Data Similarity and Dissimilarity

Summary

Page 35: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

35

Similarity and Dissimilarity

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

Dissimilarity Numerical measure of how different two data objects are Lower when objects are more alike

Two data structures commonly used to measure the above Data matrix Dissimilarity matrix

Page 36: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

36

Data Matrix and Dissimilarity Matrix

Data matrix Stores n data points with p

dimensions Two modes – stores both

objects and attributes

Dissimilarity matrix Stores n data points, but

registers only the dissimilarity between objects i and j

A triangular matrix Single mode as it only

stores dissimilarity values

npx...nfx...n1x

...............ipx...ifx...i1x

...............1px...1fx...11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

Page 37: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

37

Proximity Measure for Nominal Attributes

Can take 2 or more states, e.g., map_color may have attributes red, yellow, blue and green

Dissimilarity can be computed based on the ratio of mismatches m: # of matches, p: total # of attributes

pmpjid ),(

Page 38: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

38

Dissimilarity between Nominal Attributes

Here, we have one nominal attribute, test-1, so p=1.

The dissimilarity matrix is as shown below:

Page 39: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

39

Proximity Measure for Binary Attributes

A contingency table for binary data

Dissimilarity for symmetric binary

attributes:

Dissimilarity for asymmetric binary

attributes :

Jaccard coefficient (similarity

measure for asymmetric binary

attributes):

Object i

Object j

Page 40: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

40

Dissimilarity between Binary Variables

Example

Gender is a symmetric attribute, others are asymmetric binary Let the values Y and P be 1, and the value N 0 Suppose the distance between patients is computed based only on the

asymmetric attributes

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

Page 41: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

41

Distance on Numeric Data: Minkowski Distance

Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is the order (the distance so defined is also called L-h norm)

Properties d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j) d(i, k) + d(k, j) (Triangle Inequality)

A distance that satisfies these properties is a metric

Page 42: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

42

Special Cases of Minkowski Distance

h = 1: Manhattan distance

h = 2: Euclidean distance

h . supremum distance This is the maximum difference between any component (attribute)

of the vectors

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

||...||||),(2211 pp jxixjxixjxixjid

Page 43: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

43

Example: Minkowski DistanceDissimilarity Matrices

point attribute 1 attribute 2x1 1 2x2 3 5x3 2 0x4 4 5

L x1 x2 x3 x4x1 0x2 5 0x3 3 6 0x4 6 1 7 0

L2 x1 x2 x3 x4x1 0x2 3.61 0x3 2.24 5.1 0x4 4.24 1 5.39 0

L x1 x2 x3 x4

x1 0x2 3 0x3 2 5 0x4 3 1 5 0

Manhattan (L1)

Euclidean (L2)

Supremum

0 2 4

2

4

x1

x2

x3

x4

Page 44: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

44

Proximity Measures for Ordinal Variables

Let M represent the number of possible values that an ordinal attribute can have replace each xif by its corresponding rank map the range of each variable onto [0, 1] by replacing i-th

object in the f-th variable by

compute the dissimilarity using any of the distance measures for numeric attributes, e.g., Euclidean distance

11

f

ifif M

rz

},...,1{fif

Mr

Page 45: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

45

Proximity Measures for Ordinal Variables

Consider the data in the adjacent table: Here, the attribute Test has three states: fair, good and

excellent, so Mf=3 For step 1, the four attribute values are assigned the

ranks 3,1,2 and 3 respectively. Step 2 normalizes the ranking by mapping rank 1 to

0.0, rank 2 to 0.5 and rank 3 to 1.0 For step 3, using Euclidean distance, a dissimilarity

matrix is obtained as shown Therefore, students 1 and 2 are most dissimilar, as are

students 2 and 4

Student Test

1 Excellent

2 Fair

3 Good

4 Excellent

Page 46: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

46

Attributes of Mixed Type

A database may contain all attribute types : Nominal, symmetric binary, asymmetric binary, numeric, ordinal

One may use a weighted formula to combine their effects

For nominal and ordinal attributes, use the technique mentioned earlier to compute dissimilarity matrix

For numeric attributes use the following formula to calculate dissimilarity

)(1

)()(1),(

fij

pf

fij

fij

pf

djid

Page 47: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

47

Attributes of Mixed Type - Example

Consider the data in the table:

The dissimilarity matrices for the nominal and ordinal data are shown to the right computed using the methods discussed before

To compute dissimilarity matrix for the numeric attribute, maxhxh=64, minhxh=22. Using the formula from previous slide, the dissimilarity matrix is obtained as shown below:

Page 48: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

48

Attributes of Mixed Type - Example

The three dissimilarity matrices can now be used to compute the overall dissimilarity between two objects using the equation

The resulting dissimilarity matrix is

)(1

)()(1),(

fij

pf

fij

fij

pf

djid

Page 49: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

49

Cosine Similarity Cosine similarity is a measure of similarity that can be used to compare

documents. A document can be represented by thousands of attributes, each recording the

frequency of a particular word (such as keywords) or phrase in the document called term-frequency vector.

Cosine measure: If d1 and d2 are two term-frequency vectors, then cos(x, y) = (x y) /||x|| ||y|| ,

where indicates dot product, ||x|| is the length of vector x defined as . Similarly, ||y|| is the length of vector y.

A cosine value of 0 means that the two vectors are at 90 degrees to each other and there is no match.

The closer the cosine value to 1, the greater the match between the vectors

Page 50: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

50

Example: Cosine Similarity

Ex: Find the similarity between documents 1 and 2 from previous slide.d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

Using the formula, cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25

||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481

||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12

cos(d1, d2 ) = 0.94

The cosine similarity shows that the two documents are quite similar.

Page 51: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

51

Chapter 2: Getting to Know Your Data

Data Objects and Attribute Types

Basic Statistical Descriptions of Data

Data Visualization

Measuring Data Similarity and Dissimilarity

Summary

Page 52: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

Summary

Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled

Many types of data sets, e.g., numerical, text, graph, image, etc. Gain insight into the data by:

Basic statistical data description: central tendency, dispersion, graphical displays

Data visualization: Measure data similarity

Above steps are the beginning of data preprocessing

Page 53: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

Exercise 1 of 2

Find the dissimilarity value between Alyssa and Chris and between Alyssa and Diane.

Student Gender

HairColor

Test1 Test2

Alyssa F Black Excellent

80

Chris M Black Good 85

Jessica F Brown Fair 55

Diane F Blonde Excellent

80

Page 54: 1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.

Exercise 2 of 2

Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8), compute the: Euclidean distance Manhattan distance Supremum distance Cosine similarity