Homework 1 - stat.cmu.educshalizi/350/hw/solutions/solution-01.pdf · Homework 1 36-350: Data Mining SOLUTIONS 1. (a) What is the bag-of-words representation of the sentence \to be

$Page 1: Homework 1 - stat.cmu.educshalizi/350/hw/solutions/solution-01.pdf · Homework 1 36-350: Data Mining SOLUTIONS 1. (a) What is the bag-of-words representation of the sentence \to be$
Homework 1

36-350: Data Mining

SOLUTIONS

1. (a) What is the bag-of-words representation of the sentence “to be or notto be”?Answer: A vector with one component for each word in our dictio-nary, all of them zero except for the following:

be not or to2 1 1 2

This is the format given by

table(c("to","be","or","not","to","be"))

(b) Suppose we search for the above sentence via the keyword “be”. Whatis the bag-of-words representation for this query, and what is theEuclidean distance from the sentence?Answer: A vector whose only non-zero component is that for “be”,where the count is 1. The Euclidean distance is√

(2− 1)2 + (1− 0)2 + (1− 0)+(0− 2)2 =√

7

(c) Describe how weighting words by inverse-document-frequency (IDF)should help when making a Web query for “The Principles of DataMining.”Answer: It keeps us from wasting time on words like “the” and“of”, and emphasize the less-common, more-informative words “prin-ciples”, “data” and “mining”; something titled “Data Mining Prin-ciples” is a good match.

(d) Describe a simple text search that could not be carried out effectivelyusing a bag-of-words representation (no matter what distance mea-sure is used). “Simple” means no actual understanding of English isrequired.Answer: There are many; but a search for the exact phrase “to beor not to be” is impossible.

1

2. (a) What is the Euclidean distance between each of the vectors (1, 0, 0),(1, 4, 5), and (10, 0, 0)?Answer: The distance between the first and second vectors is

√41 ≈

6.4; between the first and third is 9; and between the second and thirdis√

122 ≈ 11.

(b) Divide each vector by its sum. How do the relative distances change?Answer: The first and third vectors become the closest pair.

(c) Divide each vector by its Euclidean length. How do the relative dis-tances change?Answer: The first and third vectors again become the closest pair.

(d) Suppose we’re using the bag-of-words representation for similaritysearching with a Euclidean metric. Describe how the previous partsof the question illustrate a potential problem if we do not normalizefor document length.Answer: Documents with the same distribution of words but dif-ferent sizes will appear to be very far apart.

(e) Consider the conventional searching scheme where the user picks aset of keywords and the system returns all documents containing thosekeywords. Describe how the previous parts of the question illustratea potential problem with this type of search.Answer: The second vector contains the first keyword, but it’s asmall part of it, and return it because it’s a match will lower theprecision.

2

3. (a) Create document vectors for each of the stories in the music and art

folders. Give the commands you used.Answer: I’ll use the read.directory function, so I’ll comment itfirst (in somewhat more detail than you would need to).

# Read in all the files in a directory as vectors of character strings# Inputs: directory name (dirname), flag for verbose output (verbose)# Calls: read.doc() in 01.R# Output: List of vectors, each containing the words of one file in orderread.directory <- function(dirname,verbose=FALSE) {stories = list() # Make a list to store the story-vectors infilenames = dir(dirname,full.names=TRUE) # List the files in the directory# Using the full.names option means we can pass the results to other# commands without worrying about the current working directory, etc.

for (i in 1:length(filenames)) {# For each file,if(verbose) {print(filenames[i])# Print the filename before we try anything with it (if we’re verbose)

}stories[[i]] = read.doc(filenames[i]) # Read in file, stick it in the list# read.doc only works on XML files, should really check that first!

}return(stories) # Return the list

}

Now I can read in the stories:

> library(XML)> source("~/teaching/350/hw/01/01.R")> music <- read.directory("~/teaching/350/hw/01/nyt_corpus/music")> art <- read.directory("~/teaching/350/hw/01/nyt_corpus/art")> length(music)[1] 45> length(art)[1] 57> is.list(music)[1] TRUE> is.list(art)[1] TRUE> length(music[[1]])[1] 413> is.vector(music[[1]])[1] TRUE

(b) What command would you use to extract the 37th word of story num-ber 1595645 in art? (That word is “experiencing”.) Give a command

3

to count the number of times the word “the” appears in that story.(There are at least two ways to do this. The correct answer is 103.)Answer: The list art contains a vector which has story number1595645, but which? I can either count on a directory listing, or Ican have R count for me:

> which(dir("~/teaching/350/hw/01/nyt_corpus/art")=="1595645.xml")[1] 48

(Note that here I don’t use the full.names option to dir(), sinceI don’t want the full path to the files.) So the story I want isart[[48]]. The 37th word is

> art[[48]][37][1] "experiencing"

One way to see how often “the” appears is to do this:

> sum(art[[48]] == "the")[1] 103

The expression inside sum() creates a logical vector, which is TRUEat the words which are equal to “the” and FALSE elsewhere. The sumof a logical vector is the number of TRUE components, i.e., numberof instances of “the”. Alternately, use table:

> table(art[[48]])["the"]the103

Applied to a vector, table creates an array1 where the columns arethe distinct values from the vector, and entries are how often thosevalues occur. The values themselves become the column names, sowe can check how often “the” occurs without having to know whereit comes in the vocabulary list (place 633, as it happens). R willprint the column name when you access the column, but it will doarithmetic on it as usual.

> table(art[[48]])["the"] + 17the120

(c) Give the commands you would use to construct a bag-of-words data-frame from the document vectors for the art and music stories. Hint:consider lapply.Answer: I’ll take the hint.

> art.bow <- lapply(art,table)> music.bow <- lapply(music,table)> is.list(music.bow)

1R Pedantry: Really, an object of class table, which inherits from the class array.

4

[1] TRUE> length(music.bow)[1] 45

lapply is one of the most useful functions in R: it takes two argu-ments, a data structure (vector, list, array, ...) and a function, andreturns a list which it gets by applying the function to each elementof the data structure. (It has a variant, sapply, which will try toautomatically simplify the result down to a vector, but that’s notwhat’s wanted here.) The table command we’ve already discussed.The last two commands just check that things are working as theyought to.

> nyt.frame <- make.BoW.frame(c(art.bow,music.bow))> dim(nyt.frame)[1] 102 4431

The number of rows should equal the total number of stories, and45 + 57 = 102.

(d) Create distance matrices from this data frame for (a) the straight Eu-clidean distance, (b) the distance with word-count normalization and(c) the distance with vector-length scaling, and then for all three againwith inverse-document-frequency weighting. Give the commands youuse.Answer: The distances command can be used for all of these, withthe right scalings and normalization. There are already functions in01.R to do these.

dist.plain = distances(nyt.frame)dist.wordcount = distances(div.by.sum(nyt.frame))dist.euclen = distances(div.by.euc.length(nyt.frame))nyt.frame.idf = idf.weight(nyt.frame)dist.idf.plain = distances(nyt.frame.idf)dist.idf.wordcount = distances(div.by.sum(nyt.frame.idf))dist.idf.euclen = distances(div.by.euc.length(nyt.frame.idf))

Quick checks that these are acting sensibly:

> dim(dist.idf.euclen)[1] 102 102> dim(dist.plain)[1] 102 102> round(dist.plain[1:5,1:5],2)

[,1] [,2] [,3] [,4] [,5][1,] 0.00 163.82 116.99 129.27 132.92[2,] 163.82 0.00 87.26 87.42 78.60[3,] 116.99 87.26 0.00 68.26 78.89[4,] 129.27 87.42 68.26 0.00 80.46[5,] 132.92 78.60 78.89 80.46 0.00

5

So I get 102 × 102 matrices which are symmetric and have zeroesdown the diagonal — looks good.

(e) For each of the six different difference measures, what is the averagedistance between stories in the same category and between stories indifferent categories? (Include the R command you use to computethis — don’t do it by hand!)Answer: There are multiple ways to do this. The simplest is torealize that, in this case, the first 57 stories are all art, and the last45 are all music. So if d is a distance matrix, the within-categoryentries are d[1:57,1:57] and d[58:102,58:102], and the between-category entries are d[1:57,58:102]. So

mean(c(d[1:57,1:57],d[58:102,58:102]))}

would give the average distance between stories in the same category,similarly mean(d[58:102,1:57]) for the between-category average.Of course, that trick relies on the categories forming two solid blocks.Here’s another which isn’t quite so fragile. First, we create a vectorwhich gives the class labels.

class.labels = c(rep("art",57),rep("music",45))

Now create a logical matrix which says whether or not two documentsbelong to different classes.

are.different = outer(class.labels,class.labels,"!=")

The outer() function takes three arguments: two data-structuresand another function. (Here the function is !=, which I put in quotesso that R realizes I’m naming a function, and not asking it to evaluatean expression.) It returns a matrix which it gets from applying thefunction to each pair of components from its first two arguments.Here those first two arguments are vectors of length 102, so what itgives back is a 102× 102 matrix, where are.different[i,j] showswhether class.label[i] != class.label[j]. In other words, it’sTRUE if documents i and j belong to different classes. And a logicalarray picks out elements from another array:

mean(d[are.different])

is the average distance between classes. To average within classes,

mean(d[!are.different])

Not only does this work if the classes are intermingled (we just haveto get the class.labels vector right), we can also use this to notinclude the distance from a document to itself in the within-classaverage:

are.same = !are.differentdiag(are.same) = FALSEmean(d[are.same])

6

Mean distance without IDF with IDFplain sum-normed length-normed plain sum-normed length-normed

within-class, with self 76 0.10 0.73 88 0.13 1.3within-class, without self 78 0.10 0.74 90 0.13 1.4between-class 78 0.11 0.78 88 0.13 1.4

This makes for a fairer comparison of the in-class and cross-classaverages, but there’s no penalty on this homework for not doing this.

7

(f) Create multidimensional scaling plots for the different distances, anddescribe what you see. Include the code you used, the plots, andexplanations for the code.Answer: I’ll use the basic command cmdscale(), which returns amatrix whose columns give the new coordinates. In fact, I wrotea function which calls cmdscale(), and then plots the points withdifferent colors and/or shapes for the different classes. This exploitsthe useful graphics trick that if you can control the color and shapeof individual points by making the col and pch arguments to theplotting command into vectors.

# Multidimensional-scaling plot for labeled points, with color and/or shape# indicating classes

# Inputs: Distance matrix (distances)# vector of class labels (labels)# vector of class color names (class.colors)# defaults to red and blue

# vector of class point shapes (class.shapes) --- see ?pch# defaults to circles and squares

# optional plotting parameters# Calls: cmdscale()# Output: invisibly returns matrix of coordinates for the MDS image pointsmy.mds <- function(distances, labels,class.colors=c("red","blue"),

class.shapes=c(21,22),...) {# Should really check that:# distances is a matrix of non-negative numbers, size = (length of labels)^2# class.colors and class.shapes have reasonable lengths

label.values = unique(labels) # What are the different labels?names(class.colors) = label.values # Allows access by class value as opposednames(class.shapes) = label.values # to just positionnum.labels = length(label.values)# Re-cycle the colors if need be# Useful if we want to have only one color but multiple shapes

if (length(class.colors) < num.labels) {class.colors = rep(class.colors,length.out=num.labels)

}# Ditto shapesif (length(class.shapes) < num.labels) {class.shapes = rep(class.shapes,length.out=num.labels)

}cols.vec = class.colors[labels] # Vector giving class colors in ordershapes.vec = class.shapes[labels] # Ditto shapesmds.coords = cmdscale(distances) # Get the MDS coordinatesplot(mds.coords,col=cols.vec,pch=shapes.vec,...) # Actual plottingreturn(invisible(mds.coords))

}

8

-50 0 50 100 150

-100

-80

-60

-40

-20

020

40

No normalization, no weighting

mds.coords[,1]

mds.coords[,2]

-0.05 0.00 0.05 0.10 0.15 0.20

-0.15

-0.10

-0.05

0.00

0.05

Word-count normalization, no weighting

mds.coords[,1]

mds.coords[,2]

-0.2 0.0 0.2 0.4 0.6 0.8

-0.2

0.0

0.2

0.4

Euclidean length normalization, no weighting

mds.coords[,1]

mds.coords[,2]

0 50 100 150 200

050

100

150

No normalization, IDF weighting

mds.coords[,1]

mds.coords[,2]

-0.20 -0.15 -0.10 -0.05 0.00

-0.20

-0.15

-0.10

-0.05

0.00

Word-count normalization, IDF weighting

mds.coords[,1]

mds.coords[,2]

-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2

-0.3

-0.2

-0.1

0.0

0.1

0.2

Euclidean length normalization, IDF weighting

mds.coords[,1]

mds.coords[,2]

Figure 1: MDS plots. Top row, without IDF. Bottom row, with IDF. Leftcolumn, un-normalized vectors. Middle column, normalized by word-count.Right column, normalized by Euclidean length. Red circles are art, blue squaresmusic.

Only with both IDF weights and Euclidean-length normalization dowe get anything like a reasonable separate of the two categories —though admittedly in some cases the view is obscured by a few out-liers, which seem to compress the other points into a single blob.

9

4. Comment the sq.Euc.dist function — that is, go over it and explain, inEnglish, what each line does, and how the lines work together to calculatethe function.

Answer: The key observation is that ‖~x− ~y‖2 = ‖~x‖2− 2~x · ~y+ ‖~y‖2 forany vectors ~x and ~y. (Exercise: show this!) The function uses this toefficiently compute the squared distance, avoiding explicit iteration loops(which are slow in R) and avoid computing the same ‖~x‖2 and ‖~y‖2 termsover and over again.

sq.Euc.dist <- function(x,y=x) { # Set default value for second argumentx <- as.matrix(x) # Convert argument to a matrix to make sure it can be

# manipulated as suchy <- as.matrix(y) # Dittonr=nrow(x) # Count the number of rows of the first argument

# This is the number of rows in the output matrixnc=nrow(y) # Count number of rows in the second argument

# This is the number of COLUMNS in the outputx2 <- rowSums(x^2) # Find sum of squares for each x vectorxsq = matrix(x2,nrow=nr,ncol=nc) # Make a matrix where each COLUMN is a copy

# of x2 --- see help(matrix) for details on# "recycling" of arguments --- matrix is# sized for output

y2 <- rowSums(y^2) # Find sum of squares for each y vector# Make a matrix where each ROW is a copy of y2, again sized for outputysq = matrix(y2,nrow=nr,ncol=nc,byrow=TRUE)# Make a matrix whose [i,j] entry is the dot product of x[i,] and y[j,]xy = x %*% t(y) # You should check that this has nr rows and nc columns!d = xsq + ysq - 2*xy # Add partial result matrices# Remember that for vectors x and y

# |x-y|^2 = |x|^2 - 2x*y + |y|^2# writing * for the dot product. So d has as its [i,j] entry the squared# norm of x[i,] plus the squared norm of y[j,] minus twice their dot product

# Make the diagonal EXACTLY zero if the two arguments are the sameif(identical(x,y)) diag(d) = 0# Need to use the identical() function to see if two whole objects are the# same --- using "x==y" here would give a MATRIX of Boolean values

# Distances are >= 0, negative values are presumably from numerical errors in# calculating numbers close to zero; fix themd[which(d < 0)] = 0return(d) # Return the tidied-up matrix of squared distances

}

This is more detailed than it strictly needs to be.

10

5. (a) Explain what the “cosine distance” has to do with cosines.Answer: From vector algebra, we know that for any vectors ~x and~y,

~x · ~y ≡∑

i

xiyi = ‖~x‖‖~y‖ cos θ

where θ is the angle between the vectors. The cosine distance is thedot product divided by the product of the norms, so it’s that cosine.

(b) Calculate, by hand, the cosine distances between the three vectors inquestion 2.Answer: The cosine distance between the first and the third vectoris clearly 1, and between either of them and the second vector is≈ 0.15.

(c) Write a function to calculate the matrix of cosine distances (really,similarities) between all the vectors in a data-frame. Hint: you maywant to use the distances function. Check that your function agreeswith your answer to the previous part.Answer: The distances function can take an optional argumentwhich is a function to apply to each pair of vectors from the two ma-trices. So first let’s define functions which compute the dot product,Euclidean norm, and the cosine distance for a pair of vectors.

dot.product = function(x,y) {return(sum(x*y)) # With vectors, * does component-by-component multiplication

}

euc.norm = function(x) {return(sqrt(dot.product(x,x)))

}

cosine.dist.pair = function(x,y) {return(dot.product(x,y)/(euc.norm(x)*euc.norm(y)))}

Now our real function is easy:

cosine.dist.1 = function(x) {return(distances(x,fun=cosine.dist.pair))

}

This is not the slickest way to do it, because we calculate the normof each vector 2n− 1 times. (Exercise: Why is it 2n− 1?) This isfaster, at least when n is large:

cosine.dist.2 = function(x) {y = div.by.euc.length(x)return(distances(y,fun=dot.product))

}

11

Exercise: Why does this work? How does this avoid recomputingthe norms of the vectors?Let’s check this:

> question2vectors <- matrix(c(1,0,0,1,4,5,10,0,0),nrow=3,byrow=TRUE)> question2vectors

[,1] [,2] [,3][1,] 1 0 0[2,] 1 4 5[3,] 10 0 0> round(cosine.dist.1(question2vectors),2)

[,1] [,2] [,3][1,] 1.00 0.15 1.00[2,] 0.15 1.00 0.15[3,] 1.00 0.15 1.00> round(cosine.dist.2(question2vectors),2)

[,1] [,2] [,3][1,] 1.00 0.15 1.00[2,] 0.15 1.00 0.15[3,] 1.00 0.15 1.00

so both versions work.

6. Write a function to find the document which best matches a given querystring. The function should take two arguments, (1) The query, as a sin-gle character string, and (2) the bag-of-words matrix, and return the rownumber corresponding to the best-matching document. You can pick thedistance measurement, but you should include inverse document-frequencyweighting.

Answer: The easiest way to do this is to use the nearest.points func-tion in the provided code. The one tricky bit is making sure that the querystring is turned into a bag-of-words vector using the same dictionary asthe documents. The R manipulations here are based on the ones whichstandardize.ragged, in 01.R, uses to ensure that all the stories have acommon lexicon.

# Return the index of the document which best matches a given query string# First we convert the query string into a document vector, then into# a bag of words, and then we remove terms not in the lexicon of the data# frame storing the bag-of-words vectors for the documents# Note that these last removed words add the same amount to the (squared)# distance between the query and any document’s bag of words, so those words# do not change which document is closest

# Inputs: query, as a character vector (query)# data frame of bag-of-word vectors (BoW.frame)

# Presumes: a data frame has been created for bag-of-word vectors, column# names being words

12

# Calls: strip.text(), get.idf.weight(), nearest.points()# Output: Index of the document, name of the corresponding row in data framequery.by.similarity <- function(query,BoW.frame) {# Prepare the query BoW vector for comparison with the target documentsquery.vec = strip.text(query) # Turn the query into a vector of wordsquery.BoW = table(query.vec) # Turn it into a bag of wordslexicon = colnames(BoW.frame) # PRESUMES: BoW.frame has been made so words are

# the column namesquery.vocab = names(query.BoW) # What words appear in the query?# Restrict the query to words in the lexiconquery.lex = query.BoW[intersect(query.vocab,lexicon)]# Add zero entries for lexicon words not found in the queryquery.lex[setdiff(lexicon,query.vocab)] = 0# query.lex now has all the right entries, but the words are out of order!query.lex = query.lex[lexicon] # Why does that work?# Finally, turn it into a one-row matrixq = t(as.matrix(query.lex))

# Do IDF scaling on the targets AND the query# The function idf.weight calculates the weights and re-scales the# data-frame but we need the weights to re-scale the query as well so we# write our own function get.idf.weights() --- see below

# Get the weightsidf = get.idf.weights(BoW.frame)# Scale the columns in BoW.frameBoW = scale.cols(BoW.frame,idf)# Scale the columns in qq = q * idf

# Scale the rows by Euclidean lengthBoW = div.by.euc.length(BoW)# Ditto the queryq = q/sqrt(sum(q^2))

# Find the closest match between q and BoW.idfbest.index = nearest.points(q,BoW)$which# Use row names from the data for a little extra comprehensibilitybest.name = rownames(BoW)[best.index]return(list(best.index=best.index,best.name=best.name))

}

# Return the vector of IDF weights from a data frame# The core of idf.weight(), but leaving out the actual re-scaling

# Input: data frame# Output: weight vectorget.idf.weights <- function(x) {

13

doc.freq <- colSums(x>0)doc.freq[doc.freq == 0] <- 1w <- log(nrow(x)/doc.freq)return(w)

}

Just having the row-number of the matching story is not quite so useful ashaving a more mnemonic label too, so I’ve set the code up to also returnrow names. Let’s make them:

rownames(nyt.frame) = c(outer("art",1:57,paste,sep="."),outer("music",1:45,paste,sep="."))

We saw outer() already: the first instance here is going to create a vectorreading art.1, art.2, ...art.57, which will then be concatenated withmusic.1, music.2, ...music.45, and this vector, of length 102, willbecome the row-names of the data frame.

Testing:

> query.by.similarity(art[[5]],nyt.frame)$best.index[1] 5$best.name[1] "art.5"> query.by.similarity(art[[40]],nyt.frame)$best.index[1] 40$best.name[1] "art.40"> query.by.similarity(music[[12]],nyt.frame)$best.index[1] 69$best.name[1] "music.12"> query.by.similarity(music[[31]],nyt.frame)$best.index[1] 88$best.name[1] "music.31"

So it passes the basic check, and gives reasonable results for queries like“abstract expressionism” and “rhythm and blues”.

14