Final_final copy

Stat 154 Final Project: Text Mining

Taylor Hines, Haley Muhlestein, Lawrence Ye, Ricardo TanasescuDecember 15, 2014

1. Data Description:

On cursory inspection, the training data has a number of notable features. First of all, the data contains4180 messages. About 14% of the messages are labeled as spam (565 vs. 3615). Their length (about 20-25words) suggests these messages are SMS or text messages, furthermore they seem to be mostly European.For example, 45 lines contain the pound symbol for the British currency. A number of lines also contain thegerman “u” with the dots. The spam messages have some common trends: they seem commerce related, ieare trying to promote or sell something. As a result they contain a variety of items that we are likely toadapt for power features later on in the project like prices, phone numbers, and coupon codes. They alsoseem to have more capital letters, as well as punctuation.

2. Feature Creation:

We sorted the training data using a python program written by the team. (See Appendix I) For the purposeof identifying unique words appearing in the training data, we converted all capital letters to lowercase, aswell as removed all punctuation (although some power features used punctuation and were created before thepunctuation was removed). Words were defined as any string of characters surrounded by white space onboth sides. These refinements reduced the unique words from about 11,000 to 8212. Parsing out the uniquewords was definitely an initial challenge, due to the potentially large number of calculations iterating across4,000 lines and 20-25 words per line. However, we were able to complete this task e�ciently by compiling allthe words in a list and then just using the set function.

One intial challenge we encountered was the encoding di�erences between Macs and PCs. As a result pythondoes not parse the text characters the same way, resulting in our script running di�erently depending onwhich machine we used. We were able to find a solution, directly instructing Python to use ‘utf-8’ encoding.

Generating the frequency matrix created a larger challenge: essentially the task is to analyze each line andcompare the words to our 8,2012 unique words and then count their frequency. To do this iteratively wouldresult in 25 Million steps, a prohibitively costly strategy (checking 8,000 words 1 by 1 across 4,000 lines is 25million comparisons). To solve this problem, our initial thought was to use a python dictionary structure,however we were able to do this even more easily by simply creating a data frame assigning the unique wordsto the columns (essentially like labels in a dictionary), then simply iterating across the words in the datacomparing to column lables. If the word is found in the labels, the entry in that column was increased by1, if not, we move on to the next word. Our entire parsing function runs in about a minute, as opposed tohours with the iterative approach.

The python script outputs a frequency matrix with counts for each unique word (defining a column). Weconsidered converting these counts into proportions in the python script but decided to delay this step for afew di�erent reasons: The first was practical, integers take up much less space in memory than long decimalobservations. The proportion matrix was almost half a giga-bite, as opposed to 68 MB for the count matrix.The second reason was the issue of zero sum rows: if they are present,dividing by the zero row sum results inN/A’s). So they need to be removed carefully. Furthermore this step has to follow feature selection, becauseas features are removed, the likelyhood of zero rows increases.

Before feature selection the matrix is (4180 x 8095). Using the complete space, only 2 rows have zero rowsum (ie contain only stop words).

a. See python script in Appendix I for parsing function.

1

b. Read the data into R

# Data are rows of texts, columns are stripped individual wordsstatus <- read.csv(file = "dfstatus.csv", header = TRUE)status=status[,-1]status=as.character(status)

start_df <- read.csv(file = �train.csv�, header = TRUE,check.names=FALSE)names(start_df)=names(start_df[1,1:length(start_df)])start_df=start_df[,-1]

table(status)

## status## ham spam## 3615 565

b. Remove Stop Words:

# Remove columns that are the common words:remove_list = c(�a�,�able�,�about�,�across�,�after�,�all�,�almost�,�also�,�am�,�among�,�an�,�and�,�any�,�are�,�as�,�at�,�be�,�because�,�been�,�but�,�by�,�can�,�cannot�,�could�,�dear�,�did�,�do�,�does�,�either�,�else�,�ever�,�every�,�for�,�from�,�get�,�got�,�had�,�has�,�have�,�he�,�her�,�hers�, �him�,�his�,�how�,�however�,�i�,�if�,�in�,�into�,�is�,�it�,�its�,�just�,�least�,�let�,�like�,�likely�,�may�,�me�,�might�,�most�,�must�,�my�,�neither�,�no�,�nor�,�not�,�of�,�off�,�often�,�on�,�only�,�or�,�other�,�our�,�own�,�rather�,�said�,�say�,�says�,�she�,�should�,�since�,�so�, �some�,�than�,�that�,�the�,�their�,�them�,�then�,�there�,�these�,�they�,�this�,�tis�,�to�,�too�,�twas�,�us�,�wants�,�was�,�we�,�were�,�what�,�when�,�where�,�which�,�while�,�who�,�whom�,�why�,�will�,�with�,�would�,�yet�,�you�,�your�)

start_df = start_df[,!names(start_df) %in% remove_list]

row_sum <- apply(start_df, 1, sum)table(row_sum)[1]

## 0## 2

c. Derive word feature matrix:As previously mentioned, we found it necessary to perform the unsupervised feature selection first, sothe we can then determine which rows have a zero row-sum, then convert to a proportion matrix. Thisstep will follow question 3.

d. Improve feature matrix:Again, these suggested improvements were already performed by the python script. For example wealready converted all the characters to lower case so that would not di�erentiate words, similarly withpunctuation. You can see below that at this point, the matrix is 4180 X 8095

print(dim(start_df))

## [1] 4180 8095

2

3. Unsupervised Feature Filtering

Here we will accomplish a few tasks: First we want to determine the most common and rare features inthe data. It should be noted that there are two di�erent notions of rareness: the first is the raw numberof appearances, the second is a relative determination, i.e. what percentage of the messages does a featureappear in. We will examine both.

i. Calculate column sums to determine rare and common features.

col_sum <- apply(start_df, 2, sum)

print(c("mean =", round(mean(col_sum),2)))

## [1] "mean =" "5.1"

print(c("less than 3 appearances =", sum(table(col_sum)[1:3])))

## [1] "less than 3 appearances =" "6397"

print(c("rare features =", table(col_sum)[1:3]))

## 1 2 3## "rare features =" "4581" "1238" "578"

print(c("features appearing more than 100 = ", length(which(col_sum >100))))

## [1] "features appearing more than 100 = "## [2] "43"

print(c("common features =", table(col_sum)[123:125]))

## 436 840## "common features =" "1" "1"## 1401## "1"

print(c("most common", which(col_sum > 500)))

## .1 u## "most common" "1" "7316"

hist(col_sum[col_sum > 5], breaks=100, xlim=c(1,150),main="Distribution of Word Frequencies (min 5)",xlab="Total Appearances")

3

Distribution of Word Frequencies (min 5)

Total Appearances

Freq

uenc

y

0 50 100 150

010

020

030

040

050

0

Here we observe a few things: judging by column sums(total appearances), there are many rare words andonly a few common ones. The average word only occurs 5 times and 6,397 words appear less than 3 times(4581 of which only appear once), whereas only 43 words appear more than 100 times. Only 43 words appearmore than 100 times, 2 words occur dramatically more than the others: the single character “u” appearing840 times and “.1” occuring 1401. Due to the tremendous skew in the data, this histogram only shows thedistribution of words that appear at least 5 times, otherwise all that is visisible is i gigantic spike at the leftof the distribution.

ii. calculate percent of messages appeared in:

pct <- rep(0,ncol(start_df))

for(i in 1:ncol(start_df)){pct[i] <- as.numeric(table(start_df[,i])[1])}

pct <- pct/nrow(start_df)table(round((1-pct),2))

#### 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.1 0.14 0.18## 7747 270 43 16 5 7 1 2 1 1 1 1

The other notion of frequency is relative: the percentage of messages containing a feature. Interestingly, bythis measure there really aren’t any frequent features: The most prevalent features again are the character“u” and “.1”, yet they only appears in 14% and 18%, respectively, in the training messages. Only 20 wordsappear in more than 3 % of messages.Based on this we decided to only remove “.1” as the most common feature. Additionally, features that onlyappear once or twice are not likely to be useful for prediction. As a result we will remove these rare words aswell. The code for how we did this is in Appendix III.

4

rare <- (col_sum == 1 | col_sum == 2 )rare.indx <- which(rare == TRUE)sub_data <- start_df[,-rare.indx]

#remove max ".1", incidentally has index 1sub_data <- sub_data[,-1]

iii. calculate row-sums and remove zero rows:

row_sum <- apply(sub_data, 1, sum)table(row_sum)[1]

## 0## 45

nas <- which(row_sum == 0)

sub_data <- sub_data[-nas,]dim(sub_data)

## [1] 4135 2275

Here we can see 45 rows are removed.

iv. create proportion matrix

prop_data <- data.frame(t(apply(sub_data,1, function(x) {x / sum(x)

})))

colnames(prop_data) <- colnames(sub_data)dim(prop_data)

## [1] 4135 2275

Our final word matrix is 4135 X 2275. This means we have reduced the word features by about 75%.

4. Power Features:

We designed 9 power features, mostly surrounding special characters, numbers and capitalization. They areas following:

• Number of periods (dummy variable): If an observation has 2 or more uninterrupted dots (..) in thesentence. Idea: As we found out, people who sent spam usually try to be professional so they wouldnot use casual thing like . . . ..?? in their emasil.

• longest string (numeric variable): The length of the longest uppercase word in an observation. If thereis not any uppercase word in an observation, then this variable will show 0. Spam messages oftencontain a lot of emphasis, notably something they wish to sell, as such they often list the key words inuppercase characters. Whereas for ham, either for professional context or daily conversation, lots ofuppercase characters can be percieved as rude or bullying and so tend not to be used as often.

5

• Capital letters in a row (numeric variable): The average length of uppercase word in an observation.Here we just want to cover the thing that Power_longcap (longest string) failed to cover. For example,an observation may have lot of uninterrupted uppercase word but none of them is long enough tobe distinguished in Power_longcap. In this situation, Power_avecap will help capture the missinginformation.

• Special character (dummy variable): If an observation has a special character “<#&gt”, we recordit since it’s likely being a ham in general. We first identified this seemingly random set of specialcharacters through manual inspection: “<#&gt”. This appears mostly in ham messages. After cursoryreseach, we found that this string comes from HTML code. “&lt” means less than and “&gt” meansgreater than. We are not completely sure why these appear, however, we feel they may be predictive.

• Websites (dummy variable): If an observation contains a website or not. Many spam messages containa weblink. They appear in a variety of formats, however, for this feature we looked for “www.”

• Money (dummy variable): If an observation has various patterns that match monetary ammounts (like$1.00, ?40, etc.) or not. Most spam messages are trying to sell something, as such they often containprices.

• phone numbers (dummy variable): If an observation contains phone number or not. Similar to currencysymbols, spam messages often contain phone numbers (patterns like 324-540-231, 2345-234-1234, or23452).

• Number of numbers (numeric variable): How many numbers an observation has. Spam messagesalso contain a variety of other numeric information: codes, dates, prices, numbers, etc. This variableaccounts for this.

• Number of caps (Numeric variable): How many upper case letters an observation has. Idea: Spamtends to use upper case letters more often than Ham.

To parse these features from the original data, we wrote another python program, similar to the original.This program can be found in Appendix I (it was done at the same time the word matrix was created).

5. Create combined feature matrix:

# Read the output matrix from power python programpower_df <- read.csv(file = �dfpower.csv�, header = TRUE)power_df=power_df[,-1]

sub_power_df <- power_df[-nas,]combined_df <- cbind(prop_data, sub_power_df)print(dim(combined_df))

## [1] 4135 2284

6. Classification on the filtered Word Feature Matrix:

To classify the texts, we will build an SVM model. We will use cross-validation to determine the appropriatecost to use.

library(e1071)

# Convert the character vector into a factorspam <- as.integer(status == "spam")

6

ham <- -1*as.integer(status == "ham")y <- spam + ham

y <- as.factor(y[-nas])

data <- data.frame(prop_data, y)colnames(data) <- c(colnames(prop_data), "label")

print(dim(data))

## [1] 4135 2276

svmfit = svm(as.factor(label)~., data=data, kernel = "linear",cost=8, scale=FALSE)summary(svmfit)

#### Call:## svm(formula = as.factor(label) ~ ., data = data, kernel = "linear",## cost = 8, scale = FALSE)###### Parameters:## SVM-Type: C-classification## SVM-Kernel: linear## cost: 8## gamma: 0.0004396#### Number of Support Vectors: 1009#### ( 684 325 )###### Number of Classes: 2#### Levels:## -1 1

We tried using the built in tune function to determine both the ideal cost parameter as well as the cross-validated error, however, it ran far too slow. So we wrote our own function to achive this task:

i. Cross-Validation

# This function outputs a vector of errors with associated cost labels, as well as a plothand.cv <- function(data, cost, k=5, kernal="linear", title="CV Error�s by Cost"){

set.seed(1)rand.idx <- sample.int(nrow(data),replace=FALSE)rand.data <- data[rand.idx,]cuts <- floor(seq.int(0,nrow(data),length.out=(k+1)))

cv.error <- sapply(cost, function(i){mean(sapply(1:k, function(j){

7

test <- ((1+cuts[j]):cuts[j+1])svmfit = svm(as.factor(label)~., data=rand.data[-test,],

kernel = "linear",cost=i, scale=FALSE)

ypred <- predict(svmfit, rand.data[test,])conf.matrix <- table(predict=ypred, truth=rand.data$label[test])error <- (conf.matrix[1,2] + conf.matrix[2,1])/sum(conf.matrix)error}))})

names(cv.error) <- costprint(cv.error)plot(x=cost,cv.error, col="red", pch=16, main=title, xlab="Cost Parameter",

ylab="CV Error")lines(x=cost,y=cv.error,lty=1,col="red")return(cv.error)

}

#cost <- c(1:10,15)#errors <- hand.cv(data,cost=cost, k=10, title="10-Fold CV by Cost")

Below are the output plots from the cross-validation. From the larger plot, we restricted our focus to therange of values of 5-10 for the cost parameters. After multiple trials, the results were consistent showing verysmall di�erences in errors in this range. Based on the bottom plot we chose 8 for the cost parameter.

svm.best <- svm(as.factor(label)~., data=data, kernel = "linear",cost=8,scale=FALSE)

image:

8

image:

ii. ROCR Plots and Error Measures:

library(ROCR)

## Loading required package: gplots#### Attaching package: �gplots�#### The following object is masked from �package:stats�:#### lowess

# Calculate and print error measures:error.cv <- function(data, cost, k=5, kernal="linear", title="CV Error�s by Cost"){


cv.error <-sapply(1:k, function(j){test <- ((1+cuts[j]):cuts[j+1])svmfit = svm(as.factor(label)~., data=rand.data[-test,],kernel = "linear",

cost=cost, scale=FALSE)

ypred <- predict(svmfit, rand.data[test,])

9

conf.matrix <- table(predict=ypred, truth=rand.data$label[test])

error <- (conf.matrix[1,2] + conf.matrix[2,1])/sum(conf.matrix)ham.error <- conf.matrix[2,1]/(conf.matrix[1,1]+conf.matrix[2,1])spam.error <- conf.matrix[1,2]/(conf.matrix[1,2]+conf.matrix[2,2])ppv <- conf.matrix[2,2]/(conf.matrix[2,1]+conf.matrix[2,2])npv <- conf.matrix[1,1]/(conf.matrix[1,1]+conf.matrix[1,2])

#fitted <- attributes(predict(svmfit, data,decision.values=T))$decision.values#fit.flip <- -1*fitted

errors <- c(error=error,ham.error=ham.error, spam.error=spam.error,ppv=ppv,npv=npv)

errors})avg.error <- round(apply(cv.error, 1,mean),4)return(avg.error)

}

error.cv(data,cost=8, k=10)

## error ham.error spam.error ppv npv## 0.0288 0.0141 0.1222 0.9091 0.9807

# Create ROCR Plotrocplot <- function(pred, truth, ...){

predob <- prediction(pred, truth)perf <- performance(predob, "tpr", "fpr")plot(perf, ...)

}

fitted <- attributes(predict(svm.best, data,decision.values=T))$decision.values

fit.flip <- -1*fitted

rocplot(fit.flip, data$label, main="Training Data", col="red")

10

Training Data

False positive rate

True

pos

itive

rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Thereis a cross validation error rate of 2.88%. The error is about evenly split between the hams and the spams.The ppv was .9091 and the npv was .9807, meaning the ability of our model to predict true negative values isa little better than its ability to predict true positives. The dimension of the feature matrix is 4,140 by 2,277.

7. Code for verification:

test_df <- read.csv(file = �test.csv�, header = TRUE,check.names=FALSE)names(test_df)=names(test_df[1,1:length(test_df)])test_df=test_df[,-1]test_df = test_df[,!names(test_df) %in% remove_list]

# Add tiny observation to prevent N/A rows (i.e. zero row-sums)test_df[,1] <- .001

# Remove rare featuressub_test <- test_df[,-rare.indx]

#Calculate proportion matrixprop_test <- data.frame(t(apply(sub_test,1, function(x) {

x / sum(x)})))

#remove max ".1", incidentally has index 1prop_test <- prop_test[,-1]

colnames(prop_test) <- colnames(sub_test[-1])

test.pred <- predict(svm.best,prop_test)

#Convert predictions to Miles format spam = 0, ham = 1

11

pred.int <- as.integer(test.pred) - 1pred.int <- 1 - pred.intpred.int <- abs(pred.int)dims <- dim(prop_test)

8. Classification on Power Features

To determine the predictive ability of the power features, we used two methods. We first used Random Foreston only the power features. We then fit a SVM model using the power features.

A. Random Forrest

# Use Power Matrix loaded in part 5 and append status columnrf.data <- cbind(status,power_df)names(rf.data)[1]="status"

Random Forest has two parameters: the number of trees and the number of variables to consider at eachsplit. As overfitting is not a risk, we do not need to cross validate the number of trees, however, we will useCV to determine the optimal numbers of variables:

library(randomForest)

## randomForest 4.6-10## Type rfNews() to see new features/changes/bug fixes.

rf.cv <- function(data, mtry, k=10, title="CV Error"){set.seed(1)rand.idx <- sample.int(nrow(rand),replace=FALSE)rand.data <- rand[rand.idx,]cuts <- floor(seq.int(0,nrow(rand),length.out=(k+1)))

cv.error <- sapply(mtry, function(i){mean(sapply(1:k, function(j){test <- ((1+cuts[j]):cuts[j+1])rf.rand=randomForest(status~.,data=rand.data[-test,],mtry=i,importance=TRUE)

ypred <- predict(rf.rand, rand.data[test,])conf.matrix <- table(predict=ypred, truth=rand.data$status[test])error <- (conf.matrix[1,2] + conf.matrix[2,1])/sum(conf.matrix)error}))})

names(cv.error) <- mtryprint(cv.error)plot(x=mtry,cv.error, col="red", pch=16, main=title, xlab="mtry Parameter", ylab="CV Error")lines(x=mtry,y=cv.error,lty=1,col="red")return(cv.error)

}

#mtry <- c(1:9)#errors <- rf.cv(rf.data,mtry=mtry, k=10, title="10 Fold CV by Number of Variables")

We can see by the below plot that the best value for the “mtry” parameter (the number of variables consideredat each split) is 3. This conforms to the rule of thumb to select square root of p. In this case p = 9.

12

image:

Building up random forest models:

rf.model <- randomForest(status~.,data=rf.data,mtry=3,importance=TRUE)plot(rf.model)

13

0 100 200 300 400 500

0.00

0.04

0.08

0.12

rf.model

trees

Error

varImpPlot(rf.model, mfrow=c(2,1), main="Variable Importance Plots")

moneylongest.stringcapital.letters.in.a.rowspecial.characternumber.of.capsnumber.of.periodsnumber.of.numbersphone.numberswebsites

10 25 40MeanDecreaseAccuracy

moneyspecial.characternumber.of.periodswebsitesnumber.of.capscapital.letters.in.a.rowlongest.stringnumber.of.numbersphone.numbers

0 200MeanDecreaseGini

Variable Importance Plots

14

print(dim(power_df))

## [1] 4180 9

ii. Error Measures:


Training Data

False positive rate

True

pos

itive

rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

We see from this output that the performance of the Random Forest on the power features is very similar toSVM on the word matrix. The overall error is 1.39%, lower than the SVM model. The ham error is quite abit smaller than the ham error for the SVM modelat .25% compared to 1.23% previously. The spam error isalso slightly lower at 8.84%. While the spam error and the ham error are quite a bit lower for random forests,ppv and npv are almost the same, although the ppv is better for the random forest model at 0.9823. Again,we see a nearly ideal ROCR.To see how this was generated, see Appendix IV.

B. SVM

Here we fit a SVM model to just the power features. There are a number of considerations. Because SVMoperates on a metric (a notion of distance), it will likely be important to scale the data, otherwise we will begiving greater weight to variables that are nominally larger. Just like with the word matrix, we will also needto use Cross-validation to select an optimal value for the cost parameter.

y_power <- spam + ham

data_power <- data.frame(power_df, y_power)colnames(data_power)[10] <- "label"

svmfit_power = svm(as.factor(label)~., data=data_power, kernel = "linear",

15

cost=10, scale=TRUE)summary(svmfit_power)

#### Call:## svm(formula = as.factor(label) ~ ., data = data_power, kernel = "linear",## cost = 10, scale = TRUE)###### Parameters:## SVM-Type: C-classification## SVM-Kernel: linear## cost: 10## gamma: 0.1111#### Number of Support Vectors: 201#### ( 106 95 )###### Number of Classes: 2#### Levels:## -1 1

# Calculate Training Error

ypred_power <- predict(svmfit_power, data_power)conf.matrix_power <- table(predict=ypred_power, truth=data_power$label)

print(conf.matrix_power)

## truth## predict -1 1## -1 3610 78## 1 5 487

error_power <- (conf.matrix_power[1,2] + conf.matrix_power[2,1])/sum(conf.matrix_power)print(round(error_power,4))

## [1] 0.0199

ii. CV

hand.cv_power <- function(data, cost, k=5, kernal="linear",title="CV Error�s by Cost", scaling, graph){


16

cv.error <- sapply(cost, function(i){mean(sapply(1:k, function(j){test <- ((1+cuts[j]):cuts[j+1])svmfit = svm(as.factor(label)~., data=rand.data[-test,],kernel = "linear",

cost=i, scale=scaling)

ypred <- predict(svmfit, rand.data[test,])conf.matrix <- table(predict=ypred, truth=rand.data$label[test])error <- (conf.matrix[1,2] + conf.matrix[2,1])/sum(conf.matrix)error}))})

names(cv.error) <- costprint(cv.error)if (graph) {plot(x=cost,cv.error, col="red", pch=16, main=title, xlab="Cost Parameter",

ylab="CV Error")lines(x=cost,y=cv.error,lty=1,col="red")}return(cv.error)

}

cost <- c(.001, .0015, .002, .0025, .003, .005, .01, .05, .1)

errors_scaletrue <- hand.cv_power(data_power,cost=cost, k=5, scaling = TRUE,graph = TRUE)

## 0.001 0.0015 0.002 0.0025 0.003 0.005 0.01 0.05 0.1## 0.02321 0.02177 0.02105 0.02010 0.02010 0.02010 0.02010 0.02010 0.02010

0.00 0.02 0.04 0.06 0.08 0.10

0.02

000.

0210

0.02

200.

0230

CV Error's by Cost

Cost Parameter

CV

Erro

r

17

cost <- c(.001, .0015, .002, .005, .01, .05, .1,.25, .5, 1)#errors_scalefalse <- hand.cv_power(data_power, cost=cost, k = 5, scaling = FALSE, graph = TRUE)

image:The cost that yields the lowest error is not super consistent, but it is generally around .002 for both thescaled and unscaled models. The error is around .02.Because we have dummy variables in the data, we can also manually only scale the continous variables:

data_power_cont_scaled = data_powerprint(colnames(data_power_cont_scaled))

## [1] "capital.letters.in.a.row" "longest.string"## [3] "money" "number.of.caps"## [5] "number.of.numbers" "number.of.periods"## [7] "phone.numbers" "special.character"## [9] "websites" "label"

#capital letters in a row, number of numbers, longest string, number of caps,#and number of periods are continuous, while the other variables are dummy variablesscaledcost = c(.01, .05, .1, .15, .2)data_power_cont_scaled[c(1,2,4,5,6)] = scale(data_power_cont_scaled[c(1,2,4,5,6)])error_cont_scaled = hand.cv_power(data_power_cont_scaled, cost = scaledcost, k = 5,

scaling = FALSE, graph = TRUE)

## 0.01 0.05 0.1 0.15 0.2## 0.02344 0.02297 0.02129 0.02081 0.02010

18

0.05 0.10 0.15 0.20

0.02

000.

0210

0.02

200.

0230

CV Error's by Cost

Cost Parameter

CV

Erro

r

data_power[c(1, 2, 4, 5, 6)] = scale(data_power[c(1, 2, 4, 5, 6)])

When only the continuous features are scaled, the lowest error rate is still .02009. This error rate is obtainedwith a higher cost of 0.15. Because it is generally advisable to scale ones features some way or another whenusing SVM, we will scale the continuous features and use a cost of 0.15.

iii. best model

svmfit_power_best = svm(as.factor(label)~., data=data_power, kernel = "linear",cost=.15, scale=FALSE)

summary(svmfit_power)

#### Call:## svm(formula = as.factor(label) ~ ., data = data_power, kernel = "linear",## cost = 10, scale = TRUE)###### Parameters:## SVM-Type: C-classification## SVM-Kernel: linear## cost: 10## gamma: 0.1111#### Number of Support Vectors: 201#### ( 106 95 )##

19

#### Number of Classes: 2#### Levels:## -1 1

dim(data_power)

## [1] 4180 10

iv. ROCR and Error Measures:

# Calculate and print error measures:error.cv(data_power,cost=.15, k=5)


# Plot ROCRfitted <- attributes(predict(svmfit_power_best, data_power,decision.values=T))$decision.valuesfit.flip <- -1*fitted

rocplot(fit.flip, data_power$label, main="Training Data", col="red")

Training Data

False positive rate

True

pos

itive

rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

While the overall performance of this model is similar to that of the word matrix, it is notably inferior inpredicting spam with a 14% error rate for this category. However, this model predicts Ham VERY well withonly .17% error rate.

20

9. Combined Feature Matrix

Although we already created a combined feature matrix in part 5, we will add some of the insights frompart 8, notably we will scale the power features and append them to the word matrix. Given the very smallobservation values for the proportion matrix, this step is likely even more important. As with the other SVMmodels, we will need to determine an appropriate value for the cost parameter through cross-validation:

#We need to remove the N/A rows so that the power features have the same rows#as the proportion matrixsub_power_df <- data_power[-nas,]combined_df <- cbind(prop_data, sub_power_df)

money <- which(colnames(combined_df) == "money")colnames(combined_df)[money[1]] <- "money.w"

i. create combined SVM model:

svmfit_combined = svm(as.factor(label)~., data=combined_df, kernel = "linear",cost=10, scale=FALSE)

summary(svmfit_combined)

#### Call:## svm(formula = as.factor(label) ~ ., data = combined_df, kernel = "linear",## cost = 10, scale = FALSE)###### Parameters:## SVM-Type: C-classification## SVM-Kernel: linear## cost: 10## gamma: 0.0004378#### Number of Support Vectors: 375#### ( 288 87 )###### Number of Classes: 2#### Levels:## -1 1

# Calculate Training Error

ypred_combined <- predict(svmfit_combined, combined_df)conf.matrix_combined <- table(predict=ypred_combined, truth=combined_df$label);print(conf.matrix_combined)

## truth## predict -1 1## -1 3570 20## 1 2 543

21

error_combined <- (conf.matrix_combined[1,2] + conf.matrix_combined[2,1])/sum(conf.matrix_combined);print(round(error_combined,4))

## [1] 0.0053

ii. CV

hand.cv_combined <- function(data, cost, k=5, kernal="linear", title="CV Error�s by Cost",scaling, graph){


cv.error <- sapply(cost, function(i){mean(sapply(1:k, function(j){test <- ((1+cuts[j]):cuts[j+1])svmfit = svm(as.factor(y_combined)~., data=rand.data[-test,],kernel = "linear",

cost=i, scale=scaling)

ypred <- predict(svmfit, rand.data[test,])conf.matrix <- table(predict=ypred, truth=rand.data$y_combined[test])error <- (conf.matrix[1,2] + conf.matrix[2,1])/sum(conf.matrix)error}))})

names(cv.error) <- costprint(cv.error)if (graph) {plot(x=cost,cv.error, col="red", pch=16, main=title, xlab="Cost Parameter",

ylab="CV Error")lines(x=cost,y=cv.error,lty=1,col="red")}return(cv.error)

}

#combcost = c(10:20)#hand.cv_combined(data_combined,cost=combcost, k=5, scaling = FALSE, graph = TRUE)

22

image:

23

image:

Once again, a cost of 15 seems to be about right, although the lowest cross-validated error is not stable.

iii. best model

svmfit_combined_best = svm(as.factor(label)~., data=combined_df, kernel = "linear",cost=15, scale=FALSE)

summary(svmfit_combined_best)

#### Call:## svm(formula = as.factor(label) ~ ., data = combined_df, kernel = "linear",## cost = 15, scale = FALSE)###### Parameters:## SVM-Type: C-classification## SVM-Kernel: linear## cost: 15## gamma: 0.0004378#### Number of Support Vectors: 392#### ( 302 90 )###### Number of Classes: 2

24

#### Levels:## -1 1

iv. ROCR and Error Measures:

# Calculate and print error measures:error.cv(combined_df,cost=15, k=5)


# Create ROCR Plotfitted <- attributes(predict(svmfit_combined_best, combined_df,decision.values=T))$decision.values


rocplot(fit.flip, combined_df$label, main="Training Data", col="red")

Training Data

False positive rate

True

pos

itive

rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

v. Comparison The error for the combined matrix is 1.31%, smaller than the SVM power error of 1.99%and the SVM word matrix error of 2.75%. The ham error is .45%, still very good, and the spam error is6.54%, almost two times smaller than the spam error for the power features alone. The ham error isactually slightly higher for the combined matrix than the power feature matrix at .45% versus .14%,but the spam error is much smaller, making the overall error smaller and better combined performance.The ppv and npv show the same phenomena of the spam detection being more accurate but the hamdetection being less accurate than the power feature matrix.

25

As for the word matrix SVM model, the combined model is quite a bit better than it. The combined modelhas lower spam and ham errors, and also has better ppv and npv than the word feature SVM model.

The Random Forest model using only power features, at 1.44% overall error rate, was almost as accurate asthe SVM model using the combined dataframe, so a random forest model using power features and wordfeatures may be even more accurate than the combined SVM model. If we were to continue researching spamdetection, we would probably look more into using a random forest model.

Based on these observations we see that the Combined SVM model has the best overall performance. We aretrading a big improvement in predicting spam for a small cost in ham prediction performance.

10. Validation Set

powers_test <- read.csv(file=�dfpower_test.csv�,header=TRUE)status_test <- read.csv(file = �dfstatus_test.csv�, header = TRUE)status_test=status_test[,-1]

powers_test[c(1, 2, 4, 5, 6)] <- scale(powers_test[c(1, 2, 4, 5, 6)])

combined_test <- cbind(prop_test, powers_test)colnames(combined_test)[money[1]] <- "money.w"

# For some reason the predict methods run fine in program, but generate errors# when trying to Knit the PDF. Therefore, we are writting the output to csv and then# reading the same object back in to generate the PDF.

pred.test <- predict(svmfit_combined_best, newdata=combined_test)#write.csv(pred.test, file="pred.test.csv")#pred.test <- read.csv(file="pred.test.csv")#pred.test <- pred.test[,-1]

fitted <- attributes(predict(svmfit_combined_best, combined_test,decision.values=T))$decision.values#write.csv(fitted, file="fitted.csv")#fitted <- read.csv(file="fitted.csv")#fitted <- fitted[,-1]


# Calculate and print error measures:conf.matrix <- table(predict=pred.test, truth=status_test)

print(conf.matrix)

## truth## predict ham spam## -1 1202 10## 1 10 172

error <- (conf.matrix[1,2] + conf.matrix[2,1])/sum(conf.matrix)print(c("overall error =",round(error,4)))

## [1] "overall error =" "0.0143"

26

ham.error <- conf.matrix[2,1]/(conf.matrix[1,1]+conf.matrix[2,1])print(c("ham error =",round(ham.error,4)))

## [1] "ham error =" "0.0083"

spam.error <- conf.matrix[1,2]/(conf.matrix[1,2]+conf.matrix[2,2])print(c("spam error =",round(spam.error,3)))

## [1] "spam error =" "0.055"

ppv <- conf.matrix[2,2]/(conf.matrix[2,1]+conf.matrix[2,2])print(c("ppv =",round(ppv,3)))

## [1] "ppv =" "0.945"

npv <- conf.matrix[1,1]/(conf.matrix[1,1]+conf.matrix[1,2])print(c("npv =",round(npv,3)))

## [1] "npv =" "0.992"

rocplot(fit.flip, status_test, main="Training Data", col="red")

Training Data

False positive rate

True

pos

itive

rate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Thevalidation set error is at 1.43%, which is slightly higher than the cross-validated error for the best model of1.14%. The spam error is almost the same for the validation set and the cross-validated best model, but theham error is higher for the validation set than the cross-validated error at .83%. The ppv is similarly slightlylower for the validation set error while the npv is almost the same.

27

Our models have shown there is a tradeo� between accurately predicting Ham and Spam. So we mustconsider which type of errors to prioritize. False negatives are Spam that are categorized as Ham, these areinconveniente; whereas False positives are Ham that are predicted to be Spam.These are messages we reallywish to receive, but don’t. Therefore we want to minimize False positives and maximize True Negatives.Therefore NPV is more important than PPV, because we want the highest ratio of True Negatives to negativepredictions possible. The second most important factor is to minimize False Positive rate. This is the bottomaxis on our ROCR curve. Like usual we want the ROCR to hug the top left corner, showing a minimum FPrate.

Conclusion:

As just discussed the final model gives us the best combined performance with an overall error rate of 1.4%.The Ham error of .83% is slightly higher than some of the other models; however, this comes with the tradeo�of halving the spam error to only 5.5%. This is a very good improvement for less than .5% increase in hamerror. As discussed NPV is more important than PPV and the rate of 99.2% NPV is very strong.

Through the process of developing our model, we learned how choosing better models through cross-validationrequires finesse and judgement calls, as some of our cross-validation attempts were not very stable and the bestcost or number of variables to use was not obvious. We also learned that coordinating programming projectsis di�cult and some sort of version control system would be useful. We also ran into lots of computationalbarriers, and although these barriers may not be a problem if we had access to more powerful computers, itwas helpful and interesting to figure out more computational cheap methods to create our models.

28

Haley Muhlestein, Ricardo Tanasescu, Lawrence Ye, Taylor Hines

12/15/2014 Stat 154 Final Project Appendix

I. Training Set Python Code TestEmails = open('test_examples_unlb.txt','r', encoding = 'utf-8').readlines() emails = open('train_msgs.txt','r', encoding = 'utf-8').readlines() #--------------------------------------------------------------------------------------------------- #----------------------------------- TRAINING SET CODE CHUNK --------------------------------------- # Break emails by status and content words = list() for line in emails: for word in line.split("\t"): words.append(word) # Create lists status=list() content=list() for i in range(0,len(words),1): if i%2==0: status.append(words[i]) # status is the response vector else: content.append(words[i]) # This is the content of each email #creates a dummy variable that is 1 if the text contains five or more numbers in a row import re phone_numbers = list() phone_pattern = re.compile('|'.join([ r'.*(\d{4})*\D*(\d{3})\D*(\d{4}).*', r'.*(\d{3})*\D*(\d{3})\D*(\d{4}).*', r'.*\d{5}.*', ])) #creates a dummy variable that is 1 if the text contains money money = list() money_pattern = re.compile('|'.join([ r'.*\$?(\d*\.\d{1,2}).*', r'.*\£?(\d*\.\d{1,2}).*', r'.*\£(\d+).*', r'.*\£(\d+\.?).*', r'.*\$(\d+).*', r'.*\$(\d+\.?).*', ]))

number_of_numbers = list() number_of_uppers = list() for line in content: if re.match(phone_pattern, line): phone_numbers.append(1) else: phone_numbers.append(0) if re.match(money_pattern, line): money.append(1) else: money.append(0) num_count = 0; upper_count = 0; for letter in line: if letter.isdigit(): num_count = num_count + 1 elif letter.isupper(): upper_count = upper_count + 1 number_of_uppers.append(upper_count) number_of_numbers.append(num_count) # Remove \n at end of each email that is stored in content: import string #empty list to put lines without punctuation except "." in content_sans_punc = list() #take punctuation except (".") out of each line for line in content: content_sans_punc.append(''.join(word.strip('!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~') for word in line)) content_new = [line.strip("\r\n") for line in content] ### Power: count how many dot an email has count_dot = [line.count("..") for line in content_new] ### Then take "." out of each line content_after=[] for line in content_new: content_after.append(''.join(word.strip('.') for word in line))

### Power: length of the longest string in each sentence Caps=[re.findall("[A-Z][A-Z]+",line) for line in content_after] power_loncap=list() for ele in Caps: if len(ele)==0: power_loncap.append(len(ele)) else: power_loncap.append(max([len(mem) for mem in ele])) ### Power: The average length of uninterrupted sequences of capital letters import statistics power_avecap=list() for ele in Caps: if len(ele)==0: power_avecap.append(len(ele)) else: power_avecap.append(statistics.mean([len(mem) for mem in ele])) ### Power: special character containing <#&gt special_power=[line.count("<#&gt") for line in content_after] content_splitted = [line.rsplit(" ") for line in content_after] # Change each word into lower_case lower_case_content = list() for word in content_after: lower_case_content.append(word.lower()) ### Power: website, www. website_power=[line.count("www") for line in lower_case_content] # Split each email by words. Now every email is broken into words: content_splitted = [line.rsplit(" ") for line in lower_case_content] # Put all words in one single list: totalWords = list()

for lines in content_splitted: for words in lines: totalWords.append(words) # Get Unique Words uniqueWords = set(totalWords) uniqueWords = list(uniqueWords) # Create a data frame: import pandas as pd dfstatus = pd.DataFrame(status) dfstatus.to_csv('dfstatus.csv') df = pd.DataFrame(data=0, index=range(0,len(content_splitted),1), columns=sorted(uniqueWords)) #create dictionary for power features powerfeatures = {"number of numbers" : number_of_numbers, "number of caps": number_of_uppers, "money": money, "phone numbers": phone_numbers, "number of periods": count_dot, "longest string": power_loncap, "capital letters in a row": power_avecap, "special character": special_power, "websites": website_power} dfpower = pd.DataFrame(powerfeatures) dfpower.to_csv('dfpower.csv') print(content[12]) print(content[164]) print(content[191]) ############# Calculating the frequency########## for row in range(0,len(df),1): contentWords = content_splitted[row] for i in range(0,len(contentWords),1): if contentWords[i] in df.columns: df[contentWords[i]][row:row+1]=df[contentWords[i]][row:row+1]+1 else: df[contentWords[i]][row:row+1] = 1 ############################################# df.to_csv('lawrence.csv')

# Output the unique words (To be used in test.py) f = open('unique.txt', 'w', encoding = 'utf-8') for t in sorted(uniqueWords): line = ''.join(str(x) for x in t) f.write(line + '\n') f.close() #-------------------------------- END OF TRAINING SET CODE CHUNK------------------------------------ #--------------------------------------------------------------------------------------------------- II. Test Set Python Code TestEmails = open('train_msgs.txt','r', encoding = 'utf-8').readlines() unique = open('unique.txt','r', encoding = 'utf-8').readlines() uniqueWords = [line.strip("\r\n") for line in unique] #--------------------------------------------------------------------------------------------------- #------------------------------------ TEST SET CODE CHUNK ------------------------------------------ # Remove \n at end of each email that is stored in content: import string #empty list to put lines without punctuation in content_sans_puncTEST = list() #take punctuation out of each line for line in TestEmails: line = line.replace('\t',' ') content_sans_puncTEST.append(''.join(word.strip(string.punctuation) for word in line)) #print(content_sans_puncTEST[260:270]) lower_case_contentTEST = list() for word in content_sans_puncTEST: lower_case_contentTEST.append(word.lower()) content_newTEST = [line.strip("\r\n") for line in lower_case_contentTEST]

# Split each email by words. Now every email is broken into words: content_splittedTEST = [line.rsplit(" ") for line in content_newTEST] print(content_splittedTEST[260:270]) # Create a data frame: import pandas as pd maxLenTEST = 0 for line in content_splittedTEST: if len(line) > maxLenTEST: maxLenTEST = len(line) cols = list() for i in range(0, maxLenTEST): cols.append(str(i)) dfTEST = pd.DataFrame(data=0, index=range(0,len(content_splittedTEST),1), columns=sorted(uniqueWords)) ############# Calculating the frequency########## for row in range(0,len(dfTEST),1): contentWordsTEST = content_splittedTEST[row] for i in range(0,len(contentWordsTEST),1): if contentWordsTEST[i] in dfTEST.columns: dfTEST[contentWordsTEST[i]][row:row+1]=dfTEST[contentWordsTEST[i]][row:row+1]+1 ############################################# print(dfTEST[".1"]) dfTEST.to_csv('test.csv') #-------------------------------- END OF TEST SET CODE CHUNK----------------------------------------- #---------------------------------------------------------------------------------------------------- III. Remove rare features row_sum <- apply(sub_data, 1, sum) table(row_sum)[1] nas <- which(row_sum == 0) sub_data <- sub_data[-nas,]

IV. Random Forest Errors # Calculate and print error measures: rf.error.cv <- function(data, mtry, k=5, kernal="linear", title="CV Error's by Cost"){ rand.idx <- sample.int(nrow(data),replace=FALSE) rand.data <- data[rand.idx,] cuts <- floor(seq.int(0,nrow(data),length.out=(k+1))) cv.error <-sapply(1:k, function(j){ test <- ((1+cuts[j]):cuts[j+1]) rf.rand=randomForest(status~.,data=rand.data[-test,],mtry=mtry,importance=TRUE) ypred <- predict(rf.rand, rand.data[test,]) conf.matrix <- table(predict=ypred, truth=rand.data$status[test]) error <- (conf.matrix[1,2] + conf.matrix[2,1])/sum(conf.matrix) ham.error <- conf.matrix[2,1]/(conf.matrix[1,1]+conf.matrix[2,1]) spam.error <- conf.matrix[1,2]/(conf.matrix[1,2]+conf.matrix[2,2]) ppv <- conf.matrix[2,2]/(conf.matrix[2,1]+conf.matrix[2,2]) npv <- conf.matrix[1,1]/(conf.matrix[1,1]+conf.matrix[1,2]) #fitted <- attributes(predict(svmfit, data,decision.values=T))$decision.values #fit.flip <- -1*fitted errors <- c(error=error,ham.error=ham.error, spam.error=spam.error,ppv=ppv,npv=npv) errors}) avg.error <- round(apply(cv.error, 1,mean),4) return(avg.error) } rf.error.cv(rf.data,mtry=3, k=10) # Create ROCR Plot fitted <- predict(rf.model, rf.data, type="prob")[,2] preds <- prediction(fitted, rf.data$status) rocplot(fitted, rf.data$status, main="Training Data", col="red") V. Part Ten powers_test <- read.csv(file='dfpower_test.csv',header=TRUE) status_test <- read.csv(file = 'dfstatus_test.csv', header = TRUE) status_test=status_test[,-1] powers_test[c(1, 2, 4, 5, 6)] <- scale(powers_test[c(1, 2, 4, 5, 6)])

combined_test <- cbind(prop_test, powers_test) colnames(combined_test)[money[1]] <- "money.w" # For some reason the predict methods run fine in program, but generate errors when trying to Knit the PDF. Therefore, we are writting the output to csv and then reading the same object back in to generate the PDF. #pred.test <- predict(svmfit_combined_best, newdata=combined_test) #write.csv(pred.test, file="pred.test.csv") pred.test <- read.csv(file="pred.test.csv") pred.test <- pred.test[,-1] #fitted <- attributes(predict(svmfit_combined_best, combined_test,decision.values=T))$decision.values #write.csv(fitted, file="fitted.csv") fitted <- read.csv(file="fitted.csv") fitted <- fitted[,-1] fit.flip <- -1*fitted # Calculate and print error measures: conf.matrix <- table(predict=pred.test, truth=status_test) print(conf.matrix) error <- (conf.matrix[1,2] + conf.matrix[2,1])/sum(conf.matrix) print(c("overall error =",round(error,4))) ham.error <- conf.matrix[2,1]/(conf.matrix[1,1]+conf.matrix[2,1]) print(c("ham error =",round(ham.error,4))) spam.error <- conf.matrix[1,2]/(conf.matrix[1,2]+conf.matrix[2,2]) print(c("spam error =",round(spam.error,3))) ppv <- conf.matrix[2,2]/(conf.matrix[2,1]+conf.matrix[2,2]) print(c("ppv =",round(ppv,3))) npv <- conf.matrix[1,1]/(conf.matrix[1,1]+conf.matrix[1,2]) print(c("npv =",round(npv,3))) rocplot(fit.flip, status_test, main="Training Data", col="red")

Final_final copy

Documents