Stat 154 Final Project: Text Mining Taylor Hines, Haley Muhlestein, Lawrence Ye, Ricardo Tanasescu December 15, 2014 1. Data Description: On cursory inspection, the training data has a number of notable features. First of all, the data contains 4180 messages. About 14% of the messages are labeled as spam (565 vs. 3615). Their length (about 20-25 words) suggests these messages are SMS or text messages, furthermore they seem to be mostly European. For example, 45 lines contain the pound symbol for the British currency. A number of lines also contain the german “u” with the dots. The spam messages have some common trends: they seem commerce related, ie are trying to promote or sell something. As a result they contain a variety of items that we are likely to adapt for power features later on in the project like prices, phone numbers, and coupon codes. They also seem to have more capital letters, as well as punctuation. 2. Feature Creation: We sorted the training data using a python program written by the team. (See Appendix I) For the purpose of identifying unique words appearing in the training data, we converted all capital letters to lowercase, as well as removed all punctuation (although some power features used punctuation and were created before the punctuation was removed). Words were defined as any string of characters surrounded by white space on both sides. These refinements reduced the unique words from about 11,000 to 8212. Parsing out the unique words was definitely an initial challenge, due to the potentially large number of calculations iterating across 4,000 lines and 20-25 words per line. However, we were able to complete this task efficiently by compiling all the words in a list and then just using the set function. One intial challenge we encountered was the encoding differences between Macs and PCs. As a result python does not parse the text characters the same way, resulting in our script running differently depending on which machine we used. We were able to find a solution, directly instructing Python to use ‘utf-8’ encoding. Generating the frequency matrix created a larger challenge: essentially the task is to analyze each line and compare the words to our 8,2012 unique words and then count their frequency. To do this iteratively would result in 25 Million steps, a prohibitively costly strategy (checking 8,000 words 1 by 1 across 4,000 lines is 25 million comparisons). To solve this problem, our initial thought was to use a python dictionary structure, however we were able to do this even more easily by simply creating a data frame assigning the unique words to the columns (essentially like labels in a dictionary), then simply iterating across the words in the data comparing to column lables. If the word is found in the labels, the entry in that column was increased by 1, if not, we move on to the next word. Our entire parsing function runs in about a minute, as opposed to hours with the iterative approach. The python script outputs a frequency matrix with counts for each unique word (defining a column). We considered converting these counts into proportions in the python script but decided to delay this step for a few different reasons: The first was practical, integers take up much less space in memory than long decimal observations. The proportion matrix was almost half a giga-bite, as opposed to 68 MB for the count matrix. The second reason was the issue of zero sum rows: if they are present,dividing by the zero row sum results in N/A’s). So they need to be removed carefully. Furthermore this step has to follow feature selection, because as features are removed, the likelyhood of zero rows increases. Before feature selection the matrix is (4180 x 8095). Using the complete space, only 2 rows have zero row sum (ie contain only stop words). a. See python script in Appendix I for parsing function. 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Stat 154 Final Project: Text Mining
Taylor Hines, Haley Muhlestein, Lawrence Ye, Ricardo TanasescuDecember 15, 2014
1. Data Description:
On cursory inspection, the training data has a number of notable features. First of all, the data contains4180 messages. About 14% of the messages are labeled as spam (565 vs. 3615). Their length (about 20-25words) suggests these messages are SMS or text messages, furthermore they seem to be mostly European.For example, 45 lines contain the pound symbol for the British currency. A number of lines also contain thegerman “u” with the dots. The spam messages have some common trends: they seem commerce related, ieare trying to promote or sell something. As a result they contain a variety of items that we are likely toadapt for power features later on in the project like prices, phone numbers, and coupon codes. They alsoseem to have more capital letters, as well as punctuation.
2. Feature Creation:
We sorted the training data using a python program written by the team. (See Appendix I) For the purposeof identifying unique words appearing in the training data, we converted all capital letters to lowercase, aswell as removed all punctuation (although some power features used punctuation and were created before thepunctuation was removed). Words were defined as any string of characters surrounded by white space onboth sides. These refinements reduced the unique words from about 11,000 to 8212. Parsing out the uniquewords was definitely an initial challenge, due to the potentially large number of calculations iterating across4,000 lines and 20-25 words per line. However, we were able to complete this task e�ciently by compiling allthe words in a list and then just using the set function.
One intial challenge we encountered was the encoding di�erences between Macs and PCs. As a result pythondoes not parse the text characters the same way, resulting in our script running di�erently depending onwhich machine we used. We were able to find a solution, directly instructing Python to use ‘utf-8’ encoding.
Generating the frequency matrix created a larger challenge: essentially the task is to analyze each line andcompare the words to our 8,2012 unique words and then count their frequency. To do this iteratively wouldresult in 25 Million steps, a prohibitively costly strategy (checking 8,000 words 1 by 1 across 4,000 lines is 25million comparisons). To solve this problem, our initial thought was to use a python dictionary structure,however we were able to do this even more easily by simply creating a data frame assigning the unique wordsto the columns (essentially like labels in a dictionary), then simply iterating across the words in the datacomparing to column lables. If the word is found in the labels, the entry in that column was increased by1, if not, we move on to the next word. Our entire parsing function runs in about a minute, as opposed tohours with the iterative approach.
The python script outputs a frequency matrix with counts for each unique word (defining a column). Weconsidered converting these counts into proportions in the python script but decided to delay this step for afew di�erent reasons: The first was practical, integers take up much less space in memory than long decimalobservations. The proportion matrix was almost half a giga-bite, as opposed to 68 MB for the count matrix.The second reason was the issue of zero sum rows: if they are present,dividing by the zero row sum results inN/A’s). So they need to be removed carefully. Furthermore this step has to follow feature selection, becauseas features are removed, the likelyhood of zero rows increases.
Before feature selection the matrix is (4180 x 8095). Using the complete space, only 2 rows have zero rowsum (ie contain only stop words).
a. See python script in Appendix I for parsing function.
1
b. Read the data into R
# Data are rows of texts, columns are stripped individual wordsstatus <- read.csv(file = "dfstatus.csv", header = TRUE)status=status[,-1]status=as.character(status)
# Remove columns that are the common words:remove_list = c(�a�,�able�,�about�,�across�,�after�,�all�,�almost�,�also�,�am�,�among�,�an�,�and�,�any�,�are�,�as�,�at�,�be�,�because�,�been�,�but�,�by�,�can�,�cannot�,�could�,�dear�,�did�,�do�,�does�,�either�,�else�,�ever�,�every�,�for�,�from�,�get�,�got�,�had�,�has�,�have�,�he�,�her�,�hers�, �him�,�his�,�how�,�however�,�i�,�if�,�in�,�into�,�is�,�it�,�its�,�just�,�least�,�let�,�like�,�likely�,�may�,�me�,�might�,�most�,�must�,�my�,�neither�,�no�,�nor�,�not�,�of�,�off�,�often�,�on�,�only�,�or�,�other�,�our�,�own�,�rather�,�said�,�say�,�says�,�she�,�should�,�since�,�so�, �some�,�than�,�that�,�the�,�their�,�them�,�then�,�there�,�these�,�they�,�this�,�tis�,�to�,�too�,�twas�,�us�,�wants�,�was�,�we�,�were�,�what�,�when�,�where�,�which�,�while�,�who�,�whom�,�why�,�will�,�with�,�would�,�yet�,�you�,�your�)
c. Derive word feature matrix:As previously mentioned, we found it necessary to perform the unsupervised feature selection first, sothe we can then determine which rows have a zero row-sum, then convert to a proportion matrix. Thisstep will follow question 3.
d. Improve feature matrix:Again, these suggested improvements were already performed by the python script. For example wealready converted all the characters to lower case so that would not di�erentiate words, similarly withpunctuation. You can see below that at this point, the matrix is 4180 X 8095
print(dim(start_df))
## [1] 4180 8095
2
3. Unsupervised Feature Filtering
Here we will accomplish a few tasks: First we want to determine the most common and rare features inthe data. It should be noted that there are two di�erent notions of rareness: the first is the raw numberof appearances, the second is a relative determination, i.e. what percentage of the messages does a featureappear in. We will examine both.
i. Calculate column sums to determine rare and common features.
col_sum <- apply(start_df, 2, sum)
print(c("mean =", round(mean(col_sum),2)))
## [1] "mean =" "5.1"
print(c("less than 3 appearances =", sum(table(col_sum)[1:3])))
## [1] "less than 3 appearances =" "6397"
print(c("rare features =", table(col_sum)[1:3]))
## 1 2 3## "rare features =" "4581" "1238" "578"
print(c("features appearing more than 100 = ", length(which(col_sum >100))))
## [1] "features appearing more than 100 = "## [2] "43"
print(c("common features =", table(col_sum)[123:125]))
## 436 840## "common features =" "1" "1"## 1401## "1"
print(c("most common", which(col_sum > 500)))
## .1 u## "most common" "1" "7316"
hist(col_sum[col_sum > 5], breaks=100, xlim=c(1,150),main="Distribution of Word Frequencies (min 5)",xlab="Total Appearances")
3
Distribution of Word Frequencies (min 5)
Total Appearances
Freq
uenc
y
0 50 100 150
010
020
030
040
050
0
Here we observe a few things: judging by column sums(total appearances), there are many rare words andonly a few common ones. The average word only occurs 5 times and 6,397 words appear less than 3 times(4581 of which only appear once), whereas only 43 words appear more than 100 times. Only 43 words appearmore than 100 times, 2 words occur dramatically more than the others: the single character “u” appearing840 times and “.1” occuring 1401. Due to the tremendous skew in the data, this histogram only shows thedistribution of words that appear at least 5 times, otherwise all that is visisible is i gigantic spike at the leftof the distribution.
ii. calculate percent of messages appeared in:
pct <- rep(0,ncol(start_df))
for(i in 1:ncol(start_df)){pct[i] <- as.numeric(table(start_df[,i])[1])}
The other notion of frequency is relative: the percentage of messages containing a feature. Interestingly, bythis measure there really aren’t any frequent features: The most prevalent features again are the character“u” and “.1”, yet they only appears in 14% and 18%, respectively, in the training messages. Only 20 wordsappear in more than 3 % of messages.Based on this we decided to only remove “.1” as the most common feature. Additionally, features that onlyappear once or twice are not likely to be useful for prediction. As a result we will remove these rare words aswell. The code for how we did this is in Appendix III.
Our final word matrix is 4135 X 2275. This means we have reduced the word features by about 75%.
4. Power Features:
We designed 9 power features, mostly surrounding special characters, numbers and capitalization. They areas following:
• Number of periods (dummy variable): If an observation has 2 or more uninterrupted dots (..) in thesentence. Idea: As we found out, people who sent spam usually try to be professional so they wouldnot use casual thing like . . . ..?? in their emasil.
• longest string (numeric variable): The length of the longest uppercase word in an observation. If thereis not any uppercase word in an observation, then this variable will show 0. Spam messages oftencontain a lot of emphasis, notably something they wish to sell, as such they often list the key words inuppercase characters. Whereas for ham, either for professional context or daily conversation, lots ofuppercase characters can be percieved as rude or bullying and so tend not to be used as often.
5
• Capital letters in a row (numeric variable): The average length of uppercase word in an observation.Here we just want to cover the thing that Power_longcap (longest string) failed to cover. For example,an observation may have lot of uninterrupted uppercase word but none of them is long enough tobe distinguished in Power_longcap. In this situation, Power_avecap will help capture the missinginformation.
• Special character (dummy variable): If an observation has a special character “<#>”, we recordit since it’s likely being a ham in general. We first identified this seemingly random set of specialcharacters through manual inspection: “<#>”. This appears mostly in ham messages. After cursoryreseach, we found that this string comes from HTML code. “<” means less than and “>” meansgreater than. We are not completely sure why these appear, however, we feel they may be predictive.
• Websites (dummy variable): If an observation contains a website or not. Many spam messages containa weblink. They appear in a variety of formats, however, for this feature we looked for “www.”
• Money (dummy variable): If an observation has various patterns that match monetary ammounts (like$1.00, ?40, etc.) or not. Most spam messages are trying to sell something, as such they often containprices.
• phone numbers (dummy variable): If an observation contains phone number or not. Similar to currencysymbols, spam messages often contain phone numbers (patterns like 324-540-231, 2345-234-1234, or23452).
• Number of numbers (numeric variable): How many numbers an observation has. Spam messagesalso contain a variety of other numeric information: codes, dates, prices, numbers, etc. This variableaccounts for this.
• Number of caps (Numeric variable): How many upper case letters an observation has. Idea: Spamtends to use upper case letters more often than Ham.
To parse these features from the original data, we wrote another python program, similar to the original.This program can be found in Appendix I (it was done at the same time the word matrix was created).
5. Create combined feature matrix:
# Read the output matrix from power python programpower_df <- read.csv(file = �dfpower.csv�, header = TRUE)power_df=power_df[,-1]
#### Call:## svm(formula = as.factor(label) ~ ., data = data, kernel = "linear",## cost = 8, scale = FALSE)###### Parameters:## SVM-Type: C-classification## SVM-Kernel: linear## cost: 8## gamma: 0.0004396#### Number of Support Vectors: 1009#### ( 684 325 )###### Number of Classes: 2#### Levels:## -1 1
We tried using the built in tune function to determine both the ideal cost parameter as well as the cross-validated error, however, it ran far too slow. So we wrote our own function to achive this task:
i. Cross-Validation
# This function outputs a vector of errors with associated cost labels, as well as a plothand.cv <- function(data, cost, k=5, kernal="linear", title="CV Error�s by Cost"){
#cost <- c(1:10,15)#errors <- hand.cv(data,cost=cost, k=10, title="10-Fold CV by Cost")
Below are the output plots from the cross-validation. From the larger plot, we restricted our focus to therange of values of 5-10 for the cost parameters. After multiple trials, the results were consistent showing verysmall di�erences in errors in this range. Based on the bottom plot we chose 8 for the cost parameter.
Thereis a cross validation error rate of 2.88%. The error is about evenly split between the hams and the spams.The ppv was .9091 and the npv was .9807, meaning the ability of our model to predict true negative values isa little better than its ability to predict true positives. The dimension of the feature matrix is 4,140 by 2,277.
To determine the predictive ability of the power features, we used two methods. We first used Random Foreston only the power features. We then fit a SVM model using the power features.
A. Random Forrest
# Use Power Matrix loaded in part 5 and append status columnrf.data <- cbind(status,power_df)names(rf.data)[1]="status"
Random Forest has two parameters: the number of trees and the number of variables to consider at eachsplit. As overfitting is not a risk, we do not need to cross validate the number of trees, however, we will useCV to determine the optimal numbers of variables:
library(randomForest)
## randomForest 4.6-10## Type rfNews() to see new features/changes/bug fixes.
#mtry <- c(1:9)#errors <- rf.cv(rf.data,mtry=mtry, k=10, title="10 Fold CV by Number of Variables")
We can see by the below plot that the best value for the “mtry” parameter (the number of variables consideredat each split) is 3. This conforms to the rule of thumb to select square root of p. In this case p = 9.
We see from this output that the performance of the Random Forest on the power features is very similar toSVM on the word matrix. The overall error is 1.39%, lower than the SVM model. The ham error is quite abit smaller than the ham error for the SVM modelat .25% compared to 1.23% previously. The spam error isalso slightly lower at 8.84%. While the spam error and the ham error are quite a bit lower for random forests,ppv and npv are almost the same, although the ppv is better for the random forest model at 0.9823. Again,we see a nearly ideal ROCR.To see how this was generated, see Appendix IV.
B. SVM
Here we fit a SVM model to just the power features. There are a number of considerations. Because SVMoperates on a metric (a notion of distance), it will likely be important to scale the data, otherwise we will begiving greater weight to variables that are nominally larger. Just like with the word matrix, we will also needto use Cross-validation to select an optimal value for the cost parameter.
image:The cost that yields the lowest error is not super consistent, but it is generally around .002 for both thescaled and unscaled models. The error is around .02.Because we have dummy variables in the data, we can also manually only scale the continous variables:
#capital letters in a row, number of numbers, longest string, number of caps,#and number of periods are continuous, while the other variables are dummy variablesscaledcost = c(.01, .05, .1, .15, .2)data_power_cont_scaled[c(1,2,4,5,6)] = scale(data_power_cont_scaled[c(1,2,4,5,6)])error_cont_scaled = hand.cv_power(data_power_cont_scaled, cost = scaledcost, k = 5,
When only the continuous features are scaled, the lowest error rate is still .02009. This error rate is obtainedwith a higher cost of 0.15. Because it is generally advisable to scale ones features some way or another whenusing SVM, we will scale the continuous features and use a cost of 0.15.
While the overall performance of this model is similar to that of the word matrix, it is notably inferior inpredicting spam with a 14% error rate for this category. However, this model predicts Ham VERY well withonly .17% error rate.
20
9. Combined Feature Matrix
Although we already created a combined feature matrix in part 5, we will add some of the insights frompart 8, notably we will scale the power features and append them to the word matrix. Given the very smallobservation values for the proportion matrix, this step is likely even more important. As with the other SVMmodels, we will need to determine an appropriate value for the cost parameter through cross-validation:
#We need to remove the N/A rows so that the power features have the same rows#as the proportion matrixsub_power_df <- data_power[-nas,]combined_df <- cbind(prop_data, sub_power_df)
v. Comparison The error for the combined matrix is 1.31%, smaller than the SVM power error of 1.99%and the SVM word matrix error of 2.75%. The ham error is .45%, still very good, and the spam error is6.54%, almost two times smaller than the spam error for the power features alone. The ham error isactually slightly higher for the combined matrix than the power feature matrix at .45% versus .14%,but the spam error is much smaller, making the overall error smaller and better combined performance.The ppv and npv show the same phenomena of the spam detection being more accurate but the hamdetection being less accurate than the power feature matrix.
25
As for the word matrix SVM model, the combined model is quite a bit better than it. The combined modelhas lower spam and ham errors, and also has better ppv and npv than the word feature SVM model.
The Random Forest model using only power features, at 1.44% overall error rate, was almost as accurate asthe SVM model using the combined dataframe, so a random forest model using power features and wordfeatures may be even more accurate than the combined SVM model. If we were to continue researching spamdetection, we would probably look more into using a random forest model.
Based on these observations we see that the Combined SVM model has the best overall performance. We aretrading a big improvement in predicting spam for a small cost in ham prediction performance.
# For some reason the predict methods run fine in program, but generate errors# when trying to Knit the PDF. Therefore, we are writting the output to csv and then# reading the same object back in to generate the PDF.
Thevalidation set error is at 1.43%, which is slightly higher than the cross-validated error for the best model of1.14%. The spam error is almost the same for the validation set and the cross-validated best model, but theham error is higher for the validation set than the cross-validated error at .83%. The ppv is similarly slightlylower for the validation set error while the npv is almost the same.
27
Our models have shown there is a tradeo� between accurately predicting Ham and Spam. So we mustconsider which type of errors to prioritize. False negatives are Spam that are categorized as Ham, these areinconveniente; whereas False positives are Ham that are predicted to be Spam.These are messages we reallywish to receive, but don’t. Therefore we want to minimize False positives and maximize True Negatives.Therefore NPV is more important than PPV, because we want the highest ratio of True Negatives to negativepredictions possible. The second most important factor is to minimize False Positive rate. This is the bottomaxis on our ROCR curve. Like usual we want the ROCR to hug the top left corner, showing a minimum FPrate.
Conclusion:
As just discussed the final model gives us the best combined performance with an overall error rate of 1.4%.The Ham error of .83% is slightly higher than some of the other models; however, this comes with the tradeo�of halving the spam error to only 5.5%. This is a very good improvement for less than .5% increase in hamerror. As discussed NPV is more important than PPV and the rate of 99.2% NPV is very strong.
Through the process of developing our model, we learned how choosing better models through cross-validationrequires finesse and judgement calls, as some of our cross-validation attempts were not very stable and the bestcost or number of variables to use was not obvious. We also learned that coordinating programming projectsis di�cult and some sort of version control system would be useful. We also ran into lots of computationalbarriers, and although these barriers may not be a problem if we had access to more powerful computers, itwas helpful and interesting to figure out more computational cheap methods to create our models.
28
Haley Muhlestein, Ricardo Tanasescu, Lawrence Ye, Taylor Hines
12/15/2014 Stat 154 Final Project Appendix
I. Training Set Python Code TestEmails = open('test_examples_unlb.txt','r', encoding = 'utf-8').readlines() emails = open('train_msgs.txt','r', encoding = 'utf-8').readlines() #--------------------------------------------------------------------------------------------------- #----------------------------------- TRAINING SET CODE CHUNK --------------------------------------- # Break emails by status and content words = list() for line in emails: for word in line.split("\t"): words.append(word) # Create lists status=list() content=list() for i in range(0,len(words),1): if i%2==0: status.append(words[i]) # status is the response vector else: content.append(words[i]) # This is the content of each email #creates a dummy variable that is 1 if the text contains five or more numbers in a row import re phone_numbers = list() phone_pattern = re.compile('|'.join([ r'.*(\d{4})*\D*(\d{3})\D*(\d{4}).*', r'.*(\d{3})*\D*(\d{3})\D*(\d{4}).*', r'.*\d{5}.*', ])) #creates a dummy variable that is 1 if the text contains money money = list() money_pattern = re.compile('|'.join([ r'.*\$?(\d*\.\d{1,2}).*', r'.*\£?(\d*\.\d{1,2}).*', r'.*\£(\d+).*', r'.*\£(\d+\.?).*', r'.*\$(\d+).*', r'.*\$(\d+\.?).*', ]))
number_of_numbers = list() number_of_uppers = list() for line in content: if re.match(phone_pattern, line): phone_numbers.append(1) else: phone_numbers.append(0) if re.match(money_pattern, line): money.append(1) else: money.append(0) num_count = 0; upper_count = 0; for letter in line: if letter.isdigit(): num_count = num_count + 1 elif letter.isupper(): upper_count = upper_count + 1 number_of_uppers.append(upper_count) number_of_numbers.append(num_count) # Remove \n at end of each email that is stored in content: import string #empty list to put lines without punctuation except "." in content_sans_punc = list() #take punctuation except (".") out of each line for line in content: content_sans_punc.append(''.join(word.strip('!"#$%&\'()*+,-/:;<=>?@[\\]^_`{|}~') for word in line)) content_new = [line.strip("\r\n") for line in content] ### Power: count how many dot an email has count_dot = [line.count("..") for line in content_new] ### Then take "." out of each line content_after=[] for line in content_new: content_after.append(''.join(word.strip('.') for word in line))
### Power: length of the longest string in each sentence Caps=[re.findall("[A-Z][A-Z]+",line) for line in content_after] power_loncap=list() for ele in Caps: if len(ele)==0: power_loncap.append(len(ele)) else: power_loncap.append(max([len(mem) for mem in ele])) ### Power: The average length of uninterrupted sequences of capital letters import statistics power_avecap=list() for ele in Caps: if len(ele)==0: power_avecap.append(len(ele)) else: power_avecap.append(statistics.mean([len(mem) for mem in ele])) ### Power: special character containing <#> special_power=[line.count("<#>") for line in content_after] content_splitted = [line.rsplit(" ") for line in content_after] # Change each word into lower_case lower_case_content = list() for word in content_after: lower_case_content.append(word.lower()) ### Power: website, www. website_power=[line.count("www") for line in lower_case_content] # Split each email by words. Now every email is broken into words: content_splitted = [line.rsplit(" ") for line in lower_case_content] # Put all words in one single list: totalWords = list()
for lines in content_splitted: for words in lines: totalWords.append(words) # Get Unique Words uniqueWords = set(totalWords) uniqueWords = list(uniqueWords) # Create a data frame: import pandas as pd dfstatus = pd.DataFrame(status) dfstatus.to_csv('dfstatus.csv') df = pd.DataFrame(data=0, index=range(0,len(content_splitted),1), columns=sorted(uniqueWords)) #create dictionary for power features powerfeatures = {"number of numbers" : number_of_numbers, "number of caps": number_of_uppers, "money": money, "phone numbers": phone_numbers, "number of periods": count_dot, "longest string": power_loncap, "capital letters in a row": power_avecap, "special character": special_power, "websites": website_power} dfpower = pd.DataFrame(powerfeatures) dfpower.to_csv('dfpower.csv') print(content[12]) print(content[164]) print(content[191]) ############# Calculating the frequency########## for row in range(0,len(df),1): contentWords = content_splitted[row] for i in range(0,len(contentWords),1): if contentWords[i] in df.columns: df[contentWords[i]][row:row+1]=df[contentWords[i]][row:row+1]+1 else: df[contentWords[i]][row:row+1] = 1 ############################################# df.to_csv('lawrence.csv')
# Output the unique words (To be used in test.py) f = open('unique.txt', 'w', encoding = 'utf-8') for t in sorted(uniqueWords): line = ''.join(str(x) for x in t) f.write(line + '\n') f.close() #-------------------------------- END OF TRAINING SET CODE CHUNK------------------------------------ #--------------------------------------------------------------------------------------------------- II. Test Set Python Code TestEmails = open('train_msgs.txt','r', encoding = 'utf-8').readlines() unique = open('unique.txt','r', encoding = 'utf-8').readlines() uniqueWords = [line.strip("\r\n") for line in unique] #--------------------------------------------------------------------------------------------------- #------------------------------------ TEST SET CODE CHUNK ------------------------------------------ # Remove \n at end of each email that is stored in content: import string #empty list to put lines without punctuation in content_sans_puncTEST = list() #take punctuation out of each line for line in TestEmails: line = line.replace('\t',' ') content_sans_puncTEST.append(''.join(word.strip(string.punctuation) for word in line)) #print(content_sans_puncTEST[260:270]) lower_case_contentTEST = list() for word in content_sans_puncTEST: lower_case_contentTEST.append(word.lower()) content_newTEST = [line.strip("\r\n") for line in lower_case_contentTEST]
# Split each email by words. Now every email is broken into words: content_splittedTEST = [line.rsplit(" ") for line in content_newTEST] print(content_splittedTEST[260:270]) # Create a data frame: import pandas as pd maxLenTEST = 0 for line in content_splittedTEST: if len(line) > maxLenTEST: maxLenTEST = len(line) cols = list() for i in range(0, maxLenTEST): cols.append(str(i)) dfTEST = pd.DataFrame(data=0, index=range(0,len(content_splittedTEST),1), columns=sorted(uniqueWords)) ############# Calculating the frequency########## for row in range(0,len(dfTEST),1): contentWordsTEST = content_splittedTEST[row] for i in range(0,len(contentWordsTEST),1): if contentWordsTEST[i] in dfTEST.columns: dfTEST[contentWordsTEST[i]][row:row+1]=dfTEST[contentWordsTEST[i]][row:row+1]+1 ############################################# print(dfTEST[".1"]) dfTEST.to_csv('test.csv') #-------------------------------- END OF TEST SET CODE CHUNK----------------------------------------- #---------------------------------------------------------------------------------------------------- III. Remove rare features row_sum <- apply(sub_data, 1, sum) table(row_sum)[1] nas <- which(row_sum == 0) sub_data <- sub_data[-nas,]