DB 297C Data Analytics – Project Report Term I (2013-14) Group Information GROUP NO: 11 TEAM MEMBERS: Bisen Vikratsingh Mohansingh - MT2012036 Kodamasimham Pridhvi - MT2012066 Vaibhav Singh Rajput - MT2012145 Dataset Description Blue Martini Software approached several clients using its Customer Interaction System to volunteer their data, and a small dot-com company called Gazelle.com, a leg wear and leg care retailer. Data was made available to in two formats: original data and aggregated data. Among the data collected by the Blue Martini application server, the following three categories are the relevant data: Customer information, which includes customer ID, registration information, and registration form questionnaire responses. Order information like Order header, which includes date/time, discount, tax, total amount, payment, shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and status. Click stream information session, which includes starting and ending date/time, cookie, browser, referrer, visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time, product, and assortment. Initial dataset .csv file -> 25MB (14000 row x 296 column), With SQL query removed columns with NULL making total no. of columns = 208. Removed record of Crawlers and Record with just one page view i.e. Session_time_elasped =0.0 making new rows count = ~5000. Manually removed few more irrelevant columns like browser/os, day, date info, etc, now no. of columns =128.Considering the most frequent visitors removed Columns whose sum (visit) < 10 making new number of columns =113. Final size: 6.5MB, ~5000 x 113
26
Embed
Experiments and Results on Click stream analysis using R
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DB 297C Data Analytics – Project Report Term I (2013-14)
Group Information GROUP NO: 11
TEAM MEMBERS:
Bisen Vikratsingh Mohansingh - MT2012036
Kodamasimham Pridhvi - MT2012066
Vaibhav Singh Rajput - MT2012145
Dataset Description Blue Martini Software approached several clients using its Customer Interaction System to volunteer their data, and a small dot-com company called Gazelle.com, a leg wear and leg care retailer. Data was made available to in two formats: original data and aggregated data. Among the data collected by the Blue Martini application server, the following three categories are the relevant data:
Customer information, which includes customer ID, registration information, and registration form questionnaire responses.
Order information like Order header, which includes date/time, discount, tax, total amount, payment, shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and status.
Click stream information session, which includes starting and ending date/time, cookie, browser, referrer, visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time, product, and assortment.
Initial dataset .csv file -> 25MB (14000 row x 296 column), With SQL query removed columns with NULL making total no. of columns = 208. Removed record of Crawlers and Record with just one page view i.e. Session_time_elasped =0.0 making new rows count = ~5000. Manually removed few more irrelevant columns like browser/os, day, date info, etc, now no. of columns =128.Considering the most frequent visitors removed Columns whose sum (visit) < 10 making new number of columns =113.
Final size: 6.5MB, ~5000 x 113
DB 297C Data Analytics – Project Report Term I (2013-14)
Summary of top 5 observations
Rule Based Classification:
Rule-based methods, rule discovery or rule extraction from data, are data mining techniques
aimed at understanding data structures, providing comprehensible description instead of only
black-box prediction. Sets of rules are useful if rules are not too numerous, comprehensible,
and have sufficiently high accuracy.
From the result of the experiment we can see rules being generated, we have show some
sample rules in the documentation, there were totally 182 rules generated. To which class the
rule belongs to is shown at the end of each rule with actual number of rows / number of
misclassifications that belong to that particular rule.
Association Rules:
Association rules were taken based on two factors, lift and support. Rules having lift greater
than 1 and min-support > 0.5. A total of 377564 rules were generated out of which we applied
filters and selected few rules which showed some interesting patterns.
Result 1
The Rule based classification will generate a set of rules on which classification takes place; we
can see the set of rules from the model as generated below
q2_part
O/p: PART decision list
Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND
DB 297C Data Analytics – Project Report Term I (2013-14)
We are here opening a connection to the database and reading data from Data File and writing it into
the table we created before in database. We are directly inserting into the database reading from file.
>>python da.py script.sql
here we need to specify the path inside the file where our .csv Data File exists and it will read from csv
and enter into the database.
After entering the data inside the DBMS we are removing the columns with all NULL values using simple
sql queries as mentioned above.
After performing the Data Pre-processing on the given dataset, we are exporting the table into a .csv file
for performing analysis using R.
Now we will analyze the data using R.
import MySQLdb
myfile = open("path where required csv is there",'r')
db = MySQLdb.connect(host="localhost",# your host, usually localhost
user="root",# your username
passwd="root",# your password
db="da1")# name of the data base
cur = db.cursor()
for line in myfile:
print line
my_line_list = line.split(',')
string =""
for value in my_line_list:
string = string +"'"+ str(value)+"',"
query_string = string[:-1]
final_query ="insert into question1 values"+"("+query_string+");"
cur.execute(final_query)
DB 297C Data Analytics – Project Report Term I (2013-14)
Classification
Random Forest
Objective: To generate a model for building decision tree and to identify important features using random forest.
Description: Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
Procedure:- 1. After the Data preprocessing of dataset, it is loaded into R environment by using ,
question1 <- read.csv ("dataset.csv") dim (question1)# 5220 103 ----- number of rows and columns
2. After loading the dataset now we will divide dataset into 70% as trainDataset and 30% as testDataset as below:
Here it will generate two samples from our dataset having 70% and 30% rows with non-overlapping of rows. ‘nrow’ will give the number of rows in our dataset, ’prob’ decides the division ratio and ‘sample’ will assign numbers 1 or 2 to the rows as to which sample that row belongs to. To create trainData the command is:
trainData<- question1[div==1,] dim (trainData) #3670 103 --- dimensions of trainDataset This will copy all the rows in dataset into trainData which are marked as 1 as per sample. Similarly the testData is:
testData<- question1[div==2,] dim(testData)#1550 103 --- dimensions of testDataset 3. After generating the trainData and testData now we will load the required package ‘randomForest’ into R:
library(randomForest) 4. Defining the target variable and independent variable in the formula to be used in the generation of the model as below:
myformula<- Session_Continues ~ . Here ‘Session_Continues’ is the target variables having classes as ‘true’ or ‘false’ and we are giving remaining all as the independent variables on which basis our target variable is classified as represented by ‘~.’.
DB 297C Data Analytics – Project Report Term I (2013-14)
5. After the formula is decided now we applying the formula to generate the model based on our trainData as below using the function ‘randomForest’ and get the model stored in ‘rf’:
‘ntree’ is the parameter to specify how many trees it algorithm has to generate to get an accurate model, ‘proximity’is the parameter for checking for exact match for oob and tree generated, used for reducing error rates. 6. We can see the classification result by: -->rf
output: Call: randomForest(formula = myformula, data = trainData, ntree = 100, proximity = T) Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 10 OOB estimate of error rate: 34.17% Confusion matrix: False True class.error False 2280 173 0.07052589 True 1081 136 0.88824979 By seeing the result we can say that we are getting an error of 34%. 7. For seeing the generated tree for classification:
-->getTree(rf,1) Output:-
left daughter right daughter split var split point status prediction
1 2 3 3 1 0
2 4 5 29 1 0
3 6 7 105 1 0
4 8 9 98 1 0
5 10 11 36 1 0
6 12 13 34 1 0
7 0 0 0 -1 2
If status is -1 then that is the leaf node in the decision tree and prediction is 1 0r 2 means the class of classification it is classified. We can get any tree information using the above command just by specifying the randomForest object ‘rf’ and which tree number i.e ‘n’ which in our case is 1 < n < 100
DB 297C Data Analytics – Project Report Term I (2013-14)
8. We can plot the error rates in our generated trees by : plot(rf)
we will get a graph as show in figure(1)in observation.
9. We can also find the features that contribute more to the decision tree using:
importance(rf) It will give the feature and its mean Gini Index , we can see and decide which are the essential features that effect our decision tree. 10. We can use many attributes that are generated by randomForest, which are:
DB 297C Data Analytics – Project Report Term I (2013-14)
Relative absolute error 99.9541 % Root relative squared error 99.9958 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 100 % Total Number of Instances 1305 === Confusion Matrix === a b <-- classified as 734 124 | a = FALSE 301 136 | b = TRUE
From the result we can see that we are getting a classification rate around 68% - 72 % which is a better rate than decision tree.
Observation The Rule based classification will generate a set of rules on which classification takes place , we can see
the set of rules from the model it generated as below
q2_part
O/p: PART decision list
Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND
From the result you can see the rules being generated, we have show some sample rules there were
totally 182 rules generated. Here at the last the class has been mentioned to which class the rule
belongs to showing actual number of rows / number of misclassifications that belong to that particular
rule.
Conclusion So from the above observation and results we can see that a successful rule based model was build with
an accuracy of above 70% for identifying whether a user will continue his session or not.
DB 297C Data Analytics – Project Report Term I (2013-14)
Clustering
Objective: To group visitor of websites whose page view pattern is similar and identify their interest.
Approach: Clustering is a best methodology in Data analysis which can be used to group objects based on their
similarities. We are making use of WEKA tools for doing this analysis.
Preprocessing: 1. Remove all spam data by deleting record with just one page view
2. There are about 500+ dimension which is very not feasible to analyze, so for dimensionality
reduction
a. Go to select attribute of WEKA
b. Manually- remove all session data, browser information, most common page]
c. Auto – Calculate information gain and select top 25 Attribute
Process:
Experiment I
In this experiment we will make 5 cluster of given instances (users) and analyze their purchase habits
K-means
Steps:
1. Import a reduced dataset in weka
2. Select simple k-mean
3. Specify number of clusters
4. Set distance function to Euclidean
5. Specify k (no of cluster)
6. Click on start to generate clusters
Results:
(A) Using Euclidean distance
=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10 Relation: q3_added-weka.filters.unsupervised.attribute.Remove-V-R7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15 Instances: 1781 Attributes: 26 City Customer_ID US_State Num_Sheer_Look_Product_Views Num_CT_Waist_Control_Views
DB 297C Data Analytics – Project Report Term I (2013-14)
Num_PH_Category_Views Num_main_shopping_cart_Template_Views Num_Replenishable_Stock_Views Num_account_Template_Views Num_main_login2_Template_Views Num_Sandal_Foot_Views Num_HasDressingRoom_True_Views Num_Legwear_Product_Views Num_products_productDetailLegwear_Template_Views Num_DonnaKaran_Product_Views Num_AmericanEssentials_Product_Views Num_Basic_Product_Views Num_WDCS_Category_Views Num_Oroblu_Product_Views WhichDoYouWearMostFrequent Num_products_Template_Views Home_Market_Value Num_WAS_Category_Views Num_main_vendor_Template_Views Num_main_freegift_Template_Views Spend_over_$12_per_order_on_average Test mode: evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 12 Within cluster sum of squared errors: 4789.607514501406 Missing values globally replaced with mean/mode Time taken to build model (full training data) : 0.14 seconds === Model and evaluation on training set === Clustered Instances 0 263 ( 15%) 1 273 ( 15%) 2 368 ( 21%) 3 484 ( 27%) 4 393 ( 22%)
DB 297C Data Analytics – Project Report Term I (2013-14)
Observation:
Cluster 0
o High income
o Spend avg >12$ => potential customer (value)
o Purchase nylon more than cotton (nylon is costlier than cotton)
o More men product than other cluster => cluster might have more mens
o Frequent use of search bar
o Rich visitors with most of them are have above average home/assets value
Cluster 2
o General visitor
o Buy cheap products
Cluster 3
o Interested mostly in offers/free gift products
o Highest visit to checkout page => potential customer (frequency)
Cluster 4
o No special pattern observed
Experiment II
We can validate whether we can use page view data for identifying potential customer using
clustering. We have labeled data of avg. purchase >12$ with 1368 instances as true and 413 instances as
false.
=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10 Relation: q3_added-weka.filters.unsupervised.attribute.Remove-V-R7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15 Instances: 1781 Attributes: 26 City Customer_ID US_State Num_Sheer_Look_Product_Views Num_CT_Waist_Control_Views Num_PH_Category_Views Num_main_shopping_cart_Template_Views Num_Replenishable_Stock_Views Num_account_Template_Views Num_main_login2_Template_Views Num_Sandal_Foot_Views Num_HasDressingRoom_True_Views Num_Legwear_Product_Views
DB 297C Data Analytics – Project Report Term I (2013-14)
Num_products_productDetailLegwear_Template_Views Num_DonnaKaran_Product_Views Num_AmericanEssentials_Product_Views Num_Basic_Product_Views Num_WDCS_Category_Views Num_Oroblu_Product_Views WhichDoYouWearMostFrequent Num_products_Template_Views Home_Market_Value Num_WAS_Category_Views Num_main_vendor_Template_Views Num_main_freegift_Template_Views Ignored: Spend_over_$12_per_order_on_average Test mode: Classes to clusters evaluation on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 5 Within cluster sum of squared errors: 4913.0928856548035 Missing values globally replaced with mean/mode Time taken to build model (full training data) : 0.05 seconds === Model and evaluation on training set === Clustered Instances 0 561 ( 31%) 1 1220 ( 69%) Class attribute: Spend_over_$12_per_order_on_average Classes to Clusters: 0 1 <-- assigned to cluster 402 966 | False 159 254 | True Cluster 0 <-- True Cluster 1 <-- False Incorrectly clustered instances : 656.0 36.8332 %
Observation :
Only 63% of data is correctly classified, as data is more biased toward False(<12$ average spending)
class. But clustering gives us good insight of purchase/page view patterns.
DB 297C Data Analytics – Project Report Term I (2013-14)
Association Rules
Objective: To identify some interesting patterns in the users page views and also the killer pages.
Description:
Association rule learning is a popular and well researched method for discovering interesting
relations between variables in large databases. It is intended to identify strong rules discovered
in databases using different measures of interestingness. Measures used in our analysis are lift,
confidence and support.
Procedure: 1. For performing the association rules we needed to convert the dataset into binary matrix
indicating in each session indicating whether he/she visited that page or not.
2. For performing the association rules, “arules” package is available.
library(arules)
3. Now loading the converted dataset into R for generation of rules, we used the important
columns based on mean gini index obtained from randomForest result.
4. After loading the data, we convert the data as transactions by following command,
dataTrans <- as(assoc,”transactions”)
5. Now we apply “apriori” algorithm to generate the rules, where we can pass the parameter list
having support, confidence and minl-ength of every rule.
rules <- apriori(dataTrans)
This will generate all rules using min-support as 0.1 and min-confidence as 0.8 around. It will
generate all the subset rules also based on the frequent itemset of attributes.
6. To know how many rules generated we can see that by
rules
Around 377564 rules were generated out of which we were interested in only rules having RHS
as LEAVE or CONTINUE to check whether person will continue or leave after seeing certain
pages.
7. Retrieved a subset of rules from all generated rules which were having some interesting
patterns.
Observation: We were able to see some of the interesting patterns in the rules generated, like in our dataset most of
the persons were females so we were able to find out that most of the rules were having
“NUM_OF_WOMEN_PRODUCT_VIEWS” in possibly every transaction. Some of the brands were least
visited or never visited according to the rules. We were able to identify some of the killer pages based
on the user preferences or after visiting some pages user used to withdraw at the same page everytime
like that.
DB 297C Data Analytics – Project Report Term I (2013-14)
Results: Some of the rules sorted based on the “lift” values are as below:
rulesLeave <- subset(rules,subset=rhs %pin% "LEAVE")// for getting the rules
inspect(head(sort(rulesLeave,by="lift"),20))
O/p:
lhs rhs support confidence lift
1 {CONTINUE=YES} => {LEAVE=NO} 0.425 1 2.352941
2 {Num_Women_Product_Views=Yes,
CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941
17 {Num_Women_Product_Views=Yes,
Num_Men_Product_Views=No,
CONTINUE=YES} => {LEAVE=NO} 0.100 1 2.352941
18 {Num_MAS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941
19 {Num_MDS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941
20 {Num_MCS_Category_Views=No,
Num_Women_Product_Views=Yes,
CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941
Some of the interesting rules are shown above.Some of the random generated are,
inspect(head(rulesLeave,6))
0/p:
lhs rhs support confidence lift
3 {Num_Women_Product_Views=Yes,
CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941
DB 297C Data Analytics – Project Report Term I (2013-14)
4 {Num_Women_Product_Views=Yes,
CONTINUE=NO} => {LEAVE=YES} 0.100 1 1.739130
5 {Num_Women_Product_Views=No,
CONTINUE=YES} => {LEAVE=NO} 0.315 1 2.352941
6 {Num_CT_Waist_Control_Views=No,
CONTINUE=YES} => {LEAVE=NO} 0.360 1 2.352941
For CONTINUE some examples are:
inspect(head(rulesContinue,4))
o/p:
lhs rhs support confidence lift
3 {Num_Women_Product_Views=Yes,
LEAVE=NO} => {CONTINUE=YES} 0.110 1 2.352941
4 {Num_Women_Product_Views=Yes,
LEAVE=YES} => {CONTINUE=NO} 0.100 1 1.739130
5 {Num_Women_Product_Views=No,
LEAVE=NO} => {CONTINUE=YES} 0.315 1 2.352941
6 {Num_CT_Waist_Control_Views=No,
LEAVE=NO} => {CONTINUE=YES} 0.360 1 2.352941
Conclusion: We were able to find some interesting patters in users page views and were able to identify some of the
killer pages as “Num_CT_Waist_Control_Views”,” Num_MAS_Category_Views” like these pages.