Experiments and Results on Click stream analysis using R

DB 297C Data Analytics – Project Report Term I (2013-14)

Group Information GROUP NO: 11

TEAM MEMBERS:

Bisen Vikratsingh Mohansingh - MT2012036

Kodamasimham Pridhvi - MT2012066

Vaibhav Singh Rajput - MT2012145

Dataset Description Blue Martini Software approached several clients using its Customer Interaction System to volunteer their data, and a small dot-com company called Gazelle.com, a leg wear and leg care retailer. Data was made available to in two formats: original data and aggregated data. Among the data collected by the Blue Martini application server, the following three categories are the relevant data:

Customer information, which includes customer ID, registration information, and registration form questionnaire responses.

Order information like Order header, which includes date/time, discount, tax, total amount, payment, shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and status.

Click stream information session, which includes starting and ending date/time, cookie, browser, referrer, visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time, product, and assortment.

Initial dataset .csv file -> 25MB (14000 row x 296 column), With SQL query removed columns with NULL making total no. of columns = 208. Removed record of Crawlers and Record with just one page view i.e. Session_time_elasped =0.0 making new rows count = ~5000. Manually removed few more irrelevant columns like browser/os, day, date info, etc, now no. of columns =128.Considering the most frequent visitors removed Columns whose sum (visit) < 10 making new number of columns =113.

Final size: 6.5MB, ~5000 x 113


Summary of top 5 observations

Rule Based Classification:

Rule-based methods, rule discovery or rule extraction from data, are data mining techniques

aimed at understanding data structures, providing comprehensible description instead of only

black-box prediction. Sets of rules are useful if rules are not too numerous, comprehensible,

and have sufficiently high accuracy.

From the result of the experiment we can see rules being generated, we have show some

sample rules in the documentation, there were totally 182 rules generated. To which class the

rule belongs to is shown at the end of each rule with actual number of rows / number of

misclassifications that belong to that particular rule.

Association Rules:

Association rules were taken based on two factors, lift and support. Rules having lift greater

than 1 and min-support > 0.5. A total of 377564 rules were generated out of which we applied

filters and selected few rules which showed some interesting patterns.

Result 1

The Rule based classification will generate a set of rules on which classification takes place; we

can see the set of rules from the model as generated below

q2_part

O/p: PART decision list

Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND

Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0)

Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0

AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0)

Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND

Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0)

Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND


Num_articles_Template_Views <= 1: TRUE (73.0/22.0)

These are some of the sample rules that are generated based on the model using which the prediction

on the test data was done. A successful rule based model was build with an accuracy of above

70% for identifying whether a user will continue his session or not.

Results can be found at Rule Based Classification.

Result 2 5 Cluster of users plotted against their average amount spend>12$. It is clearly observable from the

figure that the most of users in cluster 0 are high spender where as cluster 2 consist of least highest

spender

Result 3 One more strange observation is high spender people are least interested in offers/free gift. Below

image shows a plot of cluster against number of visits to offer/free gift page.


Result 4 Highest spender people (cluster 0 as concluded from result 2nd ) as found to be mostly interested in

branded product that’s why their average spending is more than 12$. Below image shows a plot of

cluster against DonnaKaran(Fashion design brand) product views. And opposite for our least spender

cluster 2.

Result 5 After applying association rules , some interesting rules were generated which were very useful to

predict which users like to continue and who don’t and what were the killer pages in most cases. Due to

large number of rules generation we were unable to go through all rules. Some of the graphs were:


Scatter plot of rules generated based on support, confidence and lift.

Some of the interesting patterns were:

2 {Num_Women_Product_Views=Yes,

CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941


Num_Men_Product_Views=No,


18 {Num_MAS_Category_Views=No,

Num_Women_Product_Views=Yes,


19 {Num_MDS_Category_Views=No,



From the rules we can see that Women products were the most viewed and pages like MDS category,

MAS category were the killer pages where no one was interested.

Results can be observed here Association Rule Results.


APPENDIX – collection of individual experiment reports (not more than

15)

Data Cleaning/Filtering

DATA PRE-PROCESSING

Description of Dataset: Blue Martini Software approached several clients using its Customer Interaction System to volunteer their

data, and a small dot-com company called Gazelle.com, a legwear and legcare retailer. Data was made available

to in two formats: original data and aggregated data.

Among the data collected by the Blue Martini application server, the following three categories are the

relevant data:

Customer information, which includes customer ID, registration information, and registration form

questionnaire responses.

Order information like Order header, which includes date/time, discount, tax, total amount, payment,

shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and

status.

Clickstream information session, which includes starting and ending date/time, cookie, browser, referrer,

visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time,

product, and assortment.

Steps for Pre-processing:

Initial dataset .csv file -> 25MB (14000 row x 296 col)

Loaded .csv into RDBMS using python scripts as discussed below

With SQL query removed columns with NULL

o total no. of col = 208

Removed record of Crawlers

Record with just one page view i.e Session_time_elasped =0.0

o new rows count = ~5000

Manually removed few more irrelevant col like browser/os,day,date info, etc,

o Now no. of col =108

Removed Col whose sum(visit) < 10

o New no of col =103

Final size: 6.5MB, ~5000 x 103


Python Scripts:

Dataset was given in two files:

o Data File

o Names File

In the above script we are taking the Names File from the user and retrieving the column names for our

dataset by removing all special characters and writing it to another file for further processing of names.

>> python read.py names_file_name output_file_name

In the output_file_name we will get the columns names individually .

import sys

defreadfile(fname,oname):

f = file(fname)

whileTrue:

line = f.readline()

stri=line.split(':');

if len(line)==0:

break

outfile = file(oname,'a')

ifnot stri[0].startswith("|"):

mystr=stri[0].replace(" ","_")

if((c in mystr)for c in'*&'):

m=['*','&','/']

for i in m:

mystr=mystr.replace(i,"_")

ifnot any((c in mystr)for c in'*&|'):

# if any(c in stri[0] for c in '*$&|'):

mystr=mystr.replace("\.","")

outfile.write(mystr.strip())

outfile.write("\n")

outfile.close()

f.close()

if len(sys.argv)<2:

print'No file specified'

sys.exit()

else:

readfile(sys.argv[1],sys.argv[2])


In the above python file, we are using the file created before using read.py, which is the input

file for the above script. Here we are creating a sql script for creating a table with respective columns

having a data type of varchar, so that we can load the file into DBMS for further processing.

>> python Createscript.py output_file_name script.sql

We will get a sql script for creating the table with the columns retrieved.

import sys

defmakeScript(fname,oname):

with open(fname,'r')as file_contents:

contents = file_contents.read()

my_data=contents.split("\n")

outfile=file(oname,'a')

input_db = raw_input("Enter Database Name: ")

createDatabase ="CREATE DATABASE "+ input_db +";\n"

useDatabase ="USE "+input_db +";\n"

input_table = raw_input("Enter Table Name: ")

columns =""

count=0

print columns

for data in my_data:

count=count+1

if data:

columns += data +" VARCHAR(100) DEFAULT NULL ,\n"

createTable="CREATE TABLE "+input_table +"("+ columns +") ;\n"

createTable=createTable.replace("/","")

k = createTable.rfind(',')

createTable=createTable[:k]+" "+createTable[k+1:]

print createTable

print count

outfile.write(createDatabase)

outfile.write(useDatabase)

outfile.write(createTable)

outfile.close()

file_contents.close()

if len(sys.argv)<2:

print'No file specified'

sys.exit()

else:

makeScript(sys.argv[1],sys.argv[2])


We are here opening a connection to the database and reading data from Data File and writing it into

the table we created before in database. We are directly inserting into the database reading from file.

>>python da.py script.sql

here we need to specify the path inside the file where our .csv Data File exists and it will read from csv

and enter into the database.

After entering the data inside the DBMS we are removing the columns with all NULL values using simple

sql queries as mentioned above.

After performing the Data Pre-processing on the given dataset, we are exporting the table into a .csv file

for performing analysis using R.

Now we will analyze the data using R.

import MySQLdb

myfile = open("path where required csv is there",'r')

db = MySQLdb.connect(host="localhost",# your host, usually localhost

user="root",# your username

passwd="root",# your password

db="da1")# name of the data base

cur = db.cursor()

for line in myfile:

print line

my_line_list = line.split(',')

string =""

for value in my_line_list:

string = string +"'"+ str(value)+"',"

query_string = string[:-1]

final_query ="insert into question1 values"+"("+query_string+");"

cur.execute(final_query)


Classification

Random Forest

Objective: To generate a model for building decision tree and to identify important features using random forest.

Description: Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.

Procedure:- 1. After the Data preprocessing of dataset, it is loaded into R environment by using ,

question1 <- read.csv ("dataset.csv") dim (question1)# 5220 103 ----- number of rows and columns

2. After loading the dataset now we will divide dataset into 70% as trainDataset and 30% as testDataset as below:

div<- sample (2, nrow(question1),replace=T,prob=c(0.7,0.3))

Here it will generate two samples from our dataset having 70% and 30% rows with non-overlapping of rows. ‘nrow’ will give the number of rows in our dataset, ’prob’ decides the division ratio and ‘sample’ will assign numbers 1 or 2 to the rows as to which sample that row belongs to. To create trainData the command is:

trainData<- question1[div==1,] dim (trainData) #3670 103 --- dimensions of trainDataset This will copy all the rows in dataset into trainData which are marked as 1 as per sample. Similarly the testData is:

testData<- question1[div==2,] dim(testData)#1550 103 --- dimensions of testDataset 3. After generating the trainData and testData now we will load the required package ‘randomForest’ into R:

library(randomForest) 4. Defining the target variable and independent variable in the formula to be used in the generation of the model as below:

myformula<- Session_Continues ~ . Here ‘Session_Continues’ is the target variables having classes as ‘true’ or ‘false’ and we are giving remaining all as the independent variables on which basis our target variable is classified as represented by ‘~.’.


5. After the formula is decided now we applying the formula to generate the model based on our trainData as below using the function ‘randomForest’ and get the model stored in ‘rf’:

rf<- randomForest(myformula,data=trainData,ntree=100,proximity=T)

‘ntree’ is the parameter to specify how many trees it algorithm has to generate to get an accurate model, ‘proximity’is the parameter for checking for exact match for oob and tree generated, used for reducing error rates. 6. We can see the classification result by: -->rf

output: Call: randomForest(formula = myformula, data = trainData, ntree = 100, proximity = T) Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 10 OOB estimate of error rate: 34.17% Confusion matrix: False True class.error False 2280 173 0.07052589 True 1081 136 0.88824979 By seeing the result we can say that we are getting an error of 34%. 7. For seeing the generated tree for classification:

-->getTree(rf,1) Output:-

left daughter right daughter split var split point status prediction

1 2 3 3 1 0

2 4 5 29 1 0

3 6 7 105 1 0

4 8 9 98 1 0

5 10 11 36 1 0

6 12 13 34 1 0

7 0 0 0 -1 2

If status is -1 then that is the leaf node in the decision tree and prediction is 1 0r 2 means the class of classification it is classified. We can get any tree information using the above command just by specifying the randomForest object ‘rf’ and which tree number i.e ‘n’ which in our case is 1 < n < 100


8. We can plot the error rates in our generated trees by : plot(rf)

we will get a graph as show in figure(1)in observation.

9. We can also find the features that contribute more to the decision tree using:

importance(rf) It will give the feature and its mean Gini Index , we can see and decide which are the essential features that effect our decision tree. 10. We can use many attributes that are generated by randomForest, which are:

attributes(rf)

output:

$names

[1] "call" "type" "predicted" "err.rate"

[5] "confusion" "votes" "oob.times" "classes"

[9] "importance" "importanceSD" "localImportance" "proximity"

[13] "ntree" "mtry" "forest" "y"

[17] "test" "inbag" "terms"

$class

[1] "randomForest.formula" "randomForest"

11. Now using the model generated from the trainData, we will apply that model on testData for

prediction as below:

testpredict<- predict(rf,newdata=testData)

output:

testpredict False True

False 952 473

True 52 73

here we will use the model ‘rf’ and dataset as ‘testData’ and store the result in a variable as above.

Observations: The plot of the model error rates is:


We can see that as number of trees increases the error rate decreases and it is able to classify accurately.

Result of importance of rf:-

Column MeanGiniIndex

Num_BrandOrder_Assortment_Views 2.226234e+01

Num_UniqueBoutiques_Assortment_Views 3.372542e+01

Num_Brands_Assortment_Views 2.324481e+01

Num_Departments_Assortment_Views 2.224833e+01

Num_LifeStyles_Assortment_Views 1.265545e+01

Num_main_Template_Views 4.466076e+01

Num_products_Template_Views 1.143197e+01

Num_articles_Template_Views 1.943811e+01

Num_main_home_Template_Views 2.347768e+01

So from the result above we can observe that out of 103 features only some features contribute more to

the model so we can further reduce the dataset and achieve better results.

WEKA OUTPUT:

We also tried to get the selection attributes of the above dataset in weka , result is:

Num_Hanes_Product_Views

Num_Cotton_Product_Views

Num_Nylon_Product_Views

Num_BrandOrder_Assortment_Views

Num_UniqueBoutiques_Assortment_Views

Num_LifeStyles_Assortment_Views

Num_main_Template_Views

Num_articles_Template_Views

Num_main_home_Template_Views

Num_main_vendor_Template_Views

Num_articles_dpt_about_mgmtteam_Template_Views

Num_main_cust_serv_Template_Views

Nearly both the outputs were matching so we were able to find the factors most influencing our decision

tree.

Conclusion:

We will use the above features that have major influence in decision tree in random Forest as the

independent variable for actual construction of decision tree using ‘party’ package for better results and

classification.


Party Decision Tree

Objective: To build a decision tree using the features identified by Random Forest using the “party” package.

Description:

A computational toolbox for recursive partitioning. The core of the package is ctree(), an

implementation of conditional inference trees which embed tree-structured regression models into

a well-defined theory of conditional inference procedures. This non-parametric class of regression

trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored

as well as multivariate response variables and arbitrary measurement scales of the covariates.

Procedure:

1. Loading the dataset into R,

question1_reduced <- read.csv("q2_reduced.csv")

dim(question1) #5220 103

2. Dividing the dataset into training dataset and testdata set,

div <- sample(2,nrow(question1_reduced),replace=T,prob=c(0.70,0.30))

3. Storing the traindata and testData in variable for analysis,

trainData_reduced <- question1_reduced[div==1,]

dim(trainData_reduced) # 3675 103

testData_reduced <- question1_reduced[div==2,]

dim(testData_reduced)# 1545 103

4. Defining formula based on the features identified by random forest on the target variable,

myformula_reduced <- Session_Continues ~ Num_Hanes_Product_Views +

Num_Cotton_Product_Views + Num_Nylon_Product_Views +

Num_BrandOrder_Assortment_Views + Num_UniqueBoutiques_Assortment_Views +

Num_LifeStyles_Assortment_Views + Num_main_Template_Views +

Num_articles_Template_Views + Num_main_home_Template_Views +

Num_main_vendor_Template_Views + Num_articles_dpt_about_mgmtteam_Template_Views

+ Num_main_cust_serv_Template_Views

The features are the result of the importance factor found during the randomForest.

5. Now loading the “party” package for analysis,

library(party)

6. Applying the recursive decision tree algorithm on traindata based on the above formula

trainData_ctree <- ctree(myformula_reduced,data=trainData_reduced)

7. Now to see the generated model ,

print(trainData_ctree)


this wil show the features used in building the decision tree and also how the decision tree is built as

below

output:

Conditional inference tree with 3 terminal nodes

Response: Session_Continues

Inputs: Num_Hanes_Product_Views, Num_Cotton_Product_Views, Num_Nylon_Product_Views,

Num_BrandOrder_Assortment_Views, Num_UniqueBoutiques_Assortment_Views,

Num_LifeStyles_Assortment_Views, Num_main_Template_Views,

Num_articles_Template_Views, Num_main_home_Template_Views,

Num_main_vendor_Template_Views, Num_articles_dpt_about_mgmtteam_Template_Views,

Num_main_cust_serv_Template_Views

Number of observations: 3675

1) Num_main_home_Template_Views <= 1; criterion = 0.999, statistic = 16.455

2) Num_articles_dpt_about_mgmtteam_Template_Views <= 0; criterion = 0.984, statistic =

10.226

3)* weights = 2607

2) Num_articles_dpt_about_mgmtteam_Template_Views > 0

4)* weights = 245

1) Num_main_home_Template_Views > 1

5)* weights = 823

8. For visualizing tree graphically it is,

Plot(trainData_ctree,type=”simple”)

We will get the graph of decision tree as below show in observation.

9. Now applying the model on the testData set ,

testpred_reduced <- predict(trainData_reduced,newdata=testData_reduced)

10. For checking the accuracy of prediction,

table(testpred_reduced,testData_reduced$Session_Continues)

output will show the prediction rate, as the prediction has many errors so this is not the suitable

method for decision tree generation.


Observation:

The graph of the decision tree obtained is:

From the graph we can see that only two factors are being considered by this ctree() algorithm.


WEKA:

When based on the above factor decision tree generated in weka is:

Conclusion

Results were not satisfactory, there were large rate of error and misclassification of data. Around

50% of data was not correctly classified using “party” package.


Rule Based Classification

Objective To generate a model for building rules and to classify data based on the rules being satisfied.

Description Rule-based methods, rule discovery or rule extraction from data, are data mining techniques aimed at

understanding data structures, providing comprehensible description instead of only black-box

prediction. Rule based systems should expose in a comprehensible way knowledge hidden in data ,

providing logical justification for drawing conclusions, showing possible inconsistencies and avoiding

unpredictable conclusions that black box predictors may generate in untypical situations . Sets of rules

are useful if rules are not too numerous, comprehensible, and have sufficiently high accuracy.

Procedure 1. Data was already been loaded in R and is divided into training dataset and test dataset so we can

directly apply the rule based classification directly on the train dataset.

2. For applying rule based classification we have to install the package “RWeka” which imports all

the algorithms in weka tool.

3. We will be using the “PART” rule based classification in weka for generating the rules of our training

dataset based on which we will classify our test dataset.

library(RWeka)

The above command will load the “RWeka” package into R environment

4. Now we will apply the PART algorithm on the training Dataset for obtaining the rules,

q2_part <- PART(Session_Continues ~.,data =q2_train)

Above command will take the training dataset as “q2_train” and apply the “PART” algorithm based

on the target variable i.e Session_Continues and remaining all as independent variables.

5. A model is build based on the previous command , which is used for classifying the test dataset as

below,

q2_pre <- evaluate_Weka_classifier(q2_part,newdata=q2_test)

here we are using the model generated by the training data to classify the test data. Here

“evaluate_Weka_classifier” will use the model and classify the testdata, which is a function of weka tool.

6. For seeing the result,

q2_pre

OUTPUT:- === Summary ===

Correctly Classified Instances 880 67.433 % Incorrectly Classified Instances 425 32.567 % Kappa statistic 0 Mean absolute error 0.4447 Root mean squared error 0.4689


Relative absolute error 99.9541 % Root relative squared error 99.9958 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 100 % Total Number of Instances 1305 === Confusion Matrix === a b <-- classified as 734 124 | a = FALSE 301 136 | b = TRUE

From the result we can see that we are getting a classification rate around 68% - 72 % which is a better rate than decision tree.

Observation The Rule based classification will generate a set of rules on which classification takes place , we can see

the set of rules from the model it generated as below

q2_part

O/p: PART decision list

Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND

Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0)

Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0

AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0)

Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND

Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0)

Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND

Num_articles_Template_Views <= 1: TRUE (73.0/22.0)

From the result you can see the rules being generated, we have show some sample rules there were

totally 182 rules generated. Here at the last the class has been mentioned to which class the rule

belongs to showing actual number of rows / number of misclassifications that belong to that particular

rule.

Conclusion So from the above observation and results we can see that a successful rule based model was build with

an accuracy of above 70% for identifying whether a user will continue his session or not.


Clustering

Objective: To group visitor of websites whose page view pattern is similar and identify their interest.

Approach: Clustering is a best methodology in Data analysis which can be used to group objects based on their

similarities. We are making use of WEKA tools for doing this analysis.

Preprocessing: 1. Remove all spam data by deleting record with just one page view

2. There are about 500+ dimension which is very not feasible to analyze, so for dimensionality

reduction

a. Go to select attribute of WEKA

b. Manually- remove all session data, browser information, most common page]

c. Auto – Calculate information gain and select top 25 Attribute

Process:

Experiment I

In this experiment we will make 5 cluster of given instances (users) and analyze their purchase habits

K-means

Steps:

1. Import a reduced dataset in weka

2. Select simple k-mean

3. Specify number of clusters

4. Set distance function to Euclidean

5. Specify k (no of cluster)

6. Click on start to generate clusters

Results:

(A) Using Euclidean distance

=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10 Relation: q3_added-weka.filters.unsupervised.attribute.Remove-V-R7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15 Instances: 1781 Attributes: 26 City Customer_ID US_State Num_Sheer_Look_Product_Views Num_CT_Waist_Control_Views


Num_PH_Category_Views Num_main_shopping_cart_Template_Views Num_Replenishable_Stock_Views Num_account_Template_Views Num_main_login2_Template_Views Num_Sandal_Foot_Views Num_HasDressingRoom_True_Views Num_Legwear_Product_Views Num_products_productDetailLegwear_Template_Views Num_DonnaKaran_Product_Views Num_AmericanEssentials_Product_Views Num_Basic_Product_Views Num_WDCS_Category_Views Num_Oroblu_Product_Views WhichDoYouWearMostFrequent Num_products_Template_Views Home_Market_Value Num_WAS_Category_Views Num_main_vendor_Template_Views Num_main_freegift_Template_Views Spend_over_$12_per_order_on_average Test mode: evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 12 Within cluster sum of squared errors: 4789.607514501406 Missing values globally replaced with mean/mode Time taken to build model (full training data) : 0.14 seconds === Model and evaluation on training set === Clustered Instances 0 263 ( 15%) 1 273 ( 15%) 2 368 ( 21%) 3 484 ( 27%) 4 393 ( 22%)


Observation:

Cluster 0

o High income

o Spend avg >12$ => potential customer (value)

o Purchase nylon more than cotton (nylon is costlier than cotton)

o Mostly view brands assortment page (DonnaKaran – costly fashion design brand)

o Less visit to free gift category

o More visit to sandal foot

Cluster 1

o More men product than other cluster => cluster might have more mens

o Frequent use of search bar

o Rich visitors with most of them are have above average home/assets value

Cluster 2

o General visitor

o Buy cheap products

Cluster 3

o Interested mostly in offers/free gift products

o Highest visit to checkout page => potential customer (frequency)

Cluster 4

o No special pattern observed

Experiment II

We can validate whether we can use page view data for identifying potential customer using

clustering. We have labeled data of avg. purchase >12$ with 1368 instances as true and 413 instances as

false.

=== Run information === Scheme: weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10 Relation: q3_added-weka.filters.unsupervised.attribute.Remove-V-R7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15 Instances: 1781 Attributes: 26 City Customer_ID US_State Num_Sheer_Look_Product_Views Num_CT_Waist_Control_Views Num_PH_Category_Views Num_main_shopping_cart_Template_Views Num_Replenishable_Stock_Views Num_account_Template_Views Num_main_login2_Template_Views Num_Sandal_Foot_Views Num_HasDressingRoom_True_Views Num_Legwear_Product_Views


Num_products_productDetailLegwear_Template_Views Num_DonnaKaran_Product_Views Num_AmericanEssentials_Product_Views Num_Basic_Product_Views Num_WDCS_Category_Views Num_Oroblu_Product_Views WhichDoYouWearMostFrequent Num_products_Template_Views Home_Market_Value Num_WAS_Category_Views Num_main_vendor_Template_Views Num_main_freegift_Template_Views Ignored: Spend_over_$12_per_order_on_average Test mode: Classes to clusters evaluation on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 5 Within cluster sum of squared errors: 4913.0928856548035 Missing values globally replaced with mean/mode Time taken to build model (full training data) : 0.05 seconds === Model and evaluation on training set === Clustered Instances 0 561 ( 31%) 1 1220 ( 69%) Class attribute: Spend_over_$12_per_order_on_average Classes to Clusters: 0 1 <-- assigned to cluster 402 966 | False 159 254 | True Cluster 0 <-- True Cluster 1 <-- False Incorrectly clustered instances : 656.0 36.8332 %

Observation :

Only 63% of data is correctly classified, as data is more biased toward False(<12$ average spending)

class. But clustering gives us good insight of purchase/page view patterns.


Association Rules

Objective: To identify some interesting patterns in the users page views and also the killer pages.

Description:

Association rule learning is a popular and well researched method for discovering interesting

relations between variables in large databases. It is intended to identify strong rules discovered

in databases using different measures of interestingness. Measures used in our analysis are lift,

confidence and support.

Procedure: 1. For performing the association rules we needed to convert the dataset into binary matrix

indicating in each session indicating whether he/she visited that page or not.

2. For performing the association rules, “arules” package is available.

library(arules)

3. Now loading the converted dataset into R for generation of rules, we used the important

columns based on mean gini index obtained from randomForest result.

4. After loading the data, we convert the data as transactions by following command,

dataTrans <- as(assoc,”transactions”)

5. Now we apply “apriori” algorithm to generate the rules, where we can pass the parameter list

having support, confidence and minl-ength of every rule.

rules <- apriori(dataTrans)

This will generate all rules using min-support as 0.1 and min-confidence as 0.8 around. It will

generate all the subset rules also based on the frequent itemset of attributes.

6. To know how many rules generated we can see that by

rules

Around 377564 rules were generated out of which we were interested in only rules having RHS

as LEAVE or CONTINUE to check whether person will continue or leave after seeing certain

pages.

7. Retrieved a subset of rules from all generated rules which were having some interesting

patterns.

Observation: We were able to see some of the interesting patterns in the rules generated, like in our dataset most of

the persons were females so we were able to find out that most of the rules were having

“NUM_OF_WOMEN_PRODUCT_VIEWS” in possibly every transaction. Some of the brands were least

visited or never visited according to the rules. We were able to identify some of the killer pages based

on the user preferences or after visiting some pages user used to withdraw at the same page everytime

like that.


Results: Some of the rules sorted based on the “lift” values are as below:

rulesLeave <- subset(rules,subset=rhs %pin% "LEAVE")// for getting the rules

inspect(head(sort(rulesLeave,by="lift"),20))

O/p:

lhs rhs support confidence lift

1 {CONTINUE=YES} => {LEAVE=NO} 0.425 1 2.352941




Num_Men_Product_Views=No,


18 {Num_MAS_Category_Views=No,



19 {Num_MDS_Category_Views=No,



20 {Num_MCS_Category_Views=No,



Some of the interesting rules are shown above.Some of the random generated are,

inspect(head(rulesLeave,6))

0/p:






CONTINUE=NO} => {LEAVE=YES} 0.100 1 1.739130

5 {Num_Women_Product_Views=No,


6 {Num_CT_Waist_Control_Views=No,


For CONTINUE some examples are:

inspect(head(rulesContinue,4))

o/p:



LEAVE=NO} => {CONTINUE=YES} 0.110 1 2.352941


LEAVE=YES} => {CONTINUE=NO} 0.100 1 1.739130

5 {Num_Women_Product_Views=No,


6 {Num_CT_Waist_Control_Views=No,


Conclusion: We were able to find some interesting patters in users page views and were able to identify some of the

killer pages as “Num_CT_Waist_Control_Views”,” Num_MAS_Category_Views” like these pages.