Top Banner
Nov, 2002 Banerjee and Ghosh 1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh
23

Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Dec 15, 2015

Download

Documents

Paige Wilken
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 1

Characterizing Visitors to a Website Across Multiple Sessions

NGDM Workshop, Nov 2002

Arindam BanerjeeJoydeep Ghosh

Page 2: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 2

Motivation

Why Characterize or Predict web user behavior?

• Site-centric view: Personalization, sticky websites

• User-centric view: personal agents for information acquisition

• Universalist approaches: Pagerank, web metrics,…

Page 3: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 3

Clustering Users from Web Logs

• Wide variety of web behavior segment users based on surfing behavior as a first step to further analysis.

• User: set of sessions• Session: sequence of

– (page I.d., time spent on that page) tuples

– How to cluster sets of sequences?

Page 4: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 4

The Approach

• Cluster Sessions– Session Similarity Measure

– Session Similarity Graph

• Outlier Detection

– Graph Partitioning

• Create a Cluster Space

• Cluster users in this Space

Page 5: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 5

A Similarity Measure for Sessions

1. Overlap between two sessions represented by the longest common subsequence (LCS)

2. Obtain session similarity using LCS and time information session similarity = (time similarity in LCS) x (importance of LCS)

• The similarity component : – Average min-max similarity for each page in the LCS

• The importance component : – Average of the fraction of overall session time spent in the LCS

1,0

Page 6: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 6

Session Clustering

• Find the pairwise similarity values between all pair of sessions; record only similarities >

• Incrementally construct similarity graph G

– the vertices are the sessions, the edge weights are the session similarity values

– no isolated vertices (discard “outliers”)

• Balanced Graph Partitioning– we used Metis [Karypis, Kumar]

Page 7: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 7

The Cluster Space

• Given: each session assigned to one of k clusters (sets)Sessions of a user are distributed among the k sets

– vector u = [u1 u2 … uk ]T where ui = number of sessions of the user belonging to cluster I

• Stage II : User Clustering

– find pairwise similarity values using the extended Jaccard measure

– partition similarity graph

• Gives l user clusters and a set of outlier users

Page 8: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 8

The Dataset : Sulekha.com

Page 9: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 9

Dataset details

• Logs over a one month period

• Raw log size 184 Mb

• 453,953 files accessed

• 37,753 sessions in all

• 23,310 sessions after some preprocessing/filtering

• 2,493 users

Page 10: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 10

Results : Session ClustersCluster 1 – interest in coffeehouse, contests Cluster 2 – glance through home, articles

-(/,12)(/movies,6)(/contests,178)

-(/contests,142)

-(/coffeehouse,5)(/contests,183)

-(/contests,172)

-(/,10)(/contests,143)

-(/,22)(/articles,22)

-(/,20)(/articles,20)

-(/,21)(/articles,21)

-(/,19)(/articles,19)

-(/,20)(/articles,19)

Cluster 3 – interest in author, articles Cluster 4 – read articles

-(/,148)(/authors,6)(/articles,77)

-(/authors,290)(/articles,290)

-(/authors,295)(/articles,295)

-(/,33)(/authors,90)(/articles,475)

-(/,32)(/authors,91)(/articles,425)

-(/,39)(/articles,98)(/misc,17)

(/articles,2649)

-(/,9)(/articles,2666)

-(/authors,26)(/articles,2561)

-(/misc,20)(/articles,77)(/misc

32)(/articles,43)(/authors,16)

(/articles,2373.1)

Page 11: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 11

Results : User Clusters• user : [(128.194.xxx.xxx)]

– (/authors,3)(/articles,129)– (/authors,8)(/articles,8)– (/authors,80)(/articles,2141)

• user : [(209.30.xxx.xxx)]– (/home,77)(/articles,111)(/authors,93)(/articles,629)(/

misc,58) (/coffeehouse,75)(/wo-men,967)– (/articles,2627)

• user : [(171.68.xxx.xxx)]– (/home,323)(/articles,24)(/authors,45)(/articles,1290)

A user cluster :

people who read the articles

Page 12: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 12

Results : User Clusters• user : [(152.170.xxx.xxx)]

– (/home,21)(/wo-men,1075)(/philosophy,52)

• user : [(209.244.xxx.xxx)]– (/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo-

men,31)– (/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)

(/coffeehouse,382)(/biztech,298)(/philosophy,290)– (/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6)

(/biztech,94)(/coffeehouse,2)(/philosophy,1093)

A user cluster :

people interested in wo-men, philosophy, coffeehouse

Page 13: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 13

Results : User Clusters• user : [(216.154.xxx.xxx)]

– (/coffeehouse,12)(/biztech,25)(/books,48)– (/coffeehouse,13)(/biztech,26)(/books,19)

• user : [(204.220.xxx.xxx)]– (/coffeehouse,162)– (/coffeehouse,40)

• user : [(32.100.xxx.xxx)]– (/coffeehouse,12)(/contests 12)– (/coffeehouse,43)(/contests 44)

A user cluster :

people interested in coffeehouse – bookmarked it !

Page 14: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 14

Result Visualization using CLUSION [Strehl &Ghosh 01]

Sessions Users

Page 15: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 15

Conclusions

• Segmentation: a basic pre-processing step for Web Mining• Similarity measure + Cluster Space Concept: applicable to

clustering of sets of any data-structure • For certain websites, time spent on the pages matters

– not handled by current commercial tools

• Outlier detection before clustering is important• Results QA-ed by human subjects

– Results for clusters & outliers at both levels were subjectively good

No good way to find cluster quality analytically

Formation of similarity graph is a slow process

Page 16: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 16

Future Work

• Improve the present method by:– using cluster seeds for cluster growing

– using alternative clustering algorithms for each stage

– studying the effect of thresholds, number of clusters on performance

– studying the importance of order of page-visits

– studying the importance of balanced clustering

Page 17: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 17

Backup

Page 18: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 18

Issues : Choice of Parameters

• Number of session clusters, k, should be chosen appropriately

• Thresholds for forming session & user similarity graphs :– threshold value should be chosen after looking at the

distribution of edge weights

Page 19: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 19

Related Work

• Research in Web Mining :– Extraction of navigational patterns : Spiliopoulou,

Faulstich

– Ordering relationships : Mannila, Meek

– Surfing prediction : Pitkow, Pirolli

– Clustering web usage sessions : Fu, Sandhu, Shih

Page 20: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 20

Example

• Sessions :

– Session1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)]

– Session2 = [(b,5) (d,12) (f,1) (a,7) (c,5)]

• LCS pages = [(b)(d)(c)]

• Corresponding Index, Times Sequences :– Index1 = [(1)(2)(3)], Time1 = [(100) (8) (5)]

– Index2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)]

• Similarity over each LCS page : of the two times– Similarity on page b = 5/100 = 0.05

– Similarity on page d = 8/12 = 0.67

– Similarity on page c = 5/5 = 1.00

max

min

Page 21: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 21

Example (contd.)

• The similarity component = (0.05 + 0.67 + 1.00)/3

= 0.57

• The importance component :– Fraction of time spent in the LCS by Session1 = 113/149 = 0.76

– Fraction of time spent in the LCS by Session2 = 22/30 = 0.73

– The mean = (0.76+0.73)/2 = 0.75

• The overall similarity= 0.57 x 0.75

= 0.43

Page 22: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 22

Issues : Session Resolution

• Generate coarse resolution paths making use of the concept hierarchy of the website

• Reduces computations; Increases interpretability of results

Original Path Concept-level Path(/authors/ramesh_mahadevan.html,3)

(/articles/rm_phattas.html,75)

(/articles/rm_desidads.html,39)

(/authors,3)

(/articles,114)

(/authors/arun_sampath.html,109)

(/philosophy/messages/1951.html,102)

(/philosophy/messages/1953.html,46)

(/,3)

(/philosophy/messages/1954.html,69)

(/authors,109)

(/philosophy,148)

(/,3)

(/philosophy,69)

Page 23: Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh.

Nov, 2002 Banerjee and Ghosh 23

Comments

• Results QA-ed by human subject– Results for clusters & outliers at both levels were subjectively

good

– No good way to find cluster quality analytically

• Clustering algorithms for the two stages– Stage I : Graph partitioning works well for large sparse graphs, so

it is desirable in this stage

– Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate

• Cluster space – Gives a general framework for mapping any non-vector clustering

problem to an equivalent vector clustering problem