Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang [email protected] Wojtek Kowalczyk [email protected]. nl ECML/PKDD Discovery Challenge October 7, 2005 Porto, Portugal
Dec 13, 2015
Analysing Clickstream Data:From Anomaly Detection to
Visitor Profiling
Peter I. [email protected]
Wojtek [email protected]
ECML/PKDD Discovery Challenge October 7, 2005 Porto, Portugal
Web server data
• 7 internet shops (home electronics)• 80.000 visitors (IP-addresses) in 25 days• 0.5 million sessions• 3 million clicks (records in a log file)• Example record:
11;1076262912;193.170.198.122;eb5cbe50997fcb7f9155c6c194c832a8;/znacka/?c=162&tisk=ano;http://www.google.com./search?hl=cs&q=Sennheiser+HD+650&btnG=Vyhledat+Googlem&lr=lang_cs
• Objective: discover interesting patterns !!!
Data Mining Process
INPUT DATA
Web accesslog data
DATABASE
DATA PREPARATION
PREPROCESSING
SESSION IDENTIFICATION & DETECTION OF ANOMALIES PROFILE MINING
2
3
3
3Tree of profile sequences
Pro
bab
ility
Content types
T e x t Tex
t Tex
tT e x t
Mixture model
BASIC STATISTICS
Detection of anomalies
Shop information
Identified sessions / based on a new definition
Anomalies/Strange things I
• Multiple IP-addresses per session– 2 IP-addresses: 3.051 sessions– 3 IP-addresses: 362 sessions– 4 IP-addresses: 113 sessions– ………………– 22 IP-addresses: 1 session– Some sessions involve IP’s from different countries
• A few sessions (12) refer to multiple shops
Anomalies/Strange things II
• Sessions with long duration– 476 sessions longer than 24 hours (up to 18 days)
• Very Intensive Sessions– 2.865 sessions with more than 100 visited pages– 19 sessions with more than 1.000 visited pages– 2 sessions with more than 10.000 visited pages
• Frequent IP-addresses with short sessions– E.g.: 29.320 sessions in less than 20 hours from 147.229.205.80
• “Parallel sessions”– Overlapping sequences of clicks from the same IP to the same
shop within a short period with multiple SIDs (Opening a new window? Making a transaction? )
Anomalies/Strange things III
• Sequences of short sessions that form sessionsExample: clicks from 62.209.194.163 (31 Jan 04)
09:40:09 /dt/?c=13654;http://www.shop5.cz/09:41:21 /dt/param.php?id=115;09:41:21 /;09:41:37 /ls/?id=20;http://www.shop5.cz/dt/?c=1365409:41:42 /;09:42:24 /ls/?&id=20&view=1,2,3,8&pozice=20;http://www.shop5.cz/ls/…09:42:25 /;09:42:48 /ls/?&id=20&view=1,2,3,8;http://www.shop5.cz/ls/?&id=20& …09:42:48 /;09:42:53 /ls/?&id=20&view=1,2,3,8&pozice=40;http://www.shop5.cz/ls/…
Each one has another session identifier !!!
Fixing the data
• A new definition of “session”:
A chronologically ordered sequence of “clicks” from the same IP-address to the same shop with no gaps longer than 30 minutes
• Sessions longer than 50 clicks ignored (12.000)
• Number of sessions dropped: 522.410 281.153
Old and New Sessions
Session Length Count Old Count New
1 318.523 65.258
2 24.762 31.821
3 17.353 18.828
4 15.351 16.332
5 15.361 15.509
6 13.455 13.448
7 10.958 10.883
8 9.045 9.095
9 7.939 8.070
10 7.028 7.091
... ... ...
Visitor Profiling
Motivation: On the internet each shop is
just “one click away”. If a user is not
satisfied with the service he/she just goes
to a next one and will likely never return.
Visitor Profiling Scheme
I. Clustering of user sessions
II. Analysis/interpretation of the clusters
III. Assign a cluster label to each session
IV. Analysis of the profile sequences
Clustering
• Cadez et al. (2001) - predictive profiles from historical transaction data
• Mixture of multinomials:
• Full data likelihood:
• The unknown parameters and
are estimated by the expectation maximization (EM) algorithm.
},...,{ 1 K},...,{ 1 K
K
k
C
c
nkckijijcyp
1 1
)(
N
iiDpDp
1
)|()|(
Interpretation of the clustersProfile 1 General overview of the products
Profile 2 Focused search
Profile 3 Potential buyers
Profile 4 Parameter based search
The transitions of profiles
P1 P2 P3 P4
P1 0.7208 0.1592 0.0621 0.0579
P2 0.5908 0.2828 0.0710 0.0553
P3 0.5022 0.1616 0.2873 0.0489
P4 0.6000 0.1702 0.0685 0.1613
Tree of user profiles
Tree of potential buyers
Conclusion
• We spot several anomalies background information about pre-processing & data preparation is important
• Important features were missing (who is a buyer?)• Four clear user profiles