Top Banner
Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang [email protected] Wojtek Kowalczyk [email protected]. nl ECML/PKDD Discovery Challenge October 7, 2005 Porto, Portugal
16

Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang [email protected] Wojtek Kowalczyk [email protected] ECML/PKDD Discovery.

Dec 13, 2015

Download

Documents

Damian McCarthy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Analysing Clickstream Data:From Anomaly Detection to

Visitor Profiling

Peter I. [email protected]

Wojtek [email protected]

ECML/PKDD Discovery Challenge October 7, 2005 Porto, Portugal

Page 2: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Web server data

• 7 internet shops (home electronics)• 80.000 visitors (IP-addresses) in 25 days• 0.5 million sessions• 3 million clicks (records in a log file)• Example record:

11;1076262912;193.170.198.122;eb5cbe50997fcb7f9155c6c194c832a8;/znacka/?c=162&tisk=ano;http://www.google.com./search?hl=cs&q=Sennheiser+HD+650&btnG=Vyhledat+Googlem&lr=lang_cs

• Objective: discover interesting patterns !!!

Page 3: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Data Mining Process

INPUT DATA

Web accesslog data

DATABASE

DATA PREPARATION

PREPROCESSING

SESSION IDENTIFICATION & DETECTION OF ANOMALIES PROFILE MINING

2

3

3

3Tree of profile sequences

Pro

bab

ility

Content types

T e x t Tex

t Tex

tT e x t

Mixture model

BASIC STATISTICS

Detection of anomalies

Shop information

Identified sessions / based on a new definition

Page 4: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Anomalies/Strange things I

• Multiple IP-addresses per session– 2 IP-addresses: 3.051 sessions– 3 IP-addresses: 362 sessions– 4 IP-addresses: 113 sessions– ………………– 22 IP-addresses: 1 session– Some sessions involve IP’s from different countries

• A few sessions (12) refer to multiple shops

Page 5: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Anomalies/Strange things II

• Sessions with long duration– 476 sessions longer than 24 hours (up to 18 days)

• Very Intensive Sessions– 2.865 sessions with more than 100 visited pages– 19 sessions with more than 1.000 visited pages– 2 sessions with more than 10.000 visited pages

• Frequent IP-addresses with short sessions– E.g.: 29.320 sessions in less than 20 hours from 147.229.205.80

• “Parallel sessions”– Overlapping sequences of clicks from the same IP to the same

shop within a short period with multiple SIDs (Opening a new window? Making a transaction? )

Page 6: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Anomalies/Strange things III

• Sequences of short sessions that form sessionsExample: clicks from 62.209.194.163 (31 Jan 04)

09:40:09 /dt/?c=13654;http://www.shop5.cz/09:41:21 /dt/param.php?id=115;09:41:21 /;09:41:37 /ls/?id=20;http://www.shop5.cz/dt/?c=1365409:41:42 /;09:42:24 /ls/?&id=20&view=1,2,3,8&pozice=20;http://www.shop5.cz/ls/…09:42:25 /;09:42:48 /ls/?&id=20&view=1,2,3,8;http://www.shop5.cz/ls/?&id=20& …09:42:48 /;09:42:53 /ls/?&id=20&view=1,2,3,8&pozice=40;http://www.shop5.cz/ls/…

Each one has another session identifier !!!

Page 7: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Fixing the data

• A new definition of “session”:

A chronologically ordered sequence of “clicks” from the same IP-address to the same shop with no gaps longer than 30 minutes

• Sessions longer than 50 clicks ignored (12.000)

• Number of sessions dropped: 522.410 281.153

Page 8: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Old and New Sessions

Session Length Count Old Count New

1 318.523 65.258

2 24.762 31.821

3 17.353 18.828

4 15.351 16.332

5 15.361 15.509

6 13.455 13.448

7 10.958 10.883

8 9.045 9.095

9 7.939 8.070

10 7.028 7.091

... ... ...

Page 9: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Visitor Profiling

Motivation: On the internet each shop is

just “one click away”. If a user is not

satisfied with the service he/she just goes

to a next one and will likely never return.

Page 10: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Visitor Profiling Scheme

I. Clustering of user sessions

II. Analysis/interpretation of the clusters

III. Assign a cluster label to each session

IV. Analysis of the profile sequences

Page 11: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Clustering

• Cadez et al. (2001) - predictive profiles from historical transaction data

• Mixture of multinomials:

• Full data likelihood:

• The unknown parameters and

are estimated by the expectation maximization (EM) algorithm.

},...,{ 1 K},...,{ 1 K

K

k

C

c

nkckijijcyp

1 1

)(

N

iiDpDp

1

)|()|(

Page 12: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Interpretation of the clustersProfile 1 General overview of the products

Profile 2 Focused search

Profile 3 Potential buyers

Profile 4 Parameter based search

Page 13: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

The transitions of profiles

P1 P2 P3 P4

P1 0.7208 0.1592 0.0621 0.0579

P2 0.5908 0.2828 0.0710 0.0553

P3 0.5022 0.1616 0.2873 0.0489

P4 0.6000 0.1702 0.0685 0.1613

Page 14: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Tree of user profiles

Page 15: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Tree of potential buyers

Page 16: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery.

Conclusion

• We spot several anomalies background information about pre-processing & data preparation is important

• Important features were missing (who is a buyer?)• Four clear user profiles