Top Banner
1 Web Usage Mining Modelling: frequent- pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ. Berlin, Germany www.berendt.de
58

1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

Jan 22, 2016

Download

Documents

Tomas Creason
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

1

Web Usage Mining

Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering)

Prof. Dr. Bettina BerendtHumboldt Univ. Berlin, Germany

www.berendt.de

Page 2: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

2

Please note

These slides use and/or refer to a lot of material available on the Internet. To reduce clutter, credits and hyperlinks are given in the following ways:

Slides adapted from other people‘s materials: at bottom of slide

Pictures, screenshots etc.: URL visible in screenshot or given in PPT „Comments“ field

Literature, software: On the accompanying Web site http://vasarely.wiwi.hu-berlin.de/WebMining07/

Thanks to the Internet community!

You are invited to re-use these materials, but please give the proper credit.

Page 3: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

3

Stages of knowledge discoverydiscussed in this lecture

Applicationunderstanding

Page 4: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

4An addendum to the association rules:main interestingness measures of association rules(and a recommendation for postprocessing the result set)

Support of a rule A B

= no. of instances with A and B / no. of all instances

Confidence of a rule A B

= no. of instances with A and B / no. of instances with A

= support (A & B) / support (A)

Lift of a rule A B

= support (A & B) / [ support (A) * support (B) ]

What does this measure, and in what numerical interval can it be?

Deleting redundant rules from the result set:

If you have A B and A & C B, the second rule is redundant.

Page 5: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

5Agenda

Sequence mining: tool WUM (case study “school search”)

Classification: method Naïve Bayes (case study “happiness”)

Clustering: tool DocumentAtlas (case study “EU proposals”)

A very short note on other uses of clustering (e.g. in query mining)

Some observations on privacy ...

Best-practice „design patterns“with open-source tools

Page 6: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

6

Demonstration of WUM

Page 7: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

7

The site

Business understanding / problem definition:

* How do users search in this online catalog?

* Which search criteria are popular?

* Which are efficient?

[Berendt & Spiliopoulou,VLDB Journal 2000]

Page 8: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

8

The concept hierarchies / site ontology(excerpt)

SEITE1-...LI (1st page of a list)orSEITEn-...LI (further page)

LA („Land“) SA („Schulart“) SU („Suche“)

Page 9: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

9Sequence mining – one result pattern: successful search for a school in Germany

a refinement

a repetition

a continuation

one example pattern

select t from node a b, template a * b as t where a.url startswith "SEITE1-" and a.occurrence = 1 and b.url contains "1SCHULE" and b.occurrence = 1 and (b.support / a.support) >= 0.2

(Berendt & Spiliopoulou, VLDB J. 2000)

/liste.html?offset=920&zeilen=20&anzahl=1323&sprache=de&sw_kategorie=de&erscheint=&suchfeld=&suchwert=&staat=de&region=by&schultyp=

/liste.html?offset=920&zeilen=20&anzahl=1323&sprache=de&sw_kategorie=de&erscheint=&suchfeld=&suchwert=&staat=de&region=by&schultyp=

Page 10: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

10

Sequences

Page 11: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

11Generalized sequences, navigation patterns, hits in WUM

Page 12: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

12Aggregated Logs: The basic internal representation in WUM

Page 13: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

13The confi-dence measure for genera-lized sequences

Page 14: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

14Templates in the query language MINT, g-sequences, and navigation patterns

Page 15: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

15Interestingness measures: Support (hits) and confidence

Page 16: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

16

Aggregated Logs, queries, and query results

Page 17: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

17

The basic idea of the WUM algorithm

Page 18: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

18MINT can express 3 types of constraints (“predicates“)

Page 19: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

19

The WUM gseqm algorithm

(B predicates)

Page 20: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

20Agenda

Sequence mining: tool WUM (case study “school search”)

Classification: method Naïve Bayes (case study “happiness”)

Clustering: tool DocumentAtlas (case study “EU proposals”)

A very short note on other uses of clustering (e.g. in query mining)

Some observations on privacy ...

Best-practice „design patterns“with open-source tools

Page 21: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

21

“What makes people happy?” – a corpus-based approach to

finding happiness

Page 22: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

22

Bayes‘ formula and its use for classification

1. Joint probabilities and conditional probabilities: basics P(A & B) = P(A|B) * P(B) = P(B|A) * P(A) P(A|B) = ( P(B|A) * P(A) ) / P(B) (Bayes´ formula) P(A) : prior probability of A (a hypothesis, e.g. that an object belongs to a

certain class) P(A|B) : posterior probability of A (given the evidence B)

2. Estimation: Estimate P(A) by the frequency of A in the training set (i.e., the number of A

instances divided by the total number of instances) Estimate P(B|A) by the frequency of B within the class-A instances (i.e., the

number of A instances that have B divided by the total number of class-A instances)

3. Decision rule for classifying an instance: If there are two possible hypotheses/classes (A and ~A), choose the one that is

more probable given the evidence (~A is „not A“) If P(A|B) > P(~A|B), choose A The denominators are equal If ( P(B|A) * P(A) ) > ( P(B|~A) * P(~A) ), choose A

Page 23: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

23

Simplifications and Naive Bayes

4. Simplify by setting the priors equal (i.e., by using as many instances of class A as of class ~A)

If P(B|A) > P(B|~A), choose A

5. More than one kind of evidence

General formula:

P(A | B1 & B2 ) = P(A & B1 & B2 ) / P(B1 & B2) = P(B1 & B2 | A) * P(A) / P(B1 & B2) = P(B1 | B2 & A) * P(B2 | A) * P(A) / P(B1 & B2)

Enter the „naive“ assumption: B1 and B2 are independent given A

P(A | B1 & B2 ) = P(B1|A) * P(B2|A) * P(A) / P(B1 & B2)

By reasoning as in 3. and 4. above, the last two terms can be omitted

If (P(B1|A) * P(B2|A) ) > (P(B1|~A) * P(B2|~A) ), choose A

The generalization to n kinds of evidence is straightforward.

These kinds of evidence are often called features in machine learning.

Page 24: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

24

Example: Texts as bags of words

Common representations of texts

Set: can contain each element (word) at most once

Bag (aka multiset): can contain each word multiple times (most common representation used in text mining)

Hypotheses and evidence

A = The blog is a happy blog, the email is a spam email, etc.

~A = The blog is a sad blog, the email is a proper email, etc.

Bi refers to the ith word occurring in the whole corpus of texts

Estimation for the bag-of-words representation:

Example estimation of P(B1|A) :

number of occurrences of the first word in all happy blogs, divided by the total number of words in happy blogs (etc.)

Page 25: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

25

WEKA – NaiveBayes and NaiveBayesMultinomial

The WEKA classifier learning scheme NaiveBayesMultinomial implements this model of „the probability that a word occurs in a document given that the document is in that classs“.

Its output is a table giving these probabilities

The WEKA classifier learning scheme NaiveBayes assumes that the attributes are normally distributed.

Needed when the attributes are numerical and not necessarily 0 | 1 Its output describes the parameters of these normal distributions Explanation of the annotations of the attributes:

http://grb.mnsu.edu/grbts/doc/manual/Naive_Bayes.html

Explanation of the error measures: http://grb.mnsu.edu/grbts/doc/manual/Error_Measurements.html#sec:error

Page 26: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

26

The „happiness factor“ of Mihalcea & Liu (2006)

“Starting with the features identified as important by the Naïve Bayes classifier (a threshold of 0.3 was used in the feature selection process), we selected all those features that had a total corpus frequency higher than 150, and consequently calculate the happiness factor of a word as the ratio between the number of occurrences in the happy blogposts and the total frequency in the corpus.”

What is the relation to the Naïve Bayes estimators?

Page 27: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

27Agenda

Sequence mining: tool WUM (case study “school search”)

Classification: method Naïve Bayes (case study “happiness”)

Clustering: tool DocumentAtlas (case study “EU proposals”)

A very short note on other uses of clustering (e.g. in query mining)

Some observations on privacy ...

Best-practice „design patterns“with open-source tools

Page 28: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

28Clustering by information contained in the objects to be clustered (here: documents contain text) – www.kartoo.com

Page 29: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

29

The basic idea of clustering: group similar things

Group 1Group 2

Attribute 1

Att

rib

ute

2

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 30: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

30

Idea and Applications

Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects.

It is also called unsupervised learning.

It is a common and important task that finds many applications.

Applications in text analysis / Web content mining, e.g. for search engines:

Structuring search results

Suggesting related pages

Automatic directory construction/update

Finding near identical/duplicate pages

Applications in Web usage mining

Customer/user segmentation

User segmentation for recommender systems / personalization

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 31: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

31Concepts in Clustering

“Defining distance between points Cosine distance (which you already know)

Overlap distance

A good clustering is one where (Intra-cluster distance) the sum of distances between objects in the

same cluster are minimized,

(Inter-cluster distance) while the distances between different clusters are maximized

Objective to minimize: F(Intra,Inter)

Clusters can be evaluated with “internal” as well as “external” measures

Internal measures are related to the inter/intra cluster distance

External measures are related to how representative are the current clusters to “true” classes

– See entropy and F-measure

||

||

RQ

RQ

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 32: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

32

Inter/Intra Cluster Distances

Intra-cluster distance

(Sum/Min/Max/Avg) the (absolute/squared) distance between

- All pairs of points in the cluster OR

- Between the centroid and all points in the cluster OR

- Between the “medoid” and all points in the cluster

Inter-cluster distance

Sum the (squared) distance between all pairs of clusters

Where distance between two clusters is defined as:

- distance between their centroids/medoids

- (Spherical clusters)

- Distance between the closest pair of points belonging to the clusters

- (Chain shaped clusters)

From http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 33: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

33

How hard is clustering?

One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties

Suppose we are given n points, and would like to cluster them into k-clusters

How many possible clusterings?

!k

k n

• Too hard to do it brute force or optimally• Solution: Iterative optimization algorithms

– Start with a clustering, iteratively improve it (eg. K-means)

From http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 34: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

34

Classical clustering methods

Partitioning methods

k-Means (and EM), k-Medoids

Hierarchical methods

agglomerative, divisive, BIRCH

Model-based clustering methods

From http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 35: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

35

K-means

Works when we know k, the number of clusters we want to find

Idea:

Randomly pick k points as the “centroids” of the k clusters

Loop: For each point, put the point in the cluster to whose centroid it is

closest

Recompute the cluster centroids

Repeat loop (until there is no change in clusters between two consecutive iterations.)

Iterative improvement of the objective function: Sum of the squared distance from each point to the centroid of its cluster

From http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 36: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

36

K Means Example (K=2) For a more complex simulation, see

http://www.cs.tu-bs.de/rob/lehre/bv/Kmeans/Kmeans.html Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 37: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

37

A map of documents, grouped by their topics

Page 38: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

38

DocumentAtlas: A two-step procedure

1. Latent semantic indexing: Project documents into a semantic space (dimensionality reduction and identification of commonalities even if vocabulary is different)

2. Multidimensional scaling: Project that space into 2D, preserving the distances as well as possible

Input: a set of documents

Output: a „document map“

Page 39: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

39Agenda

Sequence mining: tool WUM (case study “school search”)

Classification: method Naïve Bayes (case study “happiness”)

Clustering: tool DocumentAtlas (case study “EU proposals”)

A very short note on other uses of clustering (e.g. in query mining)

Some observations on privacy ...

Best-practice „design patterns“with open-source tools

Page 40: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

40Clustering by information contained in the objects to be clustered (here: documents contain text) – www.kartoo.com

Page 41: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

41Clustering by information associated with the objects to be clustered (here: photos are associated with tags) – www.flickr.com

Page 42: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

42Clustering by information associated with the objects to be clustered (here: queries are associated with document texts) – (1)

Page 43: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

43Clustering by information associated with the objects to be clustered ... (2) – Baeza-Yates, Query Mining, ECIR 2005

1. Create instances of past ( query – result set ) combinations

2. Cluster them by the textual similarity of the (viewed) result documents

3. Use this to recommend a better / an additional query

Result set 1Query 1

Result set 2Query 2

New user

recommend

Page 44: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

44

Ranking by similarity and popularity: Examples

Page 45: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

45Agenda

Sequence mining: tool WUM (case study “school search”)

Classification: method Naïve Bayes (case study “happiness”)

Clustering: tool DocumentAtlas (case study “EU proposals”)

A very short note on other uses of clustering (e.g. in query mining)

Some observations on privacy ...

Best-practice „design patterns“with open-source tools

Page 46: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

46

Internet users are worried about their privacy ...

(results from a meta-study of 30 questionnaire-based studies [TK03])

(results from a meta-study of 30 questionnaire-based studies [TK03])

Page 47: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

47

... but are they really?An online shop with a difference

[Berendt, Günther, & Spiekermann, Communications of the ACM,2005]

Page 48: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

48

Privacy-related behaviour

Shopping for jackets

Shopping for cameras

[Berendt, Data Mining and Knowledge Discovery, 2002], [Berendt, Postproc. WebKDD 2002]

Page 49: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

49

Finding: People are willing to exchange privacy for personalization benefits

Users would provide, in return for personalized content, information on their name (88%), education (88%), age (86%), hobbies (83%), salary (59%), or credit card number (13%).

27% of Internet users think tracking allows the site to provide information tailored to specific users.

73% of online users find it useful if site remembers basic information such as name and address.

People are willing to give information to receive a personalized online experience: 51% or 40%, depending on the study.

[TK03]

Page 50: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

50User-centric evaluation: An experimental investigation of the effect of explaining the personalization-privacy tradeoff

[KT05] compared the effects of traditional privacy statements with that of a contextualized explanation on users’ willingness to answer questions about themselves and their (product) preferences.

In the contextualized-explanation condition, participants

answered 8.3% more questions (gave at least one answer) (p<0.001),

gave 19.6% more answers (p<0.001),

purchased 33% more often (p<0.07) ,

stated that their data had helped the Web store to select better books (p<0.035) – even though the recommendations were static and identical for both groups.

[KT05] compared the effects of traditional privacy statements with that of a contextualized explanation on users’ willingness to answer questions about themselves and their (product) preferences.

In the contextualized-explanation condition, participants

answered 8.3% more questions (gave at least one answer) (p<0.001),

gave 19.6% more answers (p<0.001),

purchased 33% more often (p<0.07) ,

stated that their data had helped the Web store to select better books (p<0.035) – even though the recommendations were static and identical for both groups.

(screenshot from Teltzrow, M. & Kobsa, A. (2004). Communication of Privacy and Personalization in E-Business. In Proceedings of the Workshop “WHOLES: A Multiple View of Individual Privacy in a Networked World”, Stockholm, Sweden.

Page 51: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

51

But what is privacy?Is it only about data protection?

Phillips, D.J. 2004. “Privacy Policy and PETs: The Influence of Policy Regimes on the Development and Social Implications of Privacy Enhancing Technologies.” New Media & Society 6(6): 691-706

freedom from intrusion

construction of the public/private divide

separation of identities

protection from surveillance (the right to choose belonging)

Page 52: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

52Also: whose privacy? Stakeholders and privacy interests:a (partially) fictitious example

users of the system:

passengers

system administrators

Other stakeholders:

airport administration

airport security

airlines

duty-free shop

Page 53: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

53

Different privacy interests of the different stakeholders

Page 54: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

54Agenda

Sequence mining: tool WUM (case study “school search”)

Classification: method Naïve Bayes (case study “happiness”)

Clustering: tool DocumentAtlas (case study “EU proposals”)

A very short note on other uses of clustering (e.g. in query mining)

Some observations on privacy ...

Best-practice „design patterns“with open-source tools

Page 55: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

55

In the preparation of a log file(recommendations for open-source tools are shown in green)

1. Use qualitative methods for application understanding (read!)

2. Inspect the site and the URLs for data understanding

1. Generate Analog reports for getting base statistics of usage

2. Build concept system / hierarchy and mapping: URLs concepts (notation: WUMprep regex)

3. Use WUMprep for data preparation

1. Remove unwanted entries (pictures etc.)

2. Sessionize

3. Remove robots

4. Replace URLs by concepts

5. (Build a database)

4. Use WEKA for modelling

1. Transform log file into ARFF (WUMprep4WEKA)

2. Cluster, classify, find association rules, ...

5. Use WUM for modelling

6. Select patterns based on objective interestingness measures (support, confidence, lift, ...) and on subjective interestingness measures (unexpected? Application-relevant?)

7. Present results in tabular, textual and graphical form (use Excel, ...)

8. Interpret the results

9. Make recommendations for site improvement etc.

Page 56: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

56

In the case study:

1. Use qualitative methods for application understanding (read!)

2. Inspect the site and the URLs for data understanding

1. Generate Analog reports for getting base statistics of usage

2. Build concept system / hierarchy and mapping: URLs concepts (notation: WUMprep regex)

3. Use WUMprep for data preparation

1. Remove unwanted entries (pictures etc.)

2. Sessionize

3. Remove robots

4. Replace URLs by concepts

5. (Build a database)

4. Use WEKA for modelling

1. Transform log file into ARFF (WUMprep4WEKA)

2. Cluster, classify, find association rules, ...

5. Use WUM for modelling

6. Select patterns based on objective interestingness measures (support, confidence, lift, ...) and on subjective interestingness measures (unexpected? Application-relevant?)

7. Present results in tabular, textual and graphical form (use Excel, ...)

8. Interpret the results

9. Make recommendations for site improvement etc.

done

Page 57: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

57

The preparation of texts(e.g., for an automatic version of step 2.2.)

Is quite involved when done properly

(a good introduction to preprocessing for text mining can be found in

Grobelnik, M., & Mladenic, D. Text Mining Tutorial.

http://eprints.pascal-network.org/archive/00000017/01/Tutorial_Marko.pdf )

However, as a first step, you can also use the raw text of documents (generated with only a few of the tools in the TextGarden library).

Page 58: 1 Web Usage Mining Modelling: frequent-pattern mining I (sequence mining with WUM), classification and clustering) Prof. Dr. Bettina Berendt Humboldt Univ.

58

Thank you!