Top Banner
Advanced Topics in Data Mining: Web Mining
66
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web mining

Advanced Topics in Data Mining:

Web Mining

Page 2: Web mining

Web MiningWeb Mining

Page 3: Web mining

Web Mining• Applications are ported to the Web at rapid pace• On-line services, such as America Online (AOL),

and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web

• How Amazon does it?• Understanding Web user behavior is important

– It can improve Web page organization– It can increase Web server performance– It can exploit Web advertising– It can increase business opportunity

Page 4: Web mining

Amazon Web Page

Association Rules

Page 5: Web mining

More Information Desired

• Collect statistical information (page hits) only, which is insufficient since:– The hit frequency of a page depends not only on its

content but also on its location

– The number of users accessing a page is not available

– Information on what pages accessed together is not available

• Data mining in the Web (Web Mining)– Web Access Pattern Collection

– Web User Pattern Mining

Page 6: Web mining

Web Access Pattern Collection

• Server-Based Data Collection– Who are visiting a given Web site and what are

they doing

• Agent-Based Data Collection– What are the Web sites a particular user has

visited?

Page 7: Web mining

Server-Based Data Collection

• Examine the logs collected by HTTPd– Access Log (IP, Time, Access Data), Referred

Log (AB), Error Log, …– We can combining some of them for our use if

necessary

• Problems – The use of proxy servers– The effect of caching

Page 8: Web mining

Server-Based Data Collection

Page 9: Web mining

Access LogIP/Domain Name Time Access Data

Page 10: Web mining

Referred Log

不考慮 Caching的問題

Page 11: Web mining

Server-Based Data Collection• Have to be done in accordance with technol

ogy advances– The use of Active Server Pages (Session ID ava

ilable)• The use of proxy servers• The effect of caching

– HTTPd 1.1

• Limitation– Can only capture the user behavior when they a

re within this site

Page 12: Web mining

Agent-Based Data Collection• Understanding individual Web behavior needs clie

nt-based data collection• Results are useful

– Better Personalized Service– Improved Web Page Organization– Better Pricing Policies

• Methods– Applets can only read/write files in their source servers

• a big security constraint

– Using Active Components (ActiveX Control) and PlugIns

• APCS (Access Pattern Collection Server)

Page 13: Web mining

APCS

Page 14: Web mining

APCS

Page 15: Web mining

APCS

Page 16: Web mining

APCS

Page 17: Web mining

APCS

Page 18: Web mining

Agent-Based Data Collection

• Very difficult to do for non-registered users in the current Web environment– We have to be conducted with users’ consent

• Very dependent upon available Web technologies

Page 19: Web mining

Web User Pattern Mining

• Web user pattern mining is to discover user access patterns in Web servers

• Pattern discovery and analysis tools– Some existing Web tools provide mechanisms f

or reporting user activity in the servers – Web Trends (http://www.webtrends.com.tw/)– Open Market (http://www.openmarket.com/)– Net.Genesis (http://www.netgen.com/)

Page 20: Web mining

Path Traversal Patterns Mining• Mining path traversal patterns in a distributed information

providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access

• Solution procedure consists of three steps: – Convert the original sequence of log data into a set of maximal

forward references (MF)• Filter out the effect of some backward references

– Mainly made for ease of traveling and concentrate on mining meaningful user access sequences

– Some objects are visited because of their locations rather than their content

– Determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained

– Determine the maximal reference sequences from large reference sequences (Trivial)

Page 21: Web mining

Step1: MF References• Suppose the traversal log contains the following traversal path for a user:

– A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V

The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV}

When backward referencesoccur, a forward reference path terminate.

Page 22: Web mining

Step1: Another Example

Page 23: Web mining

Step1: Arrange Database

Encoding

Page 24: Web mining

Step1: Database Reduction

Database Reduction

Page 25: Web mining

Step2: Find Frequent Reference Sequences

• Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences)– Full-Scan (FS) Algorithm

• FS utilizes key ideas of the DHP algorithm

– Selective-Scan (SS) Algorithm• SS reduces the number of database scans

Page 26: Web mining

Full-Scan (FS) Algorithm

ScanDB-1

Generate L1 & Hash Table

Page 27: Web mining

Generate L1 & Hash Table

ScanDB-1

h(x,y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17

Page 28: Web mining

Generate C2

Page 29: Web mining

Generate L2 & Reduce DB

ScanDB-2

Page 30: Web mining

Generate L2 & Reduce DB

ScanDB-2

Page 31: Web mining

Generate C3, L3 & Reduce DB

ScanDB-3

Page 32: Web mining

Generate C4, L4 & Reduce DB

ScanDB-4

Page 33: Web mining

Selective-Scan (SS) Algorithm

ScanDB-3

Page 34: Web mining

Step 3: Generate FrequentTraversal Patterns

Maximal Reference Sequences

Page 35: Web mining

WAP-Mine Algorithm• The key consideration is how to facilitate the te

dious support counting and candidate generating operations in the mining procedure

• Given Web Access Sequence database WAS and a support threshold , mine the complete set of -patterns of WAS

User ID Web Access Sequence

100 abdac

200 eaebcac

300 babfaec

400 afbacfc

WAS

Page 36: Web mining

WAP-Mine Algorithm

(1)Scan WAS once,find all frequent-1 events

(2)Scan WAS again,construct a WAP-tree

(3)Recursively mine the WAP-tree using conditional search

Access patterns

Page 37: Web mining

Find All Frequent-1 Events

User ID Web Access Sequence

100 abdac

200 eaebcac

300 babfaec

400 afbacfc

Item Support Frequency

a 4

b 4

c 4

d 1

e 2

f 2

Min_Sup=75%

User ID Web Access Sequence Frequent Subsequence

100 abdac abac

200 eaebcac abcac

300 babfaec babac

400 afbacfc abacc

Page 38: Web mining

WAP-Tree Construction

• Using frequent events to register all count information for further mining

User ID Frequent Subsequence

100 abac

200 abcac

300 babac

400 abacc

Page 39: Web mining

Mining Web Access Patterns from WAP-Tree

Sequence Countaba 2ab 1abca 1ab -1baba 1abac 1aba -1

Conditional Sequence Based on c

Sequence Countaba 1abca 1baba 1abac 1

Item Sup Frequencya 4b 4c 2

Generate Web Access Patterns: ac, bc

Page 40: Web mining

Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on ac

Sequence Count

ab 3

b 1

bab 1

b -1

Sequence Count

ab 3

bab 1

Item Sup Frequency

a 4

b 4Generate Web Access Patterns: aac, bac

Page 41: Web mining

Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on bac

Sequence Count

a 3

ba 1

Item Sup

Frequent

a 4

b 1Generate Web Access Patterns: abac

Page 42: Web mining

Mining Web Access Patterns from WAP-Tree

Conditional Sequence Based on abac

No Web Access Patterns are Generated

Sequence Count

a 4

Page 43: Web mining

Mining for Web Transactions

• To capture Web customer buying behavior– It is not just market basket transaction for the

set of items bought by a customer in a single purchase (Association Rules)

– It is not just Web user travel patterns (Path Traversal Patterns)

– It is an extension from path traversal patterns

• Exploring the relationship between traveling and buying

Page 44: Web mining

Mining for Web Transactions

Web Transaction

Algorithm WR (Web-transaction-Record)

Web Transaction Records <Path: a Set of Purchases>

Algorithm WTM, MTSPJ, MTSPC

Frequent Transaction Patterns

Web Transaction Association Rules

Page 45: Web mining

Mining for Web Transactions

• Web-transaction-Record (WR) Algorithm– Extract meaningful Web transaction records

from the given Web transaction

• WTM (Web Transaction Mining) Algorithm – Mining Web Transaction Patterns

• MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM

Page 46: Web mining

Mining for Web Transactions

Page 47: Web mining

Mining for Web Transactions

Page 48: Web mining

WTM Algorithm

• It joins the purchased itemsets for generating candidate transaction patterns

• WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns– WTM hashes not only each item but also each

purchase in the path

Page 49: Web mining

WTM Algorithm

S{i7}, J{i8}, Q{i10}ASJLQ

G{i5}ABFG

D{i3}ABD

400

S{i7}, J{i8}, L{i9}ASJL

B{i1}, G{i5}ABFG

B{i1}, E{i4}ABCE

300

S{i7}, Q{i10}ASJLQ

B{i1}, C{i2}, E{i4}ABCE200

S{i7}, L{i9}ASJL

B{i1}, H{i6}ABFGH

B{i1}, C{i2}, E{i4}ABCE

100

PurchasePathWT_ID

Web Transaction

DATABASE

Page 50: Web mining

Support Count

WT_ID Path Purchase

100

ABCE B{i1}, C{i2}, E{i4}

ABFGH B{i1}, H{i6}

ASJL S{i7}, L{i9}

200

ABCE B{i1}, C{i2}, E{i4}

ASJLQ S{i7}, Q{i10}

Path Purchase Support Count

AB B{i1} 2

ABC C{i2} 2

Page 51: Web mining

WTM Algorithm

2Q{i10}ASJLQ

2L{i9}ASJL

2J{i8}ASJ

4S{i7}AS

1H{i6}ABFGH

2G{i5}ABFG

3E{i4}ABCE

1D{i3}ABD

2C{i2}ABC

3B{i1}AB

Sup.PurchasePath

C1

Path Purchase Sup.

AB B{i1} 3

ABC C{i2} 2

ABCE E{i4} 3

ABFG G{i5} 2

AS S{i7} 4

ASJ J{i8} 2

ASJL L{i9} 2

ASJLQ Q{i10} 2

T1

Support Count >= 2

Page 52: Web mining

WTM Algorithm

3B{i1} E{i4}ABCE

2B{i1} C{i2}ABC

Sup.PurchasePath

0B{i1} J{i8}ASJ

0B{i1} S{i7}AS

0L{i9} Q{i10}ASJLQ

1J{i8} Q{i10}ASJLQ

C2

2C{i2} E{i4}ABCE

2S{i7} Q{i10}ASJLQ

2S{i7} L{i9} ASJL

2S{i7} J{i8}ASJ

3B{i1} E{i4}ABCE

2B{i1} C{i2}ABC

Sup.PurchasePath

T2

Support Count >= 2

共28個

Page 53: Web mining

WTM Algorithm

2B{i1} C{i2} E{i4}ABCE

Sup.PurchasePath

T3

2B{i1} C{i2} E{i4}ABCE

Sup.PurchasePath

C3

Support Count >= 2

Page 54: Web mining

WTM Disadvantages

• WTM may generate a lot of unqualified candidate transaction patterns without utilizing the paths of frequent transaction patterns

• This will degrade the performance

Page 55: Web mining

MTSPJ Algorithm

• Algorithm MTSPJ uses maximal transaction segment that contains frequent transaction patterns and the maximal path, to solve the unqualified candidate transaction pattern problem

• MTSPJ generalizes candidate transaction patterns only when the leaf node of the Web transaction tree is reached

Page 56: Web mining

MTSPJ Algorithm

S{i7}, J{i8}, Q{i10}ASJLQ

G{i5}ABFG

D{i3}ABD

400

S{i7}, J{i8}, L{i9}ASJL

B{i1}, G{i5}ABFG

B{i1}, E{i4}ABCE

300

S{i7}, Q{i10}ASJLQ

B{i1}, C{i2}, E{i4}ABCE200

S{i7}, L{i9}ASJL

B{i1}, H{i6}ABFGH

B{i1}, C{i2}, E{i4}ABCE

100

PurchasePathWT_ID

Web Transaction

DATABASE

A

F

E

C

B S

D

H

G

Q

L

J

Page 57: Web mining

MTSPJ Algorithm

2Q{i10}ASJLQ

2L{i9}ASJL

2J{i8}ASJ

4S{i7}AS

1H{i6}ABFGH

2G{i5}ABFG

3E{i4}ABCE

1D{i3}ABCD

2C{i2}ABC

3B{i1}AB

Sup.PurchasePath

C1

Path Purchase Sup.

AB B{i1} 3

ABC C{i2} 2

ABCE E{i4} 3

ABFG G{i5} 2

AS S{i7} 4

ASJ J{i8} 2

ASJL L{i9} 2

ASJLQ Q{i10} 2

T1

Support Count >= 2

F

G

J

L

Q

S

E

C

A

B

Page 58: Web mining

MTSPJ Algorithm

B{i1} C{i2} E{i4}ABCE

Maximal Transaction Segment

B{i1} G{i5}ABFG

Sup.PurchasePath

C2S{i7} J{i8} L{i9} Q{i10}ASJLQ

Maximal Transaction Segment

C2Sup.PurchasePath

L{i9} Q{i10}ASJLQ

J{i8} Q{i10}ASJLQ

S{i7} Q{i10}ASJLQ

J{i8} L{i9}ASJL

S{i7} L{i9}ASJL

S{i7} J{i8}ASJ

C2

B{i1} C{i2}ABC

C{i2} E{i4}ABCE

B{i1} E{i4}ABCE

Sup.PurchasePath

B{i1} G{i5}ABFG

Maximal Transaction Segment

F

G

J

L

Q

S

E

C

A

B

2

3

21

2

2

1

2

1

0

Page 59: Web mining

MTSPJ Algorithm

C2

Path Purchase Sup.

ABC B{i1} C{i2} 2

ABCE B{i1} E{i4} 3

ABCE C{i2} E{i4} 2

ABFG B{i1} G{i5} 1

ASJ S{i7} J{i8} 2

ASJL S{i7} L{i9} 2

ASJL J{i8} L{i9} 1

ASJLQ S{i7} Q{i10} 2

ASJLQ J{i8} Q{i10} 1

ASJLQ L{i9} Q{i10} 0

Path Purchase Sup.

ABC B{i1} C{i2} 2

ABCE B{i1} E{i4} 3

ABCE C{i2} E{i4} 2

ASJ S{i7} J{i8} 2

ASJL S{i7} L{i9} 2

ASJLQ S{i7} Q{i10} 2

T2

Page 60: Web mining

MTSPJ Algorithm

J

L

Q

S

E

C

A

B

B{i1} C{i2} E{i4}ABCE

Maximal Transaction Segment

2B{i1} C{i2} E{i4}ABCE

Sup.PurchasePath

C3

Page 61: Web mining

MTSPC Algorithm

2Q{i10}ASJLQ

2L{i9}ASJL

2J{i8}ASJ

4S{i7}AS

1H{i6}ABFGH

2G{i5}ABFG

3E{i4}ABCE

1D{i3}ABCD

2C{i2}ABC

3B{i1}AB

Sup.PurchasePath

C1

Path Purchase Sup.

AB B{i1} 3

ABC C{i2} 2

ABCE E{i4} 3

ABFG G{i5} 2

AS S{i7} 4

ASJ J{i8} 2

ASJL L{i9} 2

ASJLQ Q{i10} 2

T1

Support Count >= 2

F

G

J

L

Q

S

E

C

A

B

MTSPC utilizes the LC (Large Count) to Filter Candidates

Page 62: Web mining

MTSPC Algorithm

F

G

J

L

Q

S

E

C

A

B

1E{i4}

1C{i2}

1B{i1}

ABCE

LCItemMaximal Path

Maximal Transaction Segment

K=1

|I| = 3 > 1 (K-1)

C2

2B{i1} C{i2}ABC

2C{i2} E{i4}ABCE

3B{i1} E{i4}ABCE

Sup.PurchasePath

Maximal Transaction Segment

Maximal Path Item LC

ASJLQ

S{i7} 1

J{i8} 1

L{i9} 1

Q{i10} 1

|I| = 4 > 1C2

Sup.PurchasePath

0L{i9} Q{i10}ASJLQ

1J{i8} Q{i10}ASJLQ

2S{i7} Q{i10}ASJLQ

1J{i8} L{i9}ASJL

2S{i7} L{i9}ASJL

2S{i7} J{i8}ASJ

Maximal Transaction Segment

Maximal Path Item LC

ABFGB{i1} 1

G{i5} 1

|I| = 2 > 1

1B{i1} G{i5}ABFG

Sup.PurchasePath

C2

Page 63: Web mining

MTSPC Algorithm

C2

Path Purchase Sup.

ABC B{i1} C{i2} 2

ABCE B{i1} E{i4} 3

ABCE C{i2} E{i4} 2

ABFG B{i1} G{i5} 1

ASJ S{i7} J{i8} 2

ASJL S{i7} L{i9} 2

ASJL J{i8} L{i9} 1

ASJLQ S{i7} Q{i10} 2

ASJLQ J{i8} Q{i10} 1

ASJLQ L{i9} Q{i10} 0

Path Purchase Sup.

ABC B{i1} C{i2} 2

ABCE B{i1} E{i4} 3

ABCE C{i2} E{i4} 2

ASJ S{i7} J{i8} 2

ASJL S{i7} L{i9} 2

ASJLQ S{i7} Q{i10} 2

T2

Page 64: Web mining

MTSPC Algorithm

Maximal Transaction Segment

Maximal Path Item LC

ASJLQ

S{i7} 3

J{i8} 1

L{i9} 1

Q{i10} 1

|I| = 3 > 2

2E{i4}

2C{i2}

2B{i1}

ABCE

LCItemMaximal Path

Maximal Transaction Segment

K=2

|I| = 1 < 2

J

L

Q

S

E

C

A

B

B{i1} C{i2} E{i4}ABCE

PurchasePath

C3

No Generations

Path Purchase Sup.

ABC B{i1} C{i2} 2

ABCE B{i1} E{i4} 3

ABCE C{i2} E{i4} 2

ASJ S{i7} J{i8} 2

ASJL S{i7} L{i9} 2

ASJLQ S{i7} Q{i10} 2

T2

Page 65: Web mining

Mining for Web Transactions

• <ABCE : B{1}, E{4}> = 2

• <AB : B{1}> = 3

• We can derive <ABCE : B{1} => E{4}>– support_count(<ABCE : B{1} => E{4}>) = 2– confidence(<ABCE : B{1} => E{4}>) =

Page 66: Web mining

Summary• Data mining in the Web is an area of growing

importance– In particular, the emerging of EC

– More and more applications will benefit from the knowledge from data mining

• Web Mining = Web Data Collection + Traditional Data Mining?

• Important Issues– Incremental Web Mining