Top Banner
1 WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907 [email protected]
104

WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

Dec 30, 2015

Download

Documents

xander-riddle

WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science Purdue University, West Lafayette, IN 47907 [email protected]. www.is.a.mess. WWW. collection of multimedia documents in the form of web pages connected via hyperlinks. Characteristics of WWW. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

1

WHOWEDA : Warehouse of Web Data

Sanjay Kumar Madria

Department of Computer Science

Purdue University, West Lafayette, IN 47907

[email protected]

Page 2: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

2

Page 3: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

3

WWW

• collection of multimedia documents in the form of web pages connected via hyperlinks.

Page 4: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

4

Characteristics of WWW

• WWW is a set of directed graphs

• data in the WWW has a heterogeneous nature

• unstructured versus structured information

• no central authority to manage information

• Dynamic verses static information

• Web information discoveries - search engines

Page 5: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

5

As WWW grows, more chaotic it becomes

• Web is fast growing, distributed, non-administered global information resource

• WWW allows access to text, image, video, sound and graphic data

• more business organizations creating web servers

• more chaotic environment to locate information of interest

• lost in hyperspace syndrome

Page 6: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

6

Does it affect the corporate world?• Lack of credibility of data

– Different sites with different data– Same site different data

• Historical information is not available– Previous versions of web data– How does web data change with time– Summarization over time

• Data to information• Reduction in productivity

– Analysis is manual

Page 7: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

7

How users find web sites• Indexes and search engines 75• UseNet newsgroups 44• Cool lists 27• New lists 24• Listservers 23• Print ads 21• Word-of-mouth and e-mail 17• Linked web advertisement 4

Page 8: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

8

Limitations of Search Engines

• Do not exploit hyperlinks

• search is limited to string matching

• Queries are evaluated on archived data rather than up-to-date data; no indexing on current data

• low accuracy

• replicated results

• no further manipulation possible

Page 9: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

9

Limitations of Search Engines

• ERROR 404!

• No efficient document management

• Query results cannot be further manipulated

• No efficient means for knowledge discovery

Page 10: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

10

Current Research Projects• Web Query System

– W3QS, WebSQL, AKIRA, NetQL, RAW,

WebLog

• Semistructured Data– LOREL, UnQL, WebOQL

• Website Management System– STRUDEL

• Web Warehouse

- WHOWEDA

Page 11: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

11

WHOWEDA -Key Objectives

• Design a suitable data model to represent web information

• development of web algebra and query language

• Maintenance of Web data

• Development of knowledge discovery and web mining tools

• Web warehouse

Page 12: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

12

WHOWEDA - What?

• WareHouse Of Web Data– Subject - oriented– Integrated– Temporal– Granularity - Lower, higher– Some summary– Not updatable– Alternative information sources

Page 13: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

13

What is a Web Warehouse?

• Subject-oriented, integrated, time-variant, non-volatile repository of web data for direct querying and analysis for some sort of decision making

• A process whereby organizations or individuals extract value from their Web informational assets through the use of special stores called web warehouses

Page 14: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

14

WHOWEDA! www.cais.ntu.edu.sg:8000/~whoweda

• A WareHouse Of WEb DAta

• Web Information Coupling Model (WICM)– Web Objects– Web Schema

• Web Information Coupling Algebra

• Web Information Maintenance

• Web Mining and Knowledge discovery

Page 15: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

Web Web InformationInformationCoupling Coupling SystemSystem

Web InformationWeb InformationMaintenance SystemMaintenance System

Web InformationWeb InformationMining SystemMining System

WarehouseWarehouseConceptConcept

MartMart

WebWebMartMart

WWWWWW

Web Web WarehouseWarehouse

WebWebMartMart

WebWebMartMart

WebWebMartMart

Web Querying Web Querying & Analysis Component& Analysis Component

UserUser

Page 16: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

Global WebGlobal WebManipulationManipulation

WarehouseWarehouseConceptConcept

MartMart

WWWWWW

Web Web WarehouseWarehouse

Web Web WarehouseWarehouse

Web Query & DisplayWeb Query & Display

UserUser

Pre processingPre processing

Local WebLocal WebManipulationManipulation

Global Web Global Web CouplingCoupling

Global RankingGlobal RankingData VisualizationData Visualization

Web SelectWeb Select

Local Web CouplingLocal Web CouplingWeb ProjectWeb Project

Local RankingLocal RankingWeb JoinWeb Join

Web UnionWeb UnionWeb IntersectionWeb Intersection

Schema TightnessSchema Tightness

Schema SearchSchema SearchSchema MatchSchema Match

Schema TightnessSchema Tightness

Data VisualizationData Visualization

Page 17: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

17

Web Objects

• Node - url, title, format, size, date, text

• Link - source-url, target-url, label, link-type

• Web tuple

• Web table

• Web schema

• Web database

Page 18: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

18

Web Schema• Metadata in the warehouse

• Structural ‘summary’ of web table

• Information Coupling using a Query graph

• Query graph ->Web schema

• directed graph represented by Ordered 4-tuple:– Set of node variables– Set of link variables– Connectivities– Predicates

Page 19: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

19

Page 20: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

20

Information Square's homepage

Headline article 1

Headline article n

News@TCS

News specialsAirport info

(List of video files)

List of links tolocal news

List of links toworld news

Local news 1

Local news kWorld news 1

World news t

Brief Organization of Information Space's Web Site

Page 21: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

21

x ye

x ye

ggf

label CONTAINS"Local News"

target_URL CONTAINS"newshub/specials"

z

url CONTAINS"local"

label CONTAINS"World News"

w

url CONTAINS"world"

target_url CONTAINS"article”

h

url contains “headlines”

Page 22: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

22

Information Square's homepage

Headline article 1

News specials

List of links tolocal news

List of links toworld news

Local news 1

World news 1

Page 23: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

23

Schema- example

• Node variables: Xn = { x, y, z, w }

• Link variable: Xl = { e, f, g }

• Connectivities: C = { x<e>y and x<fg->z and x<fh->w }– The symbol represents an anonymous node

variable, a node variable not restricted by any predicate.

Page 24: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

24

• Predicates

• P={x.url=”http://www.mediacity.com.sg/i-square”,

• y.url CONTAINS “headlines”

• e.target_url CONTAINS "article",

• f.target.url CONTAINS "newshub/specials",

• g.label CONTAINS "Local News",

• z.url CONTAINS "local",

• h.label CONTAINS "World News",

• w.url CONTAINS "world" }

Page 25: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

25

Query Graph - Example 1• Query graph - same as schema except that it

has one more parameter to control the results returned.

• Informally, it is directed connected graph consists of nodes, links and keywords imposed on them.

• Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/

• Web table Diseases

Page 26: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

List of DiseasesList of Diseases

http://www.panacea.org/http://www.panacea.org/

xx

Treatment listTreatment list

qq

TreatmentTreatmentgg

Symptoms listSymptoms list

zzSymptomsSymptoms

ff

IssuesIssues

yy

eeEvaluationEvaluation

ww pp

EvaluationEvaluation

Page 27: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

List of DiseasesList of Diseases

http://www.panacea.org/http://www.panacea.org/

x0x0

Treatment listTreatment listq1q1

TreatmentTreatmentg1g1

Symptoms Symptoms listlist

z1z1SymptomsSymptoms

f1f1

IssuesIssues

y1y1

e1e1EvaluationEvaluation

w1w1 p2p2Elisa TestElisa Test

AIDSAIDS

EvaluationEvaluation

Page 28: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

28

Example 2

• Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/

• Web table Drugs

Page 29: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

List of DiseasesList of Diseases

http://www.panacea.org/http://www.panacea.org/ Drug Drug listlist IssuesIssues

UsesUses

UseUse

Side effectsSide effectsaa bb cc dd

rr

ss

kk

SideSideeffectseffects

Page 30: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

List ofList ofDiseasesDiseases

http://www.panacea.org/http://www.panacea.org/ DrugDrug listlist IssuesIssues

Uses of Uses of IndavirIndavir

UseUse

Side effectsSide effectsa0a0 b1b1 c1c1 d1d1

r1r1

s1s1

k1k1

AIDSAIDS

IndavirIndavir

Side effects Side effects of Indavirof Indavir

Page 31: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

31

Query Language

• Starting from the CS deptt home page at NTU, find all documents that are linked through paths of length less than two containing only local links, and have in their text “database”.

Page 32: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

32

• COUPLE WEBTABLE W FROM WWW

SUCH THAT NODE I, j IN WWW and LINK e,f,g IN WWW AND I<e|f,g>j WHERE I.url EQUALS “http://www.ntu.edu.sg” AND j.text CONTAINS “database” AND f.link-type EQUALS local AND g.link-type EQUALS local;

Page 33: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

33

Web Algebra

• Formal foundation of data representation and manipulation in a web warehouse

• Web operators:– Information access operator– Information manipulation operators– Web schema operators– Data visualization operators

Page 34: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

34

Information access operator

• Global Web Coupling

Page 35: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

35

Information Manipulation

- Web select– Web project– Local web coupling– Web join– Web cartesian product– Web union– Web intersect– Local Web coupling

Page 36: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

36

Web Select

• Extracts web tuples from web tables satisfying certain conditions on node and link variables and on connectivities

• Input is select Schema

• Output is a web table satisfying the select schema

Page 37: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

37

• select W1 tuples that contain world news about Indonesia since May 1 1998.

• MsW1 where

Ms = < Xsn, Xsl, Cs, Ps >,

Xsn = { x, w }, Xsl = { },

Cs = { },

Ps = { x.date > "1May1998", w.text CONTAINS “Indonesia”}

Page 38: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

38

• Xn’ = { x, y, z, w },Xl’ = { e, f, g }• C’ = { x<e>y and x<fg->z and x<fh->w }• P’={x.url=”http://www.mediacity.com.sg/i-

square”, x.date > "1May1998",• e.target_url CONTAINS "article",

f.target.url CONTAINS "newshub/specials",• g.label CONTAINS "Local News",• z.url CONTAINS "local",• h.label CONTAINS "World News",• w.url CONTAINS "world",• w.text CONTAINS “Indonesia” }

Page 39: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

39

Web Information Coupling System

• A database system to couple related web information

• Global web Coupling and Local Web Coupling

Page 40: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

40

Global Coupling - Information Access

• To integrate data from the Web

• To create historical data

• To couple related information from the WWW satisfying a query graph

• Operator to create web tables

• From web with no schema to web table with web schema

Page 41: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

41

Why local web coupling?

• Directly querying the WWW to gather these information is an expensive and repetitive affair

• Web documents containing similar information can reside in different web tables in a web warehouse

• A mechanism to gather these similar information by additional manipulation of the materialized web tables

Page 42: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

42

Local Web Couple operator

• Two web tuples and can be coupled if there exist atleast one pair of nodes from and which contains similar information.

iw

iw

jw

jw

Page 43: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

43

Local Web Couple operator

• The web couple operator is basically a web cartesian product followed by web select:

• We denote web couple by the symbol:

WW

WWW ji

ji WWW

Page 44: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

44

Web Coupling

Page 45: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

45

• M2 = < Xn”, Xl”, C”,P” > for W2• Xn” = { s, t, u}, Xl” = { k, l, m, n },• C” ={ s<kl>t and s<mn>u },• P”{s.url=

“http://www.asia1.com.sg/straitstimes/”,• k.label = “REGION”, • l.target_url=

“http://www.asia1.com.sg/straitstimes/pages/sea*.html”, m.label = “WORLD”,

• n.target_url=“http://www.asia1.com.sg/straitstimes/pages/wrld*.html”}

Page 46: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

46

• W1 q W2 where

• q = (x.date=s.date) & (w.text CONTAINS “Indonesia”) & (t.text CONTAINS “Indonesia”)

Page 47: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

47

• Xn* = { x, y, z, w, s, t, u }, Xl* = { e, f, g, k, l, m, n }, C*= { x<e>y and x<fg->z and x<fh->w and s<kl>t and s<mn>u }

• P* = { x.url=”http://www.mediacity.com.sg/i-square”, e.target_url CONTAINS "article",

• f.target.url CONTAINS "newshub/specials",

• g.label CONTAINS "Local News",

• z.url CONTAINS "local",

• h.label CONTAINS "World News",

• w.url CONTAINS "world",

• s.url = “http://www.asia1.com.sg/straitstimes/”,

Page 48: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

48

• k.label = “REGION”, l.target_url = “http://www.asia1.com.sg/straitstimes/pages/sea*.html”,

• m.label = “WORLD”,

• n.target_url = “http://www.asia1.com.sg/straitstimes/pages/wrld*.html”,

• x.date = s.date,

• w.text CONTAINS “Indonesia”,

• t.text CONTAINS “Indonesia"}

Page 49: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

49

Local Web Coupling

• Initiated explicitly by the user

• User provides the pair of node variables and the keyword set based on which coupling is to be performed

• Coupling nodes in each pair of web tuples in the input web tables must satisfy one of the coupling conditions

Page 50: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

50

Construction of coupled table

• First perform a web cartesian product on the two web tables

• For each web tuple in the resultant web table– the specified instances of node variables are

inspected to determine whether the web tuple satisfy coupling compatibility condition(s)

Page 51: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

51

Construction of coupled table

– If a pair of nodes satisfy none of the conditions, the corresponding web tuple is rejected

– Otherwise, the web tuple is stored in a separate web table

Page 52: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

52

Types of web coupling

• System driven web coupling: In this case the system to decide which are the node variables to be coupled (coupling nodes). If atleast a pair of coupling nodes cannot be identified then the web tables cannot be coupled.

Page 53: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

53

Types of web coupling

• User driven web coupling: In this case the user decides which are the node variables to be coupled (coupling nodes).

• Coupling is performed only on those user specified node variable(s).

Page 54: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

54

Types of web coupling

• Attribute driven web coupling: In this case the user specifies the coupling attributes.

• Coupling is performed only on those user specified coupling attribute(s).

Page 55: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

55

Attribute driven web coupling

COUPLE TABLE3

FROM TABLE1 AND TABLE 2

ON ATTRIBUTE “TEXT”

AT SCHEMA/TUPLE(optional)

Page 56: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

56

Types of web coupling

• Value driven web coupling: In this case the user specifies the values of the attributes of the nodes on which coupling should be performed.

• Coupling is performed only on those user specified attribute values.

Page 57: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

57

Value driven web coupling

COUPLE TABLE3

FROM TABLE1 AND TABLE 2

ON VALUE “Software Agents”

AT SCHEMA/TUPLE(optional)

Page 58: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

58

Schema level web coupling• We inspect the schemas to decide whether

the two web tables can be coupled.• If coupling conditions cannot be identified

then the two web tables cannot be coupled.

• We do not inspect the web tuples in the web table.

• Number of web tuples coupled will be n*m.

Page 59: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

59

Tuple level web coupling

• We inspect the web tuples of the two input web tables to identify nodes with similar information.

• The number of web tuples in the coupled web table <=n*m

Page 60: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

60

Why two levels?

• A schema does not capture all the information of the web documents in a web table; not always possible to identify coupling condition by inspecting the schemas.

• possible to find existence of coupling nodes which are not defined in the schemas.

Page 61: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

61

Why two levels?

• Tuple level coupling gives us a mean to correlate web documents containing similar information from the web tables (that cannot be identified from their schemas) at the expense of additional processing.

Page 62: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

62

Join Processing in Web Databases

Page 63: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

63

Web Join• Concatenate tuples based on identical nodes

or documents

• Input are two web tables and their schemas

• Output is a joined table

• Types – Pi-web join, theta-web join, outer joins, web

composition, semi web join

Page 64: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

64

Web Join• Used for combining related data from

various web tables

• Mechanism to detect changes

• Mechanism to find alternative web document in case of “Document Not Found” error

Page 65: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

65

Web Join Operator

• Information manipulation operator

• Manipulate information residing in a web database to derive additional information

• Harness useful, composite information from two web tables

• Capitalize on the reuse of retrieved data from the WWW in order to reduce execution time of queries

Page 66: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

66

Joinable Nodes

• Node variables participating in the web join process

• Expressed as a pair

• Each node in the pair should have identical URLs

Page 67: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

67

Web Join

• Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes

• Joinable nodes are identified from the schemas of the two web tables

• URLs of the joinable nodes are identical

Page 68: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

List of List of DiseasesDiseases

http://www.panacea.org/http://www.panacea.org/

xx

Treatment listTreatment list

qq

TreatmentTreatmentgg

Symptoms listSymptoms list

zzSymptomsSymptoms

ffIssuesIssues

yy

eeEvaluationEvaluation

ww pp

EvaluationEvaluationDrug Drug listlist

UsesUses

UseUse

Side effectsSide effects

bb cc ddrr

ss

kk

SideSideeffectseffects

IssuesIssues

Page 69: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

http://www.panacea.org/http://www.panacea.org/

x0x0

AIDS treatmentAIDS treatment

q1q1

g1g1Symptoms Symptoms of AIDSof AIDS

z1z1f1f1y1y1

e1e1

ww11

p2p2

EvaluationEvaluation

b1b1 c1c1 d1d1r1r1

s1s1

k1k1

Side effects Side effects of Indavirof Indavir

AIDSAIDS

AIDSAIDS

Elisa TestElisa Test

IndavirIndavir

Uses ofUses ofIndavirIndavir

Page 70: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

70

Join Existence

• Given two web tables, we determine if these two web tables are joinable

• Inspect the schemas of the web tables

• Satisfy joinability conditions based on:– node predicates– link predicates– node and link predicates– locus of a node relative to a joinable node

Page 71: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

71

Join Construction

• To construct a joined schema, we construct:– node set– link set– connectivity set– predicate set

• Construction of joined table– Concatenating the web tuples of the two input

tables over the joinable nodes

Page 72: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

72

Web Bags

• Existence of identical web tuples.

• Created due to web project operation.

• Structure based mining

• Used for discovering– Visible nodes– Luminous nodes– Luminous paths

Page 73: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

73

Definitions • Visibility of a web document or node D in a

web table W measures the number of different web documents in W that have links to D

• Luminosity - Reverse of visibility, the number of other distinct documents that are linked from D

• Luminous paths - a set of inter-linked nodes which occurs number of times in a web table

Page 74: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

74

Steps to find visible nodes

• Input: Web table W, node variable x, visibility threshold v

• Output: Set of visible nodes • Create a web table from W where each web

tuple contains distinct instances of node x and the preceeding node which is linked to x

• Eliminate the nodes linked to x in each tuple of the web table using web project

Page 75: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

75

Steps to find visible nodes

• Input: Web table W, node variable x, visibility threshold v

• Output: Set of visible nodes

• Create a web table from W where each web tuple contains distinct instances of node x and the preceeding node which is linked to x

• Eliminate the nodes linked to x in each tuple of the web table using web project

Page 76: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

76

Steps to find visible nodes • Check if the collection of web tuples of node x

thus created is a web bag by comparing their URLs

• Create multiplets for each collection of identical nodes

• For each multiplet calculate the node visibility• Determine the multiplets with node visibility

greater than the threshold• Create the visible node set

Page 77: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

77

Steps to find luminous nodes

• Input: Web table W, node variable x, luminosity threshold l

• Output: Set of luminous nodes

• Steps are similar to that of visible node discovery

• We consider the nodes linked from x in place of nodes linked to x

Page 78: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

78

Steps to find luminous nodes

• Input: Web table W, node variable x, luminosity threshold l

• Output: Set of luminous nodes

• Steps are similar to that of visible node discovery

• We consider the nodes linked from x in place of nodes linked to x

Page 79: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

79

Steps to find luminous paths

• Create the collection of multiplets

• Compute path luminosity for each multiplet

• If the path luminosity value of a multiplet is greater than or equal to threshold then a path in the multiplet is a luminous path

• Otherwise, we create a collection of linear web tuples from the above collection of web tuples

Page 80: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

80

Steps to find luminous paths

• This is to identify if there exist a subset of inter-linked nodes between x and y that are luminous paths

• We repeat the procedure to compute path luminosity for these set of inter-linked nodes

Page 81: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

http://www.panacea.org/http://www.panacea.org/

xx yy zz

CancerCancer

CancerCancerDiseasesDiseases

ee ff

Web SchemaWeb Schema

Page 82: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

http://www.panacea.org/http://www.panacea.org/ CancerCancer

x0x0 y0y0DiseasesDiseases

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.htmlCancerCancere0e0

f0f0zz11

CancerCancer

x0x0 y0y0DiseasesDiseases

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.htmlCancerCancere0e0

f0f0z1z1

CancerCancer

x0x0 y0y0DiseasesDiseases

CancerCancere0e0

f0f0z2z2

CancerCancer

x0x0 y0y0DiseasesDiseases

CancerCancere0e0

f0f0z4z4

CancerCancer

x0x0 y0y0DiseasesDiseases

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.htmlCancerCancere0e0

f0f0z1z1

Web TableWeb Table

Page 83: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

zz

CancerCancer

Projected schemaProjected schema

Page 84: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

CancerCancer

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

zz11

CancerCancer

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

z1z1

CancerCancer

z2z2

CancerCancer

z4z4

CancerCancer

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

z1z1

Web Table after eliminating Web Table after eliminating xx and and yy

Page 85: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

http://www.panacea.org/http://www.panacea.org/

xx yy zz

CancerCancer

DiseasesDiseases

ee

Projected schemaProjected schema

Page 86: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

http://www.panacea.org/http://www.panacea.org/ CancerCancer

x0x0 y0y0 z1z1DiseasesDiseases

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

http://www.panacea.org/http://www.panacea.org/ CancerCancer

x0x0 y0y0 z1z1DiseasesDiseases

http://www.panacea.org/http://www.panacea.org/ CancerCancer

x0x0 y0y0 z1z1DiseasesDiseases

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

http://www.panacea.org/http://www.panacea.org/ CancerCancer

x0x0 y0y0 z2z2DiseasesDiseases

http://www.disease.com/cancer/skin.htmhttp://www.disease.com/cancer/skin.htm

http://www.jhu.edu/medical/research/cancer.htmhttp://www.jhu.edu/medical/research/cancer.htm

Web BagWeb Bag

http://www.panacea.org/http://www.panacea.org/

CancerCancerx0x0 y0y0 z4z4DiseasesDiseases

Page 87: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

http://www.panacea.org/http://www.panacea.org/ CancerCancer

x0x0 y0y0 z1z1DiseasesDiseases

http://www.panacea.org/http://www.panacea.org/ CancerCancer

x0x0 y0y0 z2z2DiseasesDiseases

http://www.disease.com/cancer/skin.htmhttp://www.disease.com/cancer/skin.htm

http://www.panacea.org/http://www.panacea.org/ CancerCancer

x0x0 y0y0 z4z4DiseasesDiseases

http://www.jhu.edu/medical/research/cancer.htmhttp://www.jhu.edu/medical/research/cancer.htm

After removal of identical tuplesAfter removal of identical tuples

Page 88: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

CancerCancer

z1z1 http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

CancerCancer

z1z1 http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

CancerCancer

z2z2

http://www.disease.com/cancer/skin.htmhttp://www.disease.com/cancer/skin.htm

http://www.jhu.edu/medical/research/cancer.htmhttp://www.jhu.edu/medical/research/cancer.htm

CancerCancer

z1z1

CancerCancer

z4z4

Page 89: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

CancerCancer

z1z1 http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

CancerCancer

z2z2

http://www.disease.com/cancer/skin.htmhttp://www.disease.com/cancer/skin.htm

http://www.jhu.edu/medical/research/cancer.htmhttp://www.jhu.edu/medical/research/cancer.htm

CancerCancer

z1z1

CancerCancer

z4z4

Page 90: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

CancerCancer

z1z1 http://www.cancer.org/desc.htmlhttp://www.cancer.org/desc.html

CancerCancer

z2z2

http://www.disease.com/cancer/skin.htmhttp://www.disease.com/cancer/skin.htm

http://www.jhu.edu/medical/research/cancer.htmhttp://www.jhu.edu/medical/research/cancer.htm

CancerCancer

z1z1

CancerCancer

z4z4

Visible NodesVisible Nodes

Page 91: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

Luminous PathsLuminous Paths

Page 92: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

92

More Operators . . .

• Web schema operators:– Schema tightness operator, Schema match

operator, Schema search operator

• Data visualization operators:– Ranking operators (Global & Local), Web

Nest, Web Un-nest, Web Coalesce, Web Expand, Web Pack, Web Unpack, Web Sort

Page 93: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

93

Partitioning of web tables

• Partitioning web tables– restructured easily– indexed easily– monitored easily– reorganized easily

• By– time

• schema tree structure

• keywords

Page 94: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

94

Warehouse Concept Mart (WCMart)

• Subject oriented

• Concept generation.

• Manually -> Autonomous.

• Used for:– Ranking tuples– Global web coupling– Content based mining

Page 95: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

95

Mining in Web Warehouse

• Web Structure Mining

• Web Content Mining

• Web usage Mining

Page 96: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

96

Web Data Refinement

• Improve web schema - schema tightness operator

• Partition web tables based on content and structure

Page 97: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

97

Partitioning of web tables

• Partitioning web tables– restructured easily– indexed easily– monitored easily– reorganized easily

• By– time

• schema tree structure

• keywords

Page 98: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

WarehouseWarehouseConceptConcept

MartMart

WarehouseWarehouseConceptConcept

MartMart

WWWWWW

Page 99: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

Web Information Web Information ManipulationManipulation

OperatorsOperators

Lower-levelLower-levelGranularityGranularity

Higher levelHigher levelGranularityGranularity

Page 100: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

Web Web InformationInformation

Coupling Coupling SystemSystem

Web InformationWeb InformationMining SystemMining System

WarehouseWarehouseConceptConcept

MartMart

WWWWWW

Web Web WarehouseWarehouse

Web Querying Web Querying & Analysis Component& Analysis Component

UserUser

Page 101: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

101

• Structural

• Content-based– time-variant analysis– snapshot analysis– compare one period with another– trend analysis

What type of information can be summarized?

Page 102: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

102

• Most volatile documents– Sites which change frequently– Rate of change over time– a pointer to directly access documents which change

rapidly

• Most visible nodes, luminous nodes, luminous paths– Change with time– Decrease or increase - Analyze the reason

Structural Summarization

Page 103: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

103

• What can be aggregrated in a web page?– Number of links with identical labels– Number of keywords

• Changes in content with time– Comparing the changes

• Open question• XML will improve the ability of analysis of

web data

Content Summarization

Page 104: WHOWEDA : Warehouse of Web Data Sanjay Kumar Madria Department of Computer Science

104

Summary• Current status:

– Mechanism for accessing and manipulating web information in WHOWEDA

– Implementing various web operators and query language

• Future research– What types of information can be summarized?– What types of knowledge can be mined?– Refine web warehouse architecture

• www.cais.ntu.edu.sg:8000/~whoweda