dawak-mining-slides.ppt

Dawak'99 copy-right@sanjaymadria

Research Issues in Web Data Mining

Sanjay Kumar MadriaDepartment of Computer Science

Purdue University, West Lafayette, [email protected]

Sourav Bhowmick, Ng Wee Keong and Lim Ee Peng

Nanyang Technological University, Singapore


WHOWEDA! www.cais.ntu.edu.sg:8000/~whoweda

• A WareHouse Of WEb DAta • Web Information Coupling Model

(WICM)– Web Objects– Web Schema

• Web Information Coupling Algebra• Web Information Maintenance• Web Mining and Knowledge

discovery


Web Objects

• Node - url, title, format, size, date, text

• Link - source-url, target-url, label, link-type

• Web tuple• Web table• Web schema• Web database

Web Web InformationInformationCoupling Coupling SystemSystem

Web InformationWeb InformationMaintenance SystemMaintenance System

Web InformationWeb InformationMining SystemMining System

WarehouseWarehouseConceptConcept

MartMart

WebWebMartMart

WWWWWW

Web Web WarehouseWarehouse

WebWebMartMart

WebWebMartMart

WebWebMartMart

Web Querying Web Querying & Analysis Component& Analysis Component

UserUser

Global WebGlobal WebManipulationManipulation

WarehouseWarehouseConceptConcept

MartMart

WWWWWW



Web Query & DisplayWeb Query & Display

UserUser

Pre processingPre processing

Local WebLocal WebManipulationManipulation

Global Web Global Web CouplingCoupling

Global RankingGlobal RankingData VisualizationData Visualization

Web SelectWeb Select

Local Web CouplingLocal Web CouplingWeb ProjectWeb Project

Local RankingLocal RankingWeb JoinWeb Join

Web UnionWeb UnionWeb IntersectionWeb Intersection

Schema TightnessSchema Tightness

Schema SearchSchema SearchSchema MatchSchema Match

Schema TightnessSchema Tightness

Data VisualizationData Visualization


Web Schema

• Structural ‘summary’ of web table• Information Coupling using a

Query graph• Query graph ->Web schema• directed graph as ordered 4-tuple:

– Set of node variables– Set of link variables– Connectivities– Predicates



Information Square's homepage

Headline article 1

Headline article n

News@TCS

News specials

Airport info

(List of video files)

List of links tolocal news

List of links toworld news

Local news 1

Local news kWorld news 1

World news t

Brief Organization of Information Space's Web Site


x ye

x ye

ggf

label CONTAINS"Local News"

target_URL CONTAINS"newshub/specials"

z

url CONTAINS"local"

label CONTAINS"World News"

w

url CONTAINS"world"

target_url CONTAINS"article”

h

url contains “headlines”


Information Square's homepage

Headline article 1

News specials

List of links tolocal news

List of links toworld news

Local news 1

World news 1


Schema- example

• Node variables: Xn = { x, y, z, w }

• Link variable: Xl = { e, f, g }• Connectivities: C = { x<e>y and

x<fg->z and x<fh->w }


• Predicates• P={x.url=”http://

www.mediacity.com.sg/i-square”,• y.url CONTAINS “headlines”• e.target_url CONTAINS "article",• f.target.url CONTAINS "newshub/specials",• g.label CONTAINS "Local News",• z.url CONTAINS "local",• h.label CONTAINS "World News",• w.url CONTAINS "world" }


Query Graph - Example • Query graph - same as schema • Informally, it is directed connected

graph consists of nodes, links and keywords imposed on them.

• Produce a list of diseases with their symptoms, evaluation procedures and treatment starting from the web site at http://www.panacea.org/

• Web table Diseases

List of DiseasesList of Diseases

http://www.panacea.org/http://www.panacea.org/

xx

Treatment listTreatment list

qq

TreatmentTreatmentgg

Symptoms listSymptoms list

zzSymptomsSymptoms

ff

IssuesIssues

yy

eeEvaluationEvaluation

ww pp

EvaluationEvaluation


http://www.panacea.org/http://www.panacea.org/

x0x0

Treatment listTreatment listq1q1

TreatmentTreatmentg1g1

Symptoms Symptoms listlist

z1z1SymptomsSymptoms

f1f1

IssuesIssues

y1y1

e1e1EvaluationEvaluation

w1w1 p2p2Elisa TestElisa Test

AIDSAIDS

EvaluationEvaluation


Example 2

• Produce a list of drugs, and their uses and side effects starting from the web site at http://www.panacea.org/

• Web table Drugs


http://www.panacea.org/http://www.panacea.org/ Drug Drug listlist IssuesIssues

UsesUses

UseUse

Side effectsSide effectsaa bb cc dd

rr

ss

kk

SideSideeffectseffects

List ofList ofDiseasesDiseases

http://www.panacea.org/http://www.panacea.org/ DrugDrug listlist IssuesIssues

Uses of Uses of IndavirIndavir

UseUse

Side effectsSide effectsa0a0 b1b1 c1c1 d1d1

r1r1

s1s1

k1k1

AIDSAIDS

IndavirIndavir

Side effects Side effects of Indavirof Indavir


WWW Data Mining• web structure mining : Web structure

mining involves mining the web document’s structures and links.

• web content mining : Web content mining describes the automatic search of information resources available on-line.

• web usage mining : Web usage mining includes the data from server access logs, user registration or profiles, user sessions or transactions etc.


Web Structure Mining : Issues Measuring the frequency of the local links

(links in the same server) in the web tuples in a web table. web tuples have more information about inter-

related documents that exists at the same server. measures the completeness of the web site in a

sense that most of the closely related information is available at the same site(server).

For example, an airline’s home page will have more local links connecting the “routing information with air-fares and schedules”.


Measuring the frequency of web tuples in a web table containing links which are interior; links which are within the same file. measures a web document’s ability to cross-

reference other related web pages within the same document.

measures the flow of the web documents.


Measuring the frequency of web tuples in a web table that contains links that are global; links which span different web sites.

measures the visibility of the web documents and ability to relate similar or related documents across different sites.

For example, research documents related to “semi-structured data” will be available at many sites and such sites should be visible to other related sites by providing cross references by the popular phrases such as “more related links”.


Measuring the frequency of identical web tuples that appear in a web table or among the web tables.

measures the replication of web documents and may help in identifying the mirrored sites.

What is the in-degree and out-degree of each node (web document)? What is the meaning of high and low in- and out-degrees?

Locating links to popular web sites in the web tuples in a table.

Number of web tuples are returned in response to a query on some popular phrases such as “Bio-science” with respect to queries containing keywords like “earth-science”.


discover the nature of the hyperlinks in the web sites of a particular domain.

What information do they provide and how are they related conceptually.

Is it possible to extract a conceptual hierarchical information for designing web sites of a particular domain.

generalizing the flow of information in web sites representing some particular domain.


Web Bags and Web Structure Mining• Most of the search engines fail to handle the

following knowledge discovery goals: locate the most visible web sites or documents for

reference. Many paths (high fan in) can reach that sites or documents.

locate the most luminous web sites or documents for reference. web sites or documents which have the most number of outgoing links.

find the most traversed path for a particular query result. To identify the set of most popular interlinked web documents that have been traversed frequently to obtain the query result.


Applications of Visibility

• Association rules• e-commerce


Consider a query graph involving some keywords such as " types of restaurants"

and "items" and the results returned.

www.test.com items

a

restaurants

www.test.com Pizza

Italian

restaurants Milano-R

www.test.com Pizza

European

Restaurants Paris-R

www.test.com Pesta

Italian

Restaurants Milano-R

x z

X1 Z1

X1 Z1

X1 Z2


• From the results returned, find most visible pages. Assume Z1 is the most visible page with the given threshold.

• This gives estimates about different restaurants selling pizzas.

• Lower threshold gives you set (Z1, Z2) as visible pages, which sells both pizza and pasta.

• Generalize rules such as out of 66% of restaurants which offer pizza to their customers, 33% also offers pasta.


E-commerce application

• My web site’s visibility is going down!!!!


Application - Luminosity• Association rules such as X% of all the

companies which makes a product “A”, Y% of them also makes a set of products “B and C”.

• Exmple - certain companies (33%) if they make a product A also make products B and C.

• the company C makes only the product A.• That is, 66% of companies which make a

product “A” , 33% of them also make products B and C.


Consider the following web tuples in a web table.

www.eleccompany.com www.elecproduct.org/productA

company A product A Product A

www.eleccompany.com www.elecproduct.org/productB

company A product B Product B

www.eleccompany.com www.elecproduct.org/productC

company A product C Product C

www.eleccompany.com www.elecproduct.org/productB

company B product B Product B

www.eleccompany.com www.elecproduct.org/productA

company C product A Product A

X1 Z1

X1 Z2

X1 Z3

X1 Z2

X1 Z2


Web Content Mining what does it mean to mine content from the web?

Is extracting information from a very small subset of all HTML web pages is also an instance of web data mining?

mining a subset of web pages stored in one or more web tables is more feasible option.

Similarity and difference between web content mining in web warehouse context and conventional data mining.


Selection of type of data in the WWW to do web content mining.

Cleaning of selected data to mine effectively. Types of knowledge that can be discovered in a

web warehouse context. Discovery of types of information hidden in a web

warehouse which are useful for decision making. specify, measure and justify the interestingness of

the discovered knowledge knowledge to be discovered are as follows:

generalized relation, characteristic rule, discriminate rule, classification rule, association rule, and deviation rule.


Do the data mining techniques applicable to web mining and if yes, how? For example, we are interested in generating the following types of rules: 40% of web tuples (i..e, web pages) in response to a “travel information query from Hong Kong to Macau” suggest that popular means of traveling is by ferry.

To derive some additional knowledge in a web warehouse for web content mining.

mining previously unknown knowledge in a web warehouse.

Presentation of discovered knowledge to the users to expedite complex decision making.


Web Usage Mining• discovery of user access patterns from web servers;

user profile, access pattern for pages, etc. used for efficient and effective web site management and the user behavior.

In WHOWEDA, the user initiates a coupling framework to collect related information.

For example, coupling a query graph “to find the hotel information” with the query graph “to find the places of interest”.

From this query graph, we can generate some user access pattern of coupling framework like “50% of users who query “hotel” also couple their query with “places of interest”.


find coupled concepts from the coupling framework. helps in organizing web sites. For example, web documents that provide

information on “hotels” should also have hyperlinks to web pages providing information on “places of interest”.


Warehouse Concept Mart• Knowledge discovery in web data becomes more

and more complex due to the large number of data on WWW.

build the concept hierarchies involving web data to use them in knowledge discovery.

collection of concept hierarchies a Warehouse Concept Mart (WCM).

concept mart is build by extracting and generalizing terms from web documents to represent classification knowledge of a given class hierarchy.


For unclassified words, they can be clustered based on their common properties. Once the clusters are decided, the keywords can be labeled with their corresponding clusters, and common features of the terms are summarized to form the concept description.

associate a weight at each level of concept marts to evaluate the importance of a term with respect to the concept level in the concept hierarchy.


Web Concept Mart Applications• Intelligent answering of web queries supply the threshold for a given key word in the warehouse concept mart and the words with the

threshold more than the given value can be taken into consideration when answering the query. use different levels of concepts in the warehouse concept mart or can provide approximate

answers. provide the user some knowledge in framing the global coupling query graph.

• Example - DBMS and Oracle

– .Web mining and Concept Mart

Mining association rules techniques to mine the association between words appearing in the concept mart at various levels and in the web tuples returned as the result of a query.

Mining knowledge at multiple levels may help WWW users to find some interesting rules that are difficult to be discovered otherwise.

A knowledge discovery process may climb up and step down to different concepts in the warehouse concept mart’s level with user’s interactions and instructions including different threshold values.


•Web mining and Concept Mart• mine the association between words appearing in the

concept mart at various levels and in the web tuples returned as the result of a query.

Mining knowledge at multiple levels may help WWW users to find some interesting rules that are difficult to be discovered otherwise.

A knowledge discovery process may climb up and step down to different concepts in the warehouse concept mart’s level with user’s interactions and instructions including different threshold values.

capture the flow of web sites of particular domain; helpful in location information


Conclusions web mining issues in context of the web

warehousing project called WHOWEDA (Warehouse of Web Data).

discussed web mining issues with respect to web structure, web content and web usage.

Our focus is to design tools and techniques for web mining to generate some useful knowledge from the WWW data.

We are working on formal algorithms to generate association rules and classification rules.

dawak-mining-slides.ppt

Documents

web tables

web content mining

web usage mining

web table diseases

web table drugs

web documents ability

frequency of web tuples

web documents structures