Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University.

Genre and Task for Web Page Filtering

Michael Shepherd

Web Information Filtering Lab

Faculty of Computer Science

Dalhousie University

Research Team

• Students– Lei Dong– Alistair Kennedy– Richong Zhang

• Faculty– Carolyn Watters– Jack Duffy

Overview

• Introduction

• Genre

• Task

• Summary

Introduction

• The focus of our current research is the investigation of filtering techniques for the Web

• This includes context-aware retrieval where context includes:– Adaptive user modeling– The user’s “task”

• Information need• What it is the user is trying to do

• We are moving to incorporate the notions of genre and task and to evaluate the impact that these have on filtering

Filtering

GenreTask

UserProfiles

Motivation for Research

• The Web has billions of documents

• Average query is 2-3 words

One document will satisfy our information need!

But it’s more than just search

• “Browsing or surfing the Web represents the main model for web use, especially among younger users.” (Hunter)

• Three general types [Marchionini]– Directed browsing – explicit info need– Semi-directed browsing – less well defined need– Undirected browsing - there is no real goal and

the user is “surfing”

Browsing

Continuum

Surfing Searching

Motivated Behaviour

• Intrinsically Motivated Behaviour– “… is that which appears to be spontaneously initiated

by the person in pursuit of no other goal than the activity itself.” [Enzle, Wright, Redondo]

– “… engaging in a task for its enjoyment value…” [Deci, Ryan]

• Extrinsically Motivated Behaviour– “… motivation is to engage in an activity as a means to

an end … participation will result in desirable outcomes such as reward …” [Pintrich, Schunk]

Task and Information Need

Continuum

General information gathering

Explicit Information

need

I’m shopping for a computer

I want the price on the

Dell Inspiron Notebook computer

So, one document may not satisfy the

information need

Search Engine Results

0123456789

1 2 3 4 5 6 7 8 9 10 11

Ranked Screens, 10 hits per screen

Nu

mb

er o

f R

elev

ant

Hit

s

Optimal Results, After Filtering

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10

Ranked Screens, 10 hits per screen

Nu

mb

er o

f R

elev

ant

Hit

s

Why look at Genre and Task?

Filtering Based on Adaptive User Profiles and IR-type of Task

• Intrinsic Motivation– Fine-grained filtering of the Web is not feasible when

the browsing task is “undirected”

• Extrinsic Motivation– Fine-grained filtering of the Web is feasible when there

is an explicit information need

Genre

• A genre is a “classifying statement”• It allows us to recognize items that are similar

even in the midst of great diversity – Newspapers– Mystery novels– Office memos

• socially recognized communicative purpose

• Generally characterized by the tuple:<content, form>

Cybergenre

• Genre on the web

• Characterized by the tuple

<content, form, functionality>

• Where functionality is the functionality afforded by the new medium, i.e., the web

cybergenre

extant novel

replicated variant emergent spontaneous

electronic newspaper

multimedia newspaper

personalized newspaper

FAQ

Recognizing Genres of Web Pages

• The number of cybergenres is increasing, with different estimates putting the number at well over 1000 (depends on granularity)

• It is difficult to know the boundaries of a genre and to know when one has crossed from one genre into another genre

• It is difficult to know when a web page represents the emergence of a new genre

Research Problems

• How can we identify automatically the genre of a web page?

• What features should be used in describing web pages?

• How can we make this adaptive to recognize:– New genre when they emerge?– Genre classes that are fuzzy and genres that slide

from one class to another?

Research Questions

• Can we identify home pages?

• Can we distinguish among the sub-genres:– personal, corporate and organization home

pages?

• What influence does the functionality attribute have in distinguishing these genres and sub-genres?

Machine Learning Model and Dataset

• The dataset consisted of 321 web pages– 17 were classified manually as belonging to two of the

three home page sub-genres– 94 corporate home pages– 93 personal home pages– 74 organization home pages– 77 noise pages

• Neural Net Model– Single classifier with three target output classes– Three different classifiers, one for each of three target

output classes

Features

Content Number of Meta tags used.

Does the page contain any phone numbers?List of most common words appearing in between 16% and 40% of all documents.

FormNumber of images.Does the page have its own domain, or is it in a sub-directory within a domain?Size of file in bytes.Number of words in the page.

FunctionalityNumber of Links in the Web Page.Number of E-mail Links.Prop. of links that are navigational links to other web pages within the same site.Prop. of links that are links to locations within the same page.Prop. of links that are links to other pages on other sites.Number of form inputsIs the first tag a Script tag?

Terms Selected as Features

Class TermsPersonal Home Page

my, me, i, t

Corporate Home Page we, services, service, available, fax, our, us, com, contact, copyright, free, amp

Organization Home Page

events, community, organization, 2004, help, its, members, news, information

Neural Net Categorization

Personal Home Page


Corporate Home Page

Target CategoriesNeural Net

Data Set of Web Pages of Known Genre Type

Input Feature Vector

Evaluation

• Recall– The proportion of web pages of genre type Gi

that are correctly categorized into category C i

• Precision– The proportion of web pages categorized into

category Ci that are of genre type Gi

precisionrecall

precisionrecallmeasureF

2

F-measure(Gi) = the quality of the classifier with respect to web pages of genre type Gi

10-Fold Cross Validation

• Used when data set is small in order to obtain statistically valid results

10%

10%

10%

10%

10%10%

10%

10%

10%

10%

Test Set 1 10 %

Training Set 90%

Test Set 2 10 %

Test Set 3 10 %


<content, form> Significant Difference

Personal Home Page

.711 .702 -

Corporate Home Page

.666 .637 .005


.553 .555 -

F-measures using separate classifiers with noise pages

F-measures using single classifier with noise pages


<content, form> Significant Difference

Personal Home Page

.712 .698 .05

Corporate Home Page

.650 .644 -


.537 .536 -

Misclassification tablesSingle Classifier


P C O Non-home

Personal 62.2 3.1 8.2 22.2

Corporate 3.7 56.5 14.8 25.4

Organization 4.8 12.2 36.5 25.9

Noise Pages 11.1 7.4 6.7 52.9

Genre Summary

• We can recognize home pages from noise pages

• We can distinguish personal home pages from corporate and organization home pages, but distinguishing between corporate and organizational home pages is difficult

• Feature set needs a lot more attention paid to it

Open Questions

• What is an appropriate feature set?• Full evaluation of functionality attribute• What ML model to use?

– Accuracy and scalability

• Adaptive– Track recognized genres as they evolve– Recognize the introduction of a novel genre

not seen previously– Is this like topic detection and tracking?

Genre and Task on the Web?

Group Genre Task Recognition

Topics Home page, location, special topics

Cultural, shopping, news, health

url only host name, short, lots of graphics

Publications Articles, publications, news

Scholarly research, news, financial

Hierarchical structure, longer, few graphics

Products Product info, reviews, order forms

Shopping, news, computing

Short, prices, phone numbers

Educational Glossary, course list, instructional material

Educational pursuits edu domain, education lexicon

FAQ FAQ Health, self-help Metadata and headings, structure

Roussinov, et al., Genre Based Navigation on the Web, HICSS’34

Yahoo Directory

Yahoo Directory• Yahoo categories are created and

maintained manually– Creator of a web site submits a description – Editors review these

• Can we automatically classify a web page by task?

Experiment• Creation of data set

• Data cleaning

• 10-fold cross validation – Feature selection (IG) – Principal component analysis– Build Decision Tree– Testing

Creation of Data Set

• Selected 120 web pages randomly from Yahoo directories in each of:– Shopping– Health– Education

• Selected 70 pages (NSHE) not from the Web that are not shopping, health or education

• Total of 430 Web pages

• Validated by 3 raters

Data Cleaning

• XML, HTML tags – <href>, <img>, <p>

• Pictures, Audio files, Video files

• Scripts– <javascript>

• Stop words

• Porter’s stemming algorithm

Feature Selection Using the

Information Gain (IG)

• Employed as a term goodness criterion

• Based on Information Theory– The number of “bits of information” gained by

knowing the term is present or absent

Information Gain (IG)

• A measure of importance of the feature for predicting the presence of the class.

The information gain of term t is defined to be

1 1 1

( ) ( ) log ( ) ( ) ( | ) log ( | ) ( ) ( | ) log ( | )m m m

r i r i r r i r i r r i r ii i i

G t P c P c P t P c t P c t P t P c t P c t

denotes the set of categories in the target space. 0

m

i ic

Information gain (IG)Health Shopping Education IG value

Educ 23 4 93 0.352725188

Diseas 49 1 0 0.200911642

Medic 57 5 2 0.19171664

Health 79 15 19 0.188112451

Teacher 0 1 46 0.185452451

School 7 6 60 0.170352452

Price 5 50 1 0.16546535

Item 2 52 7 0.156980483

Ship 1 43 2 0.149329261

Student 6 3 51 0.148850138

Custom 7 50 4 0.133412067

Accessori 0 32 0 0.130532457

Cancer 32 0 0 0.130532457

Doctor 36 1 1 0.124860273

Public 16 3 51 0.12081971

Shop 10 55 9 0.120056849

Heart 33 2 0 0.116157938

Cart 0 35 4 0.114589777

Medicin 37 2 2 0.113854763

Physician 27 0 0 0.10811821

Risk 26 0 0 0.103738132

Number of documents in which term appears in each category

Information gain (IG)

Information Gain

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 1246 2491 3736 4981 6226 7471 8716 9961

Information Gain

300 features

Document Term Matrix

1

1 11 1

1

n

n

m m mn

t t

d a a

Documents

d a a

324 Documents (108 in each of Health, Shopping and Education)

300 terms as identified by the Information Gain measure

Principal Component Analysis

• Identifies patterns in data and is a way to express the data is such a way as to highlight their similarities and differences

• Once these patterns have been found in the data, we can reduce the number of dimensions without much loss of data

PCA

• Calculate covariance matrix of original data• Calculate eigenvalues and eigenvectors of

covariance matrix• Largest eigenvector identifies principal

component• The principal component is the eigenvector that

expresses the most significant relationship among the data dimensions

Principal Component Eigenvalues

First 3 eigenvectors carry most of the information

Matrix Projection

• After determining which components or eigenvectors to use, project the original document-term matrix into this new space

Decision Tree

• Flow-chart-like tree structure

• Each internal node denotes a test on an attribute

• Each branch represents an outcome of the test

• Leaf nodes represent classes or class distributions.

• Used for classification

Decision Tree• The tree’s generation process could be

seen as the generation of rules.

• First, build a tree from a known training data set.

• Then, use this tree to predict new data set. Decision tree makes rules among data visualized, and easy to understand.

Decision Tree

Health

Shopping

Education

NSHE

Health 10.0 0.8 0.5 0.71

Shopping 0.8 9.9 0.1 1.2

Education 0.9 0.1 9.2 1.8

NSHE 0.6 1.1 1.6 3.7

Confusion MatrixTarget Categories

Ori

gin

al C

ate

go

rie

s

Precision and Recall

Precision 0.81 0.83 0.81 0.50

Recall

0.83

0.83

0.77

0.53

Health

Shopping

Education

NSHE

Health 10.0 0.8 0.5 0.7

Shopping 0.8 9.9 0.1 1.2

Education 0.9 0.1 9.2 1.8

NSHE 0.6 1.1 1.6 3.7

Conclusion and Future Work

• As a filter, this approach would identify 80% of pages in Health, Shopping or Education

• Evaluate other classifiers

• System has to be scaled up:– More tasks, such as entertainment and sports– Larger data set with more noise

• Add form and functionality features to determine if there are recognizable genres of tasks

How do I see these filters working?

Search

Engine

Filter

By Task

Filter

By Genre

Query

Task

Genre

Search Results

Filtered Results

Thank You

Web Information Filtering Lab

http://www.cs.dal.ca/wifl/

Genre and Task for Web Page Filtering Michael Shepherd Web Information Filtering Lab Faculty of Computer Science Dalhousie University.

Documents

web slide

filtering slide

new genre slide

schunk slide

hunter slide

browsing task

cybergenre genre

users task information