Top Banner
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003
23

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Automatic Discovery and Classification of search interface to the Hidden Web

Dean Lee and Richard Sia

Dec 2nd 2003

Page 2: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Goals and Motivation Hidden Webs are informative No current search engines can index

them (even Google)

Page 3: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Search Interface

search terms

Page 4: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Search Results

results

Page 5: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Goals and Motivation Hidden Webs are informative No current search engines can index

them (even Google)

Next-generation search engine Automatic discovery of search

interface Classification/categorization of hidden

websites Generating queries to search interfaces Crawling and indexing of these web pages

Page 6: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Tasks

Crawling

Search Interface Detection

Domain classification

Page 7: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Crawling

2.2M URLs from dmoz 1.7M eventually Crawled in November 2003 20G/4G - before/after compression

Root level web pages only e.g. http://www.ucla.edu

Page 8: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Why root-level only? 80% of search interface contained in root-level

(from UIUC) Efficient, cost effective

3B web pages compared to 8M web sites

Page 9: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Search Interface Classification

Most search interfaces are inside <Form> </Form> tags

Identify specific features( e.g. keywords, special tags, etc ) that are common in all search interfaces

Page 10: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Search Interface Classification Potential attributes we’ve

considered

Page 11: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Action count

Page 12: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Select count

Page 13: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Password field

Page 14: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Training sets for C4.5

Initially only positive training set Several classification iterations using

real web data For each iteration, add correct

classifications into the positive training set and negative training sets

For misclassified web pages, do the same

Page 15: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Training set

3 iterations seem sufficient

Page 16: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Results Checked via random sampling-

select 100 random web pages and manually check the correctness of the classification

91.5% accuracy- correctly identifies search interfaces (precision)

87.5% accuracy- correctly identifies non-search interfaces

Page 17: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Results Random sampling estimation: 124311

search interfaces currently exist on our data set

OCLC estimated about 8.7M unique websites in 2003 Total #of search interface on the web

(upper bound)

K8008.0

1

7.1

7.8124311

Page 18: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Page 19: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Domain Classification Manually extract domain specific

keywords Cars – odometer, mileage, airbag, acura,

… Books – ISBN, author, title, publication, …

240 keywords used 4 target categories {Books, Cars,

Entertainment, Travel} + “Others”

Page 20: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Domain ClassificationNavie Bayes classifier

Bad result Keywords used not

specific enough to distinguish between domains

Websites span over different topics

Probabilistic Trap of analysis

based on content only

Page 21: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Domain ClassificationC4.5 classification tree

“Better” result More are classified

as “Others” Deterministic Improvement

needed More keywords Link structure Analysis of search

results

Page 22: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Conclusion A tool for automatic search interface

detection

Rough estimate of the total number of search interfaces size of Hidden Web

Domain classification Still need improvment

Page 23: Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Some statistics Precision

Books – 34% Cars – 41 % Entertainment – 48% Travel – 58%

Some examples http://www.barnesandnoble.com – Books http://www.amazon.com – Entertainment http://www.travelocity.com – Travel http://www.cnn.com – Others http://www.latimes.com – Cars http://www.nih.gov – Travel http://www.healthfinder.gov - Others