Querying Web Pages with Database Query Languages by Xiaoyu Yang Graduate Program in Cornputer Science C..Ln:u-f :- ,rc.cc=rl C.1Cll- , c 3uul1I1LtGu lu pQL LlQt LLlLLlllllltllL of the requirements for the degree of Master of Science Faculty of Graduate Studies The University o f Western Ontario London, Ontario November 1998 Q Xiaoyu Yang 1998
86
Embed
Querying Web Pages with Database Query Languages · Querying Web Pages with Database Query Languages by ... Chapter 5 Querying the Web Pages ... Chapter 2 gives an overview of Hypertext
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Querying Web Pages with Database Query Languages
by
Xiaoyu Yang
Graduate Program in Cornputer Science
C. .Ln:u- f :- ,rc.cc=rl C.1Cll- ,,c 3 u u l 1 I 1 L t G u lu pQL LlQt LLlLLlllllltllL
of the requirements for the degree of
Master of Science
Faculty of Graduate Studies
The University of Western Ontario
London, Ontario
November 1998
Q Xiaoyu Yang 1998
National tibrary Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques
395 Wellington Street 395. rue Wellington Ottawa O N K1A O N 4 OttawaON KlAON4 Canada Canada
The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distn'bute or sell reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/nlm, de
reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent ê e imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.
ABSTRACT
As the World Wide Web is growing at a phenomenal rate, it becomes more and more
difficuIt to retrieve information of interest from the enonnous number of resources that are
avaiiable. Currently, there are two ways to retneve information fiom the Web, namely,
navigationhrowsing and searcbg by search engines. However, these search methods
have significant limitations, such as, the "lost-in-hypenpace" phenomenor, the ignorance
of the hypertext structure, etc. These drawbacks motivated the development of a flexible
and powerfiil web query system.
This thesis presents a prototype system developed to query the Web with database
query languages. Ln our prototype system, the Web is modeled as a labeled directed graph
which can be stored in a relational database. A paner was designed and hplemented in
Our prototype system to extract the information of a web page 5om the source HTML fle
and store it into the database. Three query facilities are developed in the prototype system,
namely, the content query, the structure query and the advanced query, which can be used
to pose queries on both the content and the hypertext structure of web pages. Extensive
-experiments have been perfonned to test the prototype system. The testing results show
that database query languages can be used successfully in quemg the Web.
ACKNOWLEDGMENTS
I wish to express rny rnost sincere gratitude and appreciation to my supervisor, Dr.
Sylvia Osborn, for her time, invaluable guidance, support, understanding and
encouragement during the course of this work.
Thanks are also extended to al1 my niends and Wow graduate students for their
suggestions, encouragement and for the fiiendly environment they provided.
Most of aU, 1 would Wce to thank my parents and my husband. Their love and support
are invaluable,
TABLE OF CONTENTS
CERTIFICATE OF IGCAMINATION
ABSTRACT
AcKNOFYLEDGiMENTS
TABLE OF CONTENTS
LIST OF FIGURES
LIST OF TABLES
Chapter 1 Introduction
1 -1 Motivation
1 -2 Objectives ? - s - l r .=, 1 nesis S r g ~ ~ o n
Chapter 2 An Overview of Hypertext and the World Wide Web
2.1 An Introduction to Hypertext
2.1.1 A Bnef History of Hypertext
2.1.2 Hypertext Concepts
2 2 The World Wide Web
2.2.1 A Brief History of the World Wide Web
2.2.2 The World Wide Web Concepts
2.3 HyperText Markup Language (HTML)
2.3.1 A Brief History of HTML
2.3 -2 Common HTML Tags
2.3 -3 Examples
2.4 Searching the World Wide Web
2.4.1 Navigationlsrowsing
2.4.2 Searching by Search Engines
ii
iii
iv
v
viii
ix
Chapter 3 Related Work of the Web Querying
3.1 Issues of Queqing the Web
3.2 Modeling the World Wide Web
3.2.1 Object Exchange Model (OEM)
3 2.2 Araneus Data Model (ADM)
3 -3 Web Querying Systems
3.3-1 WebSQL, W3QS, WebLog
3.3.2 Wrappers Used in Querying the Web
3 -3 -3 Surnrnary of Web Querying Systems
3.4 A Review of Database Query Languages
Chapter 4 The Prototype Web Querying System
4.1 Data Model
4.2 A. Re!atimA V-ew ~f W E ! ~ U-de Weh
4.3 System Overview
4.4 Mapping HTML Fdes to the Database
4.4.1 Extracting Information fiom HTML Files
4.4.1.1 Extracting title Infornation
4.4.1 -2 Extracting Hyperlinks and Description
4.4.1 -3 Extracting Other Information
4.4.2 Storing the Information in the Database
4.5 Query Facilities
4.5.1 Content Query
4.5 -2 Structure Query
4.5.3 Advanced QUery
4.6 User Interfaces
4.7 Suppiementary Functions
4.8 Summary
Chapter 5 Querying the Web Pages
5.1 Query Methods
5.1 - 1 Content Query
5.1.2 Structure Query
5.1 -3 Advmced Query
5 -2 Experimental Resdts
5.2.1 Content Query
5.2.2 Structure Query
5 -2.3 Advanced @ery
5.3 Summary
Chapter 6 Discussions and Future Work
6.1 Discussions
6.2 Future Work
References
Viîa
vii
LIST OF FIGURES
Figure 2.1
Figure 2.2
Figure 2.3
Figure 2.4
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
li.:m.vn 2 C A 16U1 i.W
Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figure 4.5
Figure 4.6
Figure 4.7
A SampIe Hypertext Structure
Common Tags in HTML
A Sample HTML File
A Sample Web Page
An OEM Graph
A Sample ADM Scheme
A Sample Web Page
The Architecture of the WebSQL System
A C m m r r l a A r r r h C + a m + v . r a - C A Kar l :n+rr - --ri lXT-n-.i-.-.- c L 3-p~r c u CIILIL-WL- e V A A r a w L u L w L auu T V L a p p ; 3
An Example of Labeled Directed Graph Mode1
A Sample Web Page for Course Description
The System Architecture
The Query Interface for Content Query
Results of a Content Query
The Query Interface for Structure Query
Results of a Structure Query
Figure 4.8 The Query Interface for Advanced Query
Figure 4.9 Results of an Advanced Query
Figure 4.10 The User Interface for Accessing Query Facilities
Figure 4. I t The User Interface for Supplementary Functions
LIST OF TABLES
Table 4.1 Refation webpage with 1 Tuple
Table 4.2 Relation webpage-d with 1 TupIe
Table 4.3 Relation links with 13 Tuples
Chapter 1
Introduction
1.1 Motivation
Since its creation in 1989, the World Wide Web has been growing at a phenorneml
rate. As a globai information resource residing on the Intemet, the World Wide Web
contains a large amount of data relevant to alrnost ali domains of human activity:
education, business, entertainment, art, science, politics, religions, etc. There are currently
tens of d o n s of documents on the web, and the number is growing. The explosion of
the World Wide Web, on the one side, is providing more and more information; on the
m t h m t &Ao, LC\~~~LLI~LLP ;+ -le-- m R . . e a ~ ~ ~ n h 1 a - e r\-a G . ~ A C ) ~ M . . + O I ~ ~ n C . 1 - -<+CI +Le Vv-eL :c- U U W L 3 L U W 7 L A V I V W T W A ~ L L GU* W U U w ~ L V U & W L L W r V L I W AULLU-WLlbLU ~ I V U L b L L & W l i l L L L l b W b U L 3
the difficulty of retrieving specïfïc information of interest to the user, fiom the enormous
number of resources that are available Cl]. The most cornmon technology used for
searching the Web is based on browsing the web pages by following links or searching by
sending information retrieval requests to "index servers" [2]. While navigation and
browsing are useful, they can lead to the weil-known "lost-in-hyperspace" phenomenon.
On the other hand, limitations aIso exist in using search engines to seek out the
idormation f?om the Web. One major limitation of search engines is that they provide only
keyword search, which means this kind of search cm not effectively use the way
information is structured in Web documents [3]. Therefore, complex queries regarding
hypertext structures are not dowed. For example, there is no way to find out the
hypertext Links of interest within a given web page by using search engines. Under these
circumstances, a flexible and powerfid search systern for querying the Web is really
needed.
1.2 Objectives
The main objective of this thesis is to design and implement a prototype system,
which can be used to query web pages with database query Ianguages. In order to
overcome the major limitation existing in search engines meneioned in section 1.1, Our
prototype system should have the ability of supporting structure queries, which are quenes
posed on the hypertext structures of the web pages. However, a new issue arises by
adding this facility to our prototype system, Le., how to represent the structure of the web
pages. Thus, one of the objectives of this thesis is to build up a data mode1 that is
comprehensive enough to capture some important aspects invohed in querying the Web.
DBerent fiom other web querying systems, our prototype system exploits an existing
database query language to query the web pages. Accordingly, another objective of this
thesis is to explore the beuefit that we can get from using the database query language to
query the web pages.
1.3 Thesis Organization
This thesis consists of six chapters.
Chapter 1 provides an introduction to the thesis.
Chapter 2 gives an overview of Hypertext and the World Wide Web. HyperText
Markup Language and the methods currently used to search the web are also introduced.
Chapter 3 introduces the issues regarding querying the Web and presents severai
related works for querying the World Wide Web.
Chapter 4 focuses on the design and irnplernentation of the prototype system. In this
chapter, the data mode1 used to mode1 the structure of the web pages is descnbed- A
system overview is provided. The method of rnapping fiom E3TML files to the database
and the query faczties developed in this prototype systern are aiso introduced.
Chapter 5 provides and discusses the experimentd results obtained &om three
different types of queries, i-e., content query, structure query and advanced query.
Chapter 6 concludes this thesis and offers recommendations for fùture work-
Chapter 2
An Overview of Hypertext and the World Wide Web
The World Wide Web can be considered as a huge hypertext system on the Internet,
where the hypertext nodes are simply HTML files residing on the £ile systems of certain
Internet hosts. This chapter provides an overview of the hypertext concept and an
introduction to the World Wide Web.
2.1 An Introduction to Hypertext
Hypertext is text with links. It dEers f?om traditional text in providing quick access &- -&a.-- - ..A- ," ,,,,=k ida& t ~ i ? h i r i t& is~? ~ùi~efi1ly behg ïe&. Kypeï~exi sir~c:iurt: is the
fùndamental structure of the World Wide Web, by which Web documents are organized.
2.1.1 A Brief History of Hypertext
Hypertext has a surprisingly rich history cornpareci to the World Wide Web. The first
system we wouId now describe as a hypertext system was proposed by Vannevar Bush as
- early as 1945. This system, the Memeq was never irnplemented, but was only described in
theory in Bush's paper [23]. It was described as "... a device in which an individual stores
his books, records, and communications, and which is mechanized so that it may be
consulted with exceeding speed and flexïbility." The actual word "hypertext" was coined
by Ted Nelson in 1965. Nelson was an early hypertext pioneer with his Xanadu system,
which he has been developing ever since. Parts of Xanadu do work and have been a
product &om the Xanadu Operating Company since 1990. The basic Xanadu idea is that
of a repositov for everything that anybody has ever written, giving a truly universal
hypertext systern. Nelson views hypertext as a literary medium and he beiieves that
"everything is deeply intertwingled" and therefore has to be on-line together. A final event
was the extremely rapid growth of hypertext on the Internet in the mid-1990s,
spearheaded by the specification of the World Wide Web by Tirn Bemers-Lee and
colïeagues at CERN (the European Center for Nuclear Physics Research Ui Geneva,
Switzerland). Detailed information about the history of hypertext can be found in [4].
2.1.2 Hypertext Concepts
The sirnplest way to defhe hypertext is to compare it with traditional te*. Al1
traditional text is sequentiai, Le., there is a single linear sequence defïning the order in
which the text is to be read. Generaliy, when we read a book, we read page one first, and
then page two, and then page three, and so on. Hyperîext, however, is non-sequential, that
is there is no single order that determines the sequence in which the text is to be read.
üsuaiiy, hypertext presents several different options for readers to explore rather than a
single Stream of information.
Figure 2.1 : A Sample Hypertext Stmcture
Figure 2.1 illustrates a sample hypertext structure. In this figure, 4 B, ... , F represent
units of information, which are calleci nodes; - represents an anchor, and-* represents
a W. Each of the nodes may have pointers to other units, and these pointers are called
h k s . Links provide the mechanism whereby nodes are connected to one another. The
node f?om which a iink originates is cded the reference. Points within the reference
where Iînks are dehed are referred to as anchors- The node at which a Iink ends is c d e d
the refeent [4]. As can be seen, the entire hypertext structure forms a network of nodes
and links. Readers move about this network in an activity that is oflen referred to as
br~ws~hg or nawzgmhg to ernphasize that users must actively determine the order in which
they read the nodes. For example, if a reader is currently reading node 4 the next node
the reader can choose is B, D or E. If the reader selects B, then the reader has alternatives
to either read ail the text in node B or jump to node C or F, and so on.
2.2 The World Wide Web
TL, -,- ,:cl 1-- ..-, C- t - - - ~ - , - r r 1 uruaL w ~ u ~ i y us& 5 ya~t ; l l ~ LUI 11ypt;l LGXL i~ tkii- 'A'ùi:d 'A~US W&. ii is a is0 une OL
the newest Intemet services. The World Wide Web has the ability to combine text, audio.
video, graphies, etc. together. Its hypertext structure provides quick access to other
related Web documents. Now, the World Wide Web is emerghg to be the newest and
most exciting tool for locating and displaying information on the Internet.
2.2.1 A Bnef Kistory of the World Wide Web
The history of the World Wide Web is fairly short. It was developed at CERN in the late
1980s. The purpose of the World Wide Web was to d o w anyone at CERN to easily
access and display documents that were stored on a server anywhere on the Internet* By
the end of 1990, the researchers at CERN had a text-mode browser and a graphical
browser for the NeXT cornputer. During 1991, the World Wide Web was released for
general usage at CERN. Initidiy, access was restricted to hypertext and UseNet news
articles. As the project advanced, interfaces to other Internet seMces were added, such as
WAIS, anonyrnous FTP, Teinet, and Gopher. In 1992, the World Wide Web project was
made public. People began to create their own Web servers to make their information
avaiiable to the intemet and to design easy-to-use interfaces to the World Wide Web. By
the end of 1993, browsers had been developed for many diEerent computer systerns,
including X Wmdows, Apple Macintosk and PCMrmdows. By the summer of 1994, the
World W~de Web had become one of the most popular ways to access Internet resources.
2.2.2 The World Wide Web Concepts
The World Wïde Web uses a client-server architecture for distributed hypertext that
can be accessed over the Internet. Servers run specialized software, cded H T P D
(HyperText Transfer Protocol Demon), which accepts requests that arrive over the
network, performs a fiuiction in response to that request, and then returns the results to
the requester. Servers are also regarded as a collection of Web documents including
hypertext files, image, video clips, sound files, etc., which c m be shared over the World XXCJ- vv ws x s r . . ~ v v GU. Aï execùtiig pîüm?i kûiiiu â " d i ë ~ t ~ ' whez ii i~ âbk ic ~ ~ i i d à TWÜSST iû P
server, await a response, and process that response [4]. A Web browser is such a client
that knows how to interpret and display documents that it finds on the World Wlde Web;
exarnples are Netscape and Microsoft Internet Explorer. AU the servers provide their data
to the client software in a standardized format cded H l U L (HyperText Markup
Language) through a standard communication protocol cded HTIiP OIyperText Transfer
Protocol). This combination of HTh4L and HTTP constitutes the hypertext abstract
machine and is the only point at which client and server computers need to agree.
The World Wide Web has a standard way of referencing a document by using a
Uniform Resource Locator (UEU,), no matter what the document's type is, for example,
text, sound file, etc. A URL is a cornplete description of a document, containing the
location of the document you want to retrieve. The location could be on your local disk or
on an Internet site halfway around the world. A URL can be set up to be absolute or
relative. An absolute URL contains the complete address of the document that is being
referenced, including the host name, directory path, and fila name. The forma1 syntax of an
absolute URL is:
where <Protocol> is a protocol that the Web browser can use to retrieve documents,
such as http, ftp, gopher, news, mail, etc.; <Most> is the server name; <Path> is a Unix-
style path for the file; <Filename> is the actuaI file name and <Locution> is a textual
label in the me. For exarnple, the foilowing URL is an absolute URL:
Protocol Host Path Fie Name Location
However, if the destination document is on the same Web server as the source document,
a relative URL may be used. A relative URL omits the protocol and host, or even the
patb that is, a relative URL onlv - s~ecifies - the subdirectory if a d a b l e and the file name.
The foiiowing example illustrates a relative URL that can be found in our example HTML
file in Figure 2.3.
Subdirectory Fie Name
What is worth mentionhg is that the equivalent absolute URL cm dways be constructed
fiom the URL of current document and the relative URL in the current document.
2.3 HyperText Markup Language (HTML)
HTML is the language used when writing a document that is to be displayed through
the World Wide Web. It will be apparent nom the example in Figure 2.3 that HTML is a
fairly simple markup language that describes how a document is stmctured. It is therefore
easy for people to w-rïte HTML files for distribution over the World Wide Web, and this
simplicity has been one of the factors in the success and growth of the World Wide Web.
2.3.1 A Bnef History of HTML
HTML was originally deveioped by Tim Bemers-Lee while at CERN, and
popularized by the Mosaic browser developed at National Center for Supercomputing
Applications (NCSA). During the course of the 1990s it has blossomed with the explosive
growth of the World Wide Web. In 1994, HTML 2.0 was developed to c o d e common
practice. HTML+ (1993) and HTML 3.0 (1995) proposed much richer versions of
HTML. In 1996, the efforts of the World Wide Web Consortium's KIUL Working Group
to codG common practice resulted in EITML 3.2. Now, HïML 4.0 is the latest version
with more powerfid and mature features.
2.3.2 Common HTML Tags
HTML is an application of SGML (Standardized General Markup Language) [4]. It
defines a collection of tags that can be used to publish on-he documents with headuigs,
texts, tables, lists photos, etc. and to retrieve on-line information via hypertext links.
These tags also provide a rneans to enable images, sound and even animation to be
- embedded in Web documents and to design foms for conductuig transactions with remote
services, for use in searchg for information, making reservations, ordering products etc.
Figure 2.2 contains some of the commonly used tags in an HTML document.
The most important feature of HTML is its ability to insert hypertext Links into an
HTML document so that other HTML documents can be Linked by these Links. Hypertext
Links in an HTML file are pointers fiom keywords appearing in the document to a
destination. The destination could be another HTML document or a resource such as an
externai image, a video clip, or a sound me. HTML supports hypenext Links through an
anchor tag <A> in the form oE
where HREF stands for Hypertext REFerence; DestClRL is the llRL of the
destination document and AnchorText is the text to appear as an anchor when the
document in which this hypertext link is defined is displayed by the Web browser.
2.3.3 Examples
Basicaiiy, an HTML document consists of two parts: the head and the body. The head
contains meta-information about the document. It is specified using the tag <title> ...
e/+;+la\ r\r + m e /--+.-.\ TL- f -A,- ,---:-- - t t * fa A:--I---LI- ,--~-,a 7- - -, -r V* '-5 - , ~ A ~ L Q ( . A UG U V U ~ b U I l L Q U 1 3 au LUC unpayauuz GC(ZICI;IL uscr~ c~an navigaie
over the various documents by activating the hypertext Links of interest. The display that is
the result of vieming an HTML document using a browser is caiied a page. Figure 2.3 and
2.4 show a sample HTML fde and the comesponding file displayed in a Web browser.
<A HREF="httpJ/www.uwo.ca.>The University of Western Ontarioc/A> 1 <A HREF="httpi/www.city . londoa01~ca/sLondon4 1 <A HREF=*http-I/~~~~gov.onca/s(hitario4A~ 1 cA HREF="httpJ/canadag~.ca/"I:anada4A> <BRxBRxBR>
/students/xyang/index.html~ since Juiy l,l998.</P>
</CENTEIixHRxBRxBR>
CIMG SRC="painting.gif"> <FONT S I 2 E = l x b Last modifiai by
<A H R E F = " m a i l t o ~ g @ c s d ~ ~ o ~ c a ~ X i a o y u Yang(IA> on February 18,1998.4WEONT>
(IBODY>
4HTML,
Figure: 2.3 A Sample HTML File
to Xiaoyu Yang's Homepage
Personal Information Research Work MY Supervisor Courses - TA - Interestinpr Links
Figure 2.4: A Sample Web Page
2.4 Searching the World Wide Web
One of the real advantages of the World wde Web system is that ordinary users cm
create web pages that users anywhere on the Intemet can display. This feature aiiows
ordinary users to publish information that can be used by the entire world and also results
in the rapid growth of the amount of information available on the World Wilde Web. There
are software programs, graphies, magazine articles, job postings, govemment reports,
weather maps, and thousands and thousands of documents, and so on. Therefore, it is not
aiways easy for users to find what they want - or to even know how to h d it. Currently,
there are two main methods that are used to search the information of interest:
navigatiodbrowsing and searching by search engines.
This is an excellent method of locating information which a user may not have
considered available on the World Wide Web. It involves starting somewhere and just
following the Links. This is a simplest way to fhd information, but is not a reliable rnethod
to End a particular piece of information on the World Wide Web. As mentioned in Chapter
1, readers often experience the "lost-in-hyperspace" phenornenon when navigating the
World Wide Web.
A solution to the navigation problem is to provide users with classSed directories
[14], which can guide users to useful resources on a particular subject or of a particular
type. An excellent exarnple of this is Yahoo, which d o w s the users to search through its
hierarchy. Since documents of a sirniIar subject/type may be grouped together, this method
obviously can narrow the scope of a search. Navigation by classified directories, however,
is sometimes Limited by the documents available and there is stiU a risk that users may
becorne disonented or have trouble to fhd the information they need.
2.4.2 Searching by Search Engùies
The need for retrievhg information Eom the World Wide Web has led to the
development of a number of search engines. They search the Web accordhg to the
keywords or phrases s p d e d by the user and retum the resuits which are relateci to the
keywords or phrases. Typically, search engines are cornposed of a resource locator (also
known as robot) and a search interface. Searches are based on an index database which
stores the information of the web pages. The resource locator is run penodicdy to gather
information f?om the Web and create and update the index database. The search interface
takes a user query, passes the request to the Web semer, which performs a retrieval fkom
the index database and rehirns the results. The results appear in hypertext and can
immediately be selected to link to the required documents.
.fm".. &xt"r2i~ 5Mr2h eTi&îca "AL' A:CL"--a ----L:lL2-- --'A L v ~ ~ ~ ~ * r u 1 u c i c i l ~ bapaumun cmar ûfi GiEeicnî se[-vers.
Some of the search engines are: Alta InfoSeek, Lycos, etc. Although there are a
large number of search engines available on the Web, they are all used in exactly the same
simple way: type in some text, and get back a hypertext answer which points to things that
were found by the search. It is an easy and usefùi way to search the World Wide Web by
search engines. However, as Martijn Koster says in his article [6], robots "wiU become less
effective and more problematic as the Web grows". The major Limitation of search engines
is that they support only keyword search, which means it is impossible to pose query on
hypertext structures. For example, assume we know the URL of the home page of
Computer Science Department at the University of Western Ontario and we would iike to
be able to restrict the search to only pages directly or indirectly reachable fiom this page.
With currently available search tools, this kind of query is not possible. There are also
some other drawbacks with search engines, such as they cannot be adapted easily to the
requirements of a specific user, their query language is poor and they ofien return too
many answers, badly ordered and rnainly irrelevant. These Limitations stimulate the need
for new and powerful search tools.
Chapter 3
Related Work of the Web Querying
The WorId Wide Web is a distributed, ever growing, giobal information resource. The
rapid growth of the Web makes the wealth of information become more and more dficult
to mine. Cumentiy, there are ody two ways to search the information available on the
Web: navigationhrowsing or searching by search engines. These two methods, however,
have important limitations, as stated in Chapter 1. Thus, the situation here is we have an
invaluable information resource, but c m not use it effectively. This compelling need for
querying the Web in a flexible and powerfbi way has led to the development of a number
of new web querying languages and systems. In this chapter, we focus on the related work
in the area of querymg the Web. First, the important issues and the f icu l t ies existing in
queqing the Web are presented. Then, data models used to describe the Web are
introduced. FmaIly, several on-going Web queryïng projects are discussed.
3.1 Issues of Querying the Web
One important and fundamental issue of querying the Web is the design of a data
. model, which should be comprehensive enough to capture most of the important aspects
involved in querying the Web. On the Web, data consists of files in a particular format,
HTML, with some structurllig primitives such as tags and anchors. GeneralIy, the
structure of H'ïML mes is irregular, irnplicit, partial and fiequently changing. These files
do have some structure but it is too irregular to be easily modeled by ushg a relational or
an object-oriented approach [19], especially when the structure is nested or cyclic.
Accordingly, these kinds of files are called semi-structured files [18]. How to modei serni-
stmctured files is thus an essential issue in querying the Web.
Another important issue in the area of q u e m g the Web is extracting information
nom the Web. The irregularity of the structure of web pages results in the ditnculty of
extracting the information. This problem has been studied and partiaiiy solved for SGML
documents [7, 81. The idea used here is to map the underlying grammar of the document
to an appropriate database schema. Thuq when the document is parsed by using this
gramrnar, corresponding objects would be created in the database. However, when deaiing
with HTML files in the same way, grammars show important limitations. Fïrst, the
structure of HTML files is not always completely defued. Second, the structure can be
irreguiar and H T b E files ofien contain errors, in the sense that they do not fùlly comply
with HTML grarnmar niles, e-g., missing tags are a common example of these errors.
Moreover, information gathering on the Web lays its emphasis on navigation via
hyperlinks that relate documents to one another. Under these c~cumstances~ the design of
a parser or other tools used to extract information fiom the Web becomes more difficult.
A query language is also a very important issue of web querying and an absolutely
necessary component of web querying systems. Basically, this kind of language should
have the power of traditional query languages and also support richer data types, ailowing
recursive queries, etc. Recently, query languages for the Web have attracted a lot of
attention [10]. Several SQL-iïke query languages have been designed for the Web,
although these languages still need a sound theoretical foundation. Another trend of web
querying is to exploit the existing database query language. These languages are based on
weil-dehed theory and are fairly mature, thus should be able to provide more powerful
query facilities.
3.2 Modeling the World Wide Web
As mentioned above, web modeling is a very important issue of web querying. There
are several data models that are proposed to describe the Web. In this section, two
dEerent data models, namely, OEM (Object Exchange Model) [9, 20, 221 and ADM
(Armeus Data Model) [ I l ] are introduced.
3.2.1 Object Exchange Model (OEM)
Object Exchange Model (OEM) is proposed to represent semi-structured data. It is a
simple, self-describing model with object nesting and identity. Data represented in OEM
can be thought of as a graph, with objects as the vertices and labels on the edges. Entities
are represented by objects. Each object has a unique object identifier (oid), a IabeI and a
value. The Iabel is a string denoting the ccmeanhgy" of the object. The value can be fiom
one of the disjoint basic atomic types, such as integer, red, string, &, titml, audio, etc.
The value can also be a complex object which is a set of sub-objects. An object is thus a 3-
tuple: <oid, label value>. A database D = < O, N > is a set O of objects, a subset N of
which are named objects. The intuition is that named objecîs provide 'entry points77 into
the database fiom which sub-objects can be requested and explored- An OEM database
c m also be easiiy viewed as a relational database with a binary relation VAL(oid, Errc ,.rr-lf Gr.- 4L.. r,lr.-- -L. ...Lm-:- -f:--irL- QZ.id Ci A-----
u r v ~ r l i ~ - v C U U ~ J LW ~ p ~ u ~ y uie uic v r u u ç ; ~ UA a~uiiri~ u u j ~ t 3 ilai y r&àiiùn
MEMBER(oid 1, label, oid2) to spec* the values of complex objects. As can be seen, the
design of OEM is intended to make it a simple, flexible and powerfiil data model to
describe semi-stnictured data.
There are some minor variations on the OEM graph model that are used in querying
the Web [21]. These data models use a labeled graph or a graph schema, with nodes
representing web pages and labels representing hyperlinks.
Figure 3.1 illustrates an OEM graph. In this figure, &19 is the identifier of an object
whose complex value is a set containing, arnong others, the pair ("category", & 17). & 17 is
the identifier of an atomic object whose value is "gourmet".
Figure 3 .1 : h OEM Graph [ 1 O]
3.2.2 Araneus Data Mode1 (ADM)
The Araneus Data Mode1 (ADM) is a page oriented data model for the Web. This
means the main consmict of the model is that of a page scheme. Each page scheme
descnbes the structure of a set of homogeneous pages. Each web page is thus considered
as an object with an identifier (the URL) and a set of attributes, one for each relevant
piece of information in the page. The attributes used to describe a web page can be either
simple, like text, image, or link to other page, or complex. Complex attributes are
essentiaiiy lists of items, possibly nested. Based on this perspective, the ADM scheme cm
be seen as a collection of page schemes, comected using links. Figure 3.2 shows an ADM
scheme with one of the example page schemes correspondhg to a web page shown in
Figure 3 -3.
Figure 3.2: A Sample ADM Scheme [l I l
Leonardo Da Vinci
Figure 3.3: A Sample Web Page [IL]
Figure 3.3 shows a sample web page containing the publications by an author. Its
corresponding page schema AuthorPage can be found in Figure 3.2, which shows an
ADM scheme for the DB&LP Bibtiography home page. For each author in the DB&LP
Bibtiography home page there is a similar web page, and al of these web pages share the
same structure. Therefore, they can be described by the same page scheme. The
AuthorPage scheme has two attributes: Name and WorkList, which is a List of
publications, i.e., a set of nested tuples. For each paper Listed in the Worklist, there is a
page scheme, Conferencepage or JounalPage. And there are attributes contained in these
two page schemes, and so on. As can be seen, an ADM scheme deals with structured web
pages in which data are organized according to precise structures and web pages present
strong regularities. Therefore, it is suitable to build database abstractions of large and
fairly weii-structured web sites.
3.3 Web Querying Systems
Several web querying languages and systems have been recently proposed. Most of
the efforts are concernai with issues such as the development of data models and query
languages for the Web, denning formai semantics for the proposed languages and
implementation issues. In some of these systems, such as WebSQL [3, 171, W3QL [13]
and WebLog [l], there is a very simple notion of scheme, and web pages are considered
within a single type, Le., as nodes in a graph, with at rnost a fked set of attributes. In other
words, these kinds of systems use OEM or variants of OEM to model the Web. There is
also a kind of web queryïng system which intends to deploy the regular structure
presented in web pages and therefore uses a more complex data model. For example, the
Araneus Project [I l ] uses a page oriented data model to d e m i e the Web. Another trend
to retneve information &om the Web is to focus on the generation of wrappers, wbïch can
facilitate database-like querying of semi-stnictured data retrieved directly fiom Web
serven. This kind of web querying system is cded a mediator-based system. Related
work in wrapper generation and mediator-based systems can be found in [12].
Tn thir section, we lrst h t - d ~ e WehSQL, 2 web n,~e~.;,?g p;stem b.S,d c:: z sirnpk
graphic data model. The other two systems similar to WebSQL, ive. W3QS and WebLog
are also introduced, but ody the ciifference is emphasized. The mediator-based systems
present a different system architecture because of the existence of the wrapper and
rnediator. We illustrate in this section the architecture of the wrapper and mediator and
also introduce the generation of a wrapper.
3.3.1 WebSQL, W3QS, WebLob
WebSQL is developed at the University of Toronto. Its query Ianguage is an SQL-like
language for querying Web sources by exploithg the structure and topology of the
document networks. The distinct feature of WebSQL is that it provides a fomal semantics
and emphasizes the distinction between local and remote documents. Figure 3.4 provides a
system overview of WebSQL.
Traversal and Index Querying
WebSQL Compiler
{ the World Wide Web \
User interface
Figure 3 -4: The Architecture of the WebSQL Syaem
Object Code
In the WebSQL system, the User Interface accepts the user query and passes the
query to the WebSQL Compiler, where the user query is parsed and translated into a
custom-designed object language. When the Vutual Machine receives the object code
generated according to the query, it executes the object code and sends the requests to the
Query Engine which finaily performs the query, extracts the information of interest fiom
the Web and retums the results to the Query Engine. M e r the Query Engine passes the
results to the Vimial Machine, the Vïmial Machine turns the results into the HTML forms
and then displays the results to the user [3].
In WebSQL, the hypertext structure is represented by a graph data model [17J. This
model then can be viewed as a relational model composed of two virtuai relations: one for
web documents and the other for anchors in web documents. The relational abstraction of
the Web d o w s one to use an SQL-iike query language to pose queries on both content
and hypertext structure. Although the WebSQL quev language is designed as a subset of
SQL, it is a simulation of SQL. Therefore, it cannot be as powerfùl as SQL and a lot of
- Requests Results Virtuai Machine Query Engine
Lists of URLs
work needs to be done in designing the quexy language, such as query opthkation, etc.
which actuaiiy has been well studied in existing database systerns.
W3QS, developed at the Technion, Israel, is a system for SQL-like querying for the
Web. The system architecture is slightly different from that of WebSQL. The feature of
W3QS is that it interfaces to user programs and UNIX seMces for anaimg and Htering
semi-stmctured uiformation fiom Web servers. It d o w s the use of PerI regular
expressions and c d s to UnUr programs frorn the "where" clause of an SQL-lïke query, and
even c d s to Web browsers. Moreover, the language has been designed to be highly
extensible, and tools for managing Web f o m encountered during navigation are
presented [ 131. Again, advanced database techniques are not exploited in W3 QS either.
Different fiom the above-mentioned two web querying systems, WebLog, developed
at Concordia University: Montreal and University of North Cailina, emphibe~
manipulating the intemal structure of Web documents. Its query language is based on
Datalog-like recursive niles (11.
3.3.2 Wrappers Used in Querying the Web
In a mediator-based system, wrappers are the essential components built around
individuid information sources. They are used to accept queries from the mediator,
translate the query into the appropriate query for the individual source, and retum the
resuits to the mediator. They make the Web sources look like databases that can be
queried through the mediator's quexy language, Le., a database query language or a
cust om-designed query language. Figure 3.5 shows an example of mediator architecture.
In this figure, sources represent several related Web sources in a particu1a.r domain of
interest. AU of them should conform to the same format. Mediator here is used to
integrate information from multiple Web sources, i.e. Source 1, 2, 3, and it is made for a
particular domain of interest.
Figure 3.5: A Sample Architecture of Mediator and Wrappers
When a wrapper is generated for a new Web source, the following steps are involved.
First, the web pages need to be structured, i-e., iden-g sections and sub-sections of
interest on a page. Then, a parser shodd be built for the source pages to extract the
sections of interest. Findy, communication capabiiities between the wrapper, mediator
and Web sources shodd be added so that wrappers can fetch the pages containing the
requested information from the Web source and return them to the mediator. The key idea
of generating a wrapper is to exploit formatting information in web pages to hypothesize
the underlying structure of a page. Once the correct structure is obtained, a wrapper for
the source can be generated without much effort or time and information of interest can be
obtained. When web pages are loosely structured, such as personal home pages, building a
wrapper becomes a difiicult task [12].
3.3.3 Summary of Web Querying Systems
Ail the current systems used for querying the Web provicie a query language. These
languages are either SQL-like or Datalog-üke and aiIow for expressing both structure
spec@ing queries, based on the organization of the hypertext, and content quenes, maidy
based on information retrïeval techniques, e-g. search eagines. In mediator-based systems,
database query Ianguage can be used as a mediatofs query language as weil as other SQL-
like query languages. This kind of system is, however, suitable only in the cases that web
pages present strong structures. For those web pages that are loosely structured, most of
the systems exploit an SQL-like or Dataiog-like query language to query them. None of
them exploits a database query language and thus cannot benefit eorn the advanced
database technologies. HaWig this in mhd, we are m g to develop a system that uses an
existing database query laquage to query the Web and show the power of a database ciucïy* &-lg&2g k, -*-< waT.,2 q 5
3.4 A Review of Database Query Laquages
Database systems have been in existence for more than 30 years and have been
successfùily used for a wide range of areas of applications, such as in business, industry,
scientiiic research, engineering, and most recentiy on the World Wide Web.
One of the major purposa of database systems is to store data while providing ad hoc
query facilities to query these data To accomplish this purpose, Dr. E. F. Codd proposed
the relational data mode1 based on strong mathematicai foundations in 1970. During the
1970s, research and development work on relational database systems was carried out and
several prototypes were developed. The SQL-based database systerns of the 1980s
provided, for the fkst tirne, a single language to span the whole range of applications, with
support for multiple views of data and independence nom physical data structures. Since
that time, relational databases have grown fkorn strength to strength. Because of the
success of these systems, relational databases have become ubiquitous and SQL has
become a world standard database language. It was, however, discovered soon that
traditionai database query languages have less expressive power and Limitations exist. For
example, they support only limited data types and can not compute arbitrary transitive
closures. Some of these challenges faced by reiationai systems derive fkom the need to
store and retrieve new m e s of very large objects with cornplex state and behavior, such
as multimedia objects and data from the Web. In the mid- 1 %Os, a new type of database
system was emerging to meet these challenges, that is, object-oriented database systems.
These systems address many of the weaknesses of relationai databases by providing
object-oriented features, supporting a ncher data type. Recursive traversing of object sets
is aIso possible in these systems. Meanwhiie, extensions to the relationai model have been
defined recently by introducing the concepts of an Abstract Data Type (ADT) and nested
relations to the relational model to improve its object-orientation, leading to so-cailed
& j e f l - r & ~ ~ d d;?t&pse rywms. R-&fi~gA yery I ~ q ~ g e s h z v ~ eqen&d
and new features have been added. For example, SQL3 (Stmctured Query Language) is
an effort to turn ANS1 SQL-92 into an object-relational query language. Compared with
SQL-92, the new features of SQL3 include not only M e r developments and extensions
of existing concepts, but also some completely new concepts. One extension in query
facility is to extend query possibilities, for example, by using recursion. New features of
SQL3 include supporting ADTs, nested relations, etc. One exarnple of such systems is
DB2, which is a substantial advance over traditionai relationai systems. The new features
of DB2 include major innovations in query optimization, recursive union, active databases
(triggers), and stored procedures [24]. It integrates object-oriented ideas with the SQL
language to produce an object-relational database management system and provides new
functions and data types, including data types for stonng large objects. More importantly,
it provides a means for users to d e h e additional hctions and data types of their own to
meet the specialized needs of their applications.
Evolving till now, database technoIogy has become fairly mature. It is weli known
that database systems offer efficient and reliable technology to query stmctured data. It is,
however, a new chalienging issue to apply database techniques to the poorly stmcmed
Worid wde Web. Once web pages can be descnbed by a database schem the
management of web data must be able to highly profit fkom the database technology. Our
attempt to build the prototype system on top of a database systern and then use the
database query language to query the web pages is inspired by this idea The database
management system we use for our web querying system is DI32 Version 2 for common
semer. As cm be seen in the later chapters, we benefit a lot h m the powerful query
capabilities of DB2 and we are released fiom the design and implementation of a new
query Ianguage for the Web by using an existing database query language.
Chapter 4
The Prototype Web Querying System
Our prototype system is designed and implernented for querying the Web by using
database query languages. The system provides three h d s of quenes: content queries
which are quenes posed on the content of web pages; structure quenes which are queries
posed on the underlying hypertext structures of web pages; and advanced quenes which
are arbitrary queries posed on the Wtual relations of web pages. The structure of web
pages is rnodeled in our prototype system by a simple labeled directed graph. This chapter
describes the low-level design of the prototype system, including the underlying data
model, the virtual relations, the parser used to map the HTML, files to the database and
search facilities developed in the prototype.
4.1 Data Model
The World W~de Web is a large, heterogeneous, distributed collection of documents
comected by hypertext links. At the highest level of abstraction, it c m be viewed as a
graph whose nodes are web pages that are identifïed by URLs and have some arbitrary
attributes. In Our prototype system, the World Wide Web is rnodeled as such a simple
Iabeled directed graph- This model can be viewed as a variant of OEM.
In our data model, each web page is represented as a node in the graph. Each node
has a unique identifier, a label and a value. The identifier of each node is the URL of the
comesponding web page. The label is a string, which is the AnchorText [see section 2-3-21
that describes the hyperiink. The value is a set of attributes describing the node.
Labels are also attached to edges. K a node contains hyperlinks, it must have outgoing
edges to another node. If a node has no hyperlinks, it does not have any outgoing edge
and therefore is a leaf node. For any two nodes x, y, there c m be at most two edges with
different directions between x and y. A node can have at most one edge that points to the
node itself. As c m be seen, the data mode1 is a simplified one since it dows for only one
ünk in any one direction between two pages. Figure 4.1 illusmites the labeled 5rected
graph modeling parts of the web pages for the Department of Computer Science at the
University of Western Ontario.
Figure 4.1: An Example of Labeled Directed Graph Mode1
Figure 4.1 illustrates the structure of 9 web pages found in the Department of
Computer Science web site on August 2, 1998. These web pages are represented by
nodes, which are circles fUed with light gray. al, &2, ..., &9 represent URLs of
corresponding web pages, i.e., node 1 to 9. Labels are descriptions of links between
nodes. A possible edge for a node could be an incoming dge, an outgoing edge or a loop.
An incoming edge of a code is a Link that points to this node. For example, node 2 has an
incoming edge fiom node i labeled "About the Department". An outgoing edge is a iink
pointing to another node. For example, node 1 has an outgoing edge to node 2 Iabeled
"About the Department". A node has a loop when it has a link points to the node itself.
For example, node 9 has a b o p labeled "Return to Top".
4.2 A Relational View of the World Wide Web
m e r the structure of web pages is represented by a labeled directed graph rnodel, we
can easily view the Web as a relational database. The only difnculty here is to define the
value that is a set of attnbutes describing the nodes. The set of amibutes could be very
compleq reflecting the intenial structure of a web page. Each attribute could be related to
a s m d piece of information presented in the web page. For example, a number of web
pages which provide information for Course Descriptions c m be found by the following
LE : ha+:,/*.. r2giL*r.- L~VV. CGUCCU~S'!?? Z k 5 - i 5 . h ~ . Eâch ûf the =Y ~ Z ~ C S
presents the same structure as that shown in Figure 4.2.
0 Course Description
Com~uter Science 41 laib Databases II A seleaion fiom the foliowing topics: dependency theory, object-oriented databases; distributed databases and related dgorithms; database hardware; information renieval,
Antirequisite: The former Cornputer Science 4 10a
Prerequisite: Cornputer Science 3 19aib.
3 lecture hours, haifcourse,
Figure 4.2: A Sample Web Page for Course Description
For these web pages, we can use a set of attributes, such as (Description, Antirequisite,
Prerequisite, load) to dethe them. In this way, web pages can be describeci precisely and
there is less chance to lose Uiformation in web pages when they are mapped to the
database. Most web pages, howwer, are loosely structureci. For example, personal home
pages. Aimost each of them has its unique style. To define attributes Wce what we do in
the above example is almost impossible. As a tradeofS we take a minimalist approach to
determine the attributes, which captures only common features of web pages.
Generally, in an HTML f3e corresponding to a web page, there is always a pair of
title tags, i-e. <title> Title </title=+, which provides the title information of that page. This
information can be used as an attrïbute describing the web page. Also, there is some other
general information that can be found in a web page, such as the narne of the author, the
number of links contained in the web page, last modifieci date, and the size of the
~nrrespmding -HI'?&- a-, etc. IIe~re, i set cf z*ih~tes iis& tc describe z xbitru~"; vyel?
page can be obtained, e.g. (title, author, linkno, last-moaified, size). Once we have
assigned a vaiue to a node, Le., a web page, we can associate a node with a hiple in a
webpage relation:
Here, the uri represents the UEU of the web page and is thus the primary key. Except for
Iinkno and size which are integers, all other amibutes are character strings. Except for the
priinary key, d other attributes may be null. As can be seen, this relation provides the
generd information for web pages. It gives a web page a highly abstract description.
However, when web pages are mapped into this relation, some of the information may be
lost. As a result, content queries cannot be executed precisely. In order to overcome this
Limitation, we need a supplementary relation for the node. This relation is defhed as
follows in our prototype:
webpage - d (urt, content)
Each node in the data rnodel is related to a tuple in this relation and the primary key url is
the URL of the node. Attribute content is associated to the whole HTML file of a node.
Here, we take the advantage of the new data type 'CLOB (Character Large Object)'
provided in DB2. This data type can contain up to two gigabytes Q3' - 1 bytes) of single-
byte ciiaracter data It has the ability to hold a whole HIML Be. By using this r,dation,
information in a web page will not be lost when the web page is mapped to the database.
Hence, content quenes can be executed precisely. The reason why we use two relations to
describe a web page is that relation webpage-d is used for content queries only and
relation webpage is used for constructing query results.
One motivation for developing new web q u e m g systems is that current search
engines cannot use the structure in the Web documents. To address this problem, new web
que*g - - systems should provide the function of q u e ~ n g the h~ertext structure To
irnplement this hction, there should be a relation in which the relationship between two
nodes is represented. In our data model, we capture the information present in a hyperlink
as a tuple in a links relation:
where uri-a and url-b are the URLs of the origin and destination of the iink, i.e., url-a,
uri-b correspond to nodes A and B with relationship: @=@ Ail these
attributes are character strings. The primas, key for this relation is the combination of
urI-a and url_b. Ody description may be n d .
Based on the labeled directed graph model, the three relations introduced above
model the Web as a relational database. They capture both the information and the
hypertext structure presented in web pages. This relational abstraction of the Web allows
us to use a database query language to pose the quenes on both content and structure.
4.3 System Overview
Conceptuaiiy, our prototype web querying system has the following components:
Interfaces, that accept the queries, present the results and guide the users to other
fundons provided by the system;
A parser, that exiracts the infomtion fiom the HïML fiies and creates tuples in
the database for each web page in the database;
Query facilities, that invoke the appropriate search processes to provide content
query, structure query and advanced query;
Supplementary fùnctions, that provide facilities for the users to maintain the web
pages stored in the database.
User
-7- result s ot her operations
Interface L
A 4 4 1
operations operations on database results on local £iiq results
'--T- Y/;< request s
r%l Disk
the World Wide Web m m Figure 4.3: The System Architecture
Users interact with this prototype systern via an intefiace, in which they can choose to
pose a query or do other operations, such as adding a web page to the database, deleting a
web page fiom the database, or displaying the information of a locaiiy stored HTML file.
When users decide to pose a query, there are three kinds of quenes with three dZFerent
interfaces provided to the users, namely, content query, structure query and advanced
query. M e r users specirjl a query, a corresponding query process is invoked and a query
is performed on corresponding relations. Finally, the query results are displayed to the
users. Currentiy, our test web pages are locally stored and paned by a parser so that
information fiom a web page can be stored in the three relations introduced in the
previous section.
4.4 Mapping HTML Files to the Database
The 1--- --------c i iio nsy ruliipvticxir ûf iiiâpefig âi1 X E v Z Eb iü d a i d j ~ b k i h ~ p & ï ~ ï . ii
provides a means of extracting the uiformation of interest fiom HTML files and storing
them into a database, e-g., the three relations mentioned above. In this section we
introduce in detail how the parser rnaps HTML mes to the database.
4.4.1 Extractkg Information from HTML Files
In Our prototype, the World Wide Web is modelai as a labeled directed graph, which
cm be represented by three relations in a relationai database. These three relations are:
webpage ( id , title, author, linho, last-modzFed, size); webpage - d ( i d , content); links
(ru-a, uri-b, descripton). Hence, the information we need to extract fkom an HTML file
is related to the attributes in these three relations. Assume we are parsing a web page
whose comesponding HTML file is calleci samp1e.hmil. What we would Like to extract
from sample. htrni are the URL. of this web page, title, author, the nurnber of hyperfinks
contained in this file, last modified date, size of the file, ail the hyperiînks that can be found
in this me, and their corresponding description. The algorithms used in extracting the
information fiom an HTML me are described in the next severai sections.
4.4.1.1 Extracting title Information
The Me element is common in ali HTML files. The ElTML DTD (Document Type
Definition) specifies that a <title container be included in an He and there should
be only one <title container in any file [4], although there may be exceptions in some
HTML files. Generally, the rirle should identi@ the contents of the document in a global
context and the tïtle text should be included between <Me> and </Me> without any other
markups, such as anchors, paragraph tags or highlighting. The syntax introduced above
makes it easy for the paner to extract the titIe information fiom an HTML file. The
algorithm is as simple as looking for the string starting with <titIe> and ending with
c/title> and then extracting the string between <titi@ and dtitie?
4.4.1.2 Extracting Hyperlinks and Description
HTML supports hyperlinks through the anchor tag in the forrn of
The anchor object within a <a> container consists of text or another type of object, e.g..
an image. This mchor object when defined within a web page defines a hypertext
relationship to another web page. Both the start and end tags of the -> container must be
specified. It is the obligation of a browser to display an anchor object in a distinctive
manner so that its role is obvious to a reader. Based on this syntw extracting a hyperlink
£tom an HTML file has the following steps:
step 1: look for the string starting with "Ca ";
Note: Since the definition of a hyperlink starts with ff<a'' and there rnay be some
other markups between "a" and "href ', we can not simply look for "Q href' to
find an hyperlink.
step 2: if "<an is found, keep on looking for the string startuig with "hrefc"";
step 3 : if "hre+"" is found, extract the characters that foliow "bref-"";
step 4: stop when encomtering " "" .
By these four steps, we can obtain the Dest(lRL, which is the URL of another web
page. Now, we should continue extracthg the anchor object that is the description of the
DestURL .
The rmchor object may be a string or something else, e-g. an image. For exarnple, if
the anchor object is an image, it c m be defined as foIIows: