Top Banner
A Relation-Based Page Rank Algorithm for Semantic Web Search Engines Fabrizio Lamberti, Member, IEEE, Andrea Sanna, and Claudio Demartini, Member, IEEE Abstract—With the tremendous growth of information available to end users through the Web, search engines come to play ever a more critical role. Nevertheless, because of their general-purpose approach, it is always less uncommon that obtained result sets provide a burden of useless pages. The next-generation Web architecture, represented by the Semantic Web, provides the layered architecture possibly allowing overcoming this limitation. Several search engines have been proposed, which allow increasing information retrieval accuracy by exploiting a key content of Semantic Web resources, that is, relations. However, in order to rank results, most of the existing solutions need to work on the whole annotated knowledge base. In this paper, we propose a relation-based page rank algorithm to be used in conjunction with Semantic Web search engines that simply relies on information that could be extracted from user queries and on annotated resources. Relevance is measured as the probability that a retrieved resource actually contains those relations whose existence was assumed by the user at the time of query definition. Index Terms—Semantic Web, knowledge retrieval, search process, query formulation. Ç 1 INTRODUCTION I N the last years, with the massive growth of the Web, we assisted to an explosion of information accessible to Internet users. Nevertheless, at the same time, it has become ever more critical for end users to explore this huge repository and find needed resources by simply following the hyperlink network as foreseen by Berners-Lee and Fischetti in 1999 [4]. Today, search engines constitute the most helpful tools for organizing information and extract- ing knowledge from the Web [9]. However, it is not uncommon that even the most renowned search engines return result sets including many pages that are definitely useless for the user [18]. This is mainly due to the fact that the very basic relevance criterions underlying their in- formation retrieval strategies rely on the presence of query keywords within the returned pages. It is worth observing that statistical algorithms are applied to “tune” the result and, more importantly, approaches based on the concept of relevance feedback are used in order to maximize the satisfaction of user’s needs. Nevertheless, in some cases, this does not suffice. In order to show this odd effect, let us see what happens when a user enters a query composed by the following keywords “hotel,” “Rome,” and “historical center” (or “hotel,” “Roma,” and “centro storico”) in the Italian version of the well-known Google search engine. 1 He or she would not be astonished probably by finding that the result set actually includes several hotels located in the historical center of Rome, as expected. Another hotel located in a small town at some distance from the Rome city center is also included. However, two hotels located in the historical center of other main Italian cities are also displayed. Finally, three hotels named Roma are included among the 10 most relevant results even if they have nothing to do with the selected city. Only 4 out the 10 results presented to the user satisfy user needs (even if they seem to satisfy the user query, based on the strategy adopted to process it). There is no doubt that the user would be able to easily decide which results are really of interest by looking, for example, at the two-line excerpt of the Web page presented in the displayed list or by quickly examining each page. Anyway, the presence of unwanted pages in the result set would force him or her to perform a postprocessing on retrieved information to discard unneeded ones. Even though several automatic techniques have been recently proposed [32], result refinement remains a time-waste and click-expensive process, which is even more critical when the result set has to be processed by automatic software agents. Let us try to analyze more in detail the reason why “out-of-scope” pages are inserted in the result set. When the user entered the query “hotel,” “Rome,” and “historical center,” he or she was assuming the existence of some relations among those terms, such as “hotel” located in the “historical center” of “Rome.” However, when the query was sent to the search engine logic, these hidden details were lost. The search logic usually tries to recover this information by exploiting many text-matching techniques (such as the number of occur- rences and distance among terms). Nevertheless, traditional search engines do not have the necessary infrastructure for exploiting relation-based information that belongs to the semantic annotations for a Web page. The Semantic Web [5] will offer the way for solving this problem at the architecture level. In fact, in the Semantic Web, each page possesses semantic metadata that record additional details concerning the Web page itself. Annotations are based on classes of concepts and relations among them. The “vocabulary” for the annota- tion is usually expressed by means of an ontology that IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 1, JANUARY 2009 123 . The authors are with the Dipartimento di Automatica ed Informatica, Politecnico di Torino, C.so Duca degli Abruzzi, 24, 10129 Torino, Italy. E-mail: {lamberti, sanna, demartini}@polito.it. Manuscript received 9 Aug. 2007; revised 26 Feb. 2008; accepted 12 May 2008; published online 2 June 2008. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2007-08-0412. Digital Object Identifier no. 10.1109/TKDE.2008.113.
14

1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines

Mar 08, 2015

Download

Documents

mmsh05
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
Page 2: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
SUMERA HAYAT
Highlight
Page 3: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
SUMERA HAYAT
Highlight
SUMERA HAYAT
Highlight
Page 4: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
SUMERA HAYAT
Highlight
SUMERA HAYAT
Highlight
Page 5: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
Page 6: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
Page 7: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines

a formal point of view) the steps followed by a user

during the process of query definition. Let us imagine

that a user is interested in pages containing three generic

keywords k1, k2, and k3 (associated to as many generic

concepts c1, c2, and c3). The user begins query definition

by specifying a pair including a keyword and its related

concept. Let us assume that he or she starts with k1 and

c1. It is reasonable to assume that after specifying

keyword k1, the user inserts a second keyword (for

example, k2, together with concept c2) expecting either to

find pages where k1 and k2 (that is, c1 and c2) are related

in some way or to find pages where k1 is linked to some

other keywords/concepts that will be specified later. In a

similar way, when he or she specifies k3 and c3, he or

she would be expecting to further adjust the result set in

order to find pages showing also relations between k3

and k1 (not k2 since in the ontology, there is no relation

linking c3 with c2). Let us consider a very trivial example

assuming that there exists only two pages p1 and p2

containing all the keywords (and associated concepts)

specified by the user. This represents the (initial) result

set for the given query. We want to rank those pages in

order to present to the user first the page that best fits

his or her query. The semantic annotations and page

subgraphs for these pages are illustrated in Figs. 4c, 4d,

4e, and 4f. In the first page, both c2 and c3 are linked to

c1 through a single relation (Fig. 4c), while in the second

page there exists two relations linking c3 to c1. However,

c2 is not linked in any way to c1 (Fig. 4f). Since we

cannot assume which could be the concepts or the

relations more important with respect to user query, we

can provide a significant measure of page relevance by

computing the probability that a page is the one of

interest to the user (that is, its relevance) by calculating

the probability that c2 is linked to c1 and c3 is linked to

c1 through the relations in the user’s mind (either r112 or

r212 and r1

13 or r213, respectively). Let us compute

P ð�rij; Q; pÞ, which is the probability of finding in a

particular page p a relation �rij between concepts i and j

that could be the one of interest to the user (because of

query Q). According to the probability theory, this can be

defined as P ð�rij; pÞ ¼ �ij=�ij ¼ �ij (note that it does not

depend on Q). We call it the relation probability. Thus, for

the first page, we have P ð�r12; p1Þ ¼ �12=�12 ¼ �12 ¼ 1=2

and P ð�r13; p1Þ ¼ �13=�13 ¼ �13 ¼ 1=2. For the second page,

we have P ð�r12; p2Þ ¼ �12=�12 ¼ �12 ¼ 0 and P ð�r13; p2Þ ¼�13=�13 ¼ �13 ¼ 1. Based on the considerations above,

we can compute the joint probability P ðQ; pÞ ¼P ðð�r12; pÞ \ ð�r13; pÞÞ. The dependency on Q is due to the

fact that only concepts given in Q are taken into account.

Since the events ð�r12; pÞ and ð�r13; pÞ are not correlated,

P ðQ; pÞ can be rewritten as P ðQ; pÞ ¼ P ð�r12; pÞ � P ð�r13; pÞ.Thus, for the specific example being considered, it is

P ðQ; p1Þ ¼ 1=4 and P ðQ; p2Þ ¼ 0, respectively, for the first

and second page. This allows placing the first page

before the second one in the ordered result set. However,to preserve the behavior of common search strategies, a

way for assigning a score different than zero to pages in

which there exists concepts not related to other concepts

will have to be identified.Another critical situation is illustrated in Fig. 5. In this

case, the user specifies a query composed by concepts c1, c2,and c3 over a novel ontology. Based on the considerationsabove, a measure of page relevance can be computed byestimating, for each concept, the probability of having arelation between that concept and another concept and thatsuch relation is exactly the one in the user’s mind. However,it can be demonstrated that this probability can beexpressed also in different terms, capable of taking intoaccount situations in which a particular concept can berelated to more than one concept (that is, the case of thespecific example being considered, as well as of commonsituations in any concrete search scenario). Specifically, theprobability that each concept is related to other concepts isgiven by the probability of having c1 linked to c2 and c2

linked to c3 or c1 linked to c2 and c1 linked to c3 or c2 linkedto c3 and c1 linked to c3. The situations above can bemodeled again by using graph theory. In fact, having eachconcept related to at least another concept in the query isequivalent to considering all the possible spanning forests(a collection of spanning trees, one for each connectedcomponent in the graph) for page subgraph GQ;p given thequery Q. In Fig. 6, all the possible spanning forests (trees, inthis case) of the page subgraph in Fig. 5d are shown. We callSFf

Q;p the fth page spanning forest computed over GQ;p. Wedefine P ðSFf

Q;pÞ as the probability that SFfQ;p is the spanning

forest of interest to the user. By simplifying the notation and

LAMBERTI ET AL.: A RELATION-BASED PAGE RANK ALGORITHM FOR SEMANTIC WEB SEARCH ENGINES 129

Fig. 5. (a) An ontology graph. (b) Query subgraph. (c) An example of an annotated page. (d) Page subgraph built upon the given ontology/query.

Authorized licensed use limited to: UNIVERSITY OF PLYMOUTH. Downloaded on October 7, 2009 at 02:00 from IEEE Xplore. Restrictions apply.

Page 8: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines

replacing �rij, p with �rpij, the probability for page p can becomputed as

P ðQ; pÞ ¼ P�

�rp12 \ �rp23

� �\ SF 1

Q;p

� �[ �rp12 \ �rp13

� �\ SF 2

Q;p

� �

[ �rp23 \ �rp13

� �\ SF 3

Q;p

� ��:

ð1Þ

Since the events are not correlated, it is also

P ðQ; pÞ¼P �rp12 \ �rp23

� ��P SF 1

Q;p

� �þP �rp12 \ �rp13

� ��P SF 2

Q;p

� �

þ P �rp23 \ �rp13

� ��P SF 3

Q;p

� �

¼P �rp12

� ��P �rp23

� ��P SF 1

Q;p

� �þP �rp12

� ��P �rp13

� ��P SF 2

Q;p

� �

þ P �rp23

� ��P �rp13

� ��P SF 3

Q;p

� �;

ð2Þ

where P ð�rij;pÞ can be replaced with �ij ¼ �ij=�ij.Since the probability for a single page spanning forest

to be the one of interest to the user is the same withrespect to the remaining ones, if we define �Q;p asthe number of spanning forests for GQ;p, we haveP ðSF 1

Q;pÞ ¼ P ðSF 2Q;pÞ ¼ P ðSF 3

Q;pÞ ¼ 1=�Q;p. Thus, the ex-pression for P ðQ; pÞ can be rewritten again as

P ðQ; pÞ ¼P �rp12

� �� P �rp23

� �þ P �rp12

� �� P �rp13

� �þ P �rp23

� �� P �rp13

� ��Q;p

;

ð3Þ

and according to the definition of relation probability, it is

P ðQ; pÞ ¼ �12 � �23 þ �12 � �13 þ �23 � �13½ �=�Q;p: ð4Þ

Given the ontology and the query selected for theconsidered example, (4) can be used to compute a relevancescore for each page in the result set and to provide a rankingwithin the result set itself. As expected, (4) works well alsofor the example in Fig. 4, where �Q;p ¼ 1 (since the pagesubgraph already constitutes the only spanning forest).Nevertheless, P ðQ; pÞ can still assume a value equal of zerofor all those pages in which there exists concepts that do notshow any relation with other concepts but is still present, asa keyword, in the annotated page. In the following, we willanalyze this issue in detail, and we will show how to extendthe methodology above in order to come to a general rulefor ranking all the pages in the (initial) result set.

We consider again an example represented by two pages(depicted in Fig. 7 and based on the same ontology as inFig. 5a), where concept c4 (in the first page) and concept c2

(in the second page) do not show any relations with theremaining concepts. If we compute P ðQ; p1Þ and P ðQ; p2Þusing (4) (which is still valid since the page annotationrefers to the same ontology), we get a relevance score equalto zero. Based on the definition of relevance score providedabove, in order to find a score different than zero allowingeach page to be ranked with respect to other pages, we haveto relax the condition of having each concept related to each otherconcept. Since by definition, in a spanning forest, there doesnot exist any cycles, removing one edge means removing alink between a couple of concepts. That is, edges from allthe page spanning forests have to be progressivelyremoved, thus obtaining constrained page spanning forestscomposed by a decreasing number of edges (and, equiva-lently, of connected concepts). We maintain the term“spanning” in order to recall that each constrained pagespanning forest originates from a true spanning forest inwhich for all the connected components of the graph, all the

130 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 1, JANUARY 2009

Fig. 6. All the possible spanning forests (trees) that could be obtained from GQ;p in Fig. 5d.

Fig. 7. (a) An annotated page p1 where concept c4 is not linked to any other concepts. (b) Page subgraph for a query Q specifying c1, c2, c3, and c4.

(c) Annotation of a second page p2, where c2 is not linked to any other concepts. (d) Page subgraph for the same query.

Authorized licensed use limited to: UNIVERSITY OF PLYMOUTH. Downloaded on October 7, 2009 at 02:00 from IEEE Xplore. Restrictions apply.

Page 9: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
Page 10: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
Page 11: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
Page 12: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
Page 13: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines
Page 14: 1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines