1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines

SUMERA HAYAT

Highlight

SUMERA HAYAT

Highlight

SUMERA HAYAT

Highlight

SUMERA HAYAT

Highlight

SUMERA HAYAT

Highlight

a formal point of view) the steps followed by a user

during the process of query definition. Let us imagine

that a user is interested in pages containing three generic

keywords k1, k2, and k3 (associated to as many generic

concepts c1, c2, and c3). The user begins query definition

by specifying a pair including a keyword and its related

concept. Let us assume that he or she starts with k1 and

c1. It is reasonable to assume that after specifying

keyword k1, the user inserts a second keyword (for

example, k2, together with concept c2) expecting either to

find pages where k1 and k2 (that is, c1 and c2) are related

in some way or to find pages where k1 is linked to some

other keywords/concepts that will be specified later. In a

similar way, when he or she specifies k3 and c3, he or

she would be expecting to further adjust the result set in

order to find pages showing also relations between k3

and k1 (not k2 since in the ontology, there is no relation

linking c3 with c2). Let us consider a very trivial example

assuming that there exists only two pages p1 and p2

containing all the keywords (and associated concepts)

specified by the user. This represents the (initial) result

set for the given query. We want to rank those pages in

order to present to the user first the page that best fits

his or her query. The semantic annotations and page

subgraphs for these pages are illustrated in Figs. 4c, 4d,

4e, and 4f. In the first page, both c2 and c3 are linked to

c1 through a single relation (Fig. 4c), while in the second

page there exists two relations linking c3 to c1. However,

c2 is not linked in any way to c1 (Fig. 4f). Since we

cannot assume which could be the concepts or the

relations more important with respect to user query, we

can provide a significant measure of page relevance by

computing the probability that a page is the one of

interest to the user (that is, its relevance) by calculating

the probability that c2 is linked to c1 and c3 is linked to

c1 through the relations in the user’s mind (either r112 or

r212 and r1

13 or r213, respectively). Let us compute

P ð�rij; Q; pÞ, which is the probability of finding in a

particular page p a relation �rij between concepts i and j

that could be the one of interest to the user (because of

query Q). According to the probability theory, this can be

defined as P ð�rij; pÞ ¼ �ij=�ij ¼ �ij (note that it does not

depend on Q). We call it the relation probability. Thus, for

the first page, we have P ð�r12; p1Þ ¼ �12=�12 ¼ �12 ¼ 1=2

and P ð�r13; p1Þ ¼ �13=�13 ¼ �13 ¼ 1=2. For the second page,

we have P ð�r12; p2Þ ¼ �12=�12 ¼ �12 ¼ 0 and P ð�r13; p2Þ ¼�13=�13 ¼ �13 ¼ 1. Based on the considerations above,

we can compute the joint probability P ðQ; pÞ ¼P ðð�r12; pÞ \ ð�r13; pÞÞ. The dependency on Q is due to the

fact that only concepts given in Q are taken into account.

Since the events ð�r12; pÞ and ð�r13; pÞ are not correlated,

P ðQ; pÞ can be rewritten as P ðQ; pÞ ¼ P ð�r12; pÞ � P ð�r13; pÞ.Thus, for the specific example being considered, it is

P ðQ; p1Þ ¼ 1=4 and P ðQ; p2Þ ¼ 0, respectively, for the first

and second page. This allows placing the first page

before the second one in the ordered result set. However,to preserve the behavior of common search strategies, a

way for assigning a score different than zero to pages in

which there exists concepts not related to other concepts

will have to be identified.Another critical situation is illustrated in Fig. 5. In this

case, the user specifies a query composed by concepts c1, c2,and c3 over a novel ontology. Based on the considerationsabove, a measure of page relevance can be computed byestimating, for each concept, the probability of having arelation between that concept and another concept and thatsuch relation is exactly the one in the user’s mind. However,it can be demonstrated that this probability can beexpressed also in different terms, capable of taking intoaccount situations in which a particular concept can berelated to more than one concept (that is, the case of thespecific example being considered, as well as of commonsituations in any concrete search scenario). Specifically, theprobability that each concept is related to other concepts isgiven by the probability of having c1 linked to c2 and c2

linked to c3 or c1 linked to c2 and c1 linked to c3 or c2 linkedto c3 and c1 linked to c3. The situations above can bemodeled again by using graph theory. In fact, having eachconcept related to at least another concept in the query isequivalent to considering all the possible spanning forests(a collection of spanning trees, one for each connectedcomponent in the graph) for page subgraph GQ;p given thequery Q. In Fig. 6, all the possible spanning forests (trees, inthis case) of the page subgraph in Fig. 5d are shown. We callSFf

Q;p the fth page spanning forest computed over GQ;p. Wedefine P ðSFf

Q;pÞ as the probability that SFfQ;p is the spanning

forest of interest to the user. By simplifying the notation and

LAMBERTI ET AL.: A RELATION-BASED PAGE RANK ALGORITHM FOR SEMANTIC WEB SEARCH ENGINES 129

Fig. 5. (a) An ontology graph. (b) Query subgraph. (c) An example of an annotated page. (d) Page subgraph built upon the given ontology/query.

Authorized licensed use limited to: UNIVERSITY OF PLYMOUTH. Downloaded on October 7, 2009 at 02:00 from IEEE Xplore. Restrictions apply.

replacing �rij, p with �rpij, the probability for page p can becomputed as

P ðQ; pÞ ¼ P�

�rp12 \ �rp23

� �\ SF 1

Q;p

� �[ �rp12 \ �rp13

� �\ SF 2

Q;p

� �

[ �rp23 \ �rp13

� �\ SF 3

Q;p

� ��:

ð1Þ

Since the events are not correlated, it is also

P ðQ; pÞ¼P �rp12 \ �rp23

� ��P SF 1

Q;p

� �þP �rp12 \ �rp13

� ��P SF 2

Q;p

� �

þ P �rp23 \ �rp13

� ��P SF 3

Q;p

� �

¼P �rp12

� ��P �rp23

� ��P SF 1

Q;p

� �þP �rp12

� ��P �rp13

� ��P SF 2

Q;p

� �

þ P �rp23

� ��P �rp13

� ��P SF 3

Q;p

� �;

ð2Þ

where P ð�rij;pÞ can be replaced with �ij ¼ �ij=�ij.Since the probability for a single page spanning forest

to be the one of interest to the user is the same withrespect to the remaining ones, if we define �Q;p asthe number of spanning forests for GQ;p, we haveP ðSF 1

Q;pÞ ¼ P ðSF 2Q;pÞ ¼ P ðSF 3

Q;pÞ ¼ 1=�Q;p. Thus, the ex-pression for P ðQ; pÞ can be rewritten again as

P ðQ; pÞ ¼P �rp12

� �� P �rp23

� �þ P �rp12

� �� P �rp13

� �þ P �rp23

� �� P �rp13

� ��Q;p

;

ð3Þ

and according to the definition of relation probability, it is

P ðQ; pÞ ¼ �12 � �23 þ �12 � �13 þ �23 � �13½ �=�Q;p: ð4Þ

Given the ontology and the query selected for theconsidered example, (4) can be used to compute a relevancescore for each page in the result set and to provide a rankingwithin the result set itself. As expected, (4) works well alsofor the example in Fig. 4, where �Q;p ¼ 1 (since the pagesubgraph already constitutes the only spanning forest).Nevertheless, P ðQ; pÞ can still assume a value equal of zerofor all those pages in which there exists concepts that do notshow any relation with other concepts but is still present, asa keyword, in the annotated page. In the following, we willanalyze this issue in detail, and we will show how to extendthe methodology above in order to come to a general rulefor ranking all the pages in the (initial) result set.

We consider again an example represented by two pages(depicted in Fig. 7 and based on the same ontology as inFig. 5a), where concept c4 (in the first page) and concept c2

(in the second page) do not show any relations with theremaining concepts. If we compute P ðQ; p1Þ and P ðQ; p2Þusing (4) (which is still valid since the page annotationrefers to the same ontology), we get a relevance score equalto zero. Based on the definition of relevance score providedabove, in order to find a score different than zero allowingeach page to be ranked with respect to other pages, we haveto relax the condition of having each concept related to each otherconcept. Since by definition, in a spanning forest, there doesnot exist any cycles, removing one edge means removing alink between a couple of concepts. That is, edges from allthe page spanning forests have to be progressivelyremoved, thus obtaining constrained page spanning forestscomposed by a decreasing number of edges (and, equiva-lently, of connected concepts). We maintain the term“spanning” in order to recall that each constrained pagespanning forest originates from a true spanning forest inwhich for all the connected components of the graph, all the

130 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 1, JANUARY 2009

Fig. 6. All the possible spanning forests (trees) that could be obtained from GQ;p in Fig. 5d.

Fig. 7. (a) An annotated page p1 where concept c4 is not linked to any other concepts. (b) Page subgraph for a query Q specifying c1, c2, c3, and c4.

(c) Annotation of a second page p2, where c2 is not linked to any other concepts. (d) Page subgraph for the same query.

Authorized licensed use limited to: UNIVERSITY OF PLYMOUTH. Downloaded on October 7, 2009 at 02:00 from IEEE Xplore. Restrictions apply.

1 a Relation-Based Page Rank Algorithm for Semantic Web Search Engines

Documents