Top Banner
Clustering Ontology-Based Metadata in the Semantic Web Alexander Maedche and Valentin Zacharias FZI Research Center for Information Technologies at the University of Karlsruhe, Research Group WIM D-76131 Karlsruhe, Germany {maedche,zach}@fzi.de http://www.fzi.de/wim Abstract. The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling com- puters and people to work in cooperation. Recently, different applications based on this vision have been designed, e.g. in the fields of knowledge management, community web portals, e-learning, multimedia retrieval, etc. It is obvious that the complex metadata descriptions generated on the basis of pre-defined ontologies serve as perfect input data for machine learning techniques. In this paper we propose an approach for cluster- ing ontology-based metadata. Main contributions of this paper are the definition of a set of similarity measures for comparing ontology-based metadata and an application study using these measures within a hier- archical clustering algorithm. 1 Introduction The Web in its’ current form is an impressive success with a growing number of users and information sources. However, the heavy burden of accessing, ex- tracting, interpretating and maintaining information is left to the human user. Recently, Tim Berners-Lee, the inventor of the WWW, coined the vision of a Se- mantic Web 1 in which background knowledge on the meaning of Web resources is stored through the use of machine-processable metadata. The Semantic Web should bring structure to the content of Web pages, being an extension of the current Web, in which information is given a well-defined meaning. Recently, different applications based on this Semantic Web vision have been designed, in- cluding scenarios such as knowledge management, information integration, com- munity web portals, e-learning, multimedia retrieval, etc. The Semantic Web relies heavily on formal ontologies that provide shared conceptualizations of specific domains and on metadata defined according these ontologies enabling comprehensive and transportable machine understanding. Our approach relies on a set of similarity measures that allow to compute similarities between ontology-based metadata along different dimensions. The 1 http://www.w3.org/2001/sw/ T. Elomaa et al. (Eds.): PKDD, LNAI 2431, pp. 348–360, 2002. c Springer-Verlag Berlin Heidelberg 2002
13

Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

Jun 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

Clustering Ontology-Based Metadata

in the Semantic Web

Alexander Maedche and Valentin Zacharias

FZI Research Center for Information Technologies at theUniversity of Karlsruhe, Research Group WIM

D-76131 Karlsruhe, Germany{maedche,zach}@fzi.dehttp://www.fzi.de/wim

Abstract. The Semantic Web is an extension of the current web inwhich information is given well-defined meaning, better enabling com-puters and people to work in cooperation. Recently, different applicationsbased on this vision have been designed, e.g. in the fields of knowledgemanagement, community web portals, e-learning, multimedia retrieval,etc. It is obvious that the complex metadata descriptions generated onthe basis of pre-defined ontologies serve as perfect input data for machinelearning techniques. In this paper we propose an approach for cluster-ing ontology-based metadata. Main contributions of this paper are thedefinition of a set of similarity measures for comparing ontology-basedmetadata and an application study using these measures within a hier-archical clustering algorithm.

1 Introduction

The Web in its’ current form is an impressive success with a growing numberof users and information sources. However, the heavy burden of accessing, ex-tracting, interpretating and maintaining information is left to the human user.Recently, Tim Berners-Lee, the inventor of the WWW, coined the vision of a Se-mantic Web1 in which background knowledge on the meaning of Web resourcesis stored through the use of machine-processable metadata. The Semantic Webshould bring structure to the content of Web pages, being an extension of thecurrent Web, in which information is given a well-defined meaning. Recently,different applications based on this Semantic Web vision have been designed, in-cluding scenarios such as knowledge management, information integration, com-munity web portals, e-learning, multimedia retrieval, etc. The Semantic Webrelies heavily on formal ontologies that provide shared conceptualizations ofspecific domains and on metadata defined according these ontologies enablingcomprehensive and transportable machine understanding.

Our approach relies on a set of similarity measures that allow to computesimilarities between ontology-based metadata along different dimensions. The1 http://www.w3.org/2001/sw/

T. Elomaa et al. (Eds.): PKDD, LNAI 2431, pp. 348–360, 2002.c© Springer-Verlag Berlin Heidelberg 2002

Page 2: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

Clustering Ontology-Based Metadata in the Semantic Web 349

similarity measures serve as input to hierarchical clustering algorithm. The sim-ilarity measures and the overall clustering approach have been applied on realworld data, namely the CIA world fact book2. In the context of this empiricalevaluation and application study we have obtained promising results.

Organization. Section 2 introduces ontologies and metadata in the context ofthe Semantic Web. Section 3 focuses on three different similarity measuringdimensions for ontology-based metadata. Section 4 provides insights into ourempirical evaluation and application study and the results we obtained whenapplying our clustering technique on Semantic Web data. Before we concludeand outline the next steps within our work, we give an overview on related workin Section 5.

2 Ontologies and Metadata in the Semantic Web

As introduced earlier the term ”Semantic Web“ encompasses efforts to builda new WWW architecture that enhances content with formal semantics. Thiswill enable automated agents to reason about Web content, and carry outmore intelligent tasks on behalf of the user. Figure 1 illustrates the relationbetween “ontology”, “metadata” and “Web documents”. It depicts a smallpart of the CIA world fact book ontology. Furthermore, it shows two Webpages, viz. the CIA fact book pages about the country Argentina and the homepage of the United Nations, respectively, with semantic annotations given inan XML serialization of RDF-based metadata descriptions3. For the countryand the organization there are metadata definitions denoted by correspond-ing uniform resource identifiers (URIs) (http://www.cia.org/country#ag

and http://www.un.org#org). The URIs are typed with the conceptsCOUNTRY and ORGANIZATION. In addition, there is a relationship in-stance between the country and organisation: Argentina isMemberof UnitedNations.

In the following we introduce a ontology and metadata model. We here onlypresent the part of our overall model that is actually used within our ontology-based metadata clustering approach4. The model that is introduced in the fol-lowing builds the core backbone for the definition of similarity measures.

Ontologies. In its classical sense ontology is a philosophical discipline, a branchof philosophy that deals with the nature and the organization of being. In itsmost prevalent use an ontology refers to an engineering artifact, describing aformal, shared conceptualization of a particular domain of interest [4].

Definition 1 (Ontology Structure). An ontology structure is a 6-tuple O :={C,P ,A,HC, prop, att}, consisting of two disjoint sets C and P whose elements2 http://www.cia.gov/cia/publications/factbook/3 The Resource Description Format (RDF) is a W3C Recommendation for metadata

representation, http://www.w3c.org/RDF.4 A more detailed definition is available in [7].

Page 3: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

350 Alexander Maedche and Valentin Zacharias

Fig. 1. Ontology, metadata and Web documents

are called concepts and relation identifiers, respectively, a concept hierar-chy HC: HC is a directed, transitive relation HC ⊆ C × C which is also calledconcept taxonomy. HC(C1, C2) means that C1 is a sub-concept of C2, a func-tion prop : P → C × C, that relates concepts non-taxonomically (The functiondom: P → C with dom(P ) := Π1(rel(P )) gives the domain of P, and range:P → C with range(P ) := Π2(rel(P ) give its range. For prop(P ) = (C1, C2)one may also write P (C1, C2)). A specific kind of relations are attributes A.The function att : A → C relates concepts with literal values (this meansrange(A) := STRING)

Example. Let us consider a short example of an instantiated ontology structureas depicted in Figure 2. Here on the basis of C := { COUNTRY,RELIGION,RELIGION}, P := {BELIEVE, SPEAK,BORDERS}, A := {POPGRW} therelations BELIEVE(COUNTRY, RELIGION), SPEAK(COUNTRY,LANGUAGE), BORDERS(COUNTRY, COUNTRY) with its domain/rangerestrictions and the attribute POPGRW(COUNTRY) are defined.

Ontology-Based Metadata. We consider the term metadata as synonym toinstances of ontologies and define a so-called metadata structure as following:

Definition 2 (Metadata Structure). A metadata structure is a 6-tupelMD := {O, I,L, inst, instr, instl}, that consists of an ontology O, a set Iwhose elements are called instance identifiers (correspondingly C, P and I aredisjoint), a set of literal values L, a function inst : C → 2I called conceptinstantiation (For inst(C) = I one may also write C(I)), and a function

Page 4: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

Clustering Ontology-Based Metadata in the Semantic Web 351

instr : P → 2I×I called relation instantiation (For inst(P ) = {I1, I2} onemay also write P (I1, I2)). The attribute instantiation is described via thefunction instl : P → 2I×L relates instances with literal values.

Fig. 2. Example ontology and metadata

Example. Here, the following metadata statements according to theontology are defined. Let I := {Finnland, Roman-Catholic,Protestant, Finnish}. inst is applied as follows: inst(Finnland) =COUNTRY, inst(Roman-Catholic) = RELIGION, inst(Protestant) =RELIGION, inst(Finnish) = LANGUAGE. Furthermore, we define rela-tions between the instances and an attribute for the country instance. Thisis done as follows: We define BELIEVE(Finnland, Roman-Catholic),BELIEVE(Finnland, Protestant), SPEAK(Finnland,Finnish) andPOPGRW(Finnland, “1.08′′).

3 Measuring Similarity on Ontology-Based Metadata

As mentioned earlier, clustering of objects requires some kind of similarity mea-sure that is computed between the objects. In our specific case the objects aredescribed via ontology-based metadata that serve as input for measuring sim-ilarities. Our approach is based on similarities using the instantiated ontologystructure and the instantiated metadata structure as introduced earlier in par-allel. Within the overall similarity computation approach, we distinguish thefollowing three dimensions:

– Taxonomy similarity: Computes the similarity between two instances onthe basis of their corresponding concepts and their position in HC .

– Relation similarity: Compute the similarity between two instances on thebasis of their relations to other objects.

– Attribute similarity: Computes the similarity between two instances onthe basis of their attributes and attribute values.

Page 5: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

352 Alexander Maedche and Valentin Zacharias

Taxonomy Similarity. The taxonomic similarity computed between metadatainstances relies on the concepts with their position in the concept taxonomy HC .The so-called upwards cotopy (SC) [7] is the underlying measure to compute thesemantic distance in a concept hierarchy.

Definition 3 (Upwards Cotopy (UC)).

UC(Ci,HC) := {Cj ∈ C|HC(Ci, Cj) ∨ Cj = Ci}.The semantic characteristics of HC are utilized: The attention is restricted

to super-concepts of a given concept Ci and the reflexive relationship of Ci toitself. Based on the definition of the upwards cotopy (UC) the concept match(CM) is then defined:

Definition 4 (Concept Match).

CM(C1, C2 :=|(UC(C1,HC) ∩ (UC(C2,HC))||(UC(C1,HC)) ∪ (UC(C2,HC)| .

Example. Figure 3 depicts the example scenario for computing CMgraphically. The upwards cotopy UC(CHRISTIANISM,HC) is given by(UC(({CHISTIANISM}),HC)) = {CHRISTIANISM,RELIGION,ROOT}.The upwards cotopy UC(({MUSLIM}),HC) is computed byUC(({MUSLIM}),HC) = {MUSLIM,RELIGION,ROOT}. Based onthe upwards cotopy one can compute the concept match CM between twogiven specific concepts. The concept match CM between MUSLIM andCHRISTIANISM is given as 1

2 .

Fig. 3. Example for computing similarities

Definition 5 (Taxonomy Similarity).

TS(I1, I2) ={1 if I1 = I2CM(C(I1),C(I2))

2 otherwise

The taxonomy similarity between Shia Muslim toProtestant results in 14 .

Page 6: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

Clustering Ontology-Based Metadata in the Semantic Web 353

Relation similarity. Our algorithm is based on the assumption that iftwo instances have the same relation to a third instance, they are morelikely similar than two instances that have relations to totally different in-stances. Thus, the similarity of two instances depends on the similarity ofthe instances they have relations to. The similarity of the referred instancesis once again calculated using taxonomic similarity. For example, assumingwe are given two concepts COUNTRY and RELIGION and a relationBELIEVE(COUNTRY,RELIGION). The algorithm will infer that specificcountries believing in catholizism and protestantism are more similar than eitherof these two compared to hinduism because more countries have both catholicsand protestants than a combination of either of these and hindis.

After this overview, let’s get to the nitty gritty of really defining the similar-ity on relations. We are comparing two instances I1 and I2, I1, I2 ∈ I. From thedefinition of the ontology we know that there is a set of relations P1 that allowinstance I1 either as domain, as range or both (Likewise there is a set P2 for I2).Only the intersection Pco = P1 ∩P2 will be of interest for relation similarity be-cause differences between P1 and P2 are determined by the taxonomic relations,which are already taken into account by the taxonomic similarity. The set Pco ofrelations is differentiated between relations allowing I1 and I2 as range - Pco–I,and those that allow I1 and I2 as domain - Pco–O.

Definition 6 (Incoming Pco–I and Outgoing Pco–O Relations).Given O := {C,P ,A,HC,P , prop, att} and instances I1 and I2 let:

Htrans :=n(a, b) : (∃a1...an ∈ C : HC(a, a1)...H

C(an, b))o

Pco–Ii(Ii) :=�R : R ∈ P ∧ ((C(Ii), range(R)) ∈ Htrans)

Pco–Oi(Ii) :=�R : R ∈ P ∧ ((C(Ii), domain(R)) ∈ Htrans)

Pco–I(Ii, Ij) := Pco–Ii(Ii) ∩ Pco–I(Ij)

Pco–O(Ii, Ij) := Pco–Oi(Ii) ∩ Pco–O(Ij)

In the following we will only look at Pco–O, but everything applies to Pco–Ias well. Before we continue we have to note an interesting aspect: For a givenontology with a relation Px there is a minimum similarity greater than zerobetween any two instances that are source or target of an instance relation -MinSims(Px) and MinSimt(Px)

5. Ignoring this will increase the similarity of twoinstances with relations to the most different instances when compared to twoinstances that simply don’t define this relation. This is especially troublesomewhen dealing with missing values. For each relation Pn ∈ Pco–O and each in-stance Ii there exists a set of instance relations Pn(Ii, Ix). We will call the setof instances Ix the associated instances As.

Definition 7 (Associated Instances).

As(P, I) := {Ix : Ix ∈ I ∧ P (I, Ix)}5 Range and domain specify a concept and any two instances of this concept or one

of its sub-concepts will have a taxonomic similarity bigger than zero

Page 7: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

354 Alexander Maedche and Valentin Zacharias

The task of comparing the instances I1 and I2 with respect to relation Pn

boils down to comparing As(Pn, I1) with As(Pn, I2). This is done as follows:

Definition 8 (Similarity for One Relation).

OR(I1, I2, P ) =

MinSimt(P ) if As(P, I1) = ∅ ∨ As(P, I2) = ∅�P(a∈As(P,I1)) max{sim(a,b)|b∈As(P,I2)}

|As(P,I1)|

�if |As(P, I1)| ≥ |As(P, I2)|�P

(a∈As(P,I2)) max{sim(a,b)|b∈As(P,I1)}|As(P,I2)|

�otherwise

Finally, the results for all Pn ∈ Pco–O and Pn ∈ Pco–I are combined bycalculating their arithmetic mean.

Definition 9 (Relational Similarity).

RS(I1, I2) :=

Pp∈Pco–I

OR(I1, I2, p) +P

p∈Pco–OOR(I1, I2, p)

|Pco–I| + |Pco–O|

The last problem that remains is the recursive nature of process of calcu-lating similarities that may lead to infinite cycles, but it can be easily solvedby imposing a maximum depth for the recursion. After reaching this maximumdepth the arithmetic mean of taxonomic and attribute similarity is returned.

Example. Assuming based on Figure 3 we compare Finnland andGermany, we see that the set of common relations only contains thebelief relation. As the next step we compare the sets of instances as-sociated with Germany and Finnland through the belief relation -that’s {Roman-Catholicism, Protestant} for Germany and Protestant

for Finnland. The similarity function for Protestant compared withProtestant returns one because they are equal, but the similarity ofProtestant compared with Roman-Catholicsm once again depends on theirrelational similarity. If we we assume the the maximum depth of recursion is setto one, the relational similarity betweenRoman-Catholicsm and Protestant

is 0.56. So finally the relational similarity between Finnland and Germany inthis example is 0.75.

Attribute Similarity. Attribute similarity focuses on similar attribute valuesto determine the similarity between two instances. As attributes are very similarto relations7, most of what is said for relations also applies here.

Definition 10 (Compared Attributes for Two Instances).

PAi(Ii) := {A : A ∈ A}PA(Ii, Ij) := PAi(Ii) ∩ PAi(Ij)

6 The set of associated instances for Protestant contains Finnland and Germany,the set for Roman-Catholicism just Germany.

7 In RDF attributes are actually relations with a range of literal.

Page 8: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

Clustering Ontology-Based Metadata in the Semantic Web 355

Definition 11 (Attribute Values).

As(A, Ii) := {Lx : Lx ∈ L ∧ A(Ii, Lx)}

Only the members of the sets As defined earlier are not instances but literalsand we need a new similarity method to compare literals. Because attributescan be names, date of birth, population of a country, income etc. comparingthem in a senseful way is very difficult. We decided to try to parse the attributevalues as a known data type (so far only date or number)8 and to do the com-parison on the parsed values. If it’s not possible to parse all values of a specificattribute, we ignore this attribute. But even if numbers are compared, translat-ing a numeric difference to a similarity value [0, 1] can be difficult. For examplecomparing the attribute population of a country a difference of 4 should yielda similarity value very close to 1, but comparing the attribute “average numberof children per woman” the same numeric difference value should result in asimilarity value close to 0. To take this into account, we first find the maximumdifference between values of this attribute and then calculate the the similarityas 1− (Difference/maxDifference).

Definition 12 (Literal Similarity).

slsim(A,A) → [0, 1]

mlsim := max {slsim(A1, A2) : A1 ∈ A ∧ A2 ∈ A}

lsim(Ai, Aj , A) :=slsim(Ai, Aj)

mlsim(A)

And last but not least, unlike for relations the minimal similarity when com-paring attributes is always zero.

Definition 13 (Similarity for One Attribute).

OA(I1, I2, A) :=

8>>>>><>>>>>:

0 if As(A, I1) = ∅ ∨ As(A, I2) = ∅ P(a∈As(A,I1)) max{lsim(a,b,A)|b∈As(A,I2)}

|As(A,I1)|

!if |As(A, I1)| ≥ |As(A, I2)| P

(a∈As(A,I2)) max{lsim(a,b,A)|b∈As(A,I1)}|As(A,I2)|

!otherwise

Definition 14 (Attribute Similarity).

AS(I1, I2) :=

∑a∈PA(I1,I2) OA(I1, I2, a)

|PA(I1,I2)|

8 For simple string data types one may use a notion of string similarity: The edit dis-tance formulated by Levenshtein [6] is a well-established method for weighting thedifference between two strings. It measures the minimum number of token insertions,deletions, and substitutions required to transform one string into another using a dy-namic programming algorithm. For example, the edit distance, ed, between the twolexical entries “TopHotel” and “Top Hotel” equals 1, ed(“TopHotel”, “Top Hotel”) =1, because one insertion operation changes the string “TopHotel” into “Top Hotel”.

Page 9: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

356 Alexander Maedche and Valentin Zacharias

Combined Measure. The combined measure uses the three dimensions in-troduced above in a common measure. This done by calculating the weightedarithmetic mean of attribute, relation and semantic similarity.

Definition 15 (Similarity Measure).

sim(Ii, Ij) :=t × TS(Ii, Ij) + r × RS(Ii, Ij) + a × AS(Ii, Ij)

t + r + a

The weights may be adjusted according to the given data set the measuresshould be applied, e.g. within our empirical evaluation we used a weight of 2 forrelation similarity, because most of the overall information of the ontology andthe associated metadata was contained in the relations.

Hierarchical Clustering. Based on the similarity measures introduced abovewe may now apply a clustering technique. Hierarchical clustering algorithmsare preferable for concept-based learning. They produce hierarchies of clusters,and therefore contain more information than non-hierarchical algorithms. [8]describes the bottom-up algorithm we use within our approach. It starts with aseparate cluster for each object. In each step, the two most similar clusters areare determined, and merged into a new cluster. The algorithm terminates whenone large cluster containing all objects has been formed.

4 Empirical Evaluation

We have empirically evaluated our approach for clustering ontology-based meta-data based on the different similarity measures and the clustering algorithmintroduced above. We used the well-known CIA world fact book data setas input9 available in the form of a MONDIAL database10. Due to a lackof currently available ontology-based metadata on the Web, we converted asubset of MONDIAL in RDF and modeled a corresponding RDF-Schema forthe databases (on the basis of the ER model also provided by MONDIAL).Our subset of the MONDIAL database contained the concepts COUNTRY,LANGUAGE, ETHNIC-GROUP, RELIGION and CONTINENT. Rela-tions contained where

– SPEAK(COUNTRY,LANGUAGE),– BELONG(COUNTRY, ETHNIC-GROUP),– BELIEVE(COUNTRY,RELIGION),– BORDERS(COUNTRY,COUNTRY) and– ENCOMPASSES(COUNTRY,CONTINENT).

We also converted the attributes infant mortality and population growth of theconcept COUNTRY. As there is no pre-classification of countries, we decided9 http://www.cia.gov/cia/publications/factbook/

10 http://www.informatik.uni-freiburg.de/˜may/Mondial/

Page 10: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

Clustering Ontology-Based Metadata in the Semantic Web 357

to empirically evaluate the cluster against the country clusters we know anduse in our daily live (like european countries, scandinavian countries, arabiccountries etc). Sadly there is no further taxonomic information for the conceptsRELIGION, ETHNIC–GROUP or LANGUAGE available within the dataset. For our experiments we used the already introduced bottom-up clusteringalgorithm with a single linkage computation strategy using cosine measure.

Using only relation similarity. Using only the relations of countries for measur-ing similarities we got clusters resembling many real world country clusters, likethe european countries, the former soviet republics in the caucasus or such smallcluster like {Austria, Germany}. A particular interesting example is the clus-ter of scandinavian countries depicted in Figure 4 because our data nowhere con-tains a value like ”scandinavian language” or a ethnic group ”scandinavian”.11

Figure 5 shows another interesting cluster of countries that we know as the

Fig. 4. Example clustering result – scandinavian countries

Middle East12. The politically interested reader will immediately recognize thatIsrael is missing. This can be easily explained by observing that Israel, whilegeographically in the middle east is in terms of language, religion and ethnicgroup a very different country. More troublesome is that Oman is missing tooand this can be only explained by turning to the data set used to calculate thesimilarities, where we see that Oman is missing many values, for example anyrelation to language or ethnic group.

Using only attribute similarity. When using only attributes of countries for mea-suring similarities we had to restrict the clustering to infant mortality and pop-ulation growth. As infant mortality and population growth are good indicatorsfor wealth of a country, we got cluster like industrialized countries or very poorcountries.11 The meaning of the acronyms in the picture is: N:Norway, SF: Finnland, S: Sweden,

DK: Denmark and IS:Island.12 The meaning of the acronyms used in the picture is: Q:Quatar, KWT: Kuwait, UAE:

United Arab Emirates, SA: Saudi Arabia, JOR: Jordan, RL: Lebanon, IRQ: Iraq,SYR: Syria, YE, Yemen.

Page 11: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

358 Alexander Maedche and Valentin Zacharias

Fig. 5. Example clustering result – middle east

Combining relation and attribute similarity. At first surprisingly the clustersgenerated with the combination of attribute and relation similarity closely re-semble the clusters generated only with relation similarity. But after checkingthe attribute values of the countries it actually increased our confidence in thealgorithm, because countries that are geographically close together, and are sim-ilar in terms of ethnic group, religion and language are almost always also similarin terms of population growth and infant mortality. In the few cases where thiswas not the case the countries where rated far apart, for example Saudi Arabiaand Iraq lost it’s position in the core middle east cluster depicted because oftheir high infant mortality13.

Summarization of results. Due to the lack of pre-classified countries and due tothe subjectivity of clustering in general, we had to restrict our evaluation proce-dure to an empirical evaluation of the cluster we obtained against the countryclusters we know and use in our daily live. It has been seen that using our at-tribute and relation similarity measures combined with a hierarchical clusteringalgorithm results in reasonable clusters of countries taking into account the verydifferent aspects a country may be described and classified.

5 Related Work

One work closely related to ours was done by Bisson [1]. In [1] it is argued thatobject-based representation systems should use the notion of similarity insteadof the subsumption criterion for classification and categorization. The similaritybetween attributes is obtained by calculated the similarity between the valuesfor common attributes (taking upper and lower bound for this attribute intoaccount) and combining them. For a symmetrical similarity measure they arecombined by dividing the weighted sum of the similarity values for the commonattributes by the weights of all attribute that occur in one of the compared13 It may be surprising for such a rich country, but according to the CIA world fact book

the infant mortality rate in Saudi Arabia (51 death per 1000 live born children) muchcloser resembles that of sanctioned Iraq (60) than that of much poorer countries likeSyria (33) or Lebanon (28)

Page 12: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

Clustering Ontology-Based Metadata in the Semantic Web 359

individuals. For a asymmetrical similarity measure the sum is divided using justthe weights for the attributes that occur in the first argument individual, therebyallowing to calculate the degree of inclusion between first and second argument.The similarity for relations is calculated by using the similarity of the individualsthat are connected through this relations. The resulting similarity measures arethen again combined in the above described symmetrical or asymmetrical way.Compared to the algorithm proposed here the approach proposed by Bisson doesnot take ontological backgound knowledge into account.

Similar to our approach a distance-based clustering is introduced in [3] thatused RIBL (Relational Instance-Based Learning) for distance computations.RIBL as introduced in [5] is an adaption of a propositional instance-based learnerto a first order representation. It uses distance weighted k-nearest neighbor learn-ing to classify test cases. In Order to calculate the distance between examplesRIBL computes for each example a conjunction of literals describing the objectsthat are represented by the arguments of the example fact. Given an examplefact RIBL first collects all facts from the knowledge base containing at least oneof the arguments also contained in the example fact. Depending on a parameterset by the user, the system may then continue to collect all facts that contain atleast one of the arguments contained in the earlier selected facts (this goes onuntil a specified depth is reached). After selecting these facts the algorithm thengoes on to calculate the similarity between the examples in a manner similar tothe one used by Bisson or described in this paper: The similarity of the objectsdepends on the similarity of their attribute values and on the similarity of theobjects related to them. The calculation of the similarity value is augmented bypredicate and attribute weight estimation based on classification feedback14. Butlike Bissons approach RIBL does not use ontological background knowledge15.

In the context of Semantic Web research, an approach for clustering RDFstatements to obtain and refine an ontology has been introduced by [2]. The au-thors present a method for learning concept hierarchies by systematically gener-ating the most specific generalization of all possible sets of resources - in essencebuilding a subsumption hierarchy using both the intension and extension ofnewly formed concepts. If an ontology is already present, its information is usedto find generalizations - for example generalizing ”type of Max is Cat” and ”typeof Moritz is Dog” to ”type of Max,Moritz is Mammal”. Unlike the authors of [2]we deliberately chose to use a distance and not a subsumption based clusteringbecause - as for example [2] points out - subsumption based criteria are not

14 Weight estimation was not used in [3]15 It may seem obvious that it is possible to include ontological background information

as facts in the knowledge base, but the results would not be comparable to ourapproach. Assuming we are comparing u1, u2 and have the facts instance of(u1,c1),instance of(u2,c2). Comparing u1 and u2 with respect to instance of would lead tocomparing c1 and c2 which in turn lets the algorithm select all facts containing c1and c2 - containing all instances of c1 and c2 and their description. Assuming a singleroot concept and a high depth parameter sooner or later all facts will be selected -resulting not only in a long runtime but also in a very low impact of the taxonomicrelations

Page 13: Clustering Ontology-Based Metadata in the Semantic Web · Clustering Ontology-Based Metadata in the Semantic Web 353 Relation similarity. Our algorithm is based on the assumption

360 Alexander Maedche and Valentin Zacharias

well equipped to deal with incomplete or incoherent information (something weexpect to be very common within the Semantic Web).

6 Conclusion

In this paper we have presented an approach towards mining Semantic Webdata, focusing on clustering objects described by ontology-based metadata. Ourmethod has been empirically evaluated on the basis of the CIA world fact bookdata set that was easily to convert into ontology-based metadata. The resultshave shown that our clustering method is able to detect commonly known clus-ters of countries like scandinavian countries or middle east countries.

In the future much work remains to be done. Our empirical evaluation couldnot be formalized due to the lack of available pre-classifications. The actual prob-lem is that there are no ontological background knowledge. Therefore, we willmodel country clusters within the CIA world fact book ontology and experimentto which degree the algorithm is able to discover these country clusters. Thesedata set may serve as a future reference data set when experimenting with ourSemantic Web mining techniques.

Acknowledgments

The research presented in this paper has been partially funded by Daimler-Chrysler AG, Woerth in the HRMore project. We thank Steffen Staab for pro-viding useful input for defining the taxonomic similarity measure. Furthermore,we thank our student Peter Horn who did the implementation work for ourempirical evaluation study.

References

1. G. Bisson. Why and how to define a similarity measure for object based represen-tation systems, 1995. 358

2. A. Delteil, C. Faron-Zucker, and R. Dieng. Learning ontologies from RDF annota-tions. In A. Maedche, S. Staab, C. Nedellec, and E. Hovy, editors, Proceedings ofIJCAI-01 Workshop on Ontology Learning OL-2001, Seattle, August 2001, MenloPark, 2001. AAAI Press. 359

3. W. Emde and D. Wettschereck. Relational instance-based learning. Proceedingsof the 13th International Conference on Machine Learning, 1996, 1996. 359

4. T. R. Gruber. A translation approach to portable ontology specifications. Knowl-edge Acquisition, 6(2):199–221, 1993. 349

5. M. Kirsten and S. Wrobel. Relational distance-based clustering. pages 261–270.Proceedings of ILP-98, LNAI 1449, Springer, 1998, 1998. 359

6. I. V. Levenshtein. Binary Codes capable of correcting deletions, insertions, andreversals. Cybernetics and Control Theory, 10(8):707–710, 1966. 355

7. A. Maedche, S. Staab, N. Stojanovic, R. Studer, and Y. Sure. SEmantic PortAL– The SEAL approach. to appear in: Creating the Semantic Web. D. Fensel et al.,MIT Press, MA, Cambridge, 2001. 349, 352

8. C. D. Manning and H. Schuetze. Foundations of Statistical Natural LanguageProcessing. MIT Press, Cambridge, Massachusetts, 1999. 356