Top Banner
Vocabulary mining for information retrieval: rough sets and fuzzy sets Padmini Srinivasan a, *, Miguel E. Ruiz a , Donald H. Kraft b , Jianhua Chen b a School of Library and Information Science, The University of Iowa, Iowa City, IO 52242, USA b Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803-4020, USA Received 20 April 1999; accepted 7 February 2000 Abstract Vocabulary mining in information retrieval refers to the utilization of the domain vocabulary towards improving the user’s query. Most often queries posed to information retrieval systems are not optimal for retrieval purposes. Vocabulary mining allows one to generalize, specialize or perform other kinds of vocabulary-based transformations on the query in order to improve retrieval performance. This paper investigates a new framework for vocabulary mining that derives from the combination of rough sets and fuzzy sets. The framework allows one to use rough set-based approximations even when the documents and queries are described using weighted, i.e., fuzzy representations. The paper also explores the application of generalized rough sets and the variable precision models. The problem of coordination between multiple vocabulary views is also examined. Finally, a preliminary analysis of issues that arise when applying the proposed vocabulary mining framework to the Unified Medical Language System (a state-of-the-art vocabulary system) is presented. The proposed framework supports the systematic study and application of dierent vocabulary views in information retrieval. 7 2000 Elsevier Science Ltd. All rights reserved. Keywords: Vocabulary mining; Generalized rough sets; Fuzzy sets; Multiple vocabulary views; UMLS 1. Introduction In information retrieval the challenge is to retrieve relevant texts in response to user queries. Information Processing and Management 37 (2001) 15–38 0306-4573/00/$ - see front matter 7 2000 Elsevier Science Ltd. All rights reserved. PII: S0306-4573(00)00014-5 www.elsevier.com/locate/infoproman * Corresponding author. Tel.: +1-319-335-5708; fax: +1-319-335-5374. E-mail address: [email protected] (P. Srinivasan).
24

Vocabulary mining for information retrieval: rough sets and fuzzy sets

Apr 24, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Vocabulary mining for information retrieval: rough sets and fuzzy sets

Vocabulary mining for information retrieval: rough setsand fuzzy sets

Padmini Srinivasana,*, Miguel E. Ruiza, Donald H. Kraftb, Jianhua Chenb

aSchool of Library and Information Science, The University of Iowa, Iowa City, IO 52242, USAbDepartment of Computer Science, Louisiana State University, Baton Rouge, LA 70803-4020, USA

Received 20 April 1999; accepted 7 February 2000

Abstract

Vocabulary mining in information retrieval refers to the utilization of the domain vocabulary towardsimproving the user's query. Most often queries posed to information retrieval systems are not optimalfor retrieval purposes. Vocabulary mining allows one to generalize, specialize or perform other kinds ofvocabulary-based transformations on the query in order to improve retrieval performance. This paperinvestigates a new framework for vocabulary mining that derives from the combination of rough setsand fuzzy sets. The framework allows one to use rough set-based approximations even when thedocuments and queries are described using weighted, i.e., fuzzy representations. The paper also exploresthe application of generalized rough sets and the variable precision models. The problem ofcoordination between multiple vocabulary views is also examined. Finally, a preliminary analysis ofissues that arise when applying the proposed vocabulary mining framework to the Uni®ed MedicalLanguage System (a state-of-the-art vocabulary system) is presented. The proposed framework supportsthe systematic study and application of di�erent vocabulary views in information retrieval. 7 2000Elsevier Science Ltd. All rights reserved.

Keywords: Vocabulary mining; Generalized rough sets; Fuzzy sets; Multiple vocabulary views; UMLS

1. Introduction

In information retrieval the challenge is to retrieve relevant texts in response to user queries.

Information Processing and Management 37 (2001) 15±38

0306-4573/00/$ - see front matter 7 2000 Elsevier Science Ltd. All rights reserved.PII: S0306-4573(00)00014-5

www.elsevier.com/locate/infoproman

* Corresponding author. Tel.: +1-319-335-5708; fax: +1-319-335-5374.E-mail address: [email protected] (P. Srinivasan).

Page 2: Vocabulary mining for information retrieval: rough sets and fuzzy sets

Information retrieval technology has matured to the point that we now have reasonablysophisticated operational and research systems. However, increasing the e�ectiveness ofretrieval algorithms remains an important and actively pursued research goal. Queryre®nement, where the initial query is modi®ed to yield a potentially more e�ective query, is animportant part of information retrieval. This step is very critical for users whose queries arenot formulated well enough for an e�ective retrieval run. One alternative for query re®nement,referred to here as vocabulary-based query re®nement, is to exploit knowledge within avocabulary that is typically domain speci®c. A second approach utilizes the vocabulary indocuments related to the query where the related documents may be identi®ed either throughrelevance or retrieval feedback.Several families of statistical information retrieval models have received signi®cant and long-

term attention, such as the Boolean, vector, probabilistic and fuzzy families of models. Thegeneral approach is to create suitable representations (Boolean, weighted, unweighted, etc.) forthe query and the document and apply a suitable retrieval technique (similarity computation,probability of relevance, etc.) that derives from the adopted model. Query re®nement in theBoolean model may occur by either changing the query operators or changing the terms orboth. At all times the integrity of the term-operator relationships with respect to the user'sinformation needs must be maintained. In the vector model, processes such as Rocchio's andIde's feedback o�er document-based query re®nement options (Salton, 1971). Researchers havealso investigated the derivation of fuzzy thesauri (Miyamoto, 1990). However, query re®nementin these models is an optional feature. In other words these models allow retrieval to beconducted without any query re®nement.In contrast, the rough set model o�ers a tight integration between retrieval and vocabulary-

based query re®nement. In fact, retrieval operates only after ®rst exploring query re®nement.Characteristics of the domain vocabulary, i.e., terms and relationships, are automaticallyutilized to re®ne the query representation before retrieval begins. An additional advantage isthat the model also automatically allows the natural perturbations in vocabularies to in¯uencedocument representations. In essence, rough sets o�er an approach where the domain'svocabulary(ies) can be automatically mined prior to retrieval. Relationships linking terms suchas synonymy, near synonymy or related terms, lexically-related terms, speci®c and generalterms can all be automatically mined in order to strengthen retrieval e�ectiveness.Our research goal is to explore the application of the family of rough set models to

information retrieval. Almost 10 years ago, initial e�orts by one of the authors demonstratedsome of the potential of rough sets for information retrieval (Das-Gupta, 1988; Srinivasan1989, 1991). Since then the area of rough sets has matured signi®cantly with many excitingadvances reported in the literature. We will explore further developments and their potentialfor information retrieval. In particular, we aim to determine if current extensions to the modelwill strengthen our previous applications of rough sets to retrieval.Section 1 provides a brief review of the standard rough set model and our previous

application of the model to information retrieval. Section 2 explores the inclusion of fuzzy setsand logic in the rough set framework. Section 3 applies the combination of fuzzy and roughsets to information retrieval. Sections 4 and 5 describe other extensions to the rough set modeland their application to information retrieval. Extensions explored include generalized roughset models and the variable precision rough set model. Section 6 combines these extensions

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3816

Page 3: Vocabulary mining for information retrieval: rough sets and fuzzy sets

along with fuzzy notions into a uni®ed and novel framework for collaboratively miningalternative vocabulary views. Section 7 presents a preliminary analysis of issues that arise whenapplying the proposed framework to the Uni®ed Medical Language System, a start-of-the-artvocabulary system developed by the National Library of Medicine (1998). The ®nal sectiono�ers our conclusions and future plans for testing this approach.

2. Pawlak's rough set model

The Rough Set Model (RSM) was proposed by Pawlak in the early 1980s (Pawlak, 1982). Itis an extension of standard set theory that supports approximations in decision making. Itshares ideas and goals, to some extent, with other tools, such as the Dempster±Shafer theoryof evidence (Skowron & Grzymala-Busse, 1994), fuzzy set theory (Pawlak & Skowron, 1994)and discriminant analysis (Krusinska, Slowinski & Stefanowski, 1992). As stated by others(Pawlak, Grzymala-Busse, Slowinski & Ziarko, 1995), one advantage of rough set theory isthat it does not require preliminary information about the data, such as probabilityassignments (as in the Dempster±Shafer theory) or membership values (as in the fuzzy settheory). However, it does require an equivalence relation operating on a universe of objects,and o�ers a pair of approximation operators to characterize di�erent subsets of the universe.Various systems, for dealing with approximations in di�erent application contexts (especially indata mining), have been built using these operators (Hu & Cercone, 1995; Millan & Machuca,1997; Nguyen, Skowron, Synak & Wro blewski, 1997; ùhrn, Vinterbo, Szyman ski &Komorowski, 1997).In Pawlak's model an equivalence relation partitions a non-empty universe of objects into

disjoint equivalence classes. Objects within an equivalence class are indistinguishable withregard to the relation. Any appropriate equivalence relation may be used for this purpose. Theuniverse and the equivalence relation together de®ne an approximation space. The equivalenceclasses and the empty set, f, are considered the elementary or atomic sets in thisapproximation space. Such an approximation space may be used to describe arbitrary subsetsof the universe. This is done using two approximations: the lower and the upperapproximations of the subset.Let R be an equivalence relation that partitions U, a non-empty universe of objects to create

an approximation space aprR=(U, R ). Let the partition be denoted as U/R={C1, C2, . . . , Cn},where Ci is an equivalence class of R. Now, for an arbitrary subset S of U,

the lower approximation of S, aprR�S� � fx 2 Ci j Ci � Sg

and

the upper approximation of S, aprR�S� � fx 2 Ci j Ci \ S6�fg:These two approximations are, in e�ect, approximate descriptions of the subset S in theapproximation space (U, R ). The term `rough set' is described as the set of these twoapproximations �apr

R�S�, aprR�S�� for reference set S. Also, the accuracy of the approximation

for S �j aprR�S� j = j aprR�S� j :

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 17

Page 4: Vocabulary mining for information retrieval: rough sets and fuzzy sets

The RSM may be viewed as an extension of set theory with two additional unary set±theoretic operators: the lower and upper approximation operators (Lin & Liu, 1993). Theunderlying equivalence relation is serial, re¯exive, symmetric, transitive and Euclidean innature.A relation R is serial if for every x 2 U there exists a y 2 U such that xRy holds. In other

words, every element in the universe has an R-related element in the same universe. R isre¯exive if for all x 2 U, xRx holds. R is symmetric if for every xRy that holds, yRx alsoholds. R is transitive if for every x, y, z 2 U, if xRy and yRz hold, then xRz holds. R isEuclidean if for every x, y, z 2 U, if xRy and xRz hold, then yRz holds. These properties arenot independent of each other. For example, a re¯exive relation implies a serial relation and asymmetric and transitive relation implies a Euclidean relation. When equivalence is used tode®ne the approximation space, and given any two subsets of the universe A and B, the lowerapproximations satisfy the following properties (Yao, Li, Lin & Liu, 1994):

L1: aprR�A� �0�aprR�0A��:

L2: aprR�U � � U:

L3: aprR�A \ B� � apr

R�A� \ apr

R�B�:

L4: aprR�A [ B� � apr

R�A� [ apr

R�B�:

L5: A � B�)aprR�A� � apr

R�B�:

L6: aprR�f� � f:

L7: aprR�A� � A:

L8: aprR�A� � apr

R�apr

R�A��:

L9: aprR�A� � aprR�aprR�A��:

Similarly the upper approximations satisfy the following properties:

U1: aprR�A� �0�aprR�0A��:

U2: aprR�f� � f:U3: aprR�A [ B� � aprR�A� [ aprR�B�:U4: aprR�A \ B� � aprR�A� \ aprR�B�:U5: A � B�)aprR�A� � aprR�B�:U6: aprR�U � � U:U7: A � aprR�A�:U8: aprR�aprR�A�� � aprR�A�:U9: aprR�aprR�A�� � apr

R�A�:

The following two properties are also satis®ed:

K: aprR�0A [ B� �0apr

R�A� [ apr

R�B�:

LU: aprR�A� � aprR�A�:

Properties L1 and U1 indicate that these approximations are dual operators. Manyresearchers have applied and studied these operators quite extensively for data mining inseveral domains, such as health care and engineering (for example, see Hu & Cercone, 1995;Millan & Machuca, 1997; Nguyen et al., 1997; ùhrn et al., 1997; Tsumoto et al., 1995). Ourinterest is in information retrieval, i.e., the problem of identifying potentially relevant texts inresponse to queries.

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3818

Page 5: Vocabulary mining for information retrieval: rough sets and fuzzy sets

2.1. Application of Pawlak's RSM to information retrieval

In previous work we showed that a vocabulary of terms can be modeled for informationretrieval applications using the RSM (Das-Gupta, 1988; Srinivasan, 1989, 1991). The modelwas applied by considering the domain's vocabulary (individual words and phrases) as theuniverse U of objects. R represented the equivalence relation de®ned by the term synonymyrelationship and was used to create a partition of U such that terms within a class aresynonyms of each other. Documents and queries, represented by vectors, were compared viatheir approximations in the approximation space aprR=(U, R ), as illustrated in the followingexample.

Example 1. Let T={t1, t2, . . . , t10} represent a vocabulary partitioned by R the synonymyrelation such that T/R={C1, C2, C3, C4, C5}, as de®ned below:

C1 � ft1, t4, t6g

C2 � ft3, t7g

C3 � ft5, t8, t10g

C4 � ft9g

C5 � ft2gLet D1 a document and Q1 a query, be de®ned as subsets of T:

D1 � ft2, t3, t4, t7g

Q1 � ft1, t2, t3gWe can then de®ne the rough set �apr

R�D1�, aprR�D1�� with reference set D1

aprR�D1� � ft2, t3, t7g

aprR�D1� � ft1, t2, t3, t4, t6, t7gWe can also de®ne the rough set �apr

R�Q1�, aprR�Q1�� with reference set Q1

aprR�Q1� � ft2g

aprR�Q1� � ft1, t2, t3, t4, t6, t7g

The following interpretations may be given to these approximations in the context of ourretrieval application. The lower approximation identi®es properties (i.e., terms) that de®nitely

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 19

Page 6: Vocabulary mining for information retrieval: rough sets and fuzzy sets

describe the subset (i.e., document or query) of interest. In contrast, the upper approximationidenti®es features that possibly describe the subset. These de®nite and possible features are, ofcourse, determined largely by the underlying approximation space (i.e., vocabulary) and therelation R (i.e., synonymy). Thus, the vocabulary T partitioned by synonymy indicates that thequery Q1 is de®nitely described by t2 and is possibly described by t1, t2, t3, t4, t6, t7. Notice thatthe lower approximation automatically narrows the query/document to its core description,while the upper approximation expands the description to the extent permitted by thevocabulary space.

2.1.1. Comparing D1 and Q1

In our previous work a number of comparison methods were designed. For example, adocument and query were considered `roughly equal' if they had identical lower and upperapproximations. They were `top equal' if they had identical upper approximations.In the present work we adopt a slightly di�erent strategy as described next. Two subsets of

U, say S1 and S2, may be compared in the approximation space (U, R ) with a pair ofasymmetric similarity measures. Using the lower approximations, the asymmetric similaritybetween S1 and S2, with S2 as the focus can be computed in the following way: ®rst let

Bl � aprR�S2� j ÿ j �aprR�S1� \ aprR�S2��

and

Bu � aprR�S2� j ÿ j �aprR�S1� \ aprR�S2��where vÿv represents the bounded di�erence, then calculate:

SimilarityR�S1, S2� � 1ÿ �card�Bl�=card�apr

R�S2��� �1�

This will equal 0 for no match and 1 for maximum match between S2 and S1 keeping S2 as thefocus in the comparison. If card�apr

R�S2�� � 0 then Similarity

R�S1, S2� is set to equal 0. In the

same way

SimilarityR�S1, S2� � 1ÿ �card�Bu�=card�aprR�S2��� �2�This will equal 0 for no match and 1 for maximum match between S2 and S1 keeping S2 as thefocus in the comparison. If card�aprR�S2�� � 0 then SimilarityR�S1, S2� is set to equal 0.Applying these similarity measures with the query as the central focus in Example 1:

SimilarityR�D1, Q1� � 1 and SimilarityR�D1, Q1� � 1

The overall retrieval strategy could be to use either similarity measure or, perhaps, someweighted combination of the two. We are currently investigating mechanisms for combiningthese two similarity measures into one retrieval status value. In our previous work, we alsoshowed how the vocabulary model quite naturally yields document clusters, another importantapplication within information retrieval. However, there were some limitations. These include,for example, the inability to use weighted descriptions of documents and queries and theinability to utilize term relationships other than synonymy. In the current research, we solve

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3820

Page 7: Vocabulary mining for information retrieval: rough sets and fuzzy sets

the ®rst problem by using a combination of rough set and fuzzy set theories. We also considerrecent extensions to the RSM that provide further ¯exibility, as for example, in being able toaccommodate other types of vocabulary relationships. Finally, we also propose a method forcombining multiple vocabulary relationships, i.e., vocabulary views.

3. Combining rough set and fuzzy set notions

The motivation to include fuzzy information into the rough set framework is to enable usersto specify approximate descriptions of queries. Similarly, greater ¯exibility is o�ered whendocuments are described using term weights. Several researchers have studied the combinationof rough and fuzzy notion (Dubois & Prade, 1990, 1992; Lin, 1992; Yao, 1997; Yao et al.,1997). We base our e�orts on the approach proposed by Yao (1997) where he investigatescombinations based on a-level sets. As a preliminary step, the following expressions taken fromYao (1997) show how membership functions may be used to compute the two rough setapproximations in Pawlak's standard model.Assume S to be a set of interest in a universe U and R, an equivalence relation for U. If we

let mS denote the membership function for S, then

maprR�S ��x� � inffmS� y� j y 2 U, �x, y� 2 Rg

maprR�S ��x� � supfmS� y� j y 2 U, �x, y� 2 Rgand if we let mR denote the membership function for R, then

maprR�S ��x� � inff1ÿ mR�x, y� j y=2Sg

maprR�S ��x� � supfmR�x, y� j y 2 SgThese may be combined to give:

maprR�S ��x� � inff max�mS� y�, 1ÿ mR�x, y�� j y 2 U g �3�

maprR�S ��x� � supf min�mS� y�, mR�x, y�� j y 2 U g �4�Using mS to de®ne the approximations, an element x belongs to the lower approximation of S(with membership=1) if all elements equivalent to x belong to S (i.e., mS ( y )=1). Using mR tode®ne the approximations, an element x belongs to the lower approximation of S (withmembership=1) if all terms not in S are not equivalent to x (i.e., mR (x, y )=0). Finally, usingboth mS and mR to de®ne the approximations, an element x belongs to the lower approximationof S (with membership=1) if across all terms in the universe, when a term is equivalent to x(i.e., mR (x, y )=1) then it is present in S (i.e., mS ( y )=1). Thus, we see that these functions maybe used to compute the approximations in the degenerate case where mS and mR take the values

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 21

Page 8: Vocabulary mining for information retrieval: rough sets and fuzzy sets

of 0 or 1, i.e., S and R are crisp sets. As Yao shows these functions are also relevant when Sand R are fuzzy sets.

3.1. Rough fuzzy sets

A rough fuzzy set is derived from the approximation of a fuzzy set in a crisp approximationspace. Let F be a fuzzy set in an approximation space apr

R� �U, R� with R being an

equivalence relation. The a-cut, a 2 �0, 1� of a fuzzy set is de®ned as:

Fa � fx 2 U j mF�x�ragWith any given Fa as a reference set, a rough set �apr

R�F�, aprR�Fa�� may be de®ned using

the standard Pawlak model. More generally, with the fuzzy set F as the reference set, a roughfuzzy set �apr

R�F�, aprR�F�� may be de®ned where each approximation is itself a fuzzy set.

The membership value of x belonging to the fuzzy set aprR�F� is the minimum of membership

values of elements in the equivalence class containing x, and the membership value of xbelonging to aprR�F� is the maximum. Thus, Eqs. (3) and (4) may be used to determine themembership in apr

R�F� and aprR�F�:

3.2. Fuzzy rough sets

A fuzzy rough set is derived from the approximation of a crisp set in a fuzzy approximationspace. Consider a fuzzy approximation space aprRe � �U, R� with R representing a fuzzysimilarity relation. Similar to an a-cut for a fuzzy set, it is possible to apply a b-cut, with b 2�0, 1� on the fuzzy similarity relation R such that each of R.s b-level sets is an equivalencerelation. Thus, we may apply the standard Pawlak model for each b-cut and so for a givensubset S in U we can derive a rough set �apr

Rb�S�, aprRb

�S�� in each of the b-level equivalencerelations. More generally, we get a fuzzy rough set �apr

R�S�, aprR�S�� in R where each

approximation is a fuzzy set whose membership values may also be determined by Eqs. (3) and(4).

3.3. Fuzzy sets and fuzzy approximation spaces

This is a more general model that allows the approximation of a fuzzy set in a fuzzyapproximation space. Thus, we can use a-cuts on the fuzzy set F to get crisp sets and b-cutson the fuzzy similarity relation R to get equivalence relations. This yields a family of roughsets �apr

Rb�Fa�, apryRb�Fa��, a 2 �0, 1�, b 2 �0, 1�: The combination may be interpreted in three

di�erent ways: as a family of rough sets, a family of rough fuzzy sets and a family of fuzzyrough sets, depending upon how these rough sets are grouped. Irrespective of interpretation,generalized versions of Eqs. (3) and (4), as shown below, may be used to determinememberships in the fuzzy approximation sets

maprG�D��x� � inff max�mD� y�, 1ÿ mG�x, y�� j y 2 U g �5�

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3822

Page 9: Vocabulary mining for information retrieval: rough sets and fuzzy sets

maprG�D��x� � supf min�mD� y�, mG�x, y�� j y 2 U g �6�The variable G can stand either for an equivalence relation or a fuzzy similarity relation. Thevariable D can stand for either a crisp or a fuzzy subset.

4. Combining rough and fuzzy set models for information retrieval

Yao's scheme for combining rough and fuzzy sets is important for us because it allows us toexplore the following situations:

. Fuzzy documents.

. Fuzzy queries.

. Fuzzy similarity relations for the vocabulary spaces.

4.1. Situation 1: Crisp vocabulary space and fuzzy vectors

This is an application of rough fuzzy sets where the vocabulary is partitioned using anequivalence relation based on term synonymy. The document and/or query vector is a fuzzyvector. As described before the approximations of fuzzy sets in this space will yield fuzzy sets.

Example 2. As in the previous example, let T={t1, t2, . . . , t10} represent a vocabularypartitioned by R, the synonymy relation, such that T/R={C1, C2, C3, C4, C5}, as de®nedpreviously.Assume the fuzzy document FD2 and the fuzzy query FQ2 as de®ned below

FD2 � f0:9=t1, 0:7=t4, 0:5=t3, 0:8=t9g

FQ2 � f0:5=t1, 0:2=t3, 0:3=t2, 0:5=t9gThen applying the same membership functions as in Eqs. (5) and (6), the fuzzy lower andupper approximations for FD2 may be derived

aprR�FD2� � f0:8=t9g

aprR�FD2� � f0:9=t1, 0:9=t4, 0:9=t6, 0:5=t3, 0:5=t7, 0:8=t9gSimilarly,

aprR�FQ2� � f0:3=t2, 0:5=t9g

aprR�FQ2� � f0:5=t1, 0:5=t4, 0:5=t6, 0:2=t3, 0:2=t7, 0:3=t2, 0:5=t9gWe must modify Eqs. (1) and (2) slightly to involve a-cuts for the fuzzy information to yield

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 23

Page 10: Vocabulary mining for information retrieval: rough sets and fuzzy sets

SimilarityR�S1, S2�a � 1ÿ �card�Bl�=card�apr

R�S2���a �7�

This will equal 0 for no match and 1 for maximum match between S2 and S1 keeping S2 as thefocus in the comparison. If card�apr

R�S2��a � 0 then Similarity

R�S1, S2�a is set to equal 0. In the

same way

SimilarityR�S1, S2�a � 1ÿ �card�Bu�=card�aprR�S2���a �8�This will equal 0 for no match and 1 for maximum match between S2 and S1 keeping S2 as thefocus in the comparison. If card�aprR�S2��a � 0 then SimilarityR�S1, S2�a is set to equal 0. Thus,we can compute:

Bl � aprR�FQ2� j ÿ j �aprR�FD2� \ aprR�FQ2�� � f0:3=t2g

Therefore, by Eq. (7), SimilarityR�FD2, FQ2�0�1ÿ 1=2 � 0:5 and Similarity

R�FD2, FQ2�0:4�1:0:

We also have

Bu � apr�FQ2� j ÿ j �apr�FD2� \ apr�FQ2�� � f0:3=t2gTherefore, by Eq. (8), SimilarityR�FD2, FQ2�0 � 1ÿ 1=7 � 0:86 and SimilarityR�FD2, FQ2�0:4 �1:0:

4.2. Situation 2: Fuzzy vocabulary space and fuzzy document/query vectors

We now introduce a fuzzy approximation space. This is analogous to Yao's fuzzy rough setmodel. A fuzzy approximation space is created by a fuzzy similarity relation which has thefollowing properties:

reflexive: mRe�x, x� � 1 for x 2 U

symmetric: mRe�x, y� � mRe� y, x� for x, y 2 U

transitive: mRe�x, z�r min�mRe�x, y�, mRe� y, z��

Example 3. This is the most general case where the synonymy relation is fuzzy and the vectorsare also fuzzy. Let:

�t1�Re2 � f1=t1, 0:9=t4, 0:4=t6g

�t2�Re2 � f1=t2, 0:8=t3, 0:9=t5g

�t3�Re2 � f0:8=t2, 1=t3, 0:8=t5g

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3824

Page 11: Vocabulary mining for information retrieval: rough sets and fuzzy sets

�t4�Re2 � f0:9=t1, 1=t4, 0:4=t6g

�t5�Re2 � f0:9=t2, 0:8=t3, 1=t5g

�t6�Re2 � f0:4=t1, 0:4=t4, 1=t6g

�t7�Re2 � f1=t7, 0:6=t10g

�t8�Re2 � f1=t8g

�t9�Re2 � f1=t9g

�t10�Re2 � f0:6=t7, 1=t10gwhere �ti �Re2 represents the fuzzy set of terms similar to term ti. Let:

FD2 � f0:9=t1, 0:7=t4, 0:5=t3, 0:8=t9g

FQ2 � f0:5=t1, 0:2=t3, 0:3=t2, 0:5=t9gThus, the fuzzy approximations in this fuzzy equivalence space are:

aprRe2�FD2� � f0:6=t1, 0:2=t3, 0:6=t4, 0:8=t9g

aprRe2�FD2� � f0:9=t1, 0:5=t2, 0:5=t3, 0:9=t4, 0:5=t5, 0:4=t6, 0:8=t9g

aprRe2�FQ2� � f0:1=t1, 0:1=t2, 0:2=t3, 0:5=t9g

aprRe2�FQ2� � f0:5=t1, 0:3=t2, 0:3=t3, 0:5=t4, 0:3=t5, 0:4=t6, 0:5=t9gSince:

Bl � f0:1=t2g

Bu � fg

SimilarityRe2�FD2, FQ2�0 � 1ÿ 1=4 � 0:75

SimilarityRe2�FD2, FQ2�0:4 � 1ÿ 0=1 � 1:0

SimilarityRe2�FD2, FQ2�0 � 1ÿ 0=4 � 1:0

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 25

Page 12: Vocabulary mining for information retrieval: rough sets and fuzzy sets

SimilarityRe2�FD2, FQ2�0:4 � 1ÿ 0=4 � 1:0

The two situations described above represent the inclusion of di�erent levels of fuzzyinformation into the rough set-based information retrieval model. In each case we canconsistently handle the combinations of crisp and fuzzy information. The main advantagegained is in the modeling power. Fuzzy weights o�er a reasonably straightforward means torepresent the varying degrees of association between information objects and vocabulary termsand also among vocabulary terms. Thus, these fuzzy extensions to the rough set approachmake the information retrieval model more realistic.

4.2.1. Further extensions to the modelThe inclusion of fuzzy notions within the rough set framework does indeed o�er added

realism when modeling vocabularies and making approximate comparisons between documentsand queries for information retrieval. However, certain limitations make this a less than perfectsolution. Many of the interesting vocabulary-based relationships are not equivalence or fuzzysimilarity relations. Relations identifying a term's `speci®c-', `lexically-' and `statistically-related'terms have di�erent properties. However, these are important relations and we would like touse them for information retrieval.We also wish to allow multiple vocabulary relations to collaboratively support information

retrieval. In other words, if each relation is regarded as a distinct view of the vocabulary, thenit is important to be able to (optionally) consider multiple views while conducting informationretrieval. This corresponds to many realistic situations where a search session may involvesynonyms, speci®c terms, lexically-related terms, etc.

5. Extensions to the rough set model

In recent work various extensions to Pawlak's RSM have been proposed. The followingsections examine some of these extensions and their potential in achieving the goals mentionedabove.

5.1. Generalized rough set model

Some of the very interesting extensions to Pawlak's rough sets, from the retrievalperspective, are those that substitute the equivalence relation with more general binaryrelations (Lin, 1989; Lingras & Yao, 1998; Yao et al., 1994; Zakowski, 1983). Suchsubstitutions are motivated by the fact that the requirement of equivalence may be toorestrictive for certain applications (such as information retrieval). Hence, a number ofalternatives have been proposed, such as compatibility relations (Zakowski, 1983) that are alsothe basis for neighborhood systems (Lin, 1989). This approach has, for example, been used fordata mining from incomplete databases (Lingras & Yao, 1998). The properties of theunderlying relation of the rough set model are signi®cant since they determine the properties ofthe two approximation operators. For example, with compatibility relations that are only

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3826

Page 13: Vocabulary mining for information retrieval: rough sets and fuzzy sets

re¯exive and symmetric, the approximation operators do not satisfy properties L9, L10, U9and U10.The recently proposed (Yao et al., 1994) general classi®cation of rough set models allows us

to study these models based on the properties of the underlying binary relation. Thisclassi®cation considers various types of relations besides equivalence and compatibilityrelations by drawing parallels between rough sets and model logic. Just as a model logic systemis an extension of propositional logic with two modal operators (the necessity and possibilityoperators), rough sets may be viewed as an extension of the standard set theory with theadditional lower and upper approximation operators. Properties of the approximationoperators in a rough set model (listed before) relate to the axioms of the modal operators in amodal logic system. Properties L1±L5 and U1±U5 and K are satis®ed independent of the typeof binary relation. The remaining properties LU, L6±L9 and U6±U9, which depend upon thecharacteristics of the binary relation, are used to classify the various rough set models. Forexample, with a serial binary relation, property LU holds. LU in combination with L2 and U2yields L6 and U6. With a transitive relation property L8 holds. In general, any subset of theseproperties may provide a class of rough set models. Given that the properties are notindependent and using results from modal logic, ®fteen di�erent classes of rough set modelsare constructed. Pawlak's rough set model based on equivalence is the strongest one, while theweakest rough set model does not require any of the additional properties to hold. Theadvantage o�ered with general binary relations is the more ¯exible application of the rough setapproach.Irrespective of the type of binary relation underlying the rough set model, we may compute

lower and upper approximations using the general scheme shown below. Consider an arbitrarybinary relation R1 on U. That is, aR1b implies that b is R1-related to a. Thus, R1 may be usedto create a class of R1-related terms for a given term as for the term a below

R1�a� � fx 2 U j aR1xgGiven classes de®ned by such binary relations on the universe, we can de®ne for a subset S 2U

aprR1�S� � fx j R1�x� � Sg

aprR1�S� � fx j R1�x� \ S6�fg

The set aprR1�S� consists of all those elements whose R1-related elements are all in S. The set

aprR1�S� consists of those elements such that at least one R1-related element is in S. The pair

�aprR1�S�, aprR1

�S�� is referred to as the generalized rough set induced by R1 with reference setS. In the case where R1 is an equivalence relation, one gets the standard rough set model.Thus, we see that subsets in the universe may be described using these two approximationseven with more general binary relations.

5.2. The generalized rough set model and information retrieval

In information retrieval there are several vocabulary-based relations of interest, e.g., speci®c

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 27

Page 14: Vocabulary mining for information retrieval: rough sets and fuzzy sets

term, general term, lexically-related term, and statistically-related term. The `speci®c term'relation is not serial, re¯exive, symmetric or Euclidean, but only transitive. Similar featureshold for the general term relationship. In contrast, the lexically related term relationship is alsosymmetric. Thus, we are faced with alternative binary relations that exhibit varying properties.The generalized rough set model is, therefore, highly relevant and immediately yields a more¯exible modeling approach. We now illustrate the application of the generalized rough setmodel to information retrieval. We continue to consider fuzzy document and querydescriptions in this analysis.

5.3. Situation 3: Generalized binary relations and fuzzy vectors

Example 4. Assume the R1 (speci®c term) relation as given below. The interpretation given isthat t4, t6, t7, t8 and t10 de®ne the set of speci®c terms for t1. Notice that t10 is a speci®c termfor t6, t8 and t9. Notice also that not all terms have speci®c terms in the vocabulary space.Finally, as expected, these relationships implicitly de®ne the hierarchical connections betweenterms. Since in theory, each term may have its own class of speci®c terms, there could bealmost as many `speci®c term' classes as there are terms in the vocabulary

R1�t1� � ft4, t6, t7, t8, t10g

R1�t2� � ft3, t5, t7, t9, t10g

R1�t3� � ft7, t9, t10g

R1�t4� � ft7, t8, t10g

R1�t5� � ft9, t10g

R1�t6� � ft10g

R1�t8� � ft10g

R1�t9� � ft10gNow given a fuzzy document and query de®ned on the same universe of terms

FD3 � f0:3=t1, 0:4=t3, 0:5=t9, 0:2=t10g

FQ3 � f0:2=t3, 0:4=t4, 0:5=t7gwe can compute the lower and upper approximations for the document and query using thesame method as depicted in Eqs. (3) and (4)

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3828

Page 15: Vocabulary mining for information retrieval: rough sets and fuzzy sets

aprR1�FD3� � f0:2=t5, 0:2=t6, 0:2=t8, 0:2=t9g

aprR1�FD3� � f0:2=t1, 0:5=t2, 0:5=t3, 0:2=t4, 0:2=t6, 0:2=t8, 0:2=t9g

aprR1�FQ3� � fg

aprR1�FQ3� � f0:5=t1, 0:5=t2, 0:5=t3, 0:5=t4g

The following interpretations may be given to these approximations. From the point of view ofR1 (the speci®c term relationship on the vocabulary), the lower approximation identi®es t5, t6,t8 and t9 as representing FD3 with a fuzzy membership value of 0.2 for each term. Thus, thespeci®c term relationship provides a particular view of the vocabulary that o�ers twoalternatives for representing the document (and the query). The lower approximation identi®esweighted terms that are de®nitely recommended, and the upper approximation identi®esweighted terms that are possibly recommended by the vocabulary view. Now similaritybetween the query and the document may be computed using Eqs. (7) and (8):

SimilarityR1�FD3, FQ3�0 � 0

SimilarityR1�FD3, FQ3�0 � 0:5

SimilarityR1�FD3, FQ3�0:4 � 1

6. Variable precision rough sets

Another interesting extension to the rough set model is the variable precision rough setmodel (Wong & Ziarko, 1987; Ziarko, 1993). In Pawlak's standard model, an element belongsto the lower approximation of a set S if all its related elements belong to S. For the upperapproximation at least one of its related elements should be in S. In graded rough sets (Yao &Wong, 1992), the degree of overlap is considered. For some n the following are de®ned withgraded rough sets

aprnR�S� � fxkR�x� j ÿ j S \ R�x� jRng

aprnR�S� � fxkS \ R�x� jrngThus, x belongs to aprnR�S� if at most n members of R(x ) are not in S and it belongs toaprnR�S� if more than n members of R(x ) are in S. This gives us a family of gradedapproximation operators by simply varying n. Variable precision rough sets (Wong & Ziarko,1987; Ziarko, 1993) o�er a probabilistic approach to rough sets by extending the idea of

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 29

Page 16: Vocabulary mining for information retrieval: rough sets and fuzzy sets

graded rough sets. In essence it also considers the size of R(x ). Thus, we now have

aprRg�S� �

�x

���� j S \ R�x� jj R�x� j r1ÿ g�

aprRg�S� ��x

���� j S \ R�x� jj R�x� j rg�

where g 2 �0, 1�: Variable precision rough sets in essence smooth the nature of approximations.This extension is important for information retrieval because of the di�erent types ofvocabulary relations one may encounter. It is possible that the appropriate value for g isrelation-dependent. The optimal g for a relation that is somewhat loosely de®ned, i.e., thatyields large classes, is perhaps di�erent from a relation that is very tightly de®ned, such assynonymy. Further insights will be gained empirically.Given our information retrieval context with fuzzy document and query vectors the above

equations are modi®ed slightly. For a fuzzy set F and a binary relation R, de®ned on theuniverse which also contains element x, we have:

maprRg�F��x� � sup

�b���� jFb \ R�x� jj R�x� j r1ÿ g

��9�

maprRg�F��x� � sup

�b���� jFb \ R�x� jj R�x� j rg

��10�

where b represents the largest membership threshold value that allows the b-cut on F tosatisfy the given condition. Thus, g sets two thresholds on the membership function. By settingg 2 �0, 0:5�, we can ensure that the threshold for the lower approximation is higher than thethreshold for the upper approximation.

6.1. Variable precision rough sets and information retrieval

Example 5. Assume the same information given in Example 4. In addition, let us set g=0.3.Thus, the threshold for the lower approximation is 0.7 and that for the upper approximation is0.3.Then, using Eqs. (9) and (10):

aprR1g�FD3� � f0:2=t30:5=t5, 0:2=t6, 0:2=t8, 0:2=t9g

aprR1g�FD3� � f0:4=t1, 0:2=t3, 0:5=t5, 0:2=t6, 0:2=t8, 0:2=t9g

aprR1g�FQ3� � fg

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3830

Page 17: Vocabulary mining for information retrieval: rough sets and fuzzy sets

aprR1g�FQ3� � f0:4=t1, 0:2=t2, 0:5=t3, 0:5=t4gThus,

Bl � fg

Bu � f0:2=t2, 0:5=t4g

SimilarityR1�FD3, FQ3�0 � 0

SimilarityR1�FD3, FQ3�0 � 1ÿ 2=4 � 0:5

7. Multiple vocabulary views for information retrieval

The previous section considers a vocabulary with a single view, i.e., the view o�ered by thespeci®c term relation. We now consider multiple views of the same vocabulary working inconcert. This is important because an information retrieval vocabulary system expressesdi�erent kinds of relationships, e.g., more speci®c, more general, synonymy, and statistically-related. Any and all of these relationships may be relevant for a given retrieval goal. Forinstance, when searching for `data mining' literature one may use both the general synonym`knowledge discovery' and the speci®c search terms `rough sets' and `ID3'. This suggests that amodeling approach that allows multiple views of the vocabulary to collaborate in suggestingimportant search terms is worth investigating. To simplify the analysis, we consider twodistinct views on the given vocabulary, as de®ned by two distinct binary relations on the sameuniverse of terms. The extension to more than two relations is straightforward.

Example 6. Consider the binary relations R1 as representing the `speci®c term' relationship andR2 as representing the `lexical variant' relationship. Now given a fuzzy document and queryde®ned on the same universe U of terms

FD4 � f0:9=t1, 0:5=t3, 0:2=t4, 0:8=t6, 0:5=t9g

FQ4 � f0:5=t1, 0:3=t2, 0:7=t3, 0:9=t9gAssume that with R1 and g1 we have

aprR1g1�FD4� � f0:2=t1, 0:2=t2, 0:5=t4g

aprR1g1�FD4� � f0:5=t1, 0:5=t2, 0:5=t4g

aprR1g1�FQ4� � f0:7=t1, 0:2=t2, 0:9=t4g

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 31

Page 18: Vocabulary mining for information retrieval: rough sets and fuzzy sets

aprR1g1�FQ4� � f0:8=t1, 0:5=t2, 0:9=t4, 0:3=t5gAlso assume that with R2 and g2 we have

aprR2g2�FD4� � f0:2=t2, 0:5=t3, 0:5=t5g

aprR2g2�FD4� � f0:8=t2, 0:5=t3, 0:5=t5g

aprR2g2�FQ4� � f0:5=t3, 0:7=t5g

aprR2g2�FQ4� � f0:5=t3, 0:7=t5g

Notice that each vocabulary view has its own optimal value for g, allowing us to ®ne tunethe variable precision model for each relation. (However, it is also possible that a singlecommon value for g works optimally across all relations.) Each view makes its ownrecommendations regarding the weighted terms that de®nitely and possibly represent the query(or document). Thus, the speci®c term view recommends 0.2/t1, 0.2/t2, 0.5/t4, while the lexicalrelation view recommends: 0.2/t2, 0.5/t3, 0.5/t5 as de®nitely describing the document. Thisanalysis suggests that there are two alternative methods for combining the terms suggested bythe di�erent views. We can be highly selective and retain only those terms that are suggestedby both views. Alternatively, we may select terms suggested by either view. Thus, the fuzzyAND and the fuzzy OR may be appropriate for the two options, respectively.Assuming that both views must e�ect selection we have

aprR1g1 AND R2g2

�FD4� � aprR1g1�FD4� \ aprR2g2

�FD4� � f0:2=t2g

aprR1g1 AND R2g2�FD4� � aprR1g1�FD4� \ aprR2�FD4� � f0:5=t2g

aprR1g1 AND R2g2

�FQ4� � aprR1g1�FQ4� \ aprR2

�FQ4� � fg

aprR1g1 AND R2g2�FQ4� � aprR1g1�FQ4� \ aprR2�FQ4� � f0:3=t5g

Assuming that either view may e�ect selection we have

aprR1g1 OR R2g2

�FD4� � aprR1g1�FD4� [ aprR2g2

�FD4� � f0:2=t1, 0:2=t2, 0:5=t3, 0:5=t4, 0:5=t5g

aprR1g1 OR R2g2�FD4� � aprR1g1�FD4� [ aprR2g2�FD4� � f0:5=t1, 0:8=t2, 0:5=t3, 0:5=t4, 0:5=t5g

aprR1g1 OR R2g2

�FQ4� � aprR1g1�FQ4� [ aprR2g2

�FQ4� � f0:7=t1, 0:2=t2, 0:5=t3, 0:9=t4, 0:7=t5g

aprR1g1 OR R2g2�FQ4� � aprR1g1�FQ4� [ aprR2g2�FQ4� � f0:8=t1, 0:5=t2, 0:5=t3, 0:9=t4, 0:7=t5g

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3832

Page 19: Vocabulary mining for information retrieval: rough sets and fuzzy sets

Notice how the original query representation has the term t9 which does not appear even in the(more optimistic) upper approximations of either view. This is because the approximation isgenerated with the particular vocabulary view. Thus, utilizing the two vocabulary views alonefor retrieval risks loss of original terms. To solve this problem, we may treat the originalrepresentations as the `original' (Ro) view of these objects. These may then be combined withthe other views using the AND or OR operations. Thus:

aprR1g1 AND R2g2 AND Ro

�FD4� � fg

aprR1g1 AND R2g2 AND Ro�FD4� � fg

aprR1g1 AND R2g2 AND Ro

�FQ4� � fg

aprR1g1 AND R2g2 AND Ro�FQ4� � fg

and

aprR1g1 OR R2g2 OR Ro

�FD4� � f0:9=t1, 0:2=t2, 0:5=t3, 0:5=t4, 0:5=t5, 0:8=t6, 0:5=t9g

aprR1g1 OR R2g2 OR Ro�FD4� � f0:9=t1, 0:8=t2, 0:5=t3, 0:5=t4, 0:5=t5, 0:8=t6, 0:5=t9g

aprR1g1 OR R2g2 OR Ro

�FQ4� � f0:7=t1, 0:3=t2, 0:7=t3, 0:9=t4, 0:7=t5, 0:9=t9g

aprR1g1 OR R2g2 OR Ro�FQ4� � f0:8=t1, 0:5=t2, 0:7=t3, 0:9=t4, 0:7=t5, 0:9=t9g

Thus, we have

SimilarityR1 AND R2 AND Ro

�FD4, FQ4�0 � 0

SimilarityR1 AND R2 AND Ro�FD4, FQ4�0 � 0

SimilarityR1 OR R2 OR Ro

�FD4, FQ4�0 � 1ÿ 5=6 � 0:17

SimilarityR1 OR R2 OR Ro�FD4, FQ4�0 � 1ÿ 4=6 � 0:33

Thus, multiple views of the same vocabulary o�er alternative approaches for term selection.These views may operate in concert to either yield the common denominator representation orthe union representation. However, it is evident from the simple example that as the number ofviews increases, the AND operation is likely to become overly restrictive. This is in fact to beexpected, since the views are really quite di�erent from each other. It does not make muchsense to expect terms to be both speci®c terms, as well as lexically related to the query terms.

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 33

Page 20: Vocabulary mining for information retrieval: rough sets and fuzzy sets

Hence, the OR operator is more suitable given the nature of interesting relations in theinformation retrieval domain. Another alternative to consider is to apply the di�erent views insequence, with each step o�ering an Ored combination of the current view and the previousstep's ®nal representation. Thus, this modeling approach allows signi®cant ¯exibility whencombining the various binary relations that may be observed in a vocabulary scheme.

8. Preliminary analysis of the UMLS metathesaurus

We now show the relevance of the vocabulary mining framework established in previoussections by examining some of the properties of a real world vocabulary system used forinformation retrieval. We will see how such a vocabulary o�ers a variety of relations withdi�erent properties. These relations may be utilized for information retrieval eitherindependently or in particular combinations. Our framework o�ers the ability to researchretrieval e�ectiveness under di�erent conditions.The 1998 edition of the UMLS Metathesaurus (National Library of Medicine, 1998) is an

integration of more than 40 healthcare-related vocabularies. It contains 476,313 concepts. Foreach concept the Metathesaurus presents a wide variety of information, such as de®nition,synonyms and parts of speech. In this section we focus on eight di�erent types of relations thatmay be gleaned for the UMLS concepts, as described in Table 1. Column 1 identi®es eachrelation. Columns 3 and 4 provide information pertaining to the related terms for a singleMetathesaurus concept: 1,2-dipalmitoylphosphatidylcholine represented here as C1. Thus, thereare nine synonyms identi®ed for the concept with three examples given.The ancestor relation is derived from the UMLS component vocabularies that are

hierarchically organized with more general/speci®c concept links1. The parent term relation is asubset of the ancestor term relation. R5 represents the allowable qualifying terms, i.e., termsthat add further speci®city to the semantics of the quali®ed term. R7 represents relatedconcepts (other than synonymy, ancestor, child, parent, sibling). R8 represents the co-occurringconcept relation, which has the largest number of entries (380) for our example concept. TheMetathesaurus classi®es these co-occurrences further into ®ve di�erent classes, such as `co-occurrence of primary or main subject headings in citations to the published literature'. Weignore these classes in this example. However, it should be noted that each co-occurrence classmay be regarded as a separate relation. This example shows that the UMLS o�ers manyrelations besides synonymy that are potentially useful for information retrieval. The vocabularymining framework de®ned allows one to systematically study retrieval e�ectiveness using theserelations either independently or in various combinations.These eight binary relations di�er in several respects. For instance, R1, the synonymy

relation, is an equivalence relation with all its concomitant properties. R2, the ancestor relation,and R6 the narrower relation are transitive, but not serial, re¯exive or symmetric. R3 the parentrelation and R5, qualifying terms, have none of these properties. R4, sibling and R7, related

1 The numbers indicate the hierarchical distance from the concept in focus. The smaller the number the greater thedistance in the source vocabulary.

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3834

Page 21: Vocabulary mining for information retrieval: rough sets and fuzzy sets

Table 1Eight UMLS-based relations

ID Type Set size Sample entries for UMLS concept, 1,2-dipalmitoylphosphatidylcholine

R1 Synonym 9 1,2-dipalmitoyl-glycerophosphocholine, dipalmitoylphosphatidylcholine, dipalmitoyllecithinR2 Ancestor term 14 Phospholipids Ð 5, glycerophosphates Ð 6, phosphatidic acids Ð 7, phosphatidylcholines Ð 8R3 Parent term 3 PhosphatidylcholinesR4 Sibling term 1 Dimyristoylphosphatidylcholine

R5 Quali®er term 31 Administration and dosage adverse e�ects, analogs and derivativesR6 Narrower term 6 1,3-DG-2-P, 1,3-dipalmitoyl-glycero-2-phosphocholine, colfosceril palmitateR7 Related term 1 Dipalmitoylphosphatidyl: mass concentration: point in time: serum: quantitative

R8 Co-occurring term 380 Acetophenones, acids, alcohols, laurates

P.Srin

ivasanet

al./

Inform

atio

nProcessin

gandManagem

ent37(2001)15±38

35

Page 22: Vocabulary mining for information retrieval: rough sets and fuzzy sets

term relations are symmetric, while R8, the co-occurrence relation may be symmetric(depending upon the de®nition of co-occurrence implemented in the UMLS2). Not only do R4

(and R7) di�er from R8 in semantics, but they also di�er signi®cantly in frequencies. Byincluding generalized relations, the proposed framework allows one to mine these di�erentrelations either individually or in combination.There are other di�erences between the relations. For example, the co-occurrence relation

tends to yield many more related terms than R4, the sibling relation. Thus, the question arises:how does one combine the multiple vocabulary views in such a way that the combinationremains somewhat neutral to signi®cant di�erences in class size across relations? Relations with10 entries in classes and relations with more than 300 entries on average should be able tocollaborate, if necessary, during retrieval. The proposed vocabulary mining framework includesthe variable precision rough set extension. This o�ers the parameter g, which allows us somecontrol over this aspect.Another aspect to consider is that relations may di�er in their level of abstraction and

granularity. R2, the ancestor relation yields terms at di�erent levels of abstraction comparedwith R1, the synonymy relation. Similarly, the co-occurrence relation may be subdivided into®ner grain relations. Again one must be able to control such di�erences. The proposedframework allows one to enforce some degree of consistency in the level of abstraction byde®ning the relation appropriately.Finally, it is clear that some relations are not independent of each other? For example, the

parent relation is a subset of the ancestor relation. Clearly, using both is somewhat redundant.The choice between the more general relation and the more speci®c one is possibly contextdependent. With some queries, the parent relation is likely to be more useful than the ancestor.This aspect may be investigated empirically within the proposed framework.To conclude, the example shows that our rough and fuzzy set-based vocabulary mining

framework is motivated by real world complex vocabularies, such as the UMLS. It is alsoevident that a number of decisions will need to be made when applying the proposed rough setframework. The core issue underlying these decisions is, in fact, the very de®nitions for thedi�erent relations/views that can be derived for the given vocabulary. Once the views arede®ned, other aspects, such as which views to select for a given query and how to combinethem, arise. These and other aspects related to vocabulary mining will be examined empiricallyin future research.

9. Conclusion

The exploration of domain vocabularies for information retrieval has been an active researcharea for several decades. Our contribution is a retrieval model within a vocabulary miningframework where individual vocabulary relations or views may be characterized and studied.

2 These relation properties are determined by looking at how they are expressed in the UMLS database. Forexample, the parent relation is not transitive, since the individual instances may come from di�erent componentvocabularies and the transitive relation often does not span vocabularies.

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3836

Page 23: Vocabulary mining for information retrieval: rough sets and fuzzy sets

The framework also supports systematic study to identify e�ective combination methods andcombination vocabulary views for information retrieval. The framework is based on Pawlak'stheory of rough sets and some of its extensions. The ability to automatically mine domainvocabularies to adjust query and document representations is of signi®cant value. Data miningresearch, including e�orts based on rough sets, have typically focused on discoveringknowledge from highly structured databases, such as relational databases. Our work extendsdata mining goals into the realm of relatively unstructured, textual databases. Equallychallenging is our goal for just-in-time discovery of the relevant vocabulary relations for agiven user query.Future plans include testing this model in various domains. As shown here, the Uni®ed

Medical Language System (UMLS) implemented by the National Library of Medicine, o�ers arich vocabulary system with various types of binary relations. Word-Net is another example ofa rich vocabulary system that o�ers an interesting test option for the future.

References

Das-Gupta, P. (1988). Rough sets and information retrieval. In Y. Chiaramella, Proceedings of the 11thInternational Conference of the Association for Computing Machinery Special Interest Group on InformationRetrieval (ACM SIGIR), Grenoble, France (pp. 567±582). ACM Press.

Dubois, D., & Prade, H. (1990). Rough fuzzy sets and fuzzy rough sets. International Journal of General Systems, 7,191±209.

Dubois, D., & Prade, H. (1992). Putting rough sets and fuzzy sets together. In R. Slowinski, Intelligent decision

support: handbook of applications and advances of the rough sets theory (pp. 204±232). Boston: Kluwer.Hu, X., & Cercone, N. (1995). Mining knowledge rules from databases: a rough set approach. In Proceedings of the

12th International Conference on Data Engineering, New Orleans (pp. 96±105).

Krusinska, E., Slowinski, R., & Stefanowski, J. (1992). Discriminant versus rough set approach to vague dataanalysis. Appl. Stochastic Models and Data Anal., 8, 43±56.

Lin, T. Y. (1989). Neighbourhood systems and approximation in database and knowledge base systems. In

Proceedings of the Fourth International Symposium on Methodologies of Intelligent Systems.Lin, T. Y. (1992). Topological and fuzzy rough sets. In R. Slowinski, Intelligent decision support: handbook of

applications and advances in rough sets theory (pp. 287±304). Boston: Kluwer.Lin, T. Y., & Liu, Q. (1993). Rough approximate operators. In Proceedings of the International Workshop on Rough

Sets and Knowledge Discovery (1st ed.) (pp. 255±257).Lingras, P. J., & Yao, Y. Y. (1998). Data mining using extensions of the rough set model. Journal of the American

Society for Information Science, 49(5), 415±422.

Millan, M., & Machuca, F. (1997). Using the rough set theory to exploit the data mining potential in relationaldatabase systems. In RSSC'97 (pp. 344±347).

Miyamoto, S. (1990). Fuzzy sets in information retrieval and cluster analysis. Dordrecht, The Netherlands: Kluwer.

National Library of Medicine (1998). Uni®ed Medical Language System (UMLS) knowledge sources (9th ed.). MD:NLM.

Nguyen, H.S., Skowron, A., Synak, P., & Wro blewski, J. (1997). Knowledge discovery in data bases: rough setapproach. In M. Mares, R. Meisar, V. Novak, J. Ramik (Eds.), Proceedings of the Seventh International Fuzzy

Systems Association World Congress (IFSA'97), Prague, 25±29 June (pp. 204±209), vol. 2. Academia.éhrn, A., Vinterbo, S., Szyman ski, P., & Komorowski, J. (1997). Modeling cardiac patient set residuals using rough

sets. In Proceedings of AMIA Annual Fall Symposium (formerly SCAMC), Nashville, TN (pp. 203±207).

Pawlak, Z. (1982). Rough sets. International Journal of Comput. Inf. Science, 11, 341±356.Pawlak, Z., & Skowron, A. (1994). Rough membership functions. In R. R. Yaeger, M. Fedrizzi, & J. Kacprzyk,

Advances in the Dempster±Shafer theory of evidence (pp. 251±271). New York: Wiley.

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±38 37

Page 24: Vocabulary mining for information retrieval: rough sets and fuzzy sets

Pawlak, Z., Grzymala-Busse, J., Slowinski, R., & Ziarko, W. (1995). Rough sets. CACM, 38(11), 89±95.Salton, G. (1971). The SMART retrieval system Ð experiments in automatic document processing. New Jersey:

Prentice-Hall.Skowron, A., & Grzymala-Busse, J. W. (1994). From rough set theory to evidence theory. In R. R. Yaeger, M.

Fedrizzi, & J. Kacprzyk, Advances in the Dempster±Shafer theory of evidence (pp. 193±236). New York: Wiley.

Srinivasan, P. (1989). Intelligent information retrieval using rough set approximations. Information Processing andManagement, 25(4), 347±361.

Srinivasan, P. (1991). The importance of rough approximations for information retrieval. International Journal of

Man±Machine Studies, 34, 657±671.Tsumoto, S., Ziarko, W., Shan, N., & Tanaka, H. (1995). Knowledge discovery in clinical databases based on

variable precision rough sets model. In Proceedings of the 19th Annual Symposium on Computer Applications in

Medical Care, Journal of American Medical Informatics Association Supplement, New Orleans (pp. 270±274).Wong, S. K. M., & Ziarko, W. (1987). Comparison of the probabilistic approximate classi®cation and the fuzzy set

model. Fuzzy Sets and Systems, 21, 357±362.Yao, Y. Y., & Wong, S. K. M. (1992). A decision theoretic framework for approximating concepts. International

Journal of Man±Machine Studies, 37, 793±809.Yao, Y. Y., Li, X., Lin, T. Y., & Liu, Q. (1994). Representation and classi®cation of rough set models. In T. Y.

Lin, & A. M. Wildberger, Soft computing: Proceedings of the 3rd International Workshop on Rough Sets and Soft

Computing (RSSC'94) (pp. 44±47). San Diego, CA: The Society for Computer Simulation.Yao, Y. Y. (1997). Combination of rough and fuzzy sets based on a-level sets. In T. Y. Lin, & N. Cerone, Rough

sets and data mining: analysis for imprecise data (pp. 301±321). Boston: Kluwer.

Yao, Y. Y., Wong, S. K. M., & Lin, T. Y. (1997). A review of rough set models. In T. Y. Lin, & N. Cerone, Roughsets and data mining: analysis for imprecise data (pp. 47±73). Boston: Kluwer.

Zakowski, W. (1983). Approximations in the space (U, II). Demonstration Mathematica, XVI, 761±769.

Ziarko, W. (1993). Variable precision rough set model. Journal of Computer and System Sciences, 46, 39±59.

P. Srinivasan et al. / Information Processing and Management 37 (2001) 15±3838