ISSN 0972 - 9038 International Journal of Computer Science & Applications Volume 4 Issue 2 July 2007 Special Issue on Communications, Interactions and Interoperability in Information Systems Editor-in-Chief Rajendra Akerkar Editors of Special Issue Colette Rolland, Oscar Pastor and Jean-Louis Cavarero
168
Embed
International Journal of Computer Science & Applications
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ISSN 0972 - 9038
International Journal of Computer Science &
Applications
Volume 4 Issue 2 July 2007
Special Issue on Communications, Interactions and
Interoperability in Information Systems
Editor-in-Chief Rajendra Akerkar
Editors of Special Issue
Colette Rolland, Oscar Pastor and Jean-Louis Cavarero
International Journal of Computer Science & Applications Vol. 4, No.2, July 2007
ii
ADVISORY EDITOR Douglas Comer Department of Computer Science, Purdue University, USA
EDITOR-IN-CHIEF Rajendra Akerkar Technomathematics Research Foundation 204/17 KH, New Shahupuri , Kolhapur 416001, INDIA
MANAGING EDITOR David Camacho Universidad Carlos III de Madrid, Spain
ASSOCIATE EDITORS Ngoc Thanh Nguyen Wroclaw University of Technology, Poland
Pawan Lingras Saint Mary's University, Halifax, Nova Scotia, Canada.
COUNCIL OF EDITORS Stuart Aitken University of Edinburgh, UK Tetsuo Asano JAIST, Japan. Costin Badica University of Craiova,Craiova, Romania JF Baldwin University of Bristol, UK Pavel Brazdil LIACC/FEP,University of Porto, Portugal Ivan Bruha Mcmaster University, Canada Jacques Calmet Universität Karlsruhe Germany Narendra S. Chaudhari Nanyang Technological University, Singapore Walter Daelemans University of Antwerp, Belgium K. V. Dinesha IIIT, Bangalore, India David Hung-Chang Du University of Minnesota, USA Hai-Bin Duan Beihang University, P. R. China. Yakov I. Fet Russian Academy of Sciences, Russia Maria Ganzha, Gizycko Private Higher Educational Institute, Gizycko, Poland S. K. Gupta IIT, New Delhi, India Henry Hexmoor University of Arkansas, Fayetteville, U.S.A. Ray Jarvis Monash University, Victoria, Australia Peter Kacsuk MTA SZTAKI Research Institute, Budapest, Hungary
Huan Liu Arizona State University, USA Pericles Loucopoulos UMIST, Manchester, UK Wolfram - Manfred Lippe University of Muenster, Germany Lorraine McGinty University College Dublin, Belfield, Ireland C. R. Muthukrishnan Indian Institute of Technology, Chennai, India Marcin Paprzycki SWPS and IBS PAN, Warsaw Lalit M. Patnaik Indian Institute of Science, Bangalore, India Dana Petcu Western University of Timisoara, Romania Shahram Rahimi Southern Illinois University, Illinois, USA Sugata Sanyal Tata Institute of Fundamental Research, Mumbai, India. Dharmendra Sharma University of Canberra, Australia Ion O. Stamatescu FEST, Heidelberg, Germany José M. Valls Ferrán Universidad Carlos III, Spain Rajeev Wankar University of Hyderabad, Hyderabad, India Krzysztof Wecel The Poznan University of Economics, Poland
Editorial Office: Technomathematics Research Foundation, 204/17 Kh, New Shahupuri, Kolhapur 416001, India. E-mail: [email protected] Copyright 2007 by Technomthematics Research Foundation All rights reserved. This journal issue or parts thereof may not be reproduced in any form or by any means, electrical or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the copyright owner. Permission to quote from this journal is granted provided that the customary acknowledgement is given to the source. International Journal of Computer Science & Applications (ISSN 0972 – 9038) is high quality electronic journal published six-monthly by Technomathematics Research Foundation, Kolhapur, India. The www-site of IJCSA is http://www.tmrfindia.org/ijcsa.html
International Journal of Computer Science & Applications Vol. 4, No.2, July 2007
iii
Contents
Editorial (v)
1. A New Quantitative Trust Model for Negotiating Agents using Argumentation
Jamal Bentahar, Concordia Institute for Information Systems Engineering, John-Jules
Ch. Meyer, Department of Information and Computing Sciences, Utrecht University,
The Netherlands (1 –21)
2. Protocol Management Systems as a Middleware for Inter-Organizational Workflow
Adaptability of Methods forProcessing XML Data using
Relational Databases – the Stateof the Art and Open Problems1
Irena Mlynkova and Jaroslav PokornyCharles University, Faculty of Mathematics and Physics,
Department of Software Engineering,Malostranske nam. 25, 118 00 Prague 1, Czech Republic
{irena.mlynkova,jaroslav.pokorny}@mff.cuni.cz
AbstractAs XML technologies have become a standard for data representation, it isinevitable to propose and implement efficient techniques for managing XMLdata. A natural alternative is to exploit tools and functions offered by(object-)relational database systems. Unfortunately, this approach has manyobjectors, especially due to inefficiency caused by structural differences betweenXML data and relations. On the other hand, (object-)relational databases havelong theoretical and practical history and represent a mature technology, i.e.they can offer properties that no native XML database can offer yet. In thispaper we study techniques which enable to improve XML processing based onrelational databases, so-called adaptive or flexible mapping methods. We providean overview of existing approaches, we classify their main features, and sum upthe most important findings and characteristics. Finally, we discuss possibleimprovements and corresponding key problems.
Keywords: XML-to-relational mapping, state of the art, adaptability, rela-tional databases
1 Introduction
Without any doubt the XML [9] is currently one of the most popular formatsfor data representation. It is well-defined, easy-to-use, and involves various rec-ommendations such as languages for structural specification, transformation,querying, updating, etc. The popularity invoked an enormous endeavor to pro-pose more efficient methods and tools for managing and processing XML data.The four most popular ones are methods which store XML data in a file sys-tem, methods which store and process XML data using an (object-)relationaldatabase system, methods which exploit a pure object-oriented approach, andnative methods that use special indices, numbering schemes [17], and/or datastructures [12] proposed particularly for tree structure of XML data.
Naturally, each of the approaches has both keen advocates and objectors. Thesituation is not good especially for file system-based and object-oriented meth-ods. The former ones suffer from inability of querying without any additional
1This work was supported in part by Czech Science Foundation (GACR), grant number201/06/0756.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
preprocessing of the data, whereas the latter approach fails especially in find-ing a corresponding efficient and comprehensive tool. The highest-performancetechniques are the native ones since they are proposed particularly for XMLprocessing and do not need to artificially adapt existing structures to a newpurpose. But the most practically used ones are methods which exploit featuresof (object-)relational databases. The reason is that they are still regarded asuniversal data processing tools and their long theoretical and practical historycan guarantee a reasonable level of reliability. Contrary to native methods it isnot necessary to start “from scratch” but we can rely on a mature and verifiedtechnology, i.e. properties that no native XML database can offer yet.
Under a closer investigation the database-based2 methods can be further clas-sified and analyzed [19]. We usually distinguish generic methods which storeXML data regardless the existence of corresponding XML schema (e.g. [10] [16]),schema-driven methods based on structural information from existing schemaof XML documents (e.g. [26] [18]), and user-defined methods which leave all thestorage decisions in hands of future users (e.g. [2] [1]).
Techniques of the first type usually view an XML document as a directedlabelled tree with several types of nodes. We can further distinguish generictechniques which purely store components of the tree and their mutual relation-ship [10] and techniques which store additional structural information, usuallyusing a kind of a numbering schema [16]. Such schema enables to speed up cer-tain types of queries but usually at the cost of inefficient data updates. The factthat they do not exploit possibly existing XML schemes can be regarded as bothadvantage and disadvantage. On one hand they do not depend on its existencebut, on the other hand, they cannot exploit the additional structural infor-mation. But together with the finding that a significant portion of real XMLdocuments (52% [5] of randomly crawled or 7.4% [20] of semi-automaticallycollected3) have no schema at all, they seem to be the most practical choice.
By contrast, schema-driven methods have contradictory (dis)advantages. Thesituation is even worse for methods which are based particularly on XML Schema[28] [7] definitions (XSDs) and focus on their special features [18]. As it isexpectable, XSDs are used even less (only for 0.09% [5] of randomly crawled or38% [20] of semi-automatically collected XML documents) and even if they areused, they often (in 85% of cases [6]) define so-called local tree grammars [22],i.e. languages that can be defined using DTD [9] as well. The most exploited“non-DTD” features are usually simple types [6] whose lack in DTD is crucialbut for XML data processing have only a side optimization effect.
Another problem of purely schema-driven methods is that information XMLschemes provide is not satisfactory. Analysis of both XML documents andXML schemes together [20] shows that XML schemes are too general. Excessiveexamples can be recursion or “*” operator which allow theoretically infinitelydeep or wide XML documents. Naturally, XML schemes also cannot provideany information about, e.g., retrieval frequency of an element / attribute or theway they are retrieved. Thus not only XML schemes but also corresponding
2In the rest of the paper the term “database” represents an (object-)relational database.3Data collected with interference of a human operator who removes damaged, artificial,
too simple, or otherwise useless XML data.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
XML documents and XML queries need to be taken into account to get overallnotion of the demanded XML-processing application.
The last mentioned type of approach, i.e. the user-defined one, is a bit differ-ent. It does not involve methods for automatic database storage but rather toolsfor specification of the target database schema and required XML-to-relationalmapping. It is commonly offered by most known (object-)relational databasesystems [3] as a feature that enables users to define what suits them most insteadof being restricted by disadvantages of a particular technique. Nevertheless, thekey problem is evident – it assumes that the user is skilled in both database andXML technologies.
Apparently, advantages of all three approaches are closely related to the par-ticular situation. Thus it is advisable to propose a method which is able toexploit the current situation or at least to comfort to it. If we analyze database-based methods more deeply, we can distinguish so-called flexible or adaptivemethods (e.g. [13] [25] [29] [31]). They take into account a given sample set ofXML data and/or XML queries which specify the future usage and adapt theresulting database schema to them. Such techniques have naturally better per-formance results than the fixed ones (e.g. [10] [16] [26] [18]), i.e. methods whichuse pre-defined set of mapping rules and heuristics regardless the intended futureusage. Nevertheless, they have also a great disadvantage – the fact that the tar-get database schema is adapted only once. Thus if the expected usage changes,the efficiency of such techniques can be even worse than in corresponding fixedcase. Consequently the adaptability needs to be dynamic.
The idea to adapt a technique to a sample set of data is closely related toanalyses of typical features of real XML documents [20]. If we combine thetwo ideas, we can assume that a method which focuses especially on commonfeatures will be more efficient than the general one. A similar observation isalready exploited, e.g., in techniques which represent XML documents as a setof points in multidimensional space [14]. Efficiency of such techniques dependsstrongly on the depth of XML documents or the number of distinct paths.Fortunately XML analyses confirm that real XML documents are surprisinglyshallow – the average depth does not exceed 10 levels [5] [20].
Considering all the mentioned points the presumption that an adaptive en-hancing of XML-processing methods focusing on given or typical situations seemto be a promising type of improvement. In this paper we study adaptive tech-niques from various points of view. We provide an overview of existing ap-proaches, we classify them and their main features, and we sum up the mostimportant findings and characteristics. Finally, we discuss possible improve-ments and corresponding key problems. The analysis should serve as a startingpoint for proposal of an enhancing of existing adaptive methods as well as of anunprecedented approach. Thus we also discuss possible improvements of weakpoints of existing methods and solutions to the stated open problems.
The paper is structured as follows: Section 2 contains a brief introductionto formalism used throughout the paper. Section 3 describes and classifies theexisting related works, both practical and theoretical, and Section 4 sums uptheir main characteristics. Section 5 discusses possible ways of improvement ofthe recent approaches and finally, the sixth section provides conclusions.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
Before we begin to describe and classify adaptive methods, we state severalbasic terms used in the rest of the text.
An XML document is usually viewed as a directed labelled tree with severaltypes of nodes whose edges represent relationships among them. Side structures,such as entities, comments, CDATA sections, processing instructions, etc., arewithout loss of generality omitted.
Definition 1 An XML document is a directed labelled tree T = (V, E, ΣE , ΣA,Γ, lab, r), where V is a finite set of nodes, E ⊆ V × V is a set of edges, ΣE isa finite set of element names, ΣA is a finite set of attribute names, Γ is a finiteset of text values, lab : V → ΣE ∪ΣA ∪ Γ is a surjective function which assignsa label to each v ∈ V , whereas v is an element if lab(v) ∈ ΣE, an attribute iflab(v) ∈ ΣA, or a text value if lab(v) ∈ Γ, and r is the root node of the tree.
A schema of an XML document is usually described using DTD or XMLSchema which describe the allowed structure of an element using its contentmodel. An XML document is valid against a schema if each element matches itscontent model. (We state the definitions for DTDs only for the paper length.)
Definition 2 A content model α over a set of element names Σ′E is a regularexpression defined as α = ε | pcdata | f | (α1, α2, ..., αn) | (α1|α2|...|αn) | β*| β+ | β?, where ε denotes the empty content model, pcdata denotes the textcontent, f ∈ Σ′E, “,” and “|” stand for concatenation and union (of contentmodels α1, α2, ..., αn), and “*”, “+”, and “?” stand for zero or more, one ormore, and optional occurrence(s) (of content model β) respectively.
Definition 3 An XML schema S is a four-tuple (Σ′E , Σ′A, ∆, s), where Σ′E isa finite set of element names, Σ′A is a finite set of attribute names, ∆ is a finiteset of declarations of the form e → α or e → β, where e ∈ Σ′E, α is a contentmodel over Σ′E, and β ⊆ Σ′A, and s ∈ Σ′E is a start symbol.
To simplify the XML-to-relational mapping process an XML schema is oftentransformed into a graph representation. Probably the first occurrence of thisrepresentation, so-called DTD graph, can be found in [26]. There are also variousother types of graph representation of an XML schema, if necessary, we mentionthe slight differences later in the text.
Definition 4 A schema graph of a schema S = (Σ′E , Σ′A, ∆, s) is a directed,labelled graph G = (V, E, lab′), where V is a finite set of nodes, E ⊆ V × V isa set of edges, lab′ : V → Σ′E ∪ Σ′A ∪ {“|”, “*”, “+”, “?”, “,”} ∪ {pcdata} is asurjective function which assigns a label to ∀ v ∈ V , and s is the root node ofthe graph.
The core idea of XML-to-relational mapping methods is to decompose a givenschema graph into fragments, which are mapped to corresponding relations.
Definition 5 A fragment f of a schema graph G is each its connected subgraph.
Definition 6 A decomposition of a schema graph G is a set of its fragments{f1, ..., fn}, where ∀ v ∈ V is a member of at least one fragment.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
Up to now only a few papers have focused on a proposal of an adaptive database-based XML-processing method. We distinguish two main directions – cost-driven and user-driven. Techniques of the former group can choose the mostefficient XML-to-relational storage strategy automatically. They usually evalu-ate a subset of possible mappings and choose the best one according to the givensample of XML data, query workload, etc. The main advantage is expressed bythe adverb “automatically”, i.e. without necessary or undesirable user interfer-ence. By contrast, techniques of the latter group also support several storagestrategies but the final decision is left in hands of users. We distinguish thesetechniques from the user-defined ones, since their approach is slightly different:By default they offer a fixed mapping, but users can influence the mapping pro-cess by annotating fragments of the input XML schema with demanded storagestrategies. Similarly to the user-defined techniques this approach also assumesa skilled user, but most of the work is done by the system itself. The user isexpected to help the mapping process, not to perform it.
3.1 Cost-Driven Techniques
As mentioned above, cost-driven techniques can choose the best storage strategyfor a particular application automatically, without any interference of a user.Thus the user can influence the mapping process only through the providedXML schema, set of sample XML documents or data statistics, set of XMLqueries and eventually their weights, etc.
Each of the techniques can be characterized by the following five features:
1. an initial XML schema Sinit,2. a set of XML schema transformations T = {t1, t2, ..., tn}, where ∀ i : ti
transforms a given schema S into a schema Si,3. a fixed XML-to-relational mapping function fmap which transforms a given
XML schema S into a relational schema R,4. a set of sample data Dsample characterizing the future application, which
usually consists of a set of XML documents {d1, d2, .., dk} valid againstSinit, and a set of XML queries {q1, q2, .., ql} over Sinit, eventually withcorresponding weights {w1, w2, .., wl}, ∀ i : wi ∈ 〈0, 1〉, and
5. a cost function fcost which evaluates the cost of the given relational schemaR with regard to the set Dsample.
The required result is an optimal relational schema Ropt, i.e. a schema, wherefcost(Ropt, Dsample) is minimal.
A naive but illustrative cost-driven storage strategy that is based on the ideaof using a “brute force” is depicted by Algorithm 1. It first generates a set ofpossible XML schemes S using transformations from set T and starting frominitial schema Sinit (lines 1 – 4). Then it searches for schema s ∈ S with minimalcost fcost(fmap(s), Dsample) (lines 5 – 12) and returns the corresponding optimalrelational schema Ropt = fmap(s).
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
1: S ← {Sinit}2: while ∃ t ∈ T, s ∈ S : t(s) 6∈ S do3: S ← S ∪ {t(s)}4: end while5: costopt ←∞6: for all s ∈ S do7: Rtmp ← fmap(s)8: costtmp ← fcost(Rtmp, Dsample)9: if costtmp < costopt then
10: Ropt ← Rtmp ; costopt ← costtmp
11: end if12: end for13: return Ropt
Obviously the complexity of such algorithm depends strongly on the set T .It can be proven that even a simple set of transformations causes the problemof finding the optimal schema to be NP-hard [29] [31] [15]. Thus the existingtechniques in fact search for a suboptimal solution using various heuristics,greedy strategies, approximation algorithms, terminal conditions, etc. We canalso observe that fixed methods can be considered as a special type of cost-drivenmethods, where T = ∅, Dsample = ∅, and fcost(R, ∅) = const for ∀ R.
3.1.1 Hybrid Object-Relational Mapping
One of the first attempts of a cost-driven adaptive approach is a method calledHybrid object-relational mapping [13]. It is based on the fact that if XML docu-ments are mostly semi-structured, a “classical” decomposition of less structuredXML parts into relations leads to inefficient query processing caused by plentyof join operations. The algorithm exploits the idea of storing well structuredparts into relations and semi-structured using so-called XML data type, whichsupports path queries and XML-aware full-text operations. The fixed mappingfor structured parts is similar to the classical Hybrid algorithm [26], whereas, inaddition, it exploits NF 2-relations using constructs such as set-of, tuple-of,and list-of. The main concern of the method is to identify the structured andsemi-structured parts. It consists of the following steps:
1. A schema graph G1 = (V1, E1, lab′1) is built for a given DTD.2. For ∀ v ∈ V1 a measure of significance ωv (see below) is determined.3. Each v ∈ V1 which satisfies the following conditions is identified:
(a) v is not a leaf node.
(b) For v and ∀ its descendant vi;1≤i≤k : ωv < ωLOD and ωvi < ωLOD,where ωLOD is a required level of detail of the resulting schema.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
(c) v does not have a parent node which would satisfy the conditions.
4. Each fragment f ⊆ G1 which consists of a previously identified node vand its descendants is replaced with an attribute node having the XMLdata type, resulting in a schema graph G2.
5. G2 is mapped to a relational schema using a fixed mapping strategy.
The measure of significance ωv of a node v is defined as:
ωv =12ωSv +
14ωDv +
14ωQv =
12ωSv +
14· card(Dv)
card(D)+
14· card(Qv)
card(Q)(1)
where ωSv is derived from the DTD structure as a combination of weights ex-pressing position of v in the graph and complexity of its content model (see[13]), D ⊆ Dsample is a set of all given documents, Dv ⊆ D is a set of docu-ments containing v, Q ⊆ Dsample is a set of all given queries, and Qv ⊆ Q is aset of queries containing v.
As we can see, the algorithm optimizes the naive approach mainly by thefacts that the schema graph is preprocessed, i.e. ωv is determined for ∀ v ∈ V1,that the set of transformations T is a singleton, and that the transformation isperformed if the current node satisfies the above mentioned conditions (a) – (c).As it is obvious, the preprocessing ensures that the complexity of the searchalgorithm is given by K1 ∗ card(V1) + K2 ∗ card(E1), where K1,K2 ∈ N . Onthe other hand, the optimization is too restrictive in terms of the amount ofpossible XML-to-relational mappings.
3.1.2 FlexMap Mapping
Another example of adaptive cost-driven methods was implemented as so-calledFlexMap framework [25]. The algorithm optimizes the naive approach usinga simple greedy strategy as depicted in Algorithm 2. The main differencesin comparison with the naive approach are the choice of the least expensivetransformation at each iteration (lines 3 – 9) and the termination of searching ifthere exists no transformation t ∈ T that can reduce the current (sub)optimum(lines 10 – 14).
The set T of XML-to-XML transformations involves the following operations:
• Inlining and outlining – inverse operations which enable to store columnsof a subelement / attribute either in a parent table or in a separate table
• Splitting and merging elements – inverse operations which enable to storea shared element4 either in a common table or in separate tables
• Associativity and commutativity• Union distribution and factorization – inverse operations which enable to
separate out components of a union using equation (a, (b|c)) = ((a, b)|(a, c))• Splitting and merging repetitions – exploitation of equation (a+) = (a, a∗)• Simplifying unions – exploitation of equation (a|b) ⊆ (a?, b?)
4An element with multiple parent elements in the schema – see [26].
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
1: Sopt ← Sinit ; Ropt ← fmap(Sopt) ; costopt ← fcost(Ropt, Dsample)2: loop3: costmin ←∞4: for all t ∈ T do5: costt ← fcost(fmap(t(Sopt)), Dsample)6: if costt < costmin then7: tmin ← t ; costmin ← costt8: end if9: end for
10: if costmin < costopt then11: Sopt ← tmin(Sopt) ; Ropt ← fmap(Sopt) ; costopt ← fcost(Ropt, Dsample)12: else13: break;14: end if15: end loop16: return Ropt
Note that except for commutativity and simplifying unions the transforma-tions generate equivalent schema in terms of equivalence of sets of documentinstances. Commutativity does not retain the order of the schema, whereassimplifying unions generates a more general schema, i.e. a schema with largerset of document instances. (However, only inlining and outlining were imple-mented and experimentally tested by the FlexMap system.)
The fixed mapping again uses a strategy similar to the Hybrid algorithm but itis applied locally on each fragment of the schema specified by the transformationrules stated by the search algorithm. For example elements determined to beoutlined are not inlined though a “traditional” Hybrid algorithm would do so.
The process of evaluating fcost is significantly optimized. A naive approachwould require construction of a particular relational schema, loading sampleXML data into the relations, and cost analysis of the resulting relational struc-tures. The FlexMap evaluation exploits an XML Schema-aware statistics frame-work StatiX [11] which analyzes the structure of a given XSD and XML doc-uments and computes their statistical summary, which is then “mapped” torelational statistics regarding the fixed XML-to-relational mapping. Togetherwith sample query workload they are used as an input for a classical relationaloptimizer which estimates the resulting cost. Thus no relational schema has tobe constructed and as the statistics are respectively updated at each XML-to-XML transformation, the XML documents need to be processed only once.
3.1.3 An Adjustable and Adaptable Method (AAM)
The following method, which is also based on the idea of searching a space ofpossible mappings, is presented in [29] as an Adjustable and adaptable method
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
(AAM). In this case the authors adapt the given problem to features of geneticalgorithms. It is also the first paper that mentions that the problem of findinga relational schema R for a given set of XML documents and queries Dsample,s.t. fcost(R,Dsample) is minimal, is NP-hard in the size of the data.
The set T of XML-to-XML transformations consists of inlining and outliningof subelements. For the purpose of the genetic algorithm each transformedschema is represented using a bit string, where each bit corresponds to an edgeof the schema graph and it is set to 1 if the element the edge points to isstored into a separate table or 0 if the element the edge points to is stored intoparent table. The bits set to 1 represent “borders” among fragments, whereaseach fragment is stored into one table corresponding to so-called Universal table[10]. The extreme instances correspond to “one table for the whole schema” (incase of 00...0 bit string) resulting in many null values and “one table per eachelement” (in case of 11...1 bit string) resulting in many join operations.
Similarly to the previous strategy the algorithm chooses only the best possiblecontinuation at each iteration. The algorithm consists of the following steps:
1. The initial population P0 (i.e. the set of bit strings) is generated randomly.2. The following steps are repeated until terminating conditions are met:
(a) Each member of the current population Pi is evaluated and only thebest representatives are selected for further production.
(b) The next generation Pi+1 is produced by genetic operators crossover,mutation, and propagate.
The algorithm terminates either after certain number of transformations orif a good-enough schema is achieved.
The cost function fcost is expressed as:
fcost(R,Dsample) = fM (R,Dsample) + fQ(R, Dsample) =
=q∑
l=1
Cl ∗Rl + (m∑
i=1
Si ∗ PSi +n∑
k=1
Jk ∗ PJk) (2)
where fM is a space-cost function, where Cl is number of columns and Rl isnumber of rows in table Tl created for l-th element in the schema, q is thenumber of all elements in the schema, fQ is a query-cost function, where Si
is cost and PSi is probability of i-th select query and Jk is cost and PJkis
probability of k-th join query, m is the number of select queries in Dsample,and n is the number of join queries in Dsample. In other words fM representsthe total memory cost of the mapping instance, whereas fQ represents the totalquery cost. The probabilities PSi and PJk
enable to specify which elements will(not) be often retrieved and which sets of elements will (not) be often combinedto search. Also note that this algorithm represents another way of finding areasonable suboptimal solution in the theoretically infinite set of possibilities –using (in this case two) terminal conditions.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
The last but not least cost-driven adaptable representative can be found inpaper [31]. The approach is again based on a greedy type of algorithm, in thiscase a Hill climbing strategy that is depicted by Algorithm 3.
Algorithm 3 Hill Climbing AlgorithmInput: Sinit, T , fmap, Dsample, fcost
Output: Ropt
1: Sopt ← Sinit ; Ropt ← fmap(Sopt) ; costopt ← fcost(Ropt, Dsample)2: Ttmp ← T3: while Ttmp 6= ∅ do4: t ← any member of Ttmp
10: Ttmp ← T11: end if12: end while13: return Ropt
As we can see, the hill climbing strategy differs from the simple greedy strat-egy depicted in Algorithm 2 in the way it chooses the appropriate transformationt ∈ T . In the previous case the least expensive transformation that can reducethe current (sub)optimum is chosen, in this case it is the first such transforma-tion found. The schema transformations are based on the idea of vertical (V)or horizontal (H) cutting and merging the given XML schema fragment(s). Theset T consists of the following four types of (pairwise inverse) operations:
• V-Cut(f, (u,v)) – cuts fragment f into fragments f1 and f2, s.t. f1∪f2 = f ,where (u, v) is an edge from f1 to f2, i.e. u ∈ f1 and v ∈ f2
• V-Merge(f1, f2) – merges fragments f1 and f2 into fragment f = f1 ∪ f2
• H-Cut(f, (u,v)) – splits fragment f into twin fragments f1 and f2 hori-zontally from edge (u, v), where u 6∈ f and v ∈ f , s.t. ext(f1) ∪ ext(f2) =ext(f) and ext(f1) ∩ ext(f2) = ∅ 5 6
• H-Merge(f1, f2) – merges two twin fragments f1 and f2 into one fragmentf s.t. ext(f1) ∪ ext(f2) = ext(f)
As we can observe, V-Cut and V-Merge operations are similar to outliningand inlining of the fragment f2 out of or into the fragment f1. Conversely, H-Cut operation corresponds to splitting of elements used in FlexMap mapping,i.e. duplication of the shared part, and the H-Merge operation corresponds toinverse merging of elements.
5ext(fi) is the set of all instance fragments conforming to the schema fragment fi.6Fragments f1 and f2 are called twins if ext(f1) ∩ ext(f2) = ∅ and for each node u ∈ f1,
there is a node v ∈ f2 with the same label and vice versa.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
The fixed XML-to-relational mapping maps each fragment fi which consistsof nodes {v1, v2, ..., vn} to relation
Ri = (id(ri) : int, id(ri.parent) : int, lab(v1) : type(v1), ..., lab(vn) : type(vn))where ri is the root element of fi. Note that such mapping is again similar tolocally applied Universal table.
The cost function fcost is expressed as:
fcost(R, Dsample) =n∑
i=1
wi ∗ cost(Qi, R) (3)
where Dsample consists of a sample set of XML documents and a given queryworkload {(Qi, wi)i=1,2,...,n}, where Qi is an XML query and wi is its weight.The cost function cost(Qi, R) for a query Qi which accesses fragment set {fi1,..., fim} is expressed as:
where fij and fik, j 6= k are two join fragments, |Eij | is the number of elementsin ext(fij), and Selij is the selectivity of the path from the root to fij estimatedusing Markov table. In other words, the formula simulates the cost for joiningrelations corresponding to fragments fij and fik.
The authors further analyze the influence of the choice of initial schema Sinit
on efficiency of the search algorithm. They use three types of initial schemadecompositions leading to Binary [10], Shared, or Hybrid [26] mapping. Thepaper concludes with the finding that a good choice of an initial schema iscrucial and can lead to faster searches of the suboptimal mapping.
3.2 User-Driven Techniques
As mentioned above, the most flexible approach is the user-defined mapping,i.e. the idea “to leave the whole process in hands of a user” who defines both thetarget database schema and the required mapping. Due to simple implementa-tion it is supported in most commercial database systems [3]. At first sight theidea is correct – users can decide what suits them most and are not restrictedby disadvantages of a particular technique. The problem is that such approachassumes users skilled in two complex technologies and for more complex appli-cations the design of an optimal relational schema is generally an uneasy task.
On this account new techniques – in this paper called user-driven mappingstrategies – were proposed. The main difference is that the user can influencea default fixed mapping strategy using annotations which specify the requiredmapping for particular schema fragments. The set of allowed mappings is nat-urally limited but still enough powerful to define various mapping strategies.
Each of the techniques is characterized by the following four features:
1. an initial XML schema Sinit,2. a set of allowed fixed XML-to-relational mappings {f i
map}i=1,...,n,
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
3. a set of annotations A, each of which is specified by name, target, allowedvalues, and function, and
4. a default mapping strategy fdef for not annotated fragments.
3.2.1 MDF
Probably the first approach which faces the mentioned issues is proposed inpaper [8] as a Mapping definition framework (MDF). It allows users to specifythe required mapping, checks its correctness and completeness and completespossible incompleteness. The mapping specifications are made by annotatingthe input XSD with a predefined set of attributes A listed in Table 1.
Attribute Target Value Function
outline attribute or ele-ment
true,false
If the value is true, a separate ta-ble is created for the attribute /element. Otherwise, it is inlined.
tablename attribute, element,or group
string The string is used as the tablename.
columnname attribute, element,or simple type
string The string is used as the columnname.
sqltype attribute, element,or simple type
string The string defines the SQL typeof a column.
structurescheme root element KFO,Interval,Dewey
Defines the way of capturing thestructure of the whole schema.
edgemapping element true,false
If the value is true, the ele-ment and all its subelements aremapped using Edge mapping.
maptoclob attribute or ele-ment
true,false
If the value is true, the element/ attribute is mapped to a CLOBcolumn.
Table 1: Annotation attributes for MDF
As we can see, the set of allowed XML-to-relational mappings {f imap}i=1,...,n
involves inlining and outlining of an element / attribute, Edge mapping [10]strategy, and mapping an element or an attribute to a CLOB column. Further-more, it enables to specify the required capturing of the structure of the wholeschema using one of the following three approaches:
• Key, Foreign Key, and Ordinal Strategy (KFO) – each node is assigneda unique integer ID and a foreign key pointing to parent ID, the siblingorder is captured using an ordinal value
• Interval Encoding – a unique {start,end} interval is assigned to eachnode corresponding to preorder and postorder traversal entering time
• Dewey Decimal Classification – each node is assigned a path to the rootnode described using concatenation of node IDs along the path
As side effects can be considered attributes for specifying names of tablesor columns and data types of columns. Not annotated parts are stored usinguser-predefined rules, whereas such mapping is always a fixed one.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
Paper [4] also proposes a user-driven mapping strategy which is implementedand experimentally tested as an XCacheDB system which considers only un-ordered and acyclic XML schemes and omits mixed-content elements. The setof annotating attributes A that can be assigned to any node v ∈ Sinit is listedin Table 2.
Attribute Value Function
INLINE ∅ If placed on a node v, the fragment rooted at v is inlined intoparent table.
TABLE ∅ If placed on a node v, a new table is created for the fragmentrooted at v.
STORE BLOB ∅ If placed on a node v, the fragment rooted at v is stored alsointo a BLOB column.
BLOB ONLY ∅ If placed on a node v, the fragment rooted at v is stored into aBLOB column.
RENAME string The value specifies the name of corresponding table or columncreated for node v.
DATATYPE string The value specifies the data type of corresponding column cre-ated for node v.
Table 2: Annotation attributes for XCacheDB
It enables inlining and outlining of a node, storing a fragment into a BLOBcolumn, specifying table names or column names, and specifying column datatypes. The main difference is in the data redundancy allowed by attributeSTORE BLOB which enables to shred the data into table(s) and at the same timeto store pre-parsed XML fragments into a BLOB column.
The fixed mapping uses a slightly different strategy: Each element or attributenode is assigned a unique ID. Each fragment f is mapped to a table Tf whichhas an attribute avID of ID data type for each element or attribute node v ∈ f .If v is an atomic node7, Tf has also an attribute av of the same data type as v.For each distinct path that leads to f from a repeatable ancestor v, Tf has aparent reference column of ID type which points to ID of v.
3.3 Theoretic Issues
Besides proposals of cost-driven and user-driven techniques there are also paperswhich discuss the corresponding open issues on theoretic level.
3.3.1 Data Redundancy
As mentioned above, the XCacheDB system allows a certain degree of redun-dancy, in particular duplication in BLOB columns and the violation of BCNFor 3NF condition. The paper [4] discusses the strategy also on theoretic leveland defines four classes of XML schema decompositions. Before we state thedefinitions we have to note that the approach is based on a slightly different
7An attribute node or an element node having no subelements.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
graph representation than Definition 4. The nodes of the graph correspond toelements, attributes, or pcdata, whereas edges are labelled with correspondingoperators.
Definition 7 A schema decomposition is minimal if all edges connecting nodesof different fragments are labelled with “*” or “+”.
Definition 8 A schema decomposition is 4NF if all fragments are 4NF frag-ments. A fragment is 4NF if no two nodes of the fragment are connected by a“*” or “+” labelled edge.
Definition 9 A schema decomposition is non-MVD if all fragments are non-MVD fragments. A fragment is non-MVD if all “*” or “+” labelled edges appearin a single path.
Definition 10 A schema decomposition is inlined if it is non-MVD but it isnot a 4NF decomposition. A fragment is inlined if it is non-MVD but it is nota 4NF fragment.
According to these definitions fixed mapping strategies (e.g. [26] [18]) nat-urally consider only 4NF decompositions which are least space-consuming andseem to be the best choice if we do not consider any other information. Paper[4] shows that having further information (in this particular case given by auser), the choice of other type of decomposition can lead to more efficient queryprocessing though it requires a certain level of redundancy.
3.3.2 Grouping problem
Paper [15] is dealing with the idea that searching a (sub)optimal relational de-composition is not only related to given XML schema, query workload, andXML data, but it is also highly influenced by the chosen query translation algo-rithm8 and the cost model. For the theoretic purpose a subset of the problem –so-called grouping problem – is considered. It deals with possible storage strate-gies for shared subelements, i.e. either into one common table (so-called fullygrouped strategy) or into separate tables (so-called fully partitioned strategy).For analysis of its complexity the authors define two simple cost metrics:
• RelCount – the cost of a relational query is the number of relation instancesin the query expression
• RelSize – the cost of a relational query is the sum of the number of tuplesin relation instances in the query expression
and three query translation algorithms:
• Naive Translation – performs a join between the relations correspondingto all the elements appearing in the query, a wild-card query9 is convertedinto union of several queries, one for each satisfying wild-card substitution
8An algorithm for translating XML queries into SQL queries9A query containing “//” or “/*” operators.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
• Single Scan – a separate relational query is issued for each leaf elementand joins all relations on the path until the least common ancestor of allthe leaf elements is reached
• Multiple Scan – on each relation containing a part of the result is appliedSingle Scan algorithm and the resulting query consists of union of thepartial queries
On a simple example the authors show that for a wild-card query Q whichretrieves a shared fragment f with algorithm Naive Translation the fully parti-tioned strategy performs better, whereas with algorithm Multiple Scan the fullygrouped strategy performs better. Furthermore, they illustrate that reliabilityof the chosen cost model is also closely related to query translation strategy. Ifa query contains not very selective predicate than the optimizer may choose aplan that scans corresponding relations and thus RelSize is a good correspondingmetric. On the other hand, in case of highly selective predicate the optimizermay choose an index lookup plan and thus RelCount is a good metric.
4 Summary
We can sum up the state of the art of adaptability of database-based XML-processing methods into the following natural but important findings:
1. As the storage strategy has a crucial impact on query-processing perfor-mance, a fixed mapping based on predefined rules and heuristics is notuniversally efficient.
2. It is not an easy task to choose an optimal mapping strategy for a particu-lar application and thus it is not advisable to rely only on user’s experienceand intuition.
3. As the space of possible XML-to-relational mappings is very large (usuallytheoretically infinite) and most of the subproblems are even NP-hard, theexhaustive search is often impossible. It is necessary to define searchheuristics, approximation algorithms, and/or reliable terminal conditions.
4. The choice of an initial schema can strongly influence the efficiency of thesearch algorithm. It is reasonable to start with at least “locally good”schema.
5. A strategy of finding a (sub)optimal XML schema should take into accountnot only the given schema, query workload, and XML data statistics, butalso possible query translations, cost metrics, and their consequences.
6. Cost evaluation of a particular XML-to-relational mapping should not in-volve time-consuming construction of the relational schema, loading XMLdata and analyzing the resulting relational structures. It can be optimizedusing cost estimation of XML queries, XML data statistics, etc.
7. Despite the previous claim, the user should be allowed to influence themapping strategy. On the other hand, the approach should not demanda full schema specification but it should complete the user-given hints.
8. Even thought a storage strategy is able to adapt to a given sample ofschemes, data, queries, etc., its efficiency is still endangered by laterchanges of the expected usage.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
Although each of the existing approaches brings certain interesting ideas andoptimizations, there is still a space of possible future improvements of the adapt-able methods. We describe and discuss them in this section starting from (inour opinion) the least complex ones.
Missing Input Data As we already know, for cost-driven techniques thereare three types of input data – an XML schema Sinit, a set of XML documents{d1, d2, .., dk}, and a set of XML queries {q1, q2, .., ql}. The problem of miss-ing schema Sinit was already outlined in the introduction in connection with(dis)advantages of generic and schema-driven methods. As we suppose that theadaptability is the ability to adapt to the given situation, a method which doesnot depend on existence of an XML schema but can exploit the information ifbeing given is probably a natural first improvement. This idea is also stronglyrelated to the mentioned problem of choice of a locally good initial schema Sinit.The corresponding questions are: Can be the user-given schema considered as agood candidate for Sinit? How can we find an eventual better candidate? Canwe find such candidate for schema-less XML documents? A possible solutioncan be found in exploitation of methods for automatic construction of XMLschema for the given set of XML documents (e.g. [21] [23]). Assuming thatdocuments are more precise sources of structural information, we can expectthat a schema generated on their bases will have better characteristics too.
On the other hand, the problem of missing input XML documents can be atleast partly solved using reasonable default settings based on general analysisof real XML data (e.g. [5] [20]). Furthermore, the surveys show that real XMLdata are surprisingly simple and thus the default mapping strategy does not haveto be complex too. It should rather focus on efficient processing of frequentlyused XML patterns.
Finally, the presence of sample query workload is crucial since (to our knowl-edge) there are no analyses on real XML queries, i.e. no source of informationfor default settings. The reason is that collecting such real representatives is notas straightforward as in case of XML documents. Currently the best sources ofXML queries are XML benchmarking projects (e.g. [24] [30]) but as the dataand especially queries are supposed to be used for rating the performance of asystem in various situations, they cannot be considered as an example of a realworkload. Naturally, the query statistics can be gathered by the system itselfand the schema can be adapted continuously, as discussed later in the text.
Efficient Solution of Subproblems A surprising fact we have encounteredare numerous simplifications of the chosen solutions. As it was mentioned,some of the techniques omit, e.g., ordering of elements, mixed contents, orrecursion. This is a bit confusing finding regarding the fact that there areproposals of efficient processing of these XML constructs (e.g. [27]) and thatadaptive methods should cope with various situations.
A similar observation can be done for user-driven methods. Though theproposed systems are able to store schema fragments in various ways, the default
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
strategy for not annotated parts of the schema is again a fixed one. It can be aninteresting optimization to join the ideas and search the (sub)optimal mappingfor not annotated parts using a cost-driven method.
Deeper Exploitation of Information Another open issue is possible deeperexploitation of the information given by the user. We can identify two mainquestions: How can be the user-given information better exploited? Are thereany other information a user can provide to increase the efficiency?
A possible answer can be found in the idea of pattern matching, i.e. using theuser-given schema annotations as “hints” how to store particular XML patterns.We can naturally predict that structurally similar fragments should be storedsimilarly and thus to focus on finding these fragments in the rest of the schema.The main problem is how to identify the structurally similar fragments. If weconsider the variety of XML-to-XML transformations, two structurally samefragments can be expressed using “at first glance” different regular expressions.Thus it is necessary to propose particular levels of equivalence of XML schemafragments and algorithms how to determine them. Last but not least, suchsystem should focus on scalability of the similarity metric and particularly itsreasonable default setting (based on existing analyses of real-world data).
Theoretical Analysis of the Problem As the overview shows, there arevarious types of XML-to-XML transformations, whereas the mentioned onescertainly do not cover the whole set of possibilities. Unfortunately there seemsto be no theoretic study of these transformations, their key characteristics, andpossible classifications. The study can, among others, focus on equivalent andgeneralizing transformations and as such serve as a good basis for the patternmatching strategy. Especially interesting will be the question of NP-hardnessin connection with the set of allowed transformations and its complexity (sim-ilarly to paper [15] which analyzes theoretical complexity of combinations ofcost metrics and query translation algorithms). Such survey will provide usefulinformation especially for optimizations of the search algorithm.
Dynamic Adaptability The last but not least issue is connected with themost striking disadvantage of adaptive methods – the problem of possible changesof XML queries or XML data that can lead to crucial worsening of the efficiency.As mentioned above, it is also related to the problem of missing input XMLqueries and ways how to gather them. The question of changes of XML dataopens also another wide research area of updatability of the stored data – afeature that is often omitted in current approaches although its importance iscrucial.
The solution to these issues – i.e. a system that is able to adapt dynamically– is obvious and challenging but it is not an easy task. It should especially avoidtotal reconstructions of the whole relational schema and corresponding necessaryreinserting of all the stored data, or such operation should be done only in veryspecial cases. On the other hand, this “brute-force” approach can serve as aninspiration. Supposing that changes especially in case of XML queries will notbe radical, the modifications of the relational schema will be mostly local and
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
we can apply the expensive reconstruction just locally. Furthermore, we canagain exploit the idea of pattern matching and find the XML pattern definedby the modified schema fragment in the rest of the schema.
Another question is how often should be the relational schema reconstructed.The natural idea is of course “not too often”. But, on the other hand, a researchcan be done on the idea of performing gradual minor changes. It is probablethat such approach will lead to less expensive (in terms of reconstruction) and atthe same time more efficient (in terms of query processing) system. The formerhypothesis should be verified, the latter one can be almost certainly expected.The key issue is how to find a reasonable compromise.
6 Conclusion
The main goal of this paper was to describe and discuss the current state of theart and open issues of adaptability in database-based XML-processing methods.Firstly, we have stated the reasons why this topic should be ever studied. Thenwe have provided an overview and classification of the existing approaches andsummed up the key findings. Finally, we have discussed the correspondingopen issues and their possible solutions. Our aim was to show that the idea ofprocessing XML data using relational databases is still up to date and shouldbe further developed. From the overview we can see that even though there areinteresting and inspiring approaches, there is still a variety of open problemswhich can further improve the database-based XML processing.
Our future work will naturally follow the open issues stated at the end of thispaper and especially survey into the solutions we have mentioned. Firstly, wewill focus on the idea of improving the user-driven techniques using adaptivealgorithm for not annotated parts of the schema together with deeper exploita-tion of the user-given hints using pattern-matching methods – i.e. a hybrid user-driven cost-based system. Secondly, we will deal with the problem of missingtheoretic study of schema transformations, their classification, and particularlyinfluence on the complexity of the search algorithm. And finally, on the basis ofthe theoretical study and the hybrid system we will study and experimentallyanalyze the dynamic enhancing of the system.
References
[1] DB2 XML Extender. IBM. http://www.ibm.com/.[2] Oracle XML DB. Oracle Corporation. http://www.oracle.com/.[3] S. Amer-Yahia. Storage Techniques and Mapping Schemas for XML. Tech-
nical Report TD-5P4L7B, AT&T Labs-Research, 2003.[4] A. Balmin and Y. Papakonstantinou. Storing and Querying XML Data
Using Denormalized Relational Databases. The VLDB Journal, 14(1):30–49, 2005.
[5] D. Barbosa, L. Mignet, and P. Veltri. Studying the XML Web: GatheringStatistics from an XML Sample. World Wide Web, 8(4):413–438, 2005.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
[6] G. J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema:a Practical Study. In WebDB’04: Proc. of the 7th Int. Workshop on theWeb and Databases, pages 79–84, New York, NY, USA, 2004. ACM Press.
[7] P. V. Biron and A. Malhotra. XML Schema Part 2: Datatypes (SecondEdition). W3C, October 2004.
[8] F. Du, S. Amer-Yahia, and J. Freire. ShreX: Managing XML Documentsin Relational Databases. In VLDB’04: Proc. of 30th Int. Conf. on VeryLarge Data Bases, pages 1297–1300, Toronto, ON, Canada, 2004. MorganKaufmann Publishers Inc.
[9] T. Bray et al. Extensible Markup Language (XML) 1.0 (Fourth Edition).W3C, September 2006.
[10] D. Florescu and D. Kossmann. Storing and Querying XML Data using anRDMBS. IEEE Data Eng. Bull., 22(3):27–34, 1999.
[11] J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Simeon. StatiX:Making XML Count. In ACM SIGMOD’02: Proc. of the 21st Int. Conf.on Management of Data, pages 181–192, Madison, Wisconsin, USA, 2002.ACM Press.
[12] T. Grust. Accelerating XPath Location Steps. In SIGMOD’02: Proc. ofthe ACM SIGMOD Int. Conf. on Management of Data, pages 109–120,New York, NY, USA, 2002. ACM Press.
[13] M. Klettke and H. Meyer. XML and Object-Relational Database Systems– Enhancing Structural Mappings Based on Statistics. In Lecture Notes inComputer Science, volume 1997, pages 151–170, 2000.
[14] M. Kratky, J. Pokorny, and V. Snasel. Implementation of XPath Axesin the Multi-Dimensional Approach to Indexing XML Data. In Proc. ofCurrent Trends in Database Technology – EDBT’04 Workshops, pages 46–60, Heraklion, Crete, Greece, 2004. Springer.
[15] R. Krishnamurthy, V. Chakaravarthy, and J. Naughton. On the Difficultyof Finding Optimal Relational Decompositions for XML Workloads: AComplexity Theoretic Perspective. In ICDT’03: Proc. of the 9th Int. Conf.on Database Theory, pages 270–284, Siena, Italy, 2003. Springer.
[16] A. Kuckelberg and R. Krieger. Efficient Structure Oriented Storage ofXML Documents Using ORDBMS. In Proc. of the VLDB’02 WorkshopEEXTT and CAiSE’02 Workshop DTWeb, pages 131–143, London, UK,2003. Springer-Verlag.
[17] Q. Li and B. Moon. Indexing and Querying XML Data for Regular PathExpressions. In VLDB’01: Proc. of the 27th Int. Conf. on Very Large DataBases, pages 361–370, San Francisco, CA, USA, 2001. Morgan KaufmannPublishers Inc.
[18] I. Mlynkova and J. Pokorny. From XML Schema to Object-RelationalDatabase – an XML Schema-Driven Mapping Algorithm. In ICWI’04:Proc. of IADIS Int. Conf. WWW/Internet, pages 115–122, Madrid, Spain,2004. IADIS.
[19] I. Mlynkova and J. Pokorny. XML in the World of (Object-)RelationalDatabase Systems. In ISD’04: Proc. of the 13th Int. Conf. on Informa-
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
tion Systems Development, pages 63–76, Vilnius, Lithuania, 2004. SpringerScience+Business Media, Inc.
[20] I. Mlynkova, K. Toman, and J. Pokorny. Statistical Analysis of Real XMLData Collections. In COMAD’06: Proc. of the 13th Int. Conf. on Man-agement of Data, pages 20–31, New Delhi, India, 2006. Tata McGraw-HillPublishing Company Limited.
[21] C.-H. Moh, E.-P. Lim, and W. K. Ng. DTD-Miner: A Tool for Mining DTDfrom XML Documents. In WECWIS’00: Proc. of the 2nd Int. Workshopon Advanced Issues of E-Commerce and Web-Based Information Systems,pages 144–151, Milpitas, CA, USA, 2000. IEEE.
[22] M. Murata, D. Lee, and M. Mani. Taxonomy of XML Schema LanguagesUsing Formal Language Theory. In Proc. of the Extreme Markup LanguagesConf., Montreal, Quebec, Canada, 2001.
[23] S. Nestorov, S. Abiteboul, and R. Motwani. Extracting Schema fromSemistructured Data. In SIGMOD’98: Proc. of the ACM Int. Conf. OnManagement of Data, pages 295–306, Seattle, Washington, DC, USA, 1998.ACM Press.
[24] E. Rahm and T. Bohme. XMach-1: A Benchmark for XML Data Manage-ment. Database Group Leipzig, 2006.
[25] M. Ramanath, J. Freire, J. Haritsa, and P. Roy. Searching for EfficientXML-to-Relational Mappings. In XSym’03: Proc. of the 1st Int. XMLDatabase Symposium, volume 2824, pages 19–36, Berlin, Germany, 2003.Springer.
[26] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F.Naughton. Relational Databases for Querying XML Documents: Limita-tions and Opportunities. In VLDB’99: Proc. of 25th Int. Conf. on VeryLarge Data Bases, pages 302–314, San Francisco, CA, USA, 1999. MorganKaufmann Publishers Inc.
[27] I. Tatarinov, S. D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita,and C. Zhang. Storing and Querying Ordered XML Using a RelationalDatabase System. In SIGMOD’02: Proc. of 21st Int. Conf. on Managementof Data, pages 204–215, Madison, Wisconsin, USA, 2002. ACM Press.
[28] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML SchemaPart 1: Structures (Second Edition). W3C, October 2004.
[29] W. Xiao-ling, L. Jin-feng, and D. Yi-sheng. An Adaptable and AdjustableMapping from XML Data to Tables in RDB. In Proc. of the VLDB’02Workshop EEXTT and CAiSE’02 Workshop DTWeb, pages 117–130, Lon-don, UK, 2003. Springer-Verlag.
[30] B. B. Yao and M. T. Ozsu. XBench – A Family of Benchmarks for XMLDBMSs. University of Waterloo, School of Computer Science, DatabaseResearch Group, 2002.
[31] S. Zheng, J. Wen, and H. Lu. Cost-Driven Storage Schema Selection forXML. In DASFAA’03: Proc. of the 8th Int. Conf. on Database Systems forAdvanced Applications, pages 337–344, Kyoto, Japan, 2003. IEEE Com-puter Society.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 43 - 62
AbstractXML became the most significant standard for data exchange and publication over theinternet but most business data remain stored in relationaldatabases. Access to businessdata in workflows or Web sevices is frequently realized by accessing XML views pub-lished over relational databases. In these loosely coupledenvironments where activitiescan execute on XML views generated from relational data without a possibility to lock thebase data, it is necessary to provide view freshness controland invalidation mechanisms.In this paper we present an invalidation method for XML viewspublished over relationaldata developed for our prototype workflow management system.
Keywords: Workflow management, workflow data, XML, XML views, view invalidation
1 Introduction
XML became the most significant standard for data exchange and publication over theinternet. Nevertheless, most business data remain stored in relational databases. XMLviews over relational data are seen as a general way to publish relational data as XMLdocuments. There are many proposals to overcome the mismatch between flat relationaland hierarchical XML models (e.g. [1,2]). Also commercial relational database manage-ment systems offer a possibility to publish relational dataas XML (e.g. [3,4]).
The importance of XML technology is increasing tremendously in process manage-ment. Web services [5], workflow management systems [6] and B2B standards [7,8] useXML as a data format. Complex XML documents published and exchanged by businessprocesses are usually defined with XML Schema types. Processactivities expect and pro-duce XML documents as parameters. XML documents encapsulated in messages (e.g.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 63
WSDL) can trigger new process instances. Frequently, hierarchical XML data used byprocesses and activities have to be translated into flat relational data model used by ex-ternal databases. These systems are very often loosely coupled and it is impossible orvery difficult to provide view maintenance. On the other hand, the original data can beaccessed and modified by other systems or application programs. Therefore, a method ofcontrolling the freshness of a view and invalidating views becomes vital.
We developed a view invalidation method for our prototype workflow managementsystem. A special data access module, so called generic dataaccess plug-in (GDAP),enables the definition of XML views over relational data. GDAP offers a possibility tocheck the view freshness and can invalidate a stale view. In case of view update operationsthe GDAP automatically checks whether the view is not stale before propagating updateto the original database.
The remainder of this paper is organized as follows: Section2 presents an overallarchitecture of our prototype workflow management systems and introduces the idea ofdata access plug-ins used to provide uniform access to external data in workflows. Sec-tion 3 discusses invalidation mechanisms for XML views defined over relational data andSection 4 describes their actual implementation in our system. We give some overview ofrelated work in Section 5 and finally draw conclusions in Section 6.
2 Uniform and Flexible Data Access in Workflow Manage-ment Systems
Workflow management systems are not intended to provide general data management sys-tems capabilities, although they have to be able to work withlarge amounts of data comingfrom different sources. Business data, describing persistent business information neces-sary to run an enterprise, may be controlled either by a workflow management system orbe managed in external systems (e.g. corporate database). The workflow managementsystem needs a direct data access to make control flow decisions based upon data values.An important drawback is that workflow management system external data can only beused indirectly for this purpose, e.g. be queried for control decisions. Therefore most ofthe activity programming is related to accessing external databases [9].
We propose to provide the workflow management system with a uniform and transpar-ent access method to all business data stored in any data source. The workflow manage-ment system should be able to use data coming from external and independent systems todetermine a state transition or to pass it between activities as parameters. This is achievedby an abstraction layer calleddata access plug-ins.
A general architecture of our workflow management prototypeis presented in Fig. 1.The workflow engine provides operational functions to support the execution of processes.The workflow repository stores both workflow definition and instance data. The pro-gram interaction manager calls programs implementing automated activities. The work-list manager is responsible for worklists of the human actors and for the interaction withthe worklist handlers. The data access plug-in manager is responsible for registering andmanaging data access plug-ins. Apart from the generic data access plug-in there may bespecialized plug-ins for specific data sources (e.g. legacysystems). Our implementationincluded generic data access plug-in for relational databases and another one for XMLfiles stored in a file system.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 64
ProgramInteractionManager
Worklist
Manager
Data Access
Plug-ins
Data AccessPlugIn Manager
WfMS
ExternalSystems
Workflow
Engine
Worklist handler
External Data Sources
WorkflowRepository
Figure 1: Workflow management system architecture with dataaccess plug-ins
2.1 Data Access Plug-ins
Data access plug-ins are reusable and interchangeable wrappers around external datasources which present to the workflow management system the content of underlyingdata sources and manage the access to it. The functionality of external data sources isabstracted in these plug-ins.
Each data access plug-in provides documents in one or several predefined XML Schematypes. Both a data access plug-in and XML Schema types servedby this plug-in are reg-istered to the workflow management system. Once registered,a data access plug-in canbe reused in many workflow definitions to access external dataas XML documents of agiven type.
Consider the following frequent scenario: an enterprise has a large database with thecustomer data stored in several relations and used in many processes. In our approach thecompany defines a complex XML Schema type describing customer data and implementsa data access plug-in which wraps this database and retrieves and stores customer data inXML format. This has several advantages:
• Business data from external systems are accessible by the workflow managementsystem. Thus, these data can be passed to activities and usedto make control flowdecisions.
• Activities can be parameterized with XML documents of predefined types. Thelogic for accessing external data sources is hidden in a dataaccess plug-in fetchingdocuments passed to activities at runtime. This allows activities to be truly reusableand independent of physical data location.
• Making external data access explicit with the data access plug-ins rather than hidingit in the activities improves the understandability, maintainability and auditabilityof process definitions.
• Both data access plug-ins and XML Schema type are reusable.
• This solution is easily evolvable. If the customer data have to be moved to a dif-ferent database, it is sufficient to use another data access plug plug-in. The processdefinition and activities remain basically unchanged.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 65
The task of a data access plug-in is to translate the operations on XML documents tothe underlying data sources. A data access plug-in exposes to the workflow managementsystem a simple interface which allows XML documents to be read, written or createdin a collection of many documents of the same XML Schema type.Each document inthe collection is identified by a unique identifier. The plug-in must be able to identify thedocument in the collection given only this identifier.
Each data access plug-in allows an XPath expression to be evaluated on a selectedXML document. The XML documents used within a workflow can be used by the work-flow engine to control the flow of workflow processing. This is done in conditional splitnodes by evaluating the XPath conditions on documents. If a given document is stored inan external data source and accessed by a data access plug-in, then the XPath conditionhas to be evaluated by this plug-in. XPath is also used to access data values in XMLdocuments.
2.2 Generic Data Access Plug-in for Relational Data Sources
Most business data remain stored in relational databases. Therefore, a generic and ex-pandable solution for relational data sources was needed. Ageneric data access plug-in(GDAP) offers basic operations and can be extended by users to their specific data sources.GDAP is responsible for mapping of the hierarchical XML documents used by workflowsand activities into flat relational data model used by external databases. Thus, documentsproduced by GDAP can be seen as XML views of relational data.
The workflows and activities managed by the workflow management system can runfor a long time. In a loosely coupled workflow scenario it is neither reasonable nor possi-ble to lock data in the original database for a processing time a workflow. At the same timethese data can be modified by other systems or workflows. In order to provide optimisticconcurrency control, some form of view invalidation is required [4]. Therefore, GDAPprovides a view freshness control and view invalidation method. In case of view updateoperations GDAP automatically checks whether the view is not stale before propagatingupdate to the original database.
3 Change Detection and View Invalidation
In our GDAP (generic data access plug-in) we analyzed and implemented a mechanismfor invalidation of XML views of relational data by detecting relevant base data changes.Change detection in base relational data can be done in two ways: passive or active.
In our approach, passive change detection means that a GDAP is informed aboutchanges in relational data used in views managed by this plug-in. Therefore, it is neces-sary that a (passive) plug-in is able to subscribe certain views to this notification process.Additionally, a view that is not used any longer needs to be unsubscribed. We identifiedthree passive mechanisms for change detection:
1. Passive change detection by use of concepts of active databases. That means thattriggers are defined on base relations containing data published in the views, in-forming the GDAP about changes in the database.
2. Passive change detection by change methods: This mechanism is based on ob-ject oriented and object relational databases, providing the possibility to implement
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 66
change methods. Change methods that implement change operations on the under-laying base data can be extended by a functionality to informthe plug-ins in caseof potential view-invalidations.
3. Passive change detection by event mechanisms: This is themost general approachbecause here an additional publish-subscribe mechanism isassumed. GDAPs sub-scribe views for different events on the database (e.g. change operations on basedata). If an event occurs, a notification is sent to the GDAP initiating the viewinvalidation process.
On the other hand, using active change detection techniques, it is the GDAP’s ownresponsibility to check whether underlaying base data has changed periodically or at de-fined points in time. Due to the fact that no concepts of activeor object oriented databasesand no publish-subscribe mechanisms are required, these techniques are universal. Wedistinguish three active change detection methods:
1. A naive approach is to backup the relevant relational basedata at view creation timeand compare it to the relational base data at the time when theview validity has tobe checked. Differences in these two data sets may lead to view invalidation.
2. To avoid storing huge amounts of backup data, a hash function can be used tocompute its hash value and back it up. At the point in time whena view validitycheck becomes necessary again a hashvalue is computed on thenew database stateand compared to the backup-value to determine changes. Notice that in this caseit can come to a collision, i.e. hash values could be same for different data and inresult lead to over-invalidation.
3. Usage of change logs: Change logs log all changes within the database caused byinternal or external actors. Because no database state needs to be backed up atview-creation time, less space is used.
After changes in base data have been detected the second GDAPtask is to determinewhether corresponding views became stale, i.e. check if they need to be invalidated ornot. Not all changes in the base relational data lead to invalid views. We developedan algorithm that considers both the type of change operation and the view definition tocheck the view invalidation.
Figure 2 gives an overview of this algorithm. It shows the different effects causedby different change operations: The change of a tuple that isdisplayed in the view (i.e.that satisfies the selection-condition), always leads to view invalidation. In certain caseschange operations on tuples that are not displayed in the view lead to invalidation too: (1)Changes of tuples selected by a where-clause make views stale. (2) Changes invalidateall views with view-definitions that do not contain a where-clause, and the changed tuplesemerge in relations occurring in the from-clause of the view-definition.
These cases also apply to deletes and inserts. A tuple inserted by an insert operationinvalidates the view if it is displayed in the view, it is selected in the where-clause or theview-definition does not contain a where-clause and the tuple is inserted into a relationoccurring in the from-clause of the view-definition.
The same applies to the case of deletes: A view is invalidatedif the deleted tuple isdisplayed in the view, it is selected by the where-clause or it is deleted from a relationoccurring in the view-definition’s from-clause.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 67
selection-clause satisfied
(tuple is shown in view)
yes no
invalid
where-clause exists
yes no
where-clause
selects tuple
tuple occurs in relation
occuring in from-
clause
yes no
invalid valid
yes no
invalid valid
Figure 2: View invalidation algorithm
If we assume that for update-propagation reasons the identifying primary keys of thebase tuples are contained in the view, every tuple of the viewcan be associated with thebase tuple and vice-versa. Thus, every single change withinbase data can be associatedwith the view.
4 Implementation
To validate our approach we implemented a generic data access plug-in which was in-tegrated into our prototype workflow management system described in Section 2. Thecurrent implementation of the GDAP for relational databases (Fig. 3) takes advantage ofXML-DBMS middleware for transferring data between XML documents and relationaldatabases [10]. XML-DBMS maps the XML document to the database according to anobject-relational mapping in which element types are generally viewed as classes andattributes and XML text data as properties of those classes.An XML-based mapping lan-guage allows the user to define an XML view of relational data by specifying these map-pings. The XML-DBMS supports also insert, update and deleteoperations. We follow inour implementation an assumption made by the XML-DBMS that the view updateabilityproblem has already been resolved. The XML-DBMS checks onlybasic properties for theview updateability, e.g. presence in an updateable view of primary keys and obligatoryattributes. Other issues, like content and structural duplication are not addressed.
The GDAP controls the freshness of generated XML views usingthe predefined trig-gers and so called view-tuple lists (VTLs). A VLT contains primary keys of tuples whichwere selected into the view. VLTs are managed and stored internally by the GDAP. Asample view-tuple list which contains primary keys of displayed tuples in each involvedrelation is shown in Table 1.
Our GDAP also uses triggers defined on tables which were used to create a view.These triggers are created together with the view definitionand store all primary keys ofall tuples inserted, deleted or modified within these relations in a special log. This log isinspected by the GDAP later, and information gathered during this inspection process isused for our invalidation process of the view. Thus, our implementation uses a variant ofactive change detection with change logs as described in Section 3.
Two different VTLs are used in the invalidation process: VTLold is computed whenthe view is generated, VTLnew is computed at the time of the invalidation process. Theprimary keys of modified records in the original tables logged by a trigger are used to
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
detect possible changes in a view. The algorithm is as follows:For each tuple T in a change log check one of the following (IDT denotes the identifyingprimary key of the tuple T):
• Case 1: IDT is contained both in VTLold and in VTLnew. This denotes an updateoperation.
• Case 2: IDT is contained only in VTLnew. This denotes an insert operation.
• Case 3: IDT is contained only in VTLold. This denotes a delete operation.
The check procedure stops as soon as one of Cases 1-3 is true. This means that one ofthe modified tuples logged in the change log would be selectedinto the view and the viewmust be invalidated. But the view should be also invalidatedif the selection-condition ofthe view definition is not satisfied. This described by the next two cases which also must
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 69
Table 2: View exampletupleID name salary maxSalary
1 Joe 1000 30002 Bill 2000 3000
yes no
invalid
yes no
invalid valid
case 4 or case 5 satisfied
case 1, case 2 or case 3 satisfied
Figure 4: View invalidation method implemented in GDAP
be checked. Viewold resp. Viewnew denotes the original view resp. the view generatedduring the validation checking process:
• Case 4: If VTLold is not equal to VTLnew, the view is invalid because the set oftuples has changed.
• Case 5: If VTLold is equal to VTLnew and Viewold is not equal to Viewnew, theview has to be invalidated. That means that the tuples remained the same, but valueswithin these tuples have changed.
To clarify that it is necessary to check case 5, see the following view-definition andthe corresponding view listed in Table 2:
SELECT tupleID, name, salary,(SELECT max(salary) AS maxSalaryFROM employeesWHERE department=’IT’)
FROM employees WHERE department=’IT’AND tupleID<3
If the salary of the employee with the maximum salary is changed (notice that thisemployee is not selected by the selection-condition), still the same tuples are selected, butwithin the view maxSalary changes.
The invalidation checking in Cases 1-4 does not require viewrecomputation. But Case5 only needs to be checked if the Cases 1-4 are not satisfied. Notice that while the Cases 4and 5 just have to be checked once, Cases 1-3 have to be checkedfor every tuple occurringin the change log. This invalidation algorithm used in our GDAPs is summarized in Fig. 4.
Case 5 is checked by comparing two views, as shown above. The disadvantagesof this comparison are that a recomputation of the view Viewnew is time and resourceconsuming, as well as the comparison itself may be very inefficient. A more efficient wayis to replace Case 5 by Case 5a:
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 70
Table 3: View exampletupleID name salary comparisonSalary
1 Joe 1000 25002 Bill 2000 2500
• Case 5a: If VTLold is equal to VTLnew and there is an additional sub-select-clausewithin the select-clause and any relation in this from-clause has been changed, thenthe view is invalidated.
This way the view shown in Table 2 can be invalidated after theattribute maxSalaryis changed.
It is also possible that the sub-select-clause does not contain group-functions, like thefollowing view-definition:
SELECT tupleID, name, salary,(SELECT salary AS comparisonSalaryFROM employees WHERE tupleId=’10’)
FROM employees WHERE department=’IT’AND tupleID=’3’
The resulting view is listed in Table 3. If there is a change operation on the salary ofthe employee withtupleId=10, the view has to be invalidated. This is checked by case5a. Additionally, all changes on relations occurring in thefrom-clause of the sub-selectlead to an invalidation of the view. In Case 5a even the changes, that do not affect the viewitself, can make it stale. Thus, over-invalidation may occur in a high degree. Still, thismechanism of checking view freshness and view-invalidation seems to be a more efficientalternative to Case 5.
5 Related Work
In most existing workflow management systems, data used to control the flow of the work-flow instances (i.e. workflow relevant data) are controlled by the workflow managementsystem itself and stored in the workflow repository. If thesedata originate in externaldata sources, then external data are usually copied into theworkflow repository. Thereis no universal standard for accessing external data in workflow management systems.Basically each product uses different solutions [11].
There has been recent interest in publishing relational data as XML documents oftencalled XML views over relational data. Most of these proposals focus on convertingqueries over XML documents into SQL queries over the relational tables and on efficientmethods of tagging and structuring of relational data into XML (e.g. [1,2,12].
The view updateability problem is well known in relational and object- relationaldatabases [13]. The mismatch between flat relational and hierarchical XML models isan additional challenge. This problem is addressed in [14].However, most proposals ofupdateable XML views [15] and commercial RDBMS (e.g. [3]) assume that XML viewupdateability problem is already solved. The termview freshnessis not used in a uniformway, dealing with the currency and timeliness [16] of a (materialized) view. Additional
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 71
dimensions of view freshness regarding the frequency of changes are discussed in [17].We do not distinguish all these dimensions in this paper. Here view freshness means thevalidity of a view. Different types of invalidation, including over- and under-invalidationare discussed in [18].
A view is not fresh (stale) if data used to generate the view were modified. It isimportant to detect relevant modifications. In [19] authorsproposed to store both originalXML documents and their materialized XML views in special relational tables and to useupdate log to detect relevant updates. A new view generated from the modified base datamay be different as a previously generated view. Several methods for change detectionof XML documents were proposed (e.g. [20, 21]). The authors of [22] proposed firstto store XML in special relational tables and then to use SQL queries to detect contentchanges of such documents. In [4] before and after images of an updateable XML view arecompared in order to find the differences which are later usedto generate correspondingSQL statements responsible for updating relational data. The before and after images arealso used to provide optimistic concurrency control.
6 Conclusions
The data aspect of workflows requires more attention. Since workflows typically accessdata bases for performing activities or making flow decisions, the correct synchronizationbetween the base data and the copies of these data in workflow systems is of great impor-tance for the correctness of the workflow execution. We described a way for recognizingthe invalidation of materialized views of relational data used in workflow execution. Tocheck the freshness of generated views our algorithm does not require any special datastructures in the RDBMS except a log table and triggers. Additionally, view-tuple-listsare managed to store primary keys of tuples selected into a view. Thus, only a very smallamount of overhead data are stored and can be used to invalidate stale views.
The implemented general data access plug-in enables flexible publication of relationaldata as XML documents used in loosely coupled workflows. Thisbrings obvious ad-vantages for intra- and interorganizational exchange of data. In particular, it makes thedefinition of workflows easier and the coupling between workflow system and databasesmore transparent, since it is no longer needed to perform allthe necessary checks in theindividual activities of a workflow.
Acknowledgements
This work is partly supported by the Commission of the European Union within theproject WS-Diamond in FP6.STREP
References
[1] M. Fernandez, Y. Kadiyska, D. Suciu, A. Morishima, and W.-C. Tan, “Silkroute:A framework for publishing relational data in xml,”ACM Trans. Database Syst.,vol. 27, no. 4, pp. 438–493, 2002.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 72
[2] J. Shanmugasundaram, J. Kiernan, E. J. Shekita, C. Fan, and J. Funderburk, “Query-ing xml views of relational data,” inVLDB ’01: Proceedings of the 27th Interna-tional Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc.,2001, pp. 261–270.
[3] Oracle,XML Database Developer’s Guide - Oracle XML DB. Release 2 (9.2), Ora-cle Corporation, October 2002.
[4] M. Rys, “Bringing the internet to your database: Using sqlserver 2000 and xml tobuild loosely-coupled systems,” inProceedings of the 17th International Confer-ence on Data Engineering ICDE, April 2-6, 2001, Heidelberg,Germany. IEEEComputer Society, 2001, pp. 465–472.
[5] T. Andrews, F. Curbera, H. Dholakia, Y. Goland, J. Klein,F. Leymann, K. Liu,D. Roller, D. Smith, S. Thatte, I. Trickovic, and S. Weerawarana, “Business processexecution language for web services (bpel4ws),” BEA, IBM, Microsoft, SAP, SiebelSystems, Tech. Rep. 1.1, 5 May 2003.
[6] WfMC, “Process definition interface - xml process definition language (xpdl 2.0),”Workflow Management Coalition, Tech. Rep. WFMC-TC-1025, 2005.
[8] M. Sayal, F. Casati, U. Dayal, and M.-C. Shan, “Integrating workflow managementsystems with business-to-business interaction standards,” in Proceedings of the 18thInternational Conference on Data Engineering (ICDE’02). IEEE Computer Soci-ety, 2002, p. 287.
[9] M. Ader, “Workflow and business process management comparative study. volume2,” Workflow & Groupware Strategies, Tech. Rep., June 2003.
[10] R. Bourret, “Xml-dbms middleware,” Viewed: May 2005,http://www.rpbourret.com/xmldbms/index.htm.
[11] N. Russell, A. H. M. t. Hofstede, D. Edmond, and W. v. d. Aalst, “Workflow datapatterns,” Queensland University of Technology, Brisbane, Australia, Tech. Rep.FIT-TR-2004-01, April 2004.
[12] J. Shanmugasundaram, E. J. Shekita, R. Barr, M. J. Carey, B. G. Lindsay, H. Pira-hesh, and B. Reinwald, “Efficiently publishing relational data as xml documents,”VLDB J., vol. 10, no. 2-3, pp. 133–154, 2001.
[13] C. Date,An Introduction to Database Systems, Eighth Edition. Addison Wesley,2003.
[14] L. Wang and E. A. Rundensteiner, “On the updatability ofxml views published overrelational data,” inConceptual Modeling - ER 2004, 23rd International Conferenceon Conceptual Modeling, Shanghai, China, November 2004, Proceedings, ser. Lec-ture Notes in Computer Science, P. Atzeni, W. W. Chu, H. Lu, S.Zhou, and T. W.Ling, Eds., vol. 3288. Springer, 2004, pp. 795–809.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
Christian Dreier, Johann EderMarek Lehman, Juergen Mangler 73
[15] I. Tatarinov, Z. G. Ives, A. Y. Halevy, and D. S. Weld, “Updating xml,” inSIGMODConference, 2001.
[16] M. Bouzeghoub and V. Peralta, “A framework for analysisof data freshness,” inIQIS 2004, International Workshop on Information Quality in Information Systems,18 June 2004, Paris, France (SIGMOD 2004 Workshop), F. Naumann and M. Scan-napieco, Eds. ACM, 2004, pp. 59–67.
[17] M. Bouzeghoub and V. Peralta, “On the evaluation of datafreshness in data inte-gration systems,” in20 emes Journees de Bases de Donnees Avancees (BDA 2004),2004.
[18] K. S. Candan, D. Agrawal, W.-S. Li, O. Po, and W.-P. Hsiung, “View invalidation fordynamic content caching in multitiered architectures,” inVLDB, 2002, pp. 562–573.
[19] H. Kang, H. Sung, and C. Moon, “Deferred incremental refresh of xml materializedviews : Algorithms and performance evaluation,” inDatabase Technologies 2003,Proceedings of the 14th Australasian Database Conference,ADC 2003, Adelaide,South Australia, February 2003, ser. CRPIT, vol. 17. Australian Computer Society,2003, pp. 217–226.
[20] G. Cobena, S. Abiteboul, and A. Marian, “Detecting changes in xml documents,”in Proceedings of the 18th International Conference on Data Engineering ICDE, 26February - 1 March 2002, San Jose, CA. IEEE Computer Society, 2002, pp. 41–52.
[21] Y. Wang, D. J. DeWitt, and J. yi Cai, “X-diff: An effective change detection algo-rithm for xml documents,” inProceedings of the 19th International Conference onData Engineering ICDE, March 5-8, 2003, Bangalore, India, U. Dayal, K. Ramam-ritham, and T. M. Vijayaraman, Eds. IEEE Computer Society, 2003, pp. 519–530.
[22] E. Leonardi, S. S. Bhowmick, T. S. Dharma, and S. K. Madria, “Detecting contentchanges on ordered xml documents using relational databases,” in Database andExpert Systems Applications, 15th International Conference, DEXA 2004 Zaragoza,Spain, August 30-September 3, 2004, Proceedings, ser. Lecture Notes in ComputerScience, F. Galindo, M. Takizawa, and R. Traunmuller, Eds., vol. 3180. Springer,2004, pp. 580–590.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 63 - 74
In this section we formalize the semantics of adding incremental preference or equiva-
lence information on top of already existing base preferences or base equivalences. First,
we provide basic definitions required to model base and amalgamated preferences and
equivalence relationships. Then, we provide basic theorems which allow for consistent
incremented skyline computation (cf. [1]). Moreover, we show that it suffices to calcu-
late incremental changes on transitively reduced preference diagrams. We show that
local changes in the preference graph only result in locally restricted recomputations for
the incremented skyline and thus leads to superior performance (cf. [2]).
3.1 Base Preferences and the Induced Pareto Aggregation
In this section we will provide the basic definitions which are prerequisites for section
3.1 and 3.1 . We will introduce the notion for base preferences, base equivalences, their
amalgamated counterparts, a generalized Pareto composition and a generalized skyline.
The basic construct are so-called base preferences defining strict partial orders on attrib-
ute domains of database objects (based on [12], [15]):
Definition 1: (Base Preference)
Let D1, D2, Dm be a non-empty set of m domains (i.e. sets of attribute values) on the
attributes Attr1, Attr 2 Attr m so that Di is the domain of Attr i. Furthermore let O D1
× D2 Dm be a set of database objects and let attri : O Di be a function mapping
each object in O to a value of the domain Di.
Then a Base Preference Pi Di2 is a strict partial order on the domain Di.
The intended interpretation of (x, y) Pi with x, y Di (or alternatively written x <Pi y)
is attribute value y (for the domain Di) better than attribute
value x (of the same domain) This implies that for o1, o2 O (attri(o1), attri(o2)) Pi bject o2 better than object o1 with respect to its i-th attribute value
In addition to specifying preferences on a domain Di we also allow to define equiva-
lences as given in Definition 2.
Definition 2: (Base Equivalence and Compatibility)
Let O a set of database objects and Pi a base preference on Di as given in Definition 1.
Then we define a Base Equivalence Qi Di2
as an equivalence relation (i.e. Qi is reflex-
ive, symmetric and transitive) which is compatible with Pi and is defined as:
a) Qi Pi = (meaning no equivalence in Qi contradicts any strict preference in Pi)
b) Pi i = Qi i = Pi (the domination relationships expressed transitively using Pi and
Qi must always be contained in Pi)
In particular, as Qi is an equivalence relation, Qi trivially contains the pairs (x, x) for all x
Di.
The interpretation of base equivalences is similarly intuitive as for base preferences: (x,
y) Qi with x, y Di (or alternatively written x ~Qi y am indifferent between
attribute values x and y of the domain Di
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 75 - 91
Proof: We know from Theorem 2 that P* = (P (P E P) (Q E P) (P E
Q)) and for preference diagrams PD(P) of P holds:
a) P = PD(P)+ (PD(P)*)
+
b) (P E P) = (P E) P (P E Q) P = (PD(P)+ E Q) PD(P)+
(PD(P)*)+, because (PD(P)+
E Q) (PD(P)*)+ and PD(P)+
(PD(P)*)+.
c) Furthermore (P E Q) = PD(P)+ E Q (PD(P) E Q)
+ (PD(P)*)
+
d) And similarly (Q E P) = Q E PD(P)+ (Q E PD(P))+
(PD(P)*)+
Using a) d) we get P* (PD(P)*)+ and since PD(P)* P*, we get (PD(P)*)
+ (P*)
+
= P* and thus (PD(P)*)+ = P*.
To calculate PD(P)* we have to consider the terms in PD(P) (PD(P) E Q) (Q E PD(P))): The first term is just the old preference diagram. Since the second and third
terms both contain a single edge from E (i.e. either (x, y) or (y, x)), the terms can be writ-
ten as
(PD(P) E Q) = ((_ PD(P) x) Q[y]) ((_ PD(P) y) Q[x]) and
References [1] W.-T. Balke, U. Güntzer, C. Lofi. Eliciting Matters Controlling Skyline Sizes by Incremental Integra-
tion of User Preferences. Int. Conf. on Database Systems for Advanced Applications (DASFAA), Bang-kok, Thailand, 2007
[2] W.-T. Balke, U. Güntzer, C. Lofi. User Interaction Support for Incremental Refinement of Preference-
Based Queries. 1st IEEE International Conference on Research Challenges in Information Science (RCIS), Ouarzazate, Morocco, 2007.
[3] W.-T. Balke, U. Güntzer, W. Siberski. Getting Prime Cuts from Skylines over Partially Ordered Do-
mains. Datenbanksysteme in Business, Technologie und Web (BTW 2007), Aachen, Germany, 2007
[4] W.-T. Balke, U. Güntzer. Multi-objective Query Processing for Database Systems. Int. Conf. on Very
Large Data Bases (VLDB), Toronto, Canada, 2004.
[5] W.-T. Balke, M. Wagner. Through Different Eyes - Assessing Multiple Conceptual Views for Querying Web Services. Int. World Wide Web Conference (WWW), New York, USA, 2004.
[6] W.-T. Balke, U. Güntzer, W. Siberski. Exploiting Indifference for Customization of Partial Order Sky-
lines. Int. Database Engineering and Applications Symp. (IDEAS), Delhi, India, 2006. [7] W.-T. Balke, J. Zheng, U. Güntzer. Efficient Distributed Skylining for Web Information Systems. Int.
Conf. on Extending Database Technology (EDBT), Heraklion, Greece, 2004.
[8] W.-T. Balke, J. Zheng, U. Güntzer. Approaching the Efficient Frontier: Cooperative Database Retrieval Using High-Dimensional Skylines. Int. Conf. on Database Systems for Advanced Applications
(DASFAA), Beijing, China, 2005.
[8] J. Bentley, H. Kung, M. Schkolnick, C. Thompson. On the Average Number of Maxima in a Set of Vec-tors and Applications. Journal of the ACM (JACM), vol. 25(4) ACM, 1978.
[9] S. Börzsönyi, D. Kossmann, K. Stocker. The Skyline Operator. Int. Conf. on Data Engineering (ICDE),
Heidelberg, Germany, 2001. [10] C. Boutilier, R. Brafman, C. Geib, D. Poole. A Constraint-Based Approach to Preference Elicitation and
Decision Making. AAAI Spring Symposium on Qualitative Decision Theory, Stanford, USA, 1997. [11] L. Chen, P. Pu. Survey of Preference Elicitation Methods. EPFL Technical Report IC/2004/67,
Lausanne, Swiss, 2004.
[12] J. Chomicki. Preference Formulas in Relational Queries. ACM Transactions on Database Systems (TODS), Vol. 28(4), 2003.
[13] J. Chomicki. Iterative Modification and Incremental Evaluation of Preference Queries. Int. Symp. on
Found. of Inf. and Knowledge Systems (FoIKS), Budapest, Hungary, 2006. [14] P. Godfrey. Skyline Cardinality for Relational Processing. Int Symp. on Foundations of Information and
Knowledge Systems (FoIKS), Wilhelminenburg Castle, Austria, 2004.
[15] W. Kießling. Foundations of Preferences in Database Systems. Int. Conf. on Very Large Databases (VLDB), Hong Kong, China, 2002.
[16] V. Koltun, C. Papadimitriou. Approximately Dominating Representatives. Int. Conf. on Database Theory
(ICDT), Edinburgh, UK, 2005. [17] D. Kossmann, F. Ramsak, S. Rost. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries.
Int. Conf. on Very Large Data Bases (VLDB), Hong Kong, China, 2002.
[18] M. McGeachie, J. Doyle. Efficient Utility Functions for Ceteris Paribus Preferences. In Proc. of Conf. on Artificial In-telligence and Conf. on Innovative Applications of Artificial Intell
Edmonton, Canada, 2002.
[19] D. Papadias, Y. Tao, G. Fu, B. Seeger. An Optimal and Progressive Algorithm for Skyline Queries. Int. Conf. on Management of Data (SIGMOD), San Diego, USA, 2003.
[20] J. Pei, W. Jin, M. Ester, Y. Tao. Catching the Best Views of Skyline: A Semantic Approach Based on
Decisive Subspaces. Int. Conf. on Very Large Databases (VLDB), Trondheim, Norway, 2005. [21] T. Satty. A Scaling Method for Priorities in Hierarchical Structures. Journal of Mathematical Psychol-
ogy, 1977
[22] T. Xia, D. Zhang. Refreshing the sky: the compressed skycube with efficient support for frequent up-dates. Int. Conf. on Management of Data (SIGMOD), Chicago, USA, 2006.
[23] Y. Yuan, X. Lin, Q. Liu, W. Wang, J. Yu, Q. Zhang. Efficient Computation of the Skyline Cube. Int.
Conf. on Very Large Databases (VLDB), Trondheim, Norway, 2005.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 75 - 91
Abstract The Internet has had an exponential growth over the past decade and its impact on society has been rising steadily. Nonetheless, two situations persist: first, the available interfaces are not appropriate for a huge number of citizens (visual deficient people), second, it is still difficult for people to access the internet in our Country. On the other hand, the proliferation of cell phones as tools for accessing the Internet, equally paved the way for new applications and business models. The interaction with “Web pages” through the voice is an ever-increasing technological challenge and added solution in terms of ubiquitous interfaces. This paper presents a solution to access Web contents through a bidirectional voice interface. It also features a practical application of this technology, giving users an access to an academic enrolment system using the telephone to surf the Web pages of the present system. Keywords: disabled people, voice recognition
1 Introduction Humanity has long been feeling the need to arrange mechanisms that make sluggish, routine and complex tasks easier. Calculations, data storing are some of the examples and lots of tools have been created to solve these problems. The personal computer comes as a working instrument that carries out many of them. The interconnection of these devices allows for information release and gives users the chance to share contents thus improving task execution efficiency. The Internet is inevitably a consequence of the use and proliferation of computers. Its constant evolution could be framed into several features, such as infrastructure, development tools (increasingly easy to use) and interfaces with the user. The objective of this paper is to explore the new man-computer interface and its implications upon society.
The Web interfaces are based on languages of their own, which are used to write the documents and, by means of appropriate software (navigators, interpreters) can be read, visualized and executed. The first programs to present the contents of these documents used simple interfaces, which only contained the text. Nowadays, they support texts in various formats, charts, images, videos, sounds, among others. As information visualization software progresses, along with the need to make this
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 111
information available to more users, more complex interfaces arise, using hardware systems and software innovators. Then come technologies that utilize voice and dialogue recognition systems to provide and collect data. The motivation for developing this work comes from the ever-expanding fixed and mobile telephone services in Brazil, particularly over the past decade, a development that will allow a broader range of users to access the contents of the Web. There are also motivations of social character, for example, access to the Internet for special needs citizens, like visual deficient people.
1.1 Voice interfaces – What solutions? Nowadays, mobile phones are fairly common devices everywhere and increasingly perform activities that go beyond the range of simple conversation. On the other hand, the Internet is currently a giant “database” not easily accessible by voice through a cell phone. An interface that recognizes the voice and is capable of producing a coherent dialogue from a text, could be the solution to provide these contents. These devices would turn out to be very useful for visual deficient people.
As an initial approach toward the solution of the problem we could consider the WAP (Wireless Application Protocol), a specially designed protocol for mobile phones to access the contents of the Web. This type of access includes the microbrowser, a navigator specifically designed to function through a “screen” and reduced resources devices. The protocol was projected to make it possible to execute multimedia data on mobile phones, as well as on other wireless devices. The development was essentially based on the existing principles and patterns for the Internet.(TCP/IP-like) [1]. However, due to the technological deficiencies of these devices, new solutions were adopted (protocol battery) to optimize its utilization within this context. The architecture is similar to the one of the Internet, as shown in Figure 1.
Figure 1: WAP architecture – Extracted from the paper [6]
In a concise manner, we could say that the WAP architecture functions as an interface which allows for communication, more precisely for data exchange, between handhelds and the World Wide Web.
The sequence of actions that occur while accessing the WWW through a WAP protocol is as follows:
• Microbrowser on the mobile device starts; • The device seeks signal;
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 112
• A connection is established with the server of the telephone company; • A Web Site is selected to be displayed; • An order is sent to proxy using WAP; • The proxy accesses the desired information on the site by means of the
HTTP protocol; • The proxy codifies the HTTP data to WML; • The WML codified data are sent to the device; • The microbrowser displays the wireless version of the Web Site.
In spite of it all, this protocol shows some limitations, among them, the inability to establish a channel that allows for transporting audio data in real time. This is a considerable limitation and makes it impossible to use the WAP as a solution to supply Web contents through a voice interface. 2 Speech Recognizing Systems
The automatic speech recognition systems (Automated Speech Recognition - ASR) were greatly developed over the past years with the creation of improved algorithms and acoustic models and with the higher processing capacity of the computers. With a relatively accessible ASR system installed in the personal computer and with a good quality microphone quality recognition can be achieved if the system is trained for the user’s voice. Through a telephone, and a system that has not been trained, the recognition system needs a set of speech grammars to be able to recognize the answers. This is one possible manner to increase the possibilities to recognize, for example, a name among a rather big number of hypotheses, without the need for many interactions. Speech recognition through mobile phones, which are sometimes used in noisy environments, requires more complex algorithms and simple, well built grammars. Nowadays, there are many ASR commercial applications in an array of languages and action areas, for example, voice, finance, bank, telecommunications and trade portals (www.tellme.com). There are also evolutions in speech synthesis and in the transformation of texts into speech (TTS, Text-To-Speech). Many of the present TTS systems still have problems in terms of easily perceived speeches. However, a new form of speech synthesis under development is based on the concatenation of wave forms. Within this technique, speech is not entirely generated from the text, but recognition also relies on a series of pre-recorded sounds [2].
There are other recognizing manners, namely, mobile phones execute calls and other operations to voice commands of their owners – habitual user. The recognition system functions through the use of previous recordings of the user’s voice, which are then used to compare with the entry data so as to make order recognition possible. This is a rather simple recognition manner which works well and is widely used.
The recognition techniques are also used by systems that rely on a traditional operational system in order to allow the user to manipulate programs and access contents by means of a voice interface. An example of this type of application is the IN CUBE Voice Command [3]. This application uses a technology specifically developed to facilitate repetitive tasks. It is an application of the Windows(9X, NT, 2000) system but does not modify its configurations, and there are also versions available for other operational systems.
Alone, these systems do not assume themselves as possible solutions to the problem of creating a voice interface for the Web, because, as mentioned above, there is a need for high processing capacity systems to make full recognition possible, although, as shown later in this paper, this is an important technology and might even become part of the global architecture of a system that presents a
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 113
solution to the problem. The applications that execute commands of an operational system through voice recognition also poise some considerable disadvantages. If the systems are easy to use and are reliable, they do not have mechanisms that would allow to return in speech form, for example, any information that might be obtained from the Internet (from a XML document). 2.1 The advent of the VoiceXML
The working group Speech Interface Framework of the W3C (World Wide Web Consortium) created patterns that regulate access to Web contents using a voice interface. Access to information is normally done through visualization programs (browsers), which interpret a markup language (HTML, XML). The specification of the Speech Synthesis Markup Language is a relevant component of this new set of rules for voice navigators and was projected to provide a richer markup language, based on XML (Extended Markup Language), so as to make it possible to create applications for this new type of Web interface. The essential rule for working out the markup language is to establish mechanisms which allow the authors of the “synthesizable” contents to control such features of speech as pronunciation, volume, emphasis on certain words or sentences in the different platforms. This is how VoiceXml was created, a language based on XML which allows for the development of documents containing dialogues to facilitate access to Web contents [4].
This language, like the HTML, is used for Man-Computer dialogues. However, whilst the HTML assumes the existence of a graphic navigator (browser), a monitor, keyboard and mouse, the VoiceXML assumes the existence of a voice navigator with an audio outlet synthesized by the computer or a pre-recorded audio with an audio inlet via voice and/or keyboard tones. This technology “frees” the Internet for the development of voice applications, thus simplifying drastically the previously difficult tasks, creating new business opportunities. The VoiceXML is a specification of the VoiceXml Forum, an industrial consortium comprising more than 300 companies. This forum is now engaged in certifying, testing and spreading this language, while the control of its development is in the hands of the World Wide Web Consortium (W3C). As the VoiceXML is a specification, the applications that function in accordance with this specification can be used in different platforms. Telephones had a paramount importance in its development, but it is not restricted to the telephone market and reaches others, for example, personal computers. The technology’s architecture (Figure 2) is analogous to the one of the Internet, differing only in that it included a gateway whose functions will be explained later on (5).
Figure 2: – The Architecture of a Client-Server system based on VoiceXml, extracted from [6]
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 114
In a succinct manner, let us look at how information transference is processed in a system that implements this technology. First, the user calls a certain number over the mobile phone, or even over a conventional telephone. The answer to the call is made by a computerized system, the VoiceXml gateway, which then sends an order to the document server (the server could be in any place of the Internet) which will return the document referenced by the call . A constituent component of the gateway, called VoiceXml interpreter, executes the commands in order to make the contents speakable through the system. It listens to the answers and passes them on to the speech recognizing motor, which is also a part of the gateway.
The normalization process followed the Speech Synthesis Markup Requirements for Voice Markup Languages privileging some aspects, of which the following are of note [4]:
• Interoperationality: compatibility with other W3C specifications, including the Dialog Markup Language and the Audio Cascading Style Sheets.
• Generality: supports speech outlets to an array of applications and several contents.
• Internationalization: provides speech outlet in a big number of languages, with the possibility to use these languages simultaneously in a chosen document.
• Reading generation and capacity: capable of automatically generating easy-to-read documents.
• Consistency: will allow for a predictable control of outgoing data regardless of the implementation platform and the characteristics of the speech’s synthesis device.
• Implementable: the specification should be accomplishable with the existing technology and there should be a minimum number of operational functionalities.
Upon analyzing the architecture in detail (Figure 3) it becomes clear that the server (for example, a Web Server) processes the client’s application orders, the VoiceXml Interpreter, through the VoiceXml Interpreter context. The server produces VoiceXml documents in reply, which are then processed and interpreted by the VoiceXml Interpreter. The VoiceXml Interpreter Context can monitor the data furnished by the user in parallel with the VoiceXml Interprete. For example, a VoiceXml Interpreter Context may be listening to a request of the user to access the aid system, and the other might be soliciting profile alteration orders.
The implementation platform is controlled by the VoiceXml Interpreter context and by the VoiceXml interpreter. For example, in a voice interactive application, the VoiceXml Interpreter context may be responsible for detecting a call, read the initial VoiceXml document and answer the call, while the VoiceXml Interpreter conducts the dialogue after the answer. The implementation platform generates events in reply to user actions (for example: speech, entry digits, request for call termination) and system events (for example, end of temporizer counting). Some of these events are manipulated by the VoiceXml Interpreter itself, as specified by the VoiceXml document, while others are manipulated by the VoiceXml Interpreter context.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 115
Server
Request Document
Implementation Platform
VoiceXML Interpreter Context
VoiceXml Interpreter
Figure 3: Architecture details
Now analyzing the tasks executed by the gateway VoiceXML in a more detailed manner (Figure 4) it becomes clear that the interpretation of the scripts and the interaction with the user are actions controlled by the latter in order to execute them, the gateway consists of a set of hardware and software elements which form the heart of the VoiceXml (VoiceXml Interpreter technology and the above described VoiceXml Interpreter Context are also components of the gateway). Essentially, they furnish the user interaction mechanisms analogically to the browsers in a conventional HITP service. The calls are answered by the telephone services and by the signal processing component.
Figure 4: The Gateway VoiceXML components[5]
The gateways are fitted into the web in a manner very similar to the IVR (Interactive Voice Response) systems and may be placed before or after the small-
Gateway VoiceXML Services
VoiceXML Interpreter
Telephone and signal processing services
Speech recognizing device
Audio Reproduction
TTS services
HTTP clients
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 116
scale telephone centers utilized by many institutions. The architecture allows the users to request the transference of their call to an operator and also allows the technology to implement the order easily.
When a call is received, the VoiceXml Interpreter starts to check and execute the instructions contained in the scripts VoiceXml. As mentioned before, when the script, which is executing, requests an answer from the user, the interpreter directs the control to the recognition system, which “listens to” and interprets the user’s reply. The recognition system is totally independent from other gateway components. The interpreter may use a compatible client/server recognition system or may change the system during the execution with the purpose to improve the performance. Another manner of collecting data is the recognition of keys which results into DTMF controls which are interpreted to allow the user to furnish information to the system, like access passwords.
3 The Unisc-Phone prototype The prototype originated from the difficulties faced by the students of Unisc – University of Santa Cruz do Sul – at the enrolment period. Although efficient in itself, the academic enrolment system implemented on the Web requires access to the Internet and knowledge of micro-informatics, a distant reality for many students. This gave origin to the idea of creating a system that provides academic information and makes enrolment possible in the respective disciplines of the course attended by the student, at any place, any moment and automatically. The idea resulted into the Unisc-Phone project, and uses the telephone, broader in scope and available everywhere, to provide appropriate information to any student who wishes to enroll at the university. 3.1 The functionality of the Unisc-Phone While figuring out the project, the main concern of the team was the creation of a grammar flexible enough to provide for a dialogue as natural as possible. It would be up to the student to conduct the dialogue with the system, giving the student the chance to start the dialogue or leave it to the system to make the questions. In case the system assumes control, a set of questions in a pre-defined order would be made, if not, the system should be able to interact with the student, “perceiving” their needs, clearing the doubts as they arise. Upon assuming the dialogue, the student could interact with the system in several manners, as follows:
Figure 5: A first example of possible dialogue
<Student> “Hi! I’m a student of the Engineering cou rse, my enrolment number is 54667, I want to know which disciplines I can enroll in? “ <System> “Hello, Mr <XXXX>, please type in your pas sword on the telephone keyboard.”
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 117
After the validation by the user, the dialogue proceeds...
Figure 6: A continue example of possible dialogue Another dialogue could occur, for example:
Figure 7: A second example of possible dialogue These two dialogues are just examples of the many dialogues that could take place. Obviously, the VoiceXML technology does not yet provide for an intelligent system able to conduct any dialogue, however, with mixed-initiative forms, the <initial> tag of the VoiceXML, a system capable of interacting with a considerable array of dialogues, was implemented, imparting on the user-student the feeling of interacting with a system sufficiently intelligent, conveying confidence and security to this user.
3.2. The system’s architecture
A database, now one of the main components of the system, was built during the implementation phase. It contains detailed information on the disciplines (whether or not attended by the student), the pre-requisites and available schedules. These data will later be used to provide a solution to the student’s problem. The global architecture was implemented as shown in Figure 2, however, such technologies as the VoiceXML, JSP MySQL, were utilized, and the concept of architecture was applied in three layers, see Figure 8:
Figure 8: UNISC-Phone’s Three-Layer Architecture
<System> “Mr <XXXX>, the available disciplines are as follows: (Discipline A), (Discipline B), (Disciplin e C), (Discipline D)” <Student> “I want to enroll in disciplines A and C” <System> “You’ve requested to enroll in disciplines A and C, please confirm!” <Student> “That’s Right, ok”
<student> “Hi, I want to enroll in disckiplines A a nd C, my name is <XXXXX>” <System> “Good morning, Mr <XXXXX>, please inform y our number of enrolment and type in the password on the keyboard of the phone” <Student> “Oh..yes, my enrolment number is 54667 (t yping in password)” <System> “You’ve requested to enroll in disciplines A and C, please confirm!” <Student> “That’s Right, ok”
Business Logic (JSP)
Presentation Logic
(VoiceXML)
Data Access Logic (MySQL)
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 118
For the development of this system, the JSP program free server was used - http://www.eatj.com and a free VoiceXML gateway, site http://cafe.bevocal.com.
The following phase of the project consisted in the development of the VoiceXML documents that make up the interface. The objective was to create a voice interface in Portuguese, but unfortunately free VoiceXml gateways, capable of supporting this language, do not exist yet. Therefore, we developed a version that does not utilize the graphic signs typical to the Portuguese language, like accentuation and cedillas; even so, the dialogues are understood by Brazilian users. An accurate analysis of the implementation showed that the alteration of the system to make it support the Portuguese language is relatively simple and intuitive, once the only thing to do is to alter the text to be pronounced by the Portuguese text platform and force the system to utilize this language (there is a VoiceXML element to turn it into «speak xml:lang="pt-BR"»). A. System with dialogues in Portuguese The interface and the dialogues it is supposed to provide, as well as the information to be produced by the system, were studied and projected by the team, under the guidance of the professor of Special Topics, at Unisc’s (University of Santa Cruz do Sul) Computer Science College. After agreeing on the idea that the functionalities already offered by the system on the Web should be identical to the service existing on the Web, it was necessary to start implementing the dialogues in VoiceXml and also create dynamic pages so that the information contained on the database could be offered to the user-students. Dialogue organization and the corresponding database were established to make the resulting interface comply with the specifications (Figure 9).
Figure 9: Organizing the system’s files
In short, the functionality of each program is as follows:
• DisciplineRN.java: implements a class that lends support to the recovery of disciplines suitable to each student, in other words, the methods of this class only seek those disciplines possible to be attended, in compliance with the pre-requisites, vacancy availability and viable schedules.
• Functions.java: implements the grammar of the system, in other words, all the plausible dialogues are the result of hundreds of possible combinations of the grammar elements existing in this class.
Start.jsp Function.Java
EnrollmentRN.java
SaveEnrollment.jsp
DisciplineRN.java
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 119
• Enrollment.java: this class implements persistence, in other words, after the enrolment process, the disciplines chosen by the student are entered in the institution’s database.
An analysis of Figure 9 shows that the user starts interacting with the system through the program start.jsp which contains an initial greeting, asks for student’s name and enrollment number:
Figure 10: The first part of source code
… <%try { ConnectionBean conBean = new ConnectionBean (); Connection con = conBean.getConnection(); DisciplineRN disc = new DisciplineRN(con); GeneralED edResearch = new GeralED(); String no.Enrolment = request.getParameter("nroMatricula"); //int nroDia = 2; //Integer.valueOf(request.getParameter("nroDia")).i ntValue(); edResearch.put("codeStudent", numberEnrolment); String name = disc.listStudent(edResearch); DevCollection lstDisciplines; String dia[] = {"","","Monday","Tuesday","Wednesday","Thursday","F riday","end"}; %> <vxml version="2.0" xmlns="http://www.w3.org/2001/v xml"> <field name=”enrolment”> <prompt> Welcome to Unisc-Phone, now you can do your enrolment at home, by phone. Please inform your enr olment number. </prompt> </field> ... <block> <%=name%>, now we will start with your re-enrolment. </block> <form id="enrolment"> <% for (int no.Day=2;no.Day<=6;no.Day++) { %> <field name="re-enrol_<%=day[no. of Day ]%>">
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 120
Figure 11: The second part of source code Although built in JSP, the program mixes fragments of the VoiceXML code which provides interaction with the student. It should be noted that the Portuguese spelling omits all accentuation and other graphic signs (apostrophe, cedilla). On a daily basis, the program lists the available disciplines, later to be chosen by the student. A code analysis shows that the Function class implements a method known as GetGrammar:
Figure 12: The GetGrammar Method This grammar provides the student with several manners to ask for enrollment. What follows is an example of the dialogue:
<prompt> do you want to re-enroll on <%=day[no.Day]%>-day of week? </prompt> <grammar> [yes no] </grammar> <filled> <if cond="re-enrol_<%=day[no.Da y]%> == 'yes'"> Let’s get started... <goto nextitem="discipline_<%=day[no.Day]%>"/> <else/> you won’t have lessons on <%=day[no.Day]%>-day of week, <% if (no.Day==6){ % > The re-enrolment pr ocess has been concluded, sending data to server... <submit method="post" namelist="discipline_<%=daya[2]%> discipline_<%=day [3]%> discipline_<%=day[4]%> discipline_<%=day[5]%> disciplina_<%=dia[6]%>" next="http://matvoices.s41.eatj.with/EnrolmentVoice /saveEnrolment.jsp"/> ... </vxml>
public static String getGrammar(DevCollection listDiscipline, String nameDay) throws Exception { String return = ""; retorno += "<![CDATA[\n"+ " (\n"+ "?[ eu ]\n"+ "?[ I want (I wish to) I would like to]\n"+ "?[ (re-enrol) (do my re-enrolment) ]\n"+ "?[ (in discipline) (subject) for ]\ n"+ "?[ de ]\n"+ " [\n";
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 121
Figure 13: The GetGrammar Method Other manners for a user-student to ask for his/her enrolment could include requests like: “I want to re-enroll for the discipline of Algorithm on Monday”, or something like this “It is my wish to enroll in Logics”. This flexibility is possible thanks to the grammar definitions and the utilization of mixed-initiative forms. Once the user-student has made his/her choice, the interpreter creates a matrix containing all the intended disciplines and document interpretation calls for the SaveEnrolment.jsp program. This program asks for the confirmation and enters the re-enrolment data into the system’s database. After confirmation, the system cordially says farewell and ends the enrolment procedure.
4 Conclusions The rising business opportunities offered by the Internet as a result of never-ending technology improvement, both at infrastructure and interface level, and the growing number of users translate into a diversification of needs at interface level, triggering the appearance of new technologies. It was this context that led to the creation of the voice interface. The VoiceXML comes as a response to these needs due to its characteristics, once it allows for Computer-Human dialogues as a means of providing the users with information. Although entirely specified, this technology is still at its study and development stage. The companies comprised by the VoiceXML Forum are doing their best in spreading the system rapidly. Nevertheless, there are still some shortfalls like, for example, the lack of speech recognition motors and TTS transformation into languages like Portuguese, compatible with the existing gateways.
The system here in above described represents a step forward for the Brazilian scientific community, which lacks practical applications that materialize their research works and signal the right course toward the new means of Human- Computer interfaces within the Brazilian context.
System: Welcome to Unisc-Phone, now you can do your enrolment at home, by phone. Please inform your enr olment number. User-Student: 56447 System: John, now we will start your re-enrolment process. System: do you want to re-enroll on Monday? User-Student: Yes System: The disciplines available for you on that d ay are: algorithms, Logics and discreet mathematics User-Student: I would like to re-enroll in Logics System: Do you want to enroll on Tuesday? ...
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
Jacques Schreiber, Gunter FeldensEduardo Lawisch, Luciano Alves 122
References [1] "W AP Forum"- (http://www.wapforum.org). [2] Peter A. Heeman, "Modeling Speech Repairs and International Phrasing to Improve Speech
Recognition", Computer Science and Engineering Oregon Graduate Institute of Science and Technology.
[3] Speech Recognition Technologies Are NOT Ali Alike, (http://www.comman~corp.com/). [4] Speech Synthesis Markup Language Specification for the Speech Interface Framework
(http://www.w3.org/TR/2001/WD-speech-synthesis-2001 O 1 03). [5] Steve Ihnen VP, "Developing With VoiceXML:Overview and System Architecture ",
Applications DevelopmentSpeechHost, Inc. [6] WAP White Paper1 http://www.wapforum.org/what/WAPWhite_Paper1.pdf
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 111-123
In the recent years ontologies have played a major role in knowledge representa-tion, both in the theoretic aspects and in many application domains (e.g., SemanticWeb, Semantic Web Services, Information Retrieval Systems). The structure pro-vided by an ontology lets us to semantically reason with the concepts. In thispaper, we present a novel kind of concept network based on the evolution of a dy-namical fuzzy ontology. A dynamical fuzzy ontology lets us to manage vague andimprecise information. Fuzzy ontologies have been defined by integrating FuzzySet Theory into ontology domain, so that a truth value is assigned to each conceptand relation. In particular, we have examined the case where the truth valueschange in time according to the queries executed on the represented knowledgedomain. Empirically we show how the concepts and relations evolve towards apower-law statistical distribution. This distribution is the same that characterizescomplex network systems. The fuzzy concept network evolution is analyzed as anew case of a scale-free system. Two efficiency measures are evaluated on sucha network at different evolution stages. A novel information retrieval algorithmusing fuzzy concept networks is also proposed.
Keywords: Fuzzy Ontology, Scale-free Networks, Information Retrieval.
1 Introduction
In the last years, the ontology plays a key role in the Semantic Web [1] area ofresearch. This term has been used in various areas in Artificial Intelligence [2] (i.e.knowledge representation, database design, information retrieval, knowledge man-agement, and so on), so that to find an unique its meaning becomes a subtle topic.In a philosophical sense, the term ontology refers to a system of categories in orderto achieve a common sense of the world [3]. From the FRISCO Report [4] point of
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
view, this agreement has to be made not only by the relationships between humansand objects, but also from the interactions established by humans-to-humans.In the Semantic Web an ontology is a formal conceptualization of a domain of inter-est, shared among heterogeneous applications. It consists of entities, attributes,relationships and axioms to provide a common understanding of the real world[5, 3, 6, 4]. With the support of ontologies, users and systems can communicatewith each other through an easy information integration [7]. Ontologies help peo-ple and machines to communicate concisely by supporting information exchangebased on semantics rather than just syntax.
Nowadays, there are ontology applications where information is often vagueand imprecise, for instance, the semantic-based applications of the Semantic Web,such as e-commerce, knowledge management, web portals, etc. Thus, one of thekey issues in the development of the Semantic Web is to enable machines to ex-change meaningful knowledge across heterogeneous applications to reach the usersgoals. Ontology provides a semantic structure for sharing concepts across differentapplications in an unambiguous way. The conceptual formalism supported by atypical ontology may not be sufficient to represent uncertain information that iscommonly found in many application domains. For example, keywords extractedby many queries in the same domain may not be considered with the same rele-vance, since some keywords may be more significant than others. Therefore, theneed to give a different interpretation according to the context emerges. Further-more, humans use linguistic adverbs and adjectives to specify their interests andneeds (i.e., users can be interested in finding “a very fast car”, “a wine with avery strong taste”, “a fairly cold drink”, and so on). The necessity to handle therichness of natural languages used by humans emerges.
A possible solution to treat uncertain data and, hence, to tackle these prob-lems, is to incorporate fuzzy logic into ontologies. The aim of fuzzy set theory [8]introduced by L. A. Zadeh [9] is to describe vague concepts through a generalizednotion of set, according to which an object may belong to a set with a certaindegree (typically a real number in the interval [0,1]). For instance, the semanticcontent of a statement like “Cabernet is a deep red acidic wine” might have thedegree, or truth-value, of 0.6. Up to now, fuzzy sets and ontologies are jointlyused to resolve uncertain information problems in various areas, for example, intext retrieval [10, 11, 12] or to generate a scholarly ontology from a database inESKIMO [13] and FOGA [14] frameworks. The FOGA framework has been re-cently applied in the Semantic Web context [15].However, there is not a complete fusion of Fuzzy Set Theory with ontologies inany of these examples.In literature we can find some attempts to directly integrate fuzzy logic in ontology,for instance in the context of medical document retrieval [16] and in ontology-basedqueries [17]. In particular, in [16] the integration is obtained by adding a degreeof membership to all terms in the ontology to overcome the overloading problem;while in [17] a query enrichment is performed. This is done with the insertion ofa weight that introduces a similarity measure among the taxonomic relations ofthe ontology. Another proposal is an extension of the ontology domain with fuzzyconcepts and relations [18]. However it is applied only to Chinese news summa-rization. Two well-formed definitions of fuzzy ontology can be found in [19, 20].
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
In this paper we will refer to the formal definition stated in [19]. A fuzzy ontol-ogy is an ontology extended with fuzzy values assigned to entities and relationsof the ontology. Furthermore, in [19] it has been showed how to insert fuzzy logicin ontology domain extending the KAON 1 editor [21] to directly handle uncer-tainty information during the ontology definition, in order to enrich the knowledgedomain.
In a recent work, fuzzy ontologies have been used to model knowledge in cre-ative environments [22]. The goal was to build a digitally enhanced environmentsupporting creative learning process in architecture and interaction design educa-tion. The numerical values are assigned during the fuzzy ontology definition bythe domain expert and by the user queries. There is a continuous evolution of newrelations among concepts and of new concepts inserted in the fuzzy ontology. Thisevolutive process lets the concepts to arrange in a characteristic topological struc-ture describing a weighted complex network. Such a network is neither a periodiclattice nor a random graph [23]. This network has been introduced in an informa-tion retrieval algorithm, in particular this has been adopted in a computer-aidedcreative environment. Many dynamical systems can be modelled as a network,where vertices are the elements of the system and hedges identify the interactionbetween them [24]. Some examples are biological and chemical systems, neuralnetworks, social interacting species, computer networks, the WWW, and so on[25]. Thus, it is very important to understand the emerging behaviour of a com-plex network and to study its fundamental properties. In this paper, we presenthow fuzzy ontology relations evolve in time, producing a typical structure of com-plex network systems. Two efficiency measures are used to study how informationis suitably exchanged over the network, and how the concepts are closely tied [26].
The rest of paper is organized as follows: Section 2 presents the importance ofthe use of the fuzzy ontology and the definition of a new concept network basedon its evolution in time. Section 3 introduces scale-free network notation, small-world phenomena and efficiency measures on weighted network topologies, whilein Section 4 a new information retrieval algorithm exploiting the concept networkis discussed. In Section 5 we present some experimental results that confirms thescale-free nature of the fuzzy concept network. In Section 6 some other relevantworks we found in literature are presented for introducing our scope. Finally, inSection 7 some conclusions are reported.
2 Fuzzy Ontology
In this section an in depth study about how the Fuzzy Set Theory [9] has beenintegrated into the ontology definition will be discussed. Some fuzzy ontologypreliminary applications are presented. We will show how to construct a novelconcept network model also relying on this definition.
1The KAON project is a meta-project carried out at the Institute AIFB, University of Karl-sruhe and at the Research Center for Information Technologies(FZI).
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
Nowadays, the knowledge is mainly represented with ontologies. For example,applications of the Semantic Web (i.e., e-commerce, knowledge management, webportals, etc.) are based on ontologies. With the support of ontologies, users andsystems can communicate each other through an easy information exchange andintegration [7].Unfortunately, the only ontology structure is not sufficient to handle all the nu-ances of natural languages. Humans use linguistic adverbs and adjectives to de-scribe what they want. For example, a user can be interested in finding informationusing web portals about the topic “A fun holiday”. But what is “a fun holiday”?How to handle this type of request? Fuzzy Set Theory introduced by Zadeh [9] letsus tackle this problem denoting and reasoning with non-crisp concepts. A degreeof truth (typically a real number from the interval [0,1]) is assigned to a sentence.The previous statement “a fun holiday” might have truth-value of 0.8.
At first, let us remember the definition of a fuzzy set.
Definition 1 Let U be the universe of discourse, U = {u1, u2, ..., un}, where ui ∈U is an object of U and let A be a fuzzy set in U , then the fuzzy set A can berepresented as:
A = {(u1, fA(u1)), (u2, fA(u2)), ..., (un, fA(un))}, (1)
where fA, fA : U 7→ [0, 1], is the membership function of the fuzzy set A; fA(ui)indicates the degree of membership of ui in A.
Finally, we can give the definition of fuzzy ontology presented in [19].
Definition 2 A fuzzy ontology is an ontology extended with fuzzy values assignedthrough the two functions
where g is defined on the relations and h is defined on the concepts of the ontology.
2.2 Some Applications of the Fuzzy Ontology
From the practical point of view, using the given fuzzy ontology definition we candenote not only non-crisp concepts, but we can also directly include the propertyvalue according to the definition given in [27]. In particular, the knowledge do-main has been extended with the quality concept becoming an application of theproperty values. This solution can be used in the tourist context to better definethe meaning of a sentence like “this is a hot day”. Furthermore, it is an usualpractice to extend the set of concepts already present in the query with other oneswhich can be derived from an ontology. Generally, given a concept, the query isextended with its parents and children to enrich the set of displayed documents.With a fuzzy ontology it is possible to establish a threshold value (defined by thedomain expert) in order to extend queries with instances of concepts which sat-isfies the chosen value [19]. This approach can be compared with [17] where the
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
queries evaluation is determined through the similarity among the concepts andthe hyponymy relations of the ontology.In literature the problem of the efficient queries refinement has been faced witha large number of different approaches during the last years. PASS is a methoddeveloped in order to construct automatically a fuzzy ontology (the associationsamong the concepts are found analysing the documents keywords [28]) that canbe used to refine a user’s query [29] .Another possible use of the fuzzy value associated to concepts has been adopted inthe context of medical document retrieval to limit the problems due to overloadingof a concept in an ontology [16]. This also permits the reduction of the number ofdocuments found hiding those that do not fulfil the request of the user.
The relevant goal achieved using fuzzy ontology has been the direct handlingof concept modifiers into the knowledge domain. A concept modifier [30] has theeffect of altering the fuzzy value of a property. Given a set of linguistic hedgessuch as “very”, “more or less”, “slightly”, a concept modifier is a chain of oneor more hedges, such as “very slightly” or “very very slightly”, and so on. So, auser can write a statement like “Cabernet has a very dry taste”. It is necessaryto associate a membership modifier to any (linguistic) concept modifier.A membership modifier has a value β > 0 which is used as an exponent to modifythe value of the associated concepts [19, 31]. According to their effect on a fuzzyvalue, a hedge can be classified in two groups: concentration type and dilation type.The effect of a concentration modifier is to reduce the grade of a membership value.Thus, in this case, it must be β > 1. For instance, to the hedge “very”, it is usuallyassigned β = 2. So, if “Cabernet has a dry taste with value 0.8”, then Cabernethas a very dry taste with value 0.82 = 0.64. On the contrary, a dilation hedgehas the effect of raising a membership value, that is β ∈ (0, 1). The example isanalogous to the previous one. This allows not only the enrichment of the semanticthat usually the ontologies offer, but it also gives the possibility to make a requestwithout mandatory constraints to the user.
2.3 Fuzzy Concept Network
Every time that a query is performed there is an updating of the fuzzy valuesgiven to the concepts or to the relations set by the expert during the ontologydefinition. In [22] two formulae both to update and inizialize the fuzzy values aregiven. Such expressions take into account the use of concept modifiers.
The dynamical behaviour of a fuzzy ontology is also given by the introductionof new concepts when a query is performed. In [22] a system has been presentedthat allows the fuzzy ontology to adapt to the context in which it is used, inorder to propose an exhaustive approach to directly handle the knowledge-basedfuzzy information. This consists in the determination of a semantic correlation[22] among the entities (i.e. concepts and instances) that are searched together ina query.
Definition 3 A correlation is a binary and symmetric relation between entities.It is characterized by a fuzzy value: corr : O × O 7→ [0, 1], where the set O ={o1, o2, . . . , on} is the set of the entities contained in the ontology.
This defines the degree of relevance for the entities. The closer the corr value is
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
to 1, the more the two considered, for instance, concepts are correlated. Obviouslyan updating formula for each existent correlation is also given. A similar techniqueis known in literature as co-occurrence metric [32, 33].
To integrate the correlation values into the fuzzy ontology is a crucial topic.Indeed, the knowledge of a domain is given, also considering the use of the ob-jects inside the context. An important topic is to handle the trade off betweenthe correct definition of an object (given by the ontology represented definitionof the domain) and the actual means assigned to the artifact by humans (i.e.the experience-based context assumed by every person according to his specificknowledge).
Figure 1: Fuzzy concept network.
In this way, the fuzzy ontology reflects all the aspects of the knowledge-baseand allows to dynamically adapt to the context in which it is introduced. When aquery is executed (e.g., to insert a new document or to search for other documents)new correlations can be created or updated altering their weights. A fuzzy weightto the concepts and to correlations is also assigned during the definition of theontological domain by the expert. In this paper, we propose a new concept networkfor the dynamical fuzzy ontologies. A concept network consists of n nodes and aset of directed links [34, 35]. Each node represents a concept or a document andeach link is labelled with a real number ∈ [0, 1].Finally, we can give the definition of fuzzy concept network .
Definition 4 A Fuzzy Concept Network (FCN) is a complete weighted graph Nf ={O, F, m}, where O denotes the set of the ontology entities. The edges among thenodes are described by the function F : O × O 7→ [0, 1], if F (oi, oj) = 0 then theentities are considered uncorrelated. In particular F := corr. Each node oi ischaracterised by a membership value defined by the function m : O 7→ [0, 1], whichdetermines the importance of the entity by its own in the ontology. By definitionF (oi, oi) = m(oi).
In Fig.1 a graphical representation of a small fuzzy concept network is shown: inthis chart the edges with F (oi, oj) = 0 are omitted in order to increase the read-ability. The membership values m(oi) are reported beside the respective instances.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
3 Small-world behaviour and efficiency measuresof a scale-free network
From the dynamical nature of the FCN some important topological properties canbe determined. In particular, the correlations time evolution plays a dominantrole in this kind of insight. In this section some formal tools and some efficiencymeasures are presented. These have been adopted to numerically analyse the FNCevolution and the underlying fuzzy ontology.
The study of the structural properties of complex systems underlying networkscan be very important. For instance, the efficiency of communication and naviga-tion over the Net is strongly related to the topological properties of the Internetand of the World Wide Web. The connectivity structure of a population (the setof social contacts) acts on the way ideas are diffused. Only very recently the in-creasing accessibility of databases of real networks on one side, and the availabilityof powerful computers on the other side, have made possible a series of empiricalstudies on the social networks properties.
In their seminal work [23], Watts and Strogatz have shown that the connectiontopology of some real networks is neither completely regular nor completely ran-dom. These networks, named small-world networks [36], exhibit a high clusteringcoefficient (a measure of the connectedness of a network), like regular lattices,and small average distance between two generic points (small characteristic pathlength), like random graphs. Small average distance and high clustering are notall the common features of complex networks. Albert, Barabasi et al. [37] havestudied P(k), the degree distribution of a network, and found that many largenetworks are scale-free, i.e., have a power-law degree distribution P (k) ∝ kγ .
Watts and Strogatz have named these networks, that are somehow in betweenregular and random networks, small-worlds, in analogy with the small-world phe-nomenon, empirically observed in social systems more than 30 years ago [36]. Themathematical characterization of the small-world behaviour is based on the eval-uation of two quantities, the characteristic path length L, measuring the typicalseparation between two generic nodes in the network and the clustering coeffi-cient C, measuring the average cliquishness of a node. Small-world networks arehighly clustered, like regular lattices, having small characteristic path lengths, likerandom graphs.
Generic network is usually represented by a weighted graph G = (N, K), whereN is a finite set of vertices and K ⊆ (N × N) are the edges connecting thenodes. The information related to G is described by both an adjacency matrixA ∈M(|N |, {0, 1}) and by a weight matrix W ∈M(|N |,R+). Both the matrix Aand W are symmetric. The entry aij in A are 1 if there is an edge joining vertex ito vertex j, and 0 otherwise. The matrix W contains a weight wij related to anyedge aij . If aij = 0 then wij = ∞. If the condition wij = 1 for any aij = 1 isassumed the graph G corresponds to an unweighted relational network.
In network analysis a very important quantity is the degree of a vertex, i.e.,the number of incident with i ∈ N . The degree k(i) ∈ (N) of generic vertex i isdefined as:
k(i) = |{(i, j) : (i, j) ∈ K}| (4)
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
The coefficient 2 in Equation (5) appears at the denominator because eachlink in A is counted twice. The shortest path length dij : (N ×N) → {R+ ∪∞}between two vertices has to be calculated to define L. In unweighted social net-works dij corresponds to the geodesic distance between nodes and it is measuredas the minimum number of edges traversed to get from a vertex i to another vertexj. The distances dij can be calculated using any all-to-all path algorithm (e.g.,Floyd-Warshall algorithm) either for a weighted or a relational network.Let us remember that according to Definition 4, the weights in the network corre-spond to the fuzzy correlation joining two concepts in the fuzzy ontology inducedconcept network.
The characteristic path length L of graph G is defined as the average of theshortest path lengths between two generic vertices:
L(G) =∑
i 6=j∈N
1|N | (|N | − 1)
dij (6)
The definition given in Equation (6) is valid for a totally connected G, where atleast one finite path connecting any couple of vertices exists. Otherwise, whenfrom a node i we cannot reach a node j then the distance dij = ∞, thus the sumin L(G) diverges.
The clustering coefficient C(G) is a measure depending on the connectivityof the subgraph Gi induced by a generic node i and its neighbours. Formally asubgraph Gi = (Ni,Ki) of a node i ∈ N can be defined as the pair:
Ni = {j ∈ N : (i, j) ∈ K} (7)Ki = {(j, k) ∈ K : j ∈ Ni ∧ k ∈ Ni} (8)
An upper bound on the cardinality of Ki can be stated according to the fol-lowing observation: if the degree of a given node is k(i), following Equation (4),then Gi has
|Ki| ≤ k(i) (k(i)− 1)2
(9)
Let us stress that the subgraph Gi does not contain the node i. Gi results tobe useful in studying the connectivity of the neighbours of a node i after theeliminations of the node itself.
The upper bound on the number of the edges in a subgraph Gi introduces theratio of the actual number of edges in Gi with respect to the right hand side ofequation (9). Formally this ratio is defined as:
Csub(i) =2 |Ki|
k(i) (k(i)− 1)(10)
The quantities Csub(i) are used to calculate the clustering coefficient C(G) as theirmean value:
C(G) =1|N |
∑
i∈N
Csub(i) (11)
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
A network exhibits the small-world phenomenon if it is characterized by smallvalues for L and high values for the C clustering coefficient. Scale-free networksare usually identified, in addition to the exponential probability density functionP (k) of the edges, by small values of both L and C.
Studying real system networks, such as the collaboration graph of actors or thelinks among WWW documents, the probability to incur in non-connected graphsis very high. The L(G) and C(G) formalism is trivially not suited to treat thesesituations. In such cases the alternative formalism proposed by [38] is much moreeffective, even in the case of disconnected networks. This approach defines twomeasures of efficiency which give well-posed characterizations for the path meanlength and for the node mean cliquishness respectively.
To introduce the efficiency coefficients, it is necessary to consider the efficiencyεij of a generic node (i, j) ∈ K. This quantity measures the speed of informationpropagation between a node i and a node j, in particular εij = 1
dij.
With this definition, when there is no path in the graph between i and j; dij = ∞and consistently εij = 0.
The global efficiency of the graph G results to be:
Eglob(G) =
∑i 6=j∈G εij
|N |(|N | − 1)=
1|N |(|N | − 1)
∑
i 6=j∈G
1dij
(12)
and the local efficiency, in analogy with C, can be defined as the average efficiencyof local subgraphs:
E(Gi) =1
k(i)(k(i)− 1)
∑
l 6=m∈Gi
1d′lm
(13)
Eloc(G) =1N
∑
i∈G
E(Gi) (14)
where Gi, as previously defined, is the subgraph of the neighbours of i, which iscomposed of k(i) nodes.
The two definitions originally given in [24] have the important property thatboth the global and local efficiency are normalized quantities, that is: Eglob(G) ≤ 1and Eloc(G) ≤ 1. The conditions Eglob(G) = 1 and Eloc(G) = 1 hold in the case ofa completely connected graph where the weight of the edges is a node independentpositive constant.
In the efficiency-based formalism, small-world phenomenon emerges for systemswith high Eglob (corresponding to low L) and high Eloc (corresponding to highclustering C). Scale-free networks without small-world behaviour show high Eglob
and low Eloc.
4 Information Retrieval Algorithm using FuzzyConcept Network
In the last decade, there has been a rapid and wide development of Internet whichhas brought online an increasingly great amount of documents and online textual
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
information. The necessity of a better definition of the Information Retrieval Sys-tem (IRS) emerged in order to retrieve the information considered pertinent toa user query. Information Retrieval is a discipline that involves the organization,storage, retrieval and display of information. IRSs are designed with the objectiveof providing references to documents which contain the information requested bythe user [32]. In IRS, problems arise when there is the need to handle uncertaintyand vagueness that appear in many different parts of the retrieval process. On onehand, an IRS is required to understand queries expressed in natural languages.On the other hand, it has the need to handle the uncertain representation of adocument.In literature, there are many models of IRS that are classified into the follow-ing categories: boolean logic, vector space, probabilistic and fuzzy logic [39, 40].However, both the efficiency and the effectiveness of these methods are not sat-isfactory [34]. Thus, other approaches have been proposed to directly handle theknowledge-based fuzzy information. In preliminary attempts the knowledge wasrepresented by a concept matrix, where the elements identify relevant values amongconcepts [34]. Other more relevant approaches have been made adding fuzzy typesto object-oriented databases systems [41].
As stated in Section 2, a crucial topic for the semantic information handling isthe face the trade off between the proper definition of an object and its ”commonsense” counterpart. The FCN characteristic weights are initially set by an expertof the domain. In particular, he sets the initial correlation values on the fuzzyontology and the fuzzy concept network construction procedure takes these asinitial values for the links among the objects in O (see Definition 4). From nowon the correlation values F (oi, oj) will be updated according to the queries (bothselections and insertions) performed on the documents.
Most of all, the FCN usage gives the possibility to directly incorporate thesemantics expressed by the natural languages in graph spanning. This feature letus intrinsically obtain fuzzy information retrieval algorithms without introducingfuzzyfication and defuzzyfication operators. Let us stress that this process is pos-sible because the fuzzy logic is directly inserted into the knowledge expressed bythe fuzzy ontology.
An example for this kind of approach is presented in the following. The origi-nal crisp information retrieval algorithm taken into account has been successfullyapplied to support the creative processes of architects and interaction designers.More in detail, a new formalization of the algorithm adopted in the ATELIERproject (see [22] and Section 5.1) is presented including a step-by-step brief de-scription.
The FCN has been involved in steps (1) and (4) in order to semantically enrichthe results obtained. The algorithm input is the vector of the keywords in thequery. The first step of the algorithm uses these keywords to locate the documents(e.g., stored in a relational database) containing them. The keyword vector isextended with all the new keywords related to each selected document.
In the step (1) the queries are extended by navigating the FCN recursively.For each keyword specified in the query, a depth-first visit is performed arrestingthe spanning at a fixed level. In [22] this threshold was set to 3. The edgeswhose F (oi, oj) is 0 are excluded and the neighbour keywords are collected without
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
repetitions. Usually the techniques of Information Retrieval simply extend querieswith parents and children of the concepts. In step (2) the final list of neighbouringentities is pruned by navigating the fuzzy ontology, namely the set of candidatesis reduced excluding the keywords that are not connected by a direct path in thetaxonomy, i.e. the parents and children of the terms contained in the query. In thethird phase the documents containing the resulting keywords are extracted fromthe knowledge base. In the last step, the FCN is used to calculate the relevanceof the documents thus arranging them in the desired order. In particular, thanksto the FCN characterising functions F and m, the weights for the keywords oi ineach selected document are determined according to the following equation:
w(oi) = m(oi)βoi ·∑
oj∈K,oj 6=oi
[F (oi, oj)]βoi,oj (15)
Where K is the set of the keywords obtained from the step (3), βoi,oj ∈ R is amodifier value used to express concept modifiers effects (see [19] and Section 2.2for details).
The final score of a document is evaluated through a cosine distance amongthe weights of each keyword. This is done for normalisation purpose. Such a valueis finally sorted in order to obtain a ranking among the documents.
5 Test validation
This section is divided as follows: in the first part we introduce the environmentused to experiment the FCN, whereas in the second part the analytic study of thescale-free properties of these networks is given.
5.1 Description of the experiment
A creative learning environment is the context chosen to study the fuzzy conceptnetworks behaviour. In particular, the ATELIER (Architecture and Technologiesfor Inspirational Learning Environments) project has been examined. ATELIERis an EU-funded project that is part of the Disappearing Computer initiative2.The aim of this project is to build a digitally enhanced environment, supportinga creative learning process in architecture and interaction design education. Thework of the students is supported by many kinds of devices (e.g., large displays,
2http://www.disappearing-computer.net
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
RFID technology, barcodes, . . . ) and a hyper-media database (HMDB) is used tostore all digital materials produced. An approach has been studied to help and tosupport the creative practices of the students. To achieve this goal, an ontology-driven selection tool has been developed. In a recent contribution [22] it has beenshown how dynamical fuzzy ontologies are suitable for this context. Every day thestudents create a very large amount of documents and artifacts and they collect alot of material (e.g., digital pictures, notes, videos, and so on). Thus, new conceptsare produced in the life cycle of a project. Indeed, the ontology evolves in time andthe necessity emerges to make a dynamical fuzzy ontology suited for the contexttaken into account.
As presented in Section 2, the fuzzy ontology is an exhaustive approach tohandle the knowledge-based fuzzy information. Furthermore, it emerges that theevolution of the fuzzy concept network is mainly given by the keywords of thedocuments inserted in a HMDB and from the concepts written during the definitionof a query by the users. The algorithm presented in Section 4 used these propertiesdeeply.The executed experiments consider all these aspects. We have examined the trendof the fuzzy concept network in three different scenarios: the contribution of thekeywords in the HMDB, the contribution of the concepts from the queries andtheir combined evaluation effects.In the first case, 485 documents have been examined. Four keywords is the averagefor each document and the resulting final correlations are 431. In the second case,500 queries were performed by users with an age of the people from 20 to 60.For each query a user had the opportunity to include up to 5 different conceptsand each user had the possibility to semantically enrich their requests using thefollowing list of concept modifiers (little, enough, moderately, quite, very, totally).In this experiment we have obtained 232 correlations and 32 new concepts whichwere introduced in the fuzzy ontology domain. In the last test we examined twotypes of queries (485+500) jointly: the keywords of the documents and requestsof the users. The number of inducted correlations is 615, while the new conceptsare 37.
5.2 Analytic Results of the experiments
During the construction of the fuzzy concept network some snapshots have beenperiodically dumped to file (one snapshot each 50 queries) to be analyzed. Tohave a graphical topological representation a network analysis tool called AGNA3
has been used. AGNA (Applied Graph and Network Analysis) is a platform-independent application designed for scientists and researchers who employ specificmathematical methods, such as social network analysis, sociometry and sequentialanalysis. Specifically, AGNA can assist in the study of communication relationsin groups, organizational analysis and team building, kinship relations or animalbehavior laws of organization. The most recent version is AGNA 2.1.1 and it hasbeen used to produce the following pictures of the fuzzy concept networks startingfrom the weighted adjacency matrix defined in Section 3. The link color intensityis proportional to the function F introduced in Definition 2 so, the more marked
3http://http://www.geocities.com/imbenta/agna/
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
lines mean F values near to 1. Because of the large number of concepts andcorrelations, the pictures in Figure 4 have the purpose of showing qualitativelythe link distributions and the hub locations. Little semantic information aboutthe ontology can be effectively extracted from these pictures.
Both the global and the local efficiency, as defined in Equation (12) and inEquation (14), are calculated on each of these snapshots. The evolution for theefficiency measures are reported in Figure 3(a), 3(b) and 3(c). The solid linecorresponds to the global efficiency while the dashed one is the local efficiency.
In Figure 3(a) the efficiency evolutions of the HMDB are reported. The globalefficiency becomes a dominant effect after an initial transitory where the localefficiency results dominant. So, we can deduce the emergence of a hub connectedfuzzy concept network. This consideration is graphically confirmed by the networkreported in Figure 4(a). To increase the readability we located the hubs on theborders of the figure.
It can be seen clearly that the hub concepts are “people”, “man”, “woman”,“hat”, “face” and “portrait”. These central concepts have been isolated using thebetweenness [42] sociometric measure. The high betweenness values of the hubswith respect to the one for the other concepts confirm the measure obtained by theEglo. Indeed, the mean distance among the concepts is kept low thanks to thesepoints appearing very frequently in the paths from and to all the other nodes.We want to stress that the global efficiency quantifies the presence of hubs in agiven network, while the betweenness of the nodes gives a way to identify whichof the concepts are actually hub points. On the other hand, Figure 3(b) showshow the local efficiency in a fuzzy concept network builded using user queries ishigher than its global counterpart. This suggests that the network topology lackshubs. Furthermore many nodes present quite the same number of neighbours. Aconfirmation for this analysis is given by the fuzzy concept network reported inFigure 4(b). In this case the betweenness index for the concepts shows that noparticular point has significantly higher frequency in the other node paths. Finally,in Figure 3(c) the effect of the network obtained by both the documents and thequeries is reported. In this composed kind of tests a total of about 1000 queries istaken into account. It is interesting how both the data from HMDB and the userqueries act in a non linear way on the quantification of the efficiency measures.The resulting fuzzy concept network shows a dominant Eglo with respect to itsEloc and some hubs emerge. In particular Figure 5.2 highlights the fact that thehubs collect a large number of links coming from the other concepts.
The betweenness index for this fuzzy concept network identify one main hub,the concept “people”, with an extremely high value (5 times the mean value forthe other hubs). The principal other hubs are “woman”, “man”, “landscape”,“sea”, “portrait”, “red” and “fruit”. In this case the Freeman General Coefficientevaluated on the indexes is slightly lower than in the HMDB case, this is due tothe higher clustering among concepts (higher Eloc), see Table 1. The FreemanCoefficient is a function that allows the consolidation of node-level measures in asingle value related to the properties of the whole network [42].
The strength of these connections is much more marked than in the similar sit-uation treated in the case of HMDB induced fuzzy concept network. This meansthat the queries contribution reinforces the semantic correlations among the hub
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
Figure 3: (a) Efficiency measures for the concept network induced by the knowl-edge base. (b) Efficiency for the user queries. (c) Efficiency for the joined queriesand knowledge base documents.
concepts. To confirm the scale-free nature of the hubs in the fuzzy concept net-works we analyzed the statistical distributions of k(i) (see Equation (4)) reportedin Figure 5.
For Figures 5(a) and 5(c) the frequencies decrease according to a power law.This confirms what is stated by the theoretical expectations very well. The userqueries distributions, in Figure 5(b), behave differently. Their high values for Eloc
imply, as already stated, a highly clustered structure.Let us consider how Eloc is related to other classical social network parame-
ters such as the density and the weighted density [42]. The density reflects theconnectedness of a given network with respect to its complete graph. It is easy tonote that this criterion is quite similar to what stated in Equation (14): we canconsider the Eloc as a mean value for the densities evaluated locally in each nodeof the fuzzy concept network. Unexpectedly the numerical results in Table 1 showthat there is an inverse proportional relation between the Eloc and the density.More investigations are required.
The weighted density can be interpreted as a measure of the mean link weightvalue namely, the mean semantic correlation (see Section 2) among the conceptsin the fuzzy concept network. In Table 1 it can be seen that the weighted densityvalues are higher for the systems exhibiting hubs. This is graphically confirmedby the Figures 4(a) and 5.2, where the links among the concepts are more marked(i.e. more colored lines correspond to stronger correlations).
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
Figure 5: (a) HMDB fuzzy concept network link distribution. (b) User queriesfuzzy concept network link distribution. (c) Complete knowledge-base fuzzy con-cept network link distribution.
Table 1: Comparison of the efficiency measures of other complex systems w.r.t ourfuzzy concept networks.
Fuzzy Concept Eglo Freeman Coeff. Eloc Density Weighted
Network Density
HMDB 0.094 0.07 0.074 0.035 0.01
Queries 0.053 0.02 0.144 0.015 0.006
Complete 0.141 0.06 0.079 0.036 0.013
(HMDB+Queries)
6 Related Work
A semantic network (or net) is a graphical notation to represent the knowledgethrough patters of interconnected nodes and arcs, defining concepts and semanticrelations between them [43] respectively. This kind of declarative graphic can beused also to support those automated reasoning systems that treat certain do-main knowledge. Sowa [44] describes six of the most common kinds of semanticnetworks: among them he cites ’definitional networks, assertional networks, im-plicational networks, . . . ’. WordNet (a lexical database for the English language)is an example of the most famous semantic network that has been widely used inthe last years.
An ontology can be envisioned as a kind of semantic network where the conceptsare related to one another using meaningful relationships. Up to now the mostimportant semantic relations are meronymy (’A is part of B’), holonymy (’B hasA as a part of itself’), hyponymy (’A is kind of B’), and synonymy (’A denotes thesame of B’). In this field our contribution, with this paper, is the introduction of anew semantic relation between the entities of the fuzzy ontology in order to definea semantic network based on the correlations (see Definition 3). The topologicalstructure of the FCN allows an intuitive navigation through a wide collection oninformation providing correlations among concepts not predictable a-priori. In[45] a similar approach is reported: in this work a system is described which isable to extend ontologies in a semi-automatic way. In particular, such a system
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
has been applied to ontologies related to the contexts covered in the Web Miningfield of research. Another ontology model somehow related to our approach is theseed ontology. A seed ontology creates a semantic network through co-occurrenceanalysis: it considers exclusively how many times the keywords are used together.The disambiguation related problems are resolved using a WordNet consultation.The major advantage of these nets is that they allow the efficient identification ofthe most probable candidates for inclusion in an extended ontology.
Ontology-based information retrieval approaches are one of the most promisingmethodologies to improve the quality of the responses for the users. The definitionof the FCN implies a better calculation for the relevance of searched documents.A different approach to ontology-based information retrieaval has been proposedin [46]. In this work a semantic network is built to represent the semantic contentsof a document. The topological structure of this network is used in the followingway: every time that a query is performed a keyword vector is created, in orderto select the appropriate concepts that characterize the contents of the documentsaccording to the search criteria. We are investigating on how to integrate the FNCwith this different kind of semantic network, in fact both of these methodologiescould be effectively used to achieve the goal of the IRs.
7 Conclusions
In this paper, we considered an extension of a classical ontology including FuzzySet Theory. We have reported several case studies and some research areas whereat the moment it is applied. Furthermore, we have analyzed the evolution of thefuzzy ontology in time: the correlations among concepts change according to thedifferent queries submitted. Document insertions or users queries have been takeninto account. In relation to the fuzzy ontology a new concept network called fuzzyconcept networks has been defined. In this way, the dynamical fuzzy ontology re-flects the behaviour of a fuzzy knowledge-base. Furthermore, a novel informationretrieval algorithm exploiting this data structure has been proposed. We haveexamined the topological structure of this network as a novel complex networksystem, starting from the network efficiency evaluations. A highly clustered struc-ture emerges, highlighting the role of some concepts as hubs and the characteristicdistribution of weighted links among them. Thus, we have stated the scale-freenature of the fuzzy concept network evaluating two efficiency measures to investi-gate its local and global properties. At the end, we compared these measures withthe parameters commonly used in the social network analysis.
Acknowledgements
The work presented in this paper had been partially supported by the ATELIERproject (IST-2001-33064).
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
[1] T. Berners-Lee, T. Hendler, and J. Lassila, “The semantic web,” Scientific Ameri-can, vol. 284, pp. 34–43, 2001.
[2] N. Guarino, “Formal ontology and information systems,” 1998. [Online]. Available:citeseer.ist.psu.edu/guarino98formal.html
[3] T. Gruber, “A Translation Approach to Portable Ontology Specifications,” Knowl-edge Acquisition, vol. 5, pp. 199–220, 1993.
[4] E. Falkenberg, W. Hesse, P. Lindgreen, B. Nilsson, J. Oei, C. Rolland, R. Stamper,F. V. Assche, A. Verrijn-Stuart, and K. Voss, “Frisco : A framework of informationsystem concepts,” IFIP, The FRISCO Report (Web Edition) 3-901882-01-4, 1998.
[5] N. Lammari and E. Mtais, “Building and maintaining ontologies: a set of algo-rithms,” Data and Knowledge Engineering, vol. 48, pp. 155–176, 2004.
[6] N. Guarino and P. Giaretta, “Ontologies and Knowledge Bases: Towards a Termino-logical Clarification,” in Towards Very Large Knowledge Bases: Knowledge Buildingand Knowledge Sharing, N. Mars, Ed. Amsterdam: IOS Press, 1995, pp. 25–32.
[7] V. W. Soo and C. Y. Lin, “Ontology-based information retrieval in a multi-agentsystem for digital library,” in 6th Conference on Artificial Intelligence and Applica-tions, 2001, pp. 241–246.
[8] G. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications. PrenticeHall, 1995.
[9] L. A. Zadeh, “Fuzzy sets,” Inform. and Control, vol. 8, pp. 338–353, 1965.
[10] P. Bouquet, J. Euzenat, E. Franconi, L. Serafini, G. Stamou, and S. Tessaris, “Spec-ification of a common framework for characterizing alignment,” IST Knowledge webNoE, vol. 2.2.1, 2004.
[11] S. Singh, L. Dey, and M. Abulaish, “A Framework for Extending Fuzzy DescriptionLogic to Ontology based Document Processing,” in Proceedings of AWIC 2004, ser.LNAI, vol. 3034. Springer-Verlag, 2004, pp. 95–104.
[12] M. Abulaish and L. Dey, “Ontology Based Fuzzy Deductive System to Handle Impre-cise Knowledge,” in In Proceedings of the 4th International Conference on IntelligentTechnologies (InTech 2003), 2003, pp. 271–278.
[13] C. Matheus, “Using Ontology-based Rules for Situation Awareness and InformationFusion,” in Position Paper presented at the W3C Workshop on Rule Languages forInteroperability, April 2005.
[14] T. Quan, S. Hui, and T. Cao, “FOGA: A Fuzzy Ontology Generation Frameworkfor Scholarly Semantic Web,” in Knowledge Discovery and Ontologies (KDO-2004).Workshop at ECML/PKDD, 2004.
[15] Q. T. Tho, S. C. Hui, A. Fong, and T. H. Cao, “Automatic fuzzy ontology generationfor semantic web,” IEEE Transactions on Knowledge and Data Engineering, vol. 18,no. 6, pp. 842–856, 2006.
[16] D. Parry, “A fuzzy ontology for medical document retrieval,” in Proceedings of TheAustralian Workshop on DataMining and Web Intelligence (DMWI2004), Dunedin,2004, pp. 121–126.
[17] T. Andreasen, J. F. Nilsson, and H. E. Thomsen, “Ontology-based querying,”in Flexible Query-Answering Systems, 2000, pp. 15–26. [Online]. Available:citeseer.ist.psu.edu/682410.html
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
[18] L. Chang-Shing, J. Zhi-Wei, and H. Lin-Kai, “A fuzzy ontology and its application tonews summarization,” IEEE Transactions on Systems, Man, and Cybernetics-PartB: Cybernetics, vol. 35, pp. 859–880, 2005.
[19] S. Calegari and D. Ciucci, “Integrating Fuzzy Logic in Ontologies,” in ICEIS,Y. Manolopoulos, J. Filipe, P. Constantopoulos, and J. Cordeiro, Eds., 2006, pp.66–73.
[20] E. Sanchez and T. Yamanoi, “Fuzzy ontologies for the semantic web.” in Proceedingof FQAS, 2006, pp. 691–699.
[21] AA.VV., “Karlsruhe Ontology and Semantic Web Tool Suite (KAON),” 2005,http://kaon.semanticweb.org.
[22] S. Calegari and M. Loregian, “Using dynamic fuzzy ontologies to understand creativeenvironments,” in LNCS - FQAS, H. L. Larsen, G. Pasi, D. O. Arroyo, T. Andreasen,and H. Christiansen, Eds. Springer, 2006, pp. 404–415.
[23] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’ networks,”Nature, vol. 393, no. 6684, pp. 440–442, June 4 1998.
[24] V. Latora and M. Marchiori, “Efficient behavior of small-world networks,” Phys.Rev. Lett., vol. 87, no. 19, p. 198701(4), 2001.
[25] Y. Bar-Yam, Dynamics of Complex Systems, Addison-Wesley, Ed. Reading, MA,1997.
[26] M. E. J. Newman, “Models of the small world: A review,” J.STAT.PHYS.,vol. 101, p. 819, 2000. [Online]. Available: http://www.citebase.org/cgi-bin/citations?id=oai:arXiv.org:cond-mat/0001118
[27] P. Radaelli, S. Calegari, and S. Bandini, “Towards fuzzy ontology handling vaguenessof natural languages,” in LNCS - Rough Sets and Knowledge Technology, G. Wang,J. Peters, A. Skowron, and Y. Yao, Eds. Springer Berlin / Heidelberg, 2006, pp.693–700.
[28] D. Widyantoro and J. Yen, “A fuzzy ontology-based abstract search engine and itsuser studies,” in Proc. 10th IEEE Int’l Conf. Fuzzy Systems. IEEE Press, 2001, p.12911294.
[29] ——, “Using fuzzy ontology for query refinement in a personalized abstract searchengine,” in IFSA World Congress and 20th NAFIPS International Conference.IEEE Press, 2001, pp. 610–615.
[30] L. A. Zadeh, “A fuzzy-set-theoretic interpretation of linguistic hedges,” Journal ofCybernetics, vol. 2, pp. 4–34, 1972.
[31] T. D. Khang, H. Storr, and S. Holldobler, “A fuzzy description logic with hedgesas concept modifiers,” in Third International Conference on Intelligent Technologiesand Third Vietnam-Japan Symposium on Fuzzy Systems and Applications, 2002, pp.25–34.
[32] V. V. Raghavan and S. K. M. Wong, “A critical analysis of vector space model forinformation retrieval,” Journal of the American Society for Information Science,vol. 37, no. 5, pp. 279–287, 1986.
[33] J. Xu and W. B. Croft, “Improving the effectiveness of information retrieval withlocal context analysis,” ACM Trans. Inf. Syst., vol. 18, no. 1, pp. 79–112, 2000.
[34] S.-M. Chen and J.-Y. Wang, “A critical analysis of vector space model for infor-mation retrieval,” IEEE Transactions on Systems. Man, and Cybernetics., vol. 25,no. 5, pp. 793–803, 1995.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
[35] D. Lucarella and R. Morara, “First: fuzzy information retrieval system,” J. Inf.Sci., vol. 17, no. 2, pp. 81–91, 1991.
[36] M. S., “The small world problem,” Psychology Today, vol. 2, pp. 60–67, 1967.
[37] R. H. J. Albert and A. Barabasi, “Error and attack tolerance of complex networks,”Nature, vol. 406, pp. 378–382, 2000.
[38] P. Crucitti, V. Latora, M. Marchiori, and A. Rapisarda, “Efficiency of scale-freenetworks: Error and attack tolerance,” Physica A, vol. 320, pp. 622–642, 2003.
[39] G. Salton and M. McGill, Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1984.
[40] F. Crestani and G. Pasi, “Soft information retrieval: applications of fuzzy sets the-ory and neural networks,” in Neuro-Fuzzy Techniques for Intelligent InformationSystems, N. Kasabov and R. Kozma, Eds. Physica Verlag, 1999, pp. 287–315.
[41] N. Marın, O. Pons, and M. A. V. Miranda, “A strategy for adding fuzzy types to anobject-oriented database system.” Int. J. Intell. Syst., vol. 16, no. 7, pp. 863–880,2001.
[42] S. Wasserman and K. Faust, Social network analysis. Cambridge: CambridgeUniversity Press, 1994.
[43] J. Lee, M. Kim, and Y. Lee, “Information retrieval based on conceptual distance inis-a hierarchies,” J. Documentation, vol. 49, no. 2, pp. 188–207, 1993.
[44] J. Sowa, Encyclopedia of Artificial Intelligence. Stuart C. Shapiro, Wiley, 1987.
[45] W. Liu, A. Weichselbraun, A. Scharl, and E. Chang, “Semi-automatic ontologyextension using spreading activation,” J. Universal Knowledge Management, vol. 0,no. 1, pp. 50–58, 2005.
[46] M. Baziz, M. Boughanem, N. Aussenac-Gilles, and C. Chrisment, “Semantic coresfor representing document in IR,” in ACM, 2005, pp. 1011–1017.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 125-144
This paper discusses different approaches for integrating biological knowledge in gene ex-
pression analysis. Indeed we are interested in the fifth step of microarray analysis proce-
dure which focuses on knowledge discovery via interpretation of the microarray results. We
present a state of the art of methods for processing this step and we propose a classification
in three facets: prior or knowledge-based, standard or expression-based and co-clustering.
First we discuss briefly the purpose and usefulness of our classification. Then, following sec-
tions give an insight into each facet. We summarize each section with a comparison between
remarkable approaches.
Keywords: data mining, knowledge discovery, bioinformatics, microarray, biological sources
of information, gene expression, integration.
1 Introduction
Nowadays, one of the main challenges in gene expression technologies is to highlight the
main co-expressed1 and co-annotated2 gene groups using at least one of the different sources
of biological information [1]. In other words, the issue is the interpretation of microarray
results via integration of gene expression profiles with corresponding biological gene anno-
tations extracted from biological databases.
Analyzing microarray data consists in five steps: protocol and image analysis, statistical
data treatment, gene selection, gene classification and knowledge discovery via data inter-
pretation [2]. We can see in Figure 1 the goal of the fifth analysis step devoted to interpre-
tation, which is the integration between two domains, the numeric one represented by the
gene expression profiles and the knowledge one represented by gene annotations issued from
different sources of biological information.
At the beginning of gene expression technologies, researches were focused on the numeric3
side. So, there have been reported ([3, 4, 5, 6, 7, 8]) a variety of data analysis approaches
which identify groups of co-expressed genes based only on expression profiles without taking
into account biological knowledge. A common characteristic of purely numerical approaches
is that they determine gene groups (or clusters) of potential interest. However, they leave
to the expert the task of discovering and interpreting biological similarities hidden within
these groups. These methods are useful, because they guide the analysis of the co-expressed
gene groups. Nevertheless, their results are often incomplete, because they do not include
biological considerations based on prior biologists knowledge.
1Co-expressed gene group: group of genes with a common expression profile.2Co-annotated gene group: group of genes with the same annotation. A gene annotation is a piece of biological
information related to the gene that can be relational, syntactical, functional, etc.3We understand by numeric part the analysis of the gene expression measures only, disregarding the biological
annotations.
International Journal of Computer Science & ApplicationsVol. IV, No. II, pp. 145 - 163