Ingénierie des systèmes d’information

Ingeniere des systemes d’ information, RSTI serie ISI Volume 10 No 1/2005, pp. 59 – 79.

Mining XML Clinical Data: The HealthObs System George Potamias * — Lefteris Koumakis * — Vassilis Moustakis ** *Institute of Computer Science, Foundation for Research and Technology – Hellas (FORTH), Vassilika Vouton, P.O. Box 1385, 711 10 Heraklion, Crete, Greece [email protected] [email protected] ** Institute of Computer Science, Foundation for Research and Technology – Hellas (FORTH), Vassilika Vouton, P.O. Box 1385, 711 10 Heraklion, Crete, Greece, and Department of Production Engineering and Management, technical University of Crete, University Campus, Kounoupidiana, 73100 Chania, Crete, Greece [email protected]

ABSTRACT: A seamless clinical data integration and intelligent processing environment is presented, namely, the HealthObs system. HealthObs, and in particular its knowledge discovery component, aims towards the discovery of interesting associations from distributed and heterogeneous clinical information systems. HealthObs contributes to the semantic integration of patient health data stored across multiple sources. The system incorporates Association Rule Mining (ARM) operations, which operate on top of XML documents. A real-world case study, based on mining across patient records in the region of demonstrates the effectiveness, efficiency and reliability of the proposed approach and system. KEY WORDS: XML mining, association rules, clinical data, HYGEIAnet.

SERVICE EDITORIAL HERMES – LAVOISIER 14 RUE DE PROVIGNY – 94236 CACHAN

Tel : 01-47-40-67-00 Télécopie : 01-47-40-67-02 E-mail : [email protected]

Serveur Web : http//www.hermes-science.com

60 Ingeniere des systemes d’ information, RSTI serie ISI. Volume X – no 1/2005

1. Introduction

Information and communication technologies (ICT) have had, and will continue to do so, a major impact on a wide range of social functions and activities including delivery of health care services. Systems and applications, which are based on the integration of distributed and often heterogeneous sources, evolve continuously leading to the realization of information system federations. Each federation aims towards the satisfaction of a user community, whose members may have alternate objectives and develop different inquiries in conjunction with the use of the federated information resources. For instance, an inquiry may be directed towards the exploration of interesting associations between disease manifestation, symptom presence and demographic characteristics. In the same context another inquiry may be related to a specific set of clinical or laboratory features.

Common to all inquiries mentioned above is the quest for knowledge, believed to be hidden in the multitude of data that reside at system federation members. The satisfaction of the inquiry can not rely on pre-established system operations since neither the system developer not the members of the federated information system user community know beforehand the inquiry. Thus, we need a facilitator component able to interface with federated members to extract, analyze and model interesting patterns relevant to a specific inquiry.

This article presents HealthObs (composed by the words “health” and “observatory”) - an integrator and knowledge discovery facilitator system, which enables user inquiring over a federation of distributed and heterogeneous information systems. As its name indicates the specific realization of HealthObs relates to the medical domain; however, principles and concrete implementation characteristics are generic and service or application independent. We have used HealthObs in context of HYGEIAnet – the integrated health telematics network on the island of Crete, Greece (HYGEIAnet 03). HYGEIAnet represents a regional health care information infrastructure in the heart of which resides the integrated electronic health care record – IEHCR service (Kilman et al., 97, Tsiknakis et al., 2000, Katehakis et al., 2001, Grimson 01). IEHCR act as a single-point access to patients’ information recorded and resting in the HYGEIAnet’s federated clinical information systems.

HealthObs builds on top of the IEHCR infrastructure, and aims to turn the standard patient-centric nature of it into a population-centric one, able to support data mining and knowledge discovery operations across the federated clinical information resources. The contribution and the respective add-on services that HealthObs brings relate to system’s ability to:

ease query formulation – via friendly human-system interfaces, flexible enough to enhance the naturalness of data exploration inquiries,

semantically homogenize the distributed and heterogeneous data entries – via standard clinical data models and ontology, and uniformly represent them – via XML technology,

Mining XML Clinical Data 61

impose and utilize data mining and knowledge discovery operations on-top of XML structures,

offer friendly visualization operations that ease inspection and interpretation of the discovered results, and

establish the necessary conditions towards the customization of the system across various domains – enabled by a specially devised domain-editor.

Figure 1. The reference architecture of the HealthObs system: enabling components and their relations.

In its current implementation, HealthObs supports Association Rule Mining – ARM (Agrawal et al., 1994; Mannila et al., 1994) and clustering operations. In this paper we present the ARM component of HealthObs. Results reported herein account a two years long experience in using the system over HYGEIAnet’s clinical information systems.

The paper is organized as follows. Next section presents in detail the systems’ enabling components and their functionality. In section 3 we present the domain customization services as enabled by the system’s domain editor. Section 4 presents a real-world medical case-study aiming to demonstrate the use and utility of the system. In the next section we refer to related work and approaches and compare them with HealtObs. In the last section we conclude and point to future research and development plans.


2. HealthObs: Components, Operations and Functionality

An outline of the reference architecture underlying HealthObs is shown in Figure 1, above. Central to the architecture is a single data-enriched XML file which contains information and data from different information systems. Accumulation of data and their XML formatting are performed off-line (left part of Figure 1). To this end, the IEHCR services of HYGEIAnet are utilized in order to query the federated clinical information sources and recall the relevant query-specific data items. For each query, and with the aid of custom-made filtering and formatting operations, the respective query-specific XML file is created – a sample of this file is shown in Figure 2. HealthObs initiates and base its operations on such data-enriched XML files. The system’s enabling components; the involved operations and their functionality are detailed into the sequel.

Figure 2. A sample of a clinical data-enriched XML file.

2.1. Query Formulation

Query formulation supports the representation of the inquiry presented to the system. For instance, a user may decide to investigate the association between a limited number of clinical features (e.g., between biochemical tests: GLUCOSE, CHOLESTEROL etc, and related Diagnoses: ICD9DISEASE).


In Figure 3, below, the system’s feature-selection interface is shown. The features to be selected corresponds to the instance elements being present in the data-enriched XML file to process (Figure 2), and are syntactically consistent to the respective DTD (document type definition) grammar accompanying the XML file. In the Figure 3, note the unique characteristic of HealthObs that relate to the specification of the desired form of association rules: (a) if the user only check-tick ( ) a feature then, this feature may or may-not be present in the rule, i.e., not-obligatory feature (e.g., HDLCHOLESTEROL), (b) if the user not only checks a feature but, post an ‘IF’ ( , e.g., GENDERID) or, ‘THEN’ ( , e.g., ICD9DISEASE) tick on it then, the presence of the feature in the rules is obligatory in the ‘IF’ or, ‘THEN’ part of them, respectively.

Figure 3. The feature-selection interface of HealthObs.

2.2. Common Term Reference Service (CTRS) and XML Parsing

Upon presentation of the inquiry and selection of the respective query features, HealthObs activates the Common Term Reference Tervice (CTRS) component. CTRS support the placement of the query in context of domain’s semantics, e.g., involved medical ontology. Ontology is supported via the incorporation on HealthObs of knowledge embedded in the International Classification of Diseases – version 9, ICD9 (NCHS 03) and in the International Classification of Primary Care, ICPC (WICC-ICPC 03). ICD9 and ICPC are domain specific, which implies that an


alternative domain customization of HealthObs would need other ontology sources. However, in the medical realization of the system both support standard domain modeling and ascertain syntactic consistency. CTRS relies on a COAS - Clinical Object Access Service interface (COAS 99). Given that COAS offers little support towards semantic level integration, we have incorporated clinical ontology to hide heterogeneity and make individual information sources behave uniformly as a large collection of objects (Baldonado et al., 1996). In this context, CTRS incorporates specifications for the semantics of the domain such as, valid reference-ranges of clinical features enabled by the transformation of numerical values to qualitative equivalents (Sciore et al., 1994). CTRS is enabled-by and operates-on a plain-text file, where the domain semantics are clearly defined and stated, the domain semantics file. The general format of this file has as follows:

SYNTHETIC-FEATURE: <NAME> <ATOMIC-FEATURE-NAME> [<synonym-1, …]

UNCODITIONED <ATOMIC-FEATURE-NAME> <Lower>-<Upper> <NAME-to-

Visualize>_<VALUE>

CONDITIONED

if(SYNTHETIC-FEATURE-x/ATOMIC-FEATURE-y/VALUE=<VALUE-z>)

<ATOMIC-FEATURE-NAME> <Lower>-<Upper> <NAME-to-Visualize>_<VALUE>

A sample part of this file is shown in Figure 4. Note that the entries of this file

correspond to the respective element items present in the data-enriched XML file to process (Figure 2). For the editing of the domain-semantics file, and its customization to different domains, we have developed a special tool, the Domain Editor, operational within the HealthObs environment (see section 3).

Figure 4. A sample of a domain-semantics file: A plain-text file which may be edited directly or, by using the HealthObs’s domain editor.


Activation of the CTRS component results to an intermediate XML domain semantics and query specific schema (XMLdsq) is created (down part of Figure 5). XMLdsq is a restriction of the given DTD grammar and helps to: (i) focus the inquiry on the user selected features, and (ii) semantically homogenize the content of the data-enriched XML file (i.e., the file that contains the data recalled from the heterogeneous information systems).

Figure 5. Common Term reference Service (CTRS): an example of how the domain semantics and query specific schema (XMLdsq) is created.

For example (see Figure 5), suppose that the user posts a query to two different clinical information systems, IS-1 and IS-2 (the second system presents a virtual exemplar of a Greek clinical information system). The inquiry relates to the discovery of interesting associations between glucose levels – an instance of the BIOCHEMICAL_TEST element, and ICD9-encoded diseases – an instance of the DIAGNOSIS element. Activation of CTRS concludes to the combined common XMLdsq schema that includes just these two elements – notice that the CLINICAL element, included in the schema of IS-1, is not included in the resulted common schema. Furthermore, XMLdsq carries all the relevant semantic homogenization information. So, ‘BIOCHEMICAL_TEST’ is used as the standard reference term for ‘BIOCHEMISTRY’ and ‘ΒΙΟΧΗΜΙΚΕΣ_ΕΞΕΤΑΣΕΙΣ’ (i.e., the terms used by the respective information systems to store information about biochemical tests);


‘DIAGNOSIS’ for ‘ΔΙΑΓΝΩΣΗ’ – term used by the Greek clinical information system to store diagnostic information; ‘GLUCOSE’ for ‘BLOOD_SUGAR’ and ‘ΖΑΧΑΡΟ’ – terms used by the respective information systems to store information about glucose levels ; and ‘ICD9DISEASE’ for ‘DISEASE’ and ‘ΑΣΘΕΝΕΙΑ’ – terms used to store information about ICD9-encoded diseases. Moreover, XMLdsq also carries information about the reference ranges that corresponds to respective symbolic-values of glucose levels and ICD9-encoded diseases, i.e., ‘64.00-125.00’ for ‘NORMAL’ glucose values, and ‘320-389’ for the ICD9 codes linked with the diseases of the ‘NERVOUS’ system. This information will be utilized in the course of parsing the data enriched XML-document.

Parsing. The XML file to parse contains information and data from different information systems and it is parsed utilizing the created common XMLdsq and standard RDF/XML techniques (RDF/XML 2003). The result of parsing is kept in a special tree-based structure, the Prefix-tree, an efficient and very economic (in terms of space) data structure. So, there is no need to keep the whole XML file in memory but just the instantiated Prefix-tree structure. The instantiated prefix-tree is then processed in order to form the valid association rules.

2.3. Association Rule Mining (ARM) and Prefix-Tree

We support discovery of associations across segments of clinical records by using association rule mining (Agrawal et al., 1994; Mannila et al., 1994, Mueller 95), or ARM for short. Using ARM we were able to establish significant and useful associations of the type: X ⇒ Y, where X and Y correspond to clinical data entries.

ARM background. Formally ARM is defined as follows: let I = i1, i2, …, im be a set of literals, called items and let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. An association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y⊂ I, and X ∩ Y = ∅. The rule X ⇒ Y holds in the transaction set D with confidence c if, c% of transactions in D that contain X also contain Y. Confidence establishes significance of association rule. Rule X ⇒ Y has support s in the transaction set D if, s% of transactions in D contains X ∪Y. Support measures usefulness of the association. Given a set of transactions D, ARM proceeds to discover associations, which exhibit support and confidence values higher than specific thresholds, specified by the user: minimum support – minsup, and minimum confidence – minconf. The items that meet the minsup criterion are called frequent itemsets.

ARM in HealthObs. To enhance adaptation of ARM in context of HealthObs environment we have made two key conventions (Potamias et al., 2001): - Each transaction corresponds to a specific patient encounter (i.e., identifiable

visits of patients in a healthcare unit of the federation). Each encounter/visit is uniquely identified by reference to three attributes recorded and present in the


respective clinical information systems namely, patientid, information-system-id, and encounter-id (or, visit-id). Patient data are recalled anonymously and retrieved in a secure-manner by respective IEHCR security and role-based authorisation services (Potamias et al., 2000).

- Each item is represented by the following triplet: <SyntheticFeature, AtomicFeature, AtomicFeatureValue> the entries of which correspond to respective instance elements present in the XML file to process. For example, a SyntheticFeature may correspond to the BIOCHEMICAl_TEST element; AtomicFeature to an element that stands for a specific biochemical test, e.g., GLUCOSE; and FeatureValue the value of the atomic feature, e.g., ‘68.0’ for glucose.

- As it has been already noted that parsing of the XML file is done with the aid of the common XMLdsq schema, where all the needed semantic homogenization information is included. Assume that during parsing the following nested structure is met:

<BIOCHEMICHEMISTRY> <BLOOD_SUGAR>68.0</BLOOD_SUGAR>

</BIOCHEMISTRY>

With the aid (consultation) of the created XMLdsq schema this part is transformed and kept (i.e., in the memory data structures) within the following semantically homogenised form (note that the numeric value is transformed to its symbolic equivalent, i.e., 68.0 NORMAL):

<BIOCHEMICAL_TEST> <GLUCOSE>NORMAL</GLUCOSE>

</BIOCHEMICAL_EXAM>

Prefix-Tree. In the core of the ARM process is the identification of all frequent itemsets. Such a task is quite challenging because the search space is exponential to the items occurring in the database. For the clinical domain discussed in this paper the number of features used is 94, and the potential search space contains exactly 294 different itemsets- a figure that is bigger than the number of atoms in the universe (279)! So, generating and counting the support of all itemsets is practically impossible (both in terms of time and space). Note also that for a naive search strategy multiple scans of the database are needed. Therefore, several solutions have been proposed to perform a more directed search through the search space. The most known and utilised approach needs just one scan of the database- that is, of the data-enriched XML file, and keeps search results in a special tree-structure, the Prefix-Tree (PT) or, trie (Amir et al., 1997; Brin et al., 1997; Bayardo 98; Borgelt and Kruse 2002).

In a PT, every k-itemset (i.e., an itemeset of k items) has a node associated with it, as does its k -1-prefix (see Figure 6 for an example PT from the medical domain). The root represents the empty set, and so its counter is equal to the number of transactions in the database, since all transactions support the empty set.


All the 1-itemsets are attached to the root node, and their branches are labelled by the item they represent. Every other k-itemset is attached to its k-1-prefix. Every node stores the last item in the itemset it represents, its support, and its branches. To provide fast selection of an edge while traversing a PT, a hash-table is associated with each node that has children. Entries in the hash buckets are records with count, item number and a pointer to the child hash-table. Every hash-table has a dead-pointer to store a linked list of dead items that are the roots of dead branches. While the root table has to be fairly large, tables deeper in the tree can be much smaller. Since all candidates from one leaf are generated simultaneously, their supports are known, and we can use them to compute the size of the hash-table. At a certain iteration k, all candidate k-itemsets are stored at depth k in the PT. In order to find the candidate-itemsets that are contained in a transaction T, we start at the root node. To process a transaction for a node of the PT, (i) we follow the branch corresponding to the first item in the transaction and process the remainder of the transaction recursively for that branch, and (ii) we discard the first item of the transaction and process it recursively for the node itself.

Figure 6. The Prefix-tree data structure – an example from the clinical domain: (a) Edges represent itemset indices and nodes the occurrence counts of an itemset, (b) Pseudo-code for the Apriori-like generation of frequent itemsets.

Figure 6 demonstrates an example of the generation of frequent itemsets from a PT (taken from the medical cases-study detailed in section 4.2, below). In this example the user selects 5 features for a focused inquiry. The features refer to the following five items: 1:GLUCOSE=HIGH, 2:CHOLESTEROL=HIGH; 3: UREA=HIGH, 4:TRIGLYCERIDES=HIGH; and 5:ICD9DISEASE=HYPERTENSIVE.


Part (a) of Figure 6 shows the instantiated PT structure, which distinguishes between frequent and candidate sets (aminsup is set to 0.01 = 1%). The pseudo-code for the creation of frequent and candidate sets is summarized in part (b) of Figure 6 - Lk denotes frequent k-itemsets; Ck denotes candidate k-itemsets; and L denotes the union of the frequent k-itemsets that pass the minsup threshold.

Forming Association Rules. After its creation, the PT is processed in order to from the rules that pass the minconf threshold. As it has been already noted, ARM operations in HealthObs offer the ability to specify obligatory and not-obligatory features. As an example of processing the PT and generating frequent itemsets, refer to Figure 6 (minsup = 1%; and confidence is set to 0%, just for the demonstration). Now, assume that the user specified: feature-itemsets 1 and 3 as obligatory in the IF part; feature-itemset 5 as obligatory in the THEN part; and feature-itemsets 2 and 4 as not obligatory. Because, itemset 1 is obligatory then, the only sub-tree parsed and processed is the one that starts from the branch labelled with ‘1’. This has the great advantage of reducing the time needed to process the PT. Table 1 presents the valid associations (formed in a depth-first parse of the PT), the valid association rules to form, and the reasons for the formation, or not, of the rules. Even if a number of different association rules may be formed, there is only one valid (according to user’s requirements) rule: 1,3 5 - IF GLUCOSE=HIGH & UREA=HIGH THEN ICD9DISEASE=HYPERTENSIVE – support: 1%.

Table 1. Valid association rules that keep user’s IF-THEN requirements.

Valid Associations

Support Valid Association Rule

Indication

1,2,3,4,5

0.010 (150/15000) 0.007 (10/15000) 0.007 (10/15000)

0.000 (0/15000) 0.000 (0/15000) 0.000 (0/15000)

1,3 5 1,2,3 5 1,3 2,5 1,3,4 5 1,3 4,5 1,2,3,4 5

Formed–1,3 in IF and 5 in THEN Not formed-Support < minsup Not formed-Support < minsup Not formed-Support < minsup Not formed-Support < minsup Not formed-Support < minsup

1,2,3,5 0.010 (150/15000) 0.007 (10/15000) 0.007 (10/15000)

1,3 5 1,2,3 5 1,3 2,5

Formed already Not formed-Support < minsup Not formed-Support < minsup

1,2,4,5 -- No rule is formed 3⊄ 1,2,4,5

1,2,5 -- No rule is formed 3⊄ 1,2,5

1,3,4,5 0.010 (150/15000) 0.000 (0/15000) 0.000 (0/15000)

1,3 5 1,3,4 5 1,3 4,5

Formed already Not formed-Support < minsup Not formed-Support < minsup

1,3,5 0.010 (150/15000) 1,3 5 Formed already

1,4,5 -- No rule is formed 3⊄ 1,4,5

1,5 -- No rule is formed 3⊄ 1,5


3. Domain Editor: Domain Customization in HealthObs

HealthObs offers data mining operations over huge volumes of XML-formatted data that come from heterogeneous information sources. In this context the need to customize and incorporate into the system different application domains raises as a major demand. Moreover, the system aims to be used by users of different computer skills. With these observations in mind we designed and implemented a flexible and friendly Domain Editor tool that can be called from within the HealthObs environment.

Domain editor bases its functionality on the operations offered by the CTRS service (as specified and presented in section 2.2). The tool’s interface is shown in Figure 7. From the ‘Edit’ menu the user selects ‘Intervals’ and loads the domain semantics file for a specific domain. The tool offers options to: (i) delete an entire feature – by deleting all of its reference-ranges; (ii) delete a specific feature’s reference-range – in the example of Figure 7 both reference ranges ‘SMOKINGID 1-1 YES’ and ‘SMOKINGID 3-3 STOP’ are deleted; and (iii) add new features and/or new reference ranges for the various features – in the example of Figure 7 the user merges the ‘YES’ and ‘STOP’ values in the new reference range nominal value ‘YES-STOP’ (considering current and old smokers as one group); and (iv) to compose, both conditioned and unconditioned, entire new features based on existing ones.

Figure 7. The HealthObs Domain Editor - an example of defining new features’ reference-ranges and values.


4. HealthObs in Practice

Using HealthObs in practice entails the following steps: 1. Query formulation – for instance: “find all patient encounters from a specified

population of health care centers with pre-specified values for clinical findings, laboratory test results and recorded diagnoses”.

2. Access: The federated HYGEIANet’s clinical information systems are accessed, using respective IEHCR services.

3. XML file formation. Special filtering and formatting operations are activated on the output of step 2; the data-enriched and query-specific XML file is created accompanied with its corresponding DTD grammar. Furthermore, a preliminary domain-semantics file is automatically created; this file may be edited directly or, using the Domain Editor.

4. Query formulation. The XML file is loaded into the system and the user selects features to focus his/her inquiry, and specify the form of the associations rules – this operation is transparent to the user.

5. Analysis to perform. The user selects the type of analysis, for example ‘association rule discovery’ – this operation is transparent to the user.

6. CTRS and XML-parsing. The CTRS service is automatically consulted; the homogenised intermediate XML domain semantics and query specific schema (XMLdsq) is created, and the XML file is parsed – these operations are not-transparent to the user.

7. ARM. The prefix-tree is created and processed in order to form the valid association rules – these operations are not-transparent to the user.

8. Visualization. The Visualization interface of HealthObs (presented in the next section) is called and the results are shown to the user.

9. Store. The rules are inspected; user has the ability to select the rules that seems most interesting to the analysis at-hand, and store them in a file; this file may be loaded in the future for comparison studies (e.g., rules discovered when a particular clinical information system was consulted vs. rules from other clinical information systems).

4.1. A Real-World Case Study: Relating Abnormal Biochemical-test Finding and Circulatory Diseases

The reliability and efficiency of HealthObs system was tested on various real-world case-studies utilizing HYGEIAnet’s federation of distributed clinical information systems. Here we present a case-study that aims to ‘discover potential interesting relations between abnormal biochemical-test findings and diseases of the circulatory system’. The whole analysis was carried-out following the usage scenario presented above.


Dataset. About 30.000 patient encounter records, stored and maintained across three-(3) health care center facilities (located at the Archanes, Spili, and Anogia remote areas in the Crete island) for the period of January 2001 to December 2001, were retrieved. There are a total of 94 features used to represent the domain. The resulted XML file was ~27MB. The present study focuses on a set of 6 clinical features: BIOCHEMICAL TEST: GLUCOSE,CHOLESTEROL,HDLCHOLESTEROL,UREA,URICACID,

TRIGLYCERIDES DIAGNOSIS: ICD9DISEASE

Analysis. The ICD9DISEASE feature was set as obligatory in the IF part of the rules, and all BIOCHEMICAL_TEST features as not-obligatory. Putting minsup and minconf thresholds to 0.01 and 0.10, respectively, a total of 90 association rules were discovered. Furthermore, using the Domain Editor we eliminated the ‘NORMAL’ and ‘LOW’ reference-ranges of the selected features- this is because we were interested on pathologic states of the patients and ‘abnormal’ biochemical test findings. We also eliminated all ICD9DISEASE except of the ‘CIRCULATORY’ diseases entries – this is because we suspect that the selected biochemical tests are related to the diseases of the circulatory system.

Figure 8. Visualization of association rules in HealthObs.


Visualization. In Figure 8, the visualization interface of HealthObs is presented where, a part of these rules are shown. The first two columns show the rules’ support and confidence figures, respectively. The rules are ordered according to their support level, and grouped into support-level categories indicated by different colours (or, grey-levels)- dark-shaded cells indicated higher, and light-shaded cells lower support group-levels. In Figure 8, the support cell for the first rule is dark-shaded, and the cells for the next two rules are light-shaded, all other rules exhibit a 0.005 support level, shown with ‘0’, and are not shaded). In the next columns the symbolic values for each of the selected features are shown, following a two-colour scheme: ‘black’ foreground for values in the IF part of the rule, and ‘green’ (lighter grey-level in the B&W presentation of Figure 8) for values in the THEN part. So, the first association rule states (also shown in the Figure):

IF DISEASE=HYPERTENSIVE THEN UREA=HIGH Support:2% , Confidence:33%

Findings. Inspecting the whole list of discovered association rules we identified just the ones with all or, part of the BIOCHEMICAL_TEST features in the THEN part (i.e., cells with less-shaded values). In Table 2, we summarise the confidence-levels of the association rules that relate different HIGH-valued BIOCHEMICAL_TEST feature-item combinations with CIRCULATORY diseases. Table 2. Confidence-levels of the rules that relate different HIGH-valued BIOCEMICAL_TEST feature combinations with CIRCULATORY diseases – looking at each column vertically we may identify strong associations between the abnormal state (i.e., HIGH-valued) of a biochemical-test finding and the respective disease (higher that the average; indicated with bold figures)

HYP IHEART OHEART ARTER G 21 50 0 0 C 17 0 60 16 U 33 50 60 25 T 4 0 80 0 GC 6 0 0 0 GU 7 50 0 0 GT 2 0 0 0 CU 13 0 40 8 CT 2 0 60 0 UT 3 0 60 0 GCU 4 0 0 0 GCT 1 0 0 0 GUT 1 0 0 0 CUT 1 0 40 0 GCUT 1 0 0 0

G: Glucose, C: Cholesterol, U: Urea, T: Triglycerides HYP: Hypertensive, IHEART: Ischemic-HEART, OHEART: Other HEART, ARTER: ARTERIES


Looking at the results in Table 2, various interesting finding could be stated.

First of all, only the hypertensive (HYP), ischemic-heart (IHEART), other-HEART (OHEART) and arteries (ARTE) diseases were associated with high values for the selected biochemical tests. It is interesting that there was found no association with the cerebrovascular disease. Furthermore, only glucose (G), cholesterol (C), urea (U), and triglycerides (T) high-levels were associated with these diseases. The other two features that were selected, HDL-cholesterol and uric-acid were found not to be associated with any of the circulatory diseases.

The confidence figures in Table 2 could be interpreted as an indication of the grade to which a particular abnormal biochemical test finding (i.e., exhibiting a HIGH value) or, combination of them, relate to particular circulatory diseases. For example, a glucose abnormal state relates to a grade of ‘50’ to ischemic-heart disease (second row, third cell in Table 2). A visual representation of the grades, by which each abnormal biochemical finding relates to circulatory diseases, is shown in Figure 9.

The presentation above shows the power of HealthObs to perform real-world data mining studies, as well as the reliability of the system.

Figure 9. Abnormal biochemical-test findings and their relations to specific diseases (refer to table 32 for a list of them).

Efficiency. For the present case-study, parsing of the respective XML file and construction of the corresponding prefix-tree were performed in less than 1 min; processing of the prefix-tree and formation of the discovered association rules took about 2 mins. The results are more than satisfactory and prove the efficiency of the specific HealthObs implementation - experiments were conducted on a regular workstation.

G C U T

GC

GU

GT

CU

CT

UT

GC

U

GC

T

GU

T

CU

T

GC

UT

HYP

IHEARTOHEARTARTER0

10

20

30

40

50

60

70

80

HYPIHEARTOHEARTARTER


5. Related Work

A number of different XML-mining approaches have been recently occurred in the literature (Braga et al., 2002; Wan et al., 2003, Edmonds 02). Some of them present just theoretical and foundational approaches, and others system implementations as well.

Before we proceed to a comparison of HealthObs with other related work we have to define the exact XML-mining task addressed and tackled by our approach. In Figure 10 we give a general classification of different XML-mining tasks. Clearly, HealthObs falls into the category of XML-CONTENT mining. Consequently, we compare HealthObs with other related approaches that fall into the same XML-mining task. Of course, ‘XML-STRUCTURE mining’ tasks could be utilized so that, a common standardize XML-schema is found which then, could be utilized in the context of HealthObs mining operations. In the sequel we refer to similar XML-mining approaches addressing the advantages and disadvantages of each approach, and its comparison to the HealthObs system.

Figure 10. The different XML-mining tasks: HealthObs falls into the ‘XML-CONTENT mining’ category, with a special focus on the discovery of association rules between XML items.

XMINE (Braga et al., 2002). It has been devised as an operator that extends XQuery language (W3C/XQuery 01) to support ARM operations. It follows a three-step process: preprocessing- to create an intermediated relational table; mining – utilizes the ‘MINE RULE’ operator - an extension of SQL for ARM operations (Meo et al., 98), and operates on the created relational table; and post-processing – association rules extracted from relational data are mapped into an XML representation. Because of the creation of the intermediate relational-table the XMINE approach does not work directly with XML formatted data, as it is done in HealthObs.


Furthermore, the tabulated representation of data suffers from the problem of ‘data sparseness’. Transforming a hierarchical structure into a flat tabulated form we may face the situation, where a big number of row-column entries are missing (null). This may conclude to not reliable results at least when, a clear and rational strategy for the treatment of missing-values is not defined. XMINE it is build in Java on top of the DOM (W3C/DOM 03) interpreter. The DOM infrastructure presupposes that the entire XML-file is loaded into the memory, something which is demanding in terms of the needed space. In HealthObs the XML-file is parsed without loading it in memory; just the economic prefix-tree structure is stored. Moreover, XMINE does not support customization of domain-semantics, as it is done in HealthObs with the aid of CTRS service.

XQuery/Apriori (Wan et al., 2003). It implements the well-known Apriori ARM algorithm within the XQuery context. This approach operates directly on the XML-formatted data items, without creating an intermediate tabular representation. As it is noted by the developers of Xquery/Apriori, the algorithm is implemented for documents that follow a standard structure, where each item is a leaf node. If the document to be mined has an irregular structure then, an extra query is needed in order to restructure the input document. This may have implications to the time and space needed to parse the input XML file. In HealthObs, parsing and mining of irregular-structured XML documents is straightforward. For example, the irregular XML structure in Figure 11 could be retrieved, even if ‘FEATURE-2’, in the second TRANSACTION is not a leaf node and does not occur in the first TRANSACTION. As with XMINE, XQuery/Apriori does not support customization of domain-semantics.

<TRANSACTIONS> <TRANSACTION>

<FEATURE-1> <NAME-1>Value-11</NAME-1>

</FEATURE-1> </TRANSACTION> <TRANSACTION>

<FEATURE-1> <NAME-1>Value-21</NAME-1>

</FEATURE-1> <FEATURE-2>Value-22</FEATURE-2>

</TRANSACTION> </TRANSACTIONS>

Figure 11. An irregular XML structure.

XML/DecisionRuleMining (Edmonds 2002). Another published approach to

XML-content mining implementing the induction of decision-tree (and decision rules) from XML formatted data. This approach suffers from most of the problems referred above. The major drawback is the creation of a form of tabulated representation of the data, actually a vector of all the relevant l earning items. Decision-tree induction is then applied on these vectors. As it is noted in the original


publication, the implementation is arranged to ignore learning vectors where either of the data elements considered empty. Such an approach, and especially for decision-tree induction algorithms, may produce unreliable results.

6. Conclusions

The paper has briefly presented the HealthObs system. HealthObs has proved effective in supporting discovery of interesting medical associations from data originating from distributed and heterogeneous sources. HealthObs represents an interdisciplinary platform, which brings together distinct software components, methods and techniques. Specifically: (a) Semantic homogenization of heterogeneous information resources – via methods, operations and tools for the customization of domain semantics and ontology - the implementation is flexible enough to accommodate other than the clinical domain; (b) ARM operations accordingly adapted to work on-top of XML-formatted data items; and (c) Flexible query-formulation and visualization operations.

HealthObs approach is generic and can be readily adapted to health care information system specifics or domain of inquiry. The platform includes two layers: middleware and application and adaptation may be limited to application layer operations.

HealthObs has also been used experimentally to support Internet based epidemiology (Potamias et al., 2001). Querying and mining is driven towards investigation of population health dynamics. During an epidemiological investigation the query presented to the system should carry with it the health indicator, or indicators, of interest. In addition, discovery of associations enhances potential for evidence based medicine since associations (once validated) may establish a rationale and baseline supporting medical decision making.

In its current form HealthObs incorporates clustering over XML-formatted data as well (not presented in this paper). Looking ahead we plan to enrich HealthObs mining facilities by incorporating other data mining approaches, known to be effective in the discovery of interesting relationships – for instance, decision tree induction, Bayesian learning, etc. The Web-based porting of HealthObs is also included in our development plans.

Acknowledgements

The work presented in this paper was partially supported by two European Union (EU) funded projects related to healthcare telematics and data mining, namely: InterCare, HC 4011 and IRAIA, IST: 10602. Authors remain responsible for work presented herein. Results do not necessarily correspond to official project positions or, to EU policy.


References

Agrawal R., Srikant S., “Fast Algorithms for Mining Association Rules”, 20th Int’l Conf. on Very Large Data Bases, 1994, Santiago, Chile, p. 487-499.

Amir A., Feldman R., R. Kashi. “A new and versatile method for association generation”, Information Systems, vol. 2, 1997, p. 333–347.

Baldonado M., Cousins S., “Addressing Heterogeneity in the Networked Information Environment”, New Review of Information Networking, vol. 2, 1996, p. 83-102. [URL: http://citeseer.nj.nec.com/27530.html; accessed September 2003].

Bayardo, R.J. Jr. “Efficiently mining long patterns from databases”, In L.M. Haas and A. Tiwary (Eds) Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD Record, vol. 27 no. 2, 1998, p. 85–93.

Borgelt C., Kruse R. “Induction of association rules: Apriori implementation”, In W. Hardle and B. Ronz (Eds) Proceedings of the 15th Conference on Computational Statistics, 2002, p. 395–400 [URL: http://fuzzy.cs.uni-magdeburg.de/~borgelt/software.html].

Braga D., Campi A., Klemettinen M., Lanzi P.L. “Mining association rules from XML data”, In Proceedings o fthe 4th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2002), September 4-6, 2002, Aixen-Provence, p. 21-30.

Brin S., Motwani, J.D. Ullman, and S. Tsur. “Dynamic itemset counting and implication rules for market basket data”, In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, SIGMOD Record, vol. 26 no. 2, 1997, p. 255–264.

COAS. Clinical Observations Access Service, Final Submission, OMG Document, 1999. [URL: http://cgi.omg.org/cgi-bin/doc?corbamed/99-03-25].

Edmonds A.N. On data mining Tree structured data represented in XML, Sciention, UK, December 2002. [URL: http://www.scientio.com/resources/On%20data%20 mining%20Tree%20structure% 20data%20in%20XML.pdf; accessed January 2004].

Grimson J., “Delivering the electronic healthcare record for the 21st century”, International Journal of Medical Informatics, vol. 64 no. 2-3, 2001, p. 111 – 127.

HYGEIAnet. The Integrated Health Care Network of Crete (HYGEIAnet) [URL: http://www.hygeianet.gr; accessed September 2003].

Katehakis D.G., Sfakianakis S., Tsiknakis M., Orphanoudakis S.C., “An Infrastructure for Integrated Electronic Health Record Services: The Role of XML (Extensible Markup Language)”, Journal of Medical Internet Research vol. 3 no. 1, 2001. [URL: http://www.jmir. org/2001/1/e7/; accessed September 2003].

Kilman D., Forslund, D., “An International Collaboratory Based on Virtual Patient Records”, CACM, vol. 40 no. 8, 1997, p. 111-117.

Mannila H., Toivonen H., Verkamo A.I., “Efficient algorithms for discovering association rules”, KDD-94: AAAI Workshop on Knowledge Discovery in Databases, 1994, Seattle, Washington, p. 181-192.

Meo R., Psaila G., Ceri, S. “An extension to SQL for mining association rules”, Data Mining and Knowledge Discovery, vol. 2 no. 2, 1998, p. 195-224.


Mueller A. Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison, Technical Report CS-TR-3515, College Park, MD, 1995. [URL: http://citeseer.ist.psu.edu/mueller95fast.html; accessed June 2003].

NCHS. National Center of Health Statistics, International Coding of Diseases, [URL: http://www.cdc.gov/nchs/about/otheract/icd9/abticd9.htm; accessed September 2003].

Potamias G., Moustakis, V., “Knowledge Discovery from Distributed Clinical Data Sources: The Era for Internet-Based Epidemiology”, In 23rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2001, Istanbul, Turkey.

Potamias G., Tsiknakis M., Katehakis D.G., Karabela E., Moustakis V., Orphanoudakis S.C., “Role-Based Access to Patients Clinical Data: The InterCare Approach in the Region of Crete”, In A. Hasman, B. Blobel, J. Dudeck, R. Engelbrecht, G. Gell, H.-U. Prokosch (Eds), Medical Infobahn for Europe, Proceedings of MIE 2000 and GMDS 2000, Hannover, Germany, IOS Press, p. 1074-1079.

RDF/XML. “RDF/XML Syntax Specification (Revised)”, W3C Working Draft 23. http://www.w3.org/TR/rdf-syntax-grammar/; accessed January 2003].

Sciore E., Siegel M., Rosenthal A., “Using semantic values to facilitate interoperability among heterogeneous information systems”, ACM Transactions on Database Systems, vol. 19 no. 2, 1994, p. 254-290.

Tsiknakis M., Katehakis D.G., Orphanoudakis, S.C. “Information Infrastructure for an Integrated Healthcare Services Network”, Information Technology Applications in Biomedicine (ITAB-ITIS 2000), IEEE EMBS 3rd International Conference, 2000, Arlington, Virginia, USA.

Wan J.W.W., Dobbie G. Extracting Rules from XML Documents using XQuery. In Workshop On Web Information And Data Management, Proceedings of the fifth ACM international workshop on Web information and data management, New Orleans, Louisiana, USA, 2003, p. 94 – 97.

WICC-ICPC. Wonca International Classification Committee - ICPC: International Classification for Primary Care, [URL: http://www.ulb.ac.be/esp/wicc/; accessed September 2003].

W3C/DOM. Document Object Model, Level 1, Specification, Version 1.0, W3C Recommendation 1 October, 1998 [URL: http://www.w3.org/TR/REC-DOM-Level-1/; accessed January 2003].

W3C/XML Path. Version 1.0 (W3C Recommendation). http://www.w3c.org/tr/xpath/, Nov. 1999.

W3C/XQuery. XQuery 1.0: An XML Query Language (W3C Working Draft) [URL: http://www.w3.org/TR/2001/WD-xquery-20011220; accessed Dec. 2001].