Top Banner
Domain Specific Multi-stage Query Language for Medical Document Repositories Aastha Madaan Database Systems Laboratory University of Aizu, Aizu Wakamatsu Fukushima, Japan 965-8580 [email protected] Supervised by: Subhash Bhalla Database Systems Laboratory University of Aizu, Aizu Wakamatsu Fukushima, Japan 965-8580 [email protected] ABSTRACT Vast amount of medical information is increasingly avail- able on the Web. As a result, seeking medical information through queries is gaining importance in the medical do- main. The existing keyword-based search engines such as Google, Yahoo fail to suffice the needs of the health-care workers (who are well-versed with the domain knowledge required for querying) using these they often face results which are irrelevant and not useful for their tasks. In this paper, we present the need and the challenges for a user-level, domain-specific query language for the special- ized document repositories of the medical domain. This topic has not been sufficiently addressed by the existing approaches including SQL-like query languages or general- purpose keyword-based search engines and document-level indexing based search. We aim to bridge the gap between information needs of the skilled/semi-skilled domain users and the query capability provided by the query language. Overcoming such a challenge can facilitate effective use of large volume of information on the Web (and in the elec- tronic health records (EHRs)repositories). 1. INTRODUCTION The medical domain is complex. Therefore, limited access to target documents by indexing through a standard search engine is not sufficient. The medical information includes both patient-specific information (EHRs) and knowledge- based information (scientific papers and other literature) [7]. The medical knowledge (terminologies and concepts) has evolved over 10s of years. This is available on the Web through Web document repositories (such as, Medline- Plus [15]), popular medical literature related publications (PubMed [17], Medline [14]), other primary and secondary resources and EHRs ([5]). Querying these resources is re- quired by the secondary applications such as evidence-based medicine and secondary use of EHRs to improve the qual- ity of care [7]. Medical information is utilized by a vari- ety of end-users with complex requirements. Practitioners, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 39th International Conference on Very Large Data Bases, August 26th - 30th 2013, Riva del Garda, Trento, Italy. Proceedings of the VLDB Endowment, Vol. 6, No. 12 Copyright 2013 VLDB Endowment 2150-8097/13/10... $ 10.00. specialists and researchers are well-versed with the medi- cal knowledge and terminologies. These users have precise queries and expect complete results within time limits (al- most real-time). According to Kreshmoi survey [6], the end- user requirements vary on the basis of ”level of specialty”. Moreover, the issue of trustworthiness of information and authentication of a resource are of major concern for the users. The widespread use of WWW has given rise to a range of simple query processors, the search engines. These query a database of semi-structured data (the HTML pages). For example, one can use a search engine to find pages contain- ing a word ”Villain”. However, it is difficult to obtain only pages in which ”villain” appears in the context of a charac- ter in a ”wild west movie” [4]. In the healthcare domain, the end-users vary in their background, experience and have variable needs. These users interact in various contexts, a clinician interacts with the patients during clinical-care. He (or she) may need to query the medical literature and other document repositories to assess the plan for the treat- ments or patient-diagnosis. For such queries, the users need a query language and a schema. The goal of this study, is to assist the domain experts equipped with domain knowledge but not well-versed to use a query language such as SQL, XQuery to query the Web document repositories. This the- sis illustrates the need of in-depth (and granular) querying of medical information on the Web. For example, a physi- cian has an exploratory evidence-based query 1 ”cases where helicobacter pylon bacteria causes peptic ulcer” or he or she may have a hypothesis-directed query 2 , ”Treatment in case of high-fever and dizziness”. The conventional search (Ya- hoo, Google or localized search menu), return all the Web documents which match any of the keywords ranked by a set of criteria defined by the search engine. This may return a large number of documents. The physician has to go over each of the documents to find the one that (exactly) matches his expectations. This task is time-consuming. It may be abandoned by the physician. Both types of queries may in- volve multiple filtering-conditions (attributes) which make the query complex. To answer such end-user queries there is 1 The evidence-based queries are raised during patient-care where the clinicians wish to query the knowledge archives to determine the relevance of signs and symptoms to the potential existence of one or more medical disorders [3]. 2 The hypothesis-directed queries represent the non- diagnostic intent of information about conditions seeking details on potential hypothesis including treatments, cures and outcomes [3]. 1410
6

Domain Specic Multistage Query Language for Medical Document Repositories · Domain Specic Multistage Query Language for Medical Document Repositories Aastha Madaan Database Systems

Jun 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Domain Specic Multistage Query Language for Medical Document Repositories · Domain Specic Multistage Query Language for Medical Document Repositories Aastha Madaan Database Systems

Domain Specific Multi­stage Query Language for MedicalDocument Repositories

Aastha MadaanDatabase Systems Laboratory

University of Aizu, Aizu WakamatsuFukushima, Japan 965­8580

d8131102@u­aizu.ac.jp

Supervised by: Subhash BhallaDatabase Systems Laboratory

University of Aizu, Aizu WakamatsuFukushima, Japan 965­8580

bhalla@u­aizu.ac.jp

ABSTRACTVast amount of medical information is increasingly avail-able on the Web. As a result, seeking medical informationthrough queries is gaining importance in the medical do-main. The existing keyword-based search engines such asGoogle, Yahoo fail to suffice the needs of the health-careworkers (who are well-versed with the domain knowledgerequired for querying) using these they often face resultswhich are irrelevant and not useful for their tasks.In this paper, we present the need and the challenges for

a user-level, domain-specific query language for the special-ized document repositories of the medical domain. Thistopic has not been sufficiently addressed by the existingapproaches including SQL-like query languages or general-purpose keyword-based search engines and document-levelindexing based search. We aim to bridge the gap betweeninformation needs of the skilled/semi-skilled domain usersand the query capability provided by the query language.Overcoming such a challenge can facilitate effective use oflarge volume of information on the Web (and in the elec-tronic health records (EHRs)repositories).

1. INTRODUCTIONThe medical domain is complex. Therefore, limited access

to target documents by indexing through a standard searchengine is not sufficient. The medical information includesboth patient-specific information (EHRs) and knowledge-based information (scientific papers and other literature)[7]. The medical knowledge (terminologies and concepts)has evolved over 10s of years. This is available on theWeb through Web document repositories (such as, Medline-Plus [15]), popular medical literature related publications(PubMed [17], Medline [14]), other primary and secondaryresources and EHRs ([5]). Querying these resources is re-quired by the secondary applications such as evidence-basedmedicine and secondary use of EHRs to improve the qual-ity of care [7]. Medical information is utilized by a vari-ety of end-users with complex requirements. Practitioners,

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 39th International Conference on Very Large Data Bases,August 26th ­ 30th 2013, Riva del Garda, Trento, Italy.Proceedings of the VLDB Endowment, Vol. 6, No. 12Copyright 2013 VLDB Endowment 2150­8097/13/10... $ 10.00.

specialists and researchers are well-versed with the medi-cal knowledge and terminologies. These users have precisequeries and expect complete results within time limits (al-most real-time). According to Kreshmoi survey [6], the end-user requirements vary on the basis of ”level of specialty”.Moreover, the issue of trustworthiness of information andauthentication of a resource are of major concern for theusers.

The widespread use of WWW has given rise to a rangeof simple query processors, the search engines. These querya database of semi-structured data (the HTML pages). Forexample, one can use a search engine to find pages contain-ing a word ”Villain”. However, it is difficult to obtain onlypages in which ”villain” appears in the context of a charac-ter in a ”wild west movie” [4]. In the healthcare domain,the end-users vary in their background, experience and havevariable needs. These users interact in various contexts,a clinician interacts with the patients during clinical-care.He (or she) may need to query the medical literature andother document repositories to assess the plan for the treat-ments or patient-diagnosis. For such queries, the users needa query language and a schema. The goal of this study, is toassist the domain experts equipped with domain knowledgebut not well-versed to use a query language such as SQL,XQuery to query the Web document repositories. This the-sis illustrates the need of in-depth (and granular) queryingof medical information on the Web. For example, a physi-cian has an exploratory evidence-based query 1 ”cases wherehelicobacter pylon bacteria causes peptic ulcer” or he or shemay have a hypothesis-directed query 2, ”Treatment in caseof high-fever and dizziness”. The conventional search (Ya-hoo, Google or localized search menu), return all the Webdocuments which match any of the keywords ranked by aset of criteria defined by the search engine. This may returna large number of documents. The physician has to go overeach of the documents to find the one that (exactly) matcheshis expectations. This task is time-consuming. It may beabandoned by the physician. Both types of queries may in-volve multiple filtering-conditions (attributes) which makethe query complex. To answer such end-user queries there is

1The evidence-based queries are raised during patient-carewhere the clinicians wish to query the knowledge archivesto determine the relevance of signs and symptoms to thepotential existence of one or more medical disorders [3].2The hypothesis-directed queries represent the non-diagnostic intent of information about conditions seekingdetails on potential hypothesis including treatments, curesand outcomes [3].

1410

Page 2: Domain Specic Multistage Query Language for Medical Document Repositories · Domain Specic Multistage Query Language for Medical Document Repositories Aastha Madaan Database Systems

a need to facilitate DB-style queries, where a user can query”helicobacter pylon bacteria” in context of ”symptoms” and”peptic ulcer” in context of ”causes” and can append thesefiltering conditions (attributes) in multiple interactive stagesto receive results (in this case, as name of disease).Several recent studies have made an attempt to address

the information needs of the end-users within various do-mains. These include work on granular-level domain-specificsearch, estimating granularity of information in a document[19], estimating the difficulty of a document, the quality ofdocuments, document summarization and the use of ter-minology resources for query refinement [7]. Cross-lingualsearch is of importance for end users at all levels [7]. How-ever, several challenges remain for efficient health-care de-livery.

Figure 1: A snippet of the Web document ”HeartAttack” from the MedlinePlus medical encyclopediadocument repository.

In this study, we consider these challenges and outline theways in which these can be addressed. Here, we focus onquerying the medical information resources which form theexpert-users personal and external knowledge base. A multi-stage query language is proposed which provides a user-levelquery calculator to formulate a query using domain con-cepts. Overcoming these will simplify the querying tasks forthe expert and novice domain users. It will enable them toget the desired results.

2. STATE­OF­ARTIn this section we describe the state-of-art of the complex-

ity of medical information. We consider the various end-users and the recent approaches for querying data, consid-ering user’s preferences to facilitate the information retrievaland querying from various domain-specific resources. Fur-ther, the need for returning granular results to the users fortheir queries on Web documents is emphasized.

2.1 Query over Data RepositoriesRecently form-based interfaces have been proposed for the

naive users to allow them to query a database. These workwell with the query logic of the end-users. For the com-plex queries, the number of forms might increase. Consider-ing the approach proposed by Jayapandian [10], where the

forms are clustered based on the query-needs of the users,the forms need to be modified to address the queries whichcannot be performed otherwise.

2.1.1 Granular Querying for Web DocumentRepositories

The existing (keyword) searches consider the Web docu-ments as bag of words and do not exploit the rich struc-tural information available for the on-line medical informa-tion [13]. Varadarajan proposes the need for returning onlythe relevant segments of the Web documents as results ofuser search instead of returning complete Web documentsfor efficient querying and searching [22]. The proposed querylanguage aims to present only the relevant segments of theWeb documents as results to the users which satisfy thecomplete context of user-query. This will make the queryresults granular and the querying tasks time-efficient for themedical domain experts.

2.2 Complex Medical InformationThe major barriers for efficient response to querying of

medical information include perceived lack of quality, rele-vance of the content, inaccessibility, and trustworthiness ofresources [6]. Inadequate quality assessment of on-line med-ical information may lead to misleading information. There-fore, validity and reliability of the Internet resources is oftenquestioned for medical information [6]. Moreover, there isno criteria so as to when a search or query should be ter-minated. The clinicians often find that the results returnedto them are not appropriate as per their requirement [6].Querying medical information may vary in relation to de-pending on the role of the clinician and the way informationis presented. While computer-based knowledge resources areuseful in addressing these needs, ineffective search skills andlack of time are common barriers to information seeking [8].The major challenges that arise due to the complexity ofmedical information are:

1. Identification of suitable query methods, and the re-sults searched.

2. Identification of resources to be queried for the results(knowledge of schema).

3. Identification of results presented to the end-users.

Figure 1 represents a snippet of a structured medical ency-clopedia document from the MedlinePlus document repos-itory [15]. It contains the ”disease name” as the topic andthe associated concepts such as, ”causes” and ”symptoms”as sub-topics. There may be multiple subtopics within agiven topic and each of the documents may contain distinctsub-topics depending upon the topic it represents. Each ofthe topic and sub-topics labels can be represented as query-able attributes to the end-users. These can significantly easethe task of querying by the health-care experts.

2.3 The End­UsersThe end-users in the medical domain can be classified on

the basis of their knowledge and expertise in the field. Thefirst set of users can be termed as the novice-users (or thecasual users). These users include the patients and theirrelatives. They are not well-versed with medical terminolo-gies and concepts and may use keywords which may not be(much) related to the actual terms. They possess low ability

1411

Page 3: Domain Specic Multistage Query Language for Medical Document Repositories · Domain Specic Multistage Query Language for Medical Document Repositories Aastha Madaan Database Systems

to pose reasonable queries. These users (may or) may notbe good with the computer expertise. Hence, they utilizethe results that are available from general-purpose searchengines.On the other hand, the doctors and clinicians, are the

medical domain-knowledge experts and frequently need toaccess the knowledge archives (Web document repositories).These users must use reliable sources and require completecontext of the results. They may (or may not) be goodwith the computer expertise but can pose exact and precisequeries. Unlike searches on the Web, the domain-experts inthe medical domain have in-depth knowledge of the particu-lar information to be located, the details of the resource typeand its reliability. They make use of the complex structuralinformation to extract the relevant semantic information.Therefore, a query language is developed to support the

needs of the specialists (domain experts). It can subse-quently reduce the learning curve of the novice users. Thereis a knowledge gap, which is often figured out by the inter-mediary agent about what is exactly required by the user[11]. It can be eliminated by allowing the user to constructqueries interactively on a user-level schema (with query-ableattributes).

2.4 Existing Solutions and IssuesThe generic search engines may be time-efficient and free

but often lack the specificity in providing relevant and high-quality information for the professional use in medical care[6]. Physicians who search using keyword search approach,say, a problem from its conditions may find the most rele-vant link on the second or third page. Whereas, they wishto find the relevant results for their queries quickly and ef-ficiently. As a result of difficulties, the clinicians tend tobelieve that the answers to their (complex, specific and spe-cialized) queries does not exist or exists in a fuzzy state [6].Thus, the quality of results returned by the conventional

search engines about medical information suffer in qualityof information because of the following reasons:

1. Query with only 1 or 2 terms may not contain enoughterms for the search engine to retrieve the desired in-formation to the user.

2. In the document repository of the search engine, theremight exist more than thousands of articles match-ing the query requested. This makes it impossibleto locate the desired information by simply browsingthrough contents of returned results.

3. Conventional search engines focus on generic infor-mation search, domain-specific results are usually nottaken into consideration during the search. Thus asimple keyword-based search does not produce relevantsearch results in specific domains such as the medicaldomain.

3. PROBLEM STATEMENTThe general practitioners need a query tool which gives 5-

10 results per page with a simple layout. These users querythe health Web document repositories. Although physiciansare time-constrained, they are prepared to devote time tocomplex queries. The immediate need at point-of-care is

critical for the physicians. Information on drugs, disease de-scriptions, treatment and clinical trial information are domi-nant needs [6]. We describe two key-features of the proposedapproach to address this problem statement.

User-Level Schema

Clinician’s Knowledge Base

=

Web document Repository Layout

Simple Medium Complex RecursiveCases = ?

where

Symptoms = "Fever"

AND

Symptoms = "High Blood

Pressure"

Cases = ?

where

Causes = "Fever"

Causes = "Fever"

Symptoms = "Vomitting"

Medication = ?

Causes = "Vomitting"

Symptoms = "Fever"

Remedies = ?

More

causes

Figure 2: Different levels of queries performed by aclinician during health-care delivery.

The proposed query language attempts to overcome thefollowing shortcomings.

1. Implement the query methodology to seek diagnosticor hypothesis-directed information (followed by a med-ical domain-expert) and,

2. Present the relevant areas (granular results) of theWeb document that match the user’s query criteria.

3.1 Multi­Stage Visual Query LanguageQuerying in the medical domain has a natural multi-stage

and transitional progression from symptoms, to causes, rem-edy (and so-on) during patient diagnosis. The medical do-main users wish to express these complex semantics of pa-tient care and receive precise and complete answers. Databasequeries can easily fulfill this requirement as compared to thekeyword-based web search. Hence, in this thesis we aim tomodel the multi-stage process of patient-care as a query lan-guage over a database of Web documents. At each stageuser can choose an attribute (filter-conditions) that he orshe wants to append to the query. The system providesflexibility to the users to form the order of accessing thesefeatures and add conditions on these attributes. Figure 2describes various levels of query-complexity that occur as aresult of querying by a domain-expert on his knowledge-baseduring patient-diagnosis. He may perform a simple query in-volving a single medical-concept (symptoms or causes), say,”Find diseases, where fever is a cause”. For a query, ”Findmedication when a patient has vomiting due to fever”, auser may query two-concepts,symptoms and causes. Suchqueries, can be termed as ”medium-level” complex queries.Further, a clinician may have a more complex query. Heor she may query the knowledge base after observing feveras a symptom, but on further observing the patient maydiscover ”high-blood pressure” also as symptom. In sucha case, he may wish to query thesymptoms (concept) incre-mentally with two values. In more complex scenarios, a usermay recursively query multiple concepts with different val-ues. For example, in Figure 2 for the query ”find remedieswhen a patient is having fever because of vomiting”, afterreading the remedies, a clinician may further wish to querythe causes, with a value, ”over-eating”.

1412

Page 4: Domain Specic Multistage Query Language for Medical Document Repositories · Domain Specic Multistage Query Language for Medical Document Repositories Aastha Madaan Database Systems

MedlinePlus_Medical_Encyclopedia_Articles

Article 1

title_id

title_name

title_url

causes

symptoms

Exams_and_Results

Treatment

... ... ...

Article 2

title_id

title_name

title_url

causes

symptoms

Considerations

Home_Care

... ... ...

1

Article n

title_id

title_name

title_url

causes

Home_Care

When_to_contact_a_medical_professional

What_to_expect_on_an_office_visit

... ... ...

. . .

Figure 3: Structure of the user-level schema used for the proposed multi-stage query language.

3.2 User­Level SchemaThe concepts and the terminologies in the medical do-

main have evolved from the domain knowledge and expe-rience of the domain experts (clinicians and researchers)over several years. This information does not change fre-quently and is enriched over a period of time. With theemergence of the standards for medical terminologies, for ex-ample the LOINC [12] for the laboratory tests, ICD (9,10)[9] and SNOMED-CT [21] for disease codes, the medicalinformation is increasingly becoming interoperable acrossgeographically distributed health-care systems. The Webdocument repository creates a semantic object-level univer-sal schema which can be queried by domain experts andother users. The attributes and data stored in this schemais largely understandable by the users and is easy-to-query.Such a schema can simplify and enrich the querying experi-ence of the end-users. This facilitates a query language forthese users to easily formulate their queries interactively in amulti-stage manner. We propose to capture the hierarchicalschema (mapped to XML form) for the Web based documentrepositories. The multi-stage querying and the XML-basedschema in terms of the domain-level knowledge of the userseliminates the need to write complex code for the queries(or complex SQL or XQuery expressions). This approachhas been earlier proposed in [20], for granular and precisequerying of the archetype-based EHRs. Therefore, the inte-gration of the available domain knowledge and programmingmethodology is required to surmount the difficulty of usingthe query languages.Figure 3 describes the structure of the user-level schema

which is to be generated for the document repository (inthis case, the MedlinePlus medical encyclopedia [15]) aftertransformations. It represents the concepts from each ofthe documents of the repository presented to the user asquery-able attributes. Figure 4 shows a sample of the XMLdocument corresponding to the ”Aarskog Syndrome” doc-ument from the MedlinePlus medical encyclopedia [15]. Itcontains the conceptual-level tags which can be implementedas query-able attributes ”symptoms” and ”causes”.

4. PROPOSED APPROACH: OUTLINEFor the purpose of creating the user-level schema (pre-

vious section), the proposed approach maps the domainknowledge of the experts and the concepts represented ina Web document. It extracts the syntactic structure of Web

document considering the meta-data (in form of headingsand sub-headings). And further, these concepts are usedto label the nodes of the hierarchical structure to form aconcept-based structure. These node-labels define the at-tributes for querying. We aim to develop a segmentation al-gorithm which segments the Web document considering thelayout features and the semantic organization of the domain-level concepts. It makes use of the concept of Web page-segmentation, but considers the visual, layout and semanticfeatures as compared to the earlier visual-based approach[2].

As part of this thesis, we aim to develop a domain-specificmulti-stage query language described in the previous sec-tion. Figure 5(a), represents the query formulation processfor clinical queries through the proposed multi-stage querylanguage. At each stage of query formulation the user candynamically select a medical-concept to query. Assign avalue for it and then either execute the query or furtherrefine the query by adding another attribute(s) and view re-sults. The query is executed on the user-level schema. Itprovides the users with the segment-level results. Hence,the proposed query language can allow the user to formu-late complex DB-style queries using a simple interface andunderstandable attributes. Figure 5(b), query formulationfor the query, ”Cases where a patient has fever due to af-fliction of pneumonia and tuberculosis” using the proposedmulti-stage query language. The query fetches precise re-sults by querying only specific contexts or segments (Causesor Symptoms).

4.1 Tree­structured Repository: Data ModelLet D= d1, d2,..., dn, be a set of Web document from a

Web document repository R, where n is the total numberof documents in R. For the MedlinePlus medical encyclope-dia repository, n is nearly 4000 [15]. For each document di,let H be the heading of the document di and S = s1,s2,. .., sn denote the set of sub-headings contained in the doc-ument, di. Let cl represents the content enclosed betweenthe two sub-headings or a heading and sub-heading of thedocument, di. The hierarchical structure corresponding tothe Web document di, representing the coherent segments,can be defined as, let H be the root and sj first-level of in-ternal nodes. Each of the internal nodes sj along with thefollowing content cl forms a segment fk. The depth of thetree is determined by the number of levels of headings andsub-headings. Each of the node-labels of the hierarchical

1413

Page 5: Domain Specic Multistage Query Language for Medical Document Repositories · Domain Specic Multistage Query Language for Medical Document Repositories Aastha Madaan Database Systems

Article

Title

CausesSymptoms

Exams_and_Tests

Outlook Prognosis

treatmentSupport_Groups

Alternative_NamesUpdateDate

Prevention

Figure 4: A sample document in user-level (XML) schema.

structure has 1:1 mapping to a medical concept. Each seg-ment fk, can be represented as fk = (sj + cj). It defines thegranularity of the proposed query language.Query Results. A query Q, can be represented as a

combination of item(s) of concern ki and the concept conj

(disease, causes or symptoms). The item(s) of concern isqueried within the concept. The result returned is a frag-ment fk, which satisfies the user query criteria. Each of thesegment is demarcated by the headings or subheading la-bel(s). The results to user queries may belong to the samesegment within the same document (intra-segment query)or it may belong to different segments within the same doc-ument (inter-segment query). The result may also belongto different topics (Web documents). Such queries can beconsidered as inter-topical queries.

5. NEXT STEPSThis thesis will propose query methods (by using a user-

level schema) for the medical domain users. The followingsteps will be needed to complete this thesis.

1. XQBE for on-line medical document reposito-ries. At present, we attempted to use the XQBEgraphical query language [1], on the user-level schemagenerated by the proposed approach. In this approach,the topic and sub-topic labels (query-able attributes)form the intermediate nodes of XQBE structure. Thevalues of the attributes are added as leaf nodes. Sincethe user-interface is graphical, the users can easilydrag and drop the nodes adding the filtering conditionswithout any prior knowledge of the schema structure.They can specify the attributes (segments) they wishto receive as results . The usability studies are plannedto show that this kind of query language is easier touse and return relevant results to the users.

2. Multi-stage Query-by- Concept Query languagefor on- line medical documents. In this thesis, wefurther consider about relating the user-level schemawith a multi-stage query-by-object query language. Theterm ”object” can be defined as a unique identifiableor query-able entity in specialized domains. In thiscase, object is defined as a medical-concept, topic orsubtopic-label in a medical document repository. Theuser can dynamically select an attribute to query onthe user-interface (UI) and subsequently enter the value

for that attribute. Further the user can append moreattributes to it to formulate the desired query. The in-terface allows the user to query the knowledge archivesand knowledge base without the need to understandthe structure of the underlying schema or having anytechnical expertise [18], [23].

6. EVALUATIONThis section presents the planned experimental evaluation

for the effectiveness of the proposed query language.

6.1 Datasets and QueriesFor the experimental evaluation of the proposed query

language, the documents of MedlinePlus web document repos-itory has been downloaded and pre- processed [15]. It con-tains 900+ documents describing the health-topics, 4000+entries for the medical encyclopedia and 12000+ documentsabout drugs. About 1350+ organizations which provide thisinformation and around 18000+ links are provided by it forauthoritative medical information. All the original links tothe repository are preserved during the preprocessing of thedata. A set of 50 test queries (multi-staged) related to var-ious medical concepts (such as diseases and medication) isformulated to test on the above dataset. Further, samplequeries have been selected from various discussion forumsand blogs on the topic [16]. Along with this a comparativeanalysis is planned to test the efficiency and effectiveness ofthe queries with the advanced keyword search provided bythe MedlinePlus document repository [15].

6.2 Experimental EvaluationIn order to access and evaluate effectiveness and efficiency

of the proposed multi-stage query language, we aim to per-form a number of experiments on the real-world data (Sec-tion 6.1). The first category of experiments aim to evalu-ate the effectiveness of the proposed approach to create theconceptual user-level schema from the Web document repos-itory. For this, we plan to measure the accuracy of segmen-tation of the Web document through quantitative measuresof precision and recall. These measures can be calculated interms of number of accurate, coherent segments discoveredand the total number of segments. The second of categoryof experiments aim to test the effectiveness of the querylanguage for the actual end-users. For this the quantita-tive measures can be used to calculate the search space used

1414

Page 6: Domain Specic Multistage Query Language for Medical Document Repositories · Domain Specic Multistage Query Language for Medical Document Repositories Aastha Madaan Database Systems

Select attribute

Assign Value

Append condition

<OR/AND/NOT>

View

Results

Yes No

Multi-stage Query Language

Write

Query

(Doctor’s

Knowledge

Base)

Query Result

User Level-

Schema

Add Attribute

Execute Query

Refine

Results

End

Yes

Clinician

Clinician

Causes =

Pneumonia

AND

Causes =

Tuberculosis

Symptoms =

Fever

AND

Display

Results

Query

Execute

Query

Doctor’s Knowledge Base

User Level-Schema

End

No

Refine Queryyes

Querying

. . .Display Results

(a) (b)

Figure 5: (a) Query formulation through the proposed multi-stage query language on the user-level schema(Figure 3), (b) A sample query formulated using the multi-stage query language.

by the proposed method as compared to the advanced key-word search of Web document repository [15]. We also aimto conduct usability studies with the actual end-users (theclinicians). These can establish effectiveness and efficiencyof the query language along with the qualitative dimensionsof ease-of-use and understandability for the users.

7. REFERENCES[1] D. Braga, A. Campi, and S. Ceri. Xqbe (xquery by

example): A visual interface to the standard xmlquery language. ACM Trans. Database Syst.,30(2):398–443, June 2005.

[2] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extractingcontent structure for web pages based on visualrepresentation. In Proceedings of the 5th APWeb,pages 406–417. Springer-Verlag, 2003.

[3] M.-A. Cartright, R. W. White, and E. Horvitz.Intentions and attention in exploratory health search.In Proceedings of the 34th Intl. ACM SIGIRconference, pages 65–74, New York, NY, USA, 2011.ACM.

[4] S. Cohen, Y. Kanza, Y. Kogan, W. Nutt, Y. Sagiv,and A. Serebrenik. Equix-a search and query languagefor xml. Journal of the American Society forInformation Science and Technology, 53:2002, 2000.

[5] S. M. Freire, E. Sundvall, D. Karlsson, andP. Lambrix. Performance of XML Databases forEpidemiological Queries in Archetype-Based EHRs. InProceedings Scandinavian Conference on HealthInformatics 2012, volume 70 of Linkping ElectronicConference Proceedings, pages 51–57. LinkpingUniversity Electronic Press, 2012.

[6] M. Gschwandtner, M. Kritz, and C. Boyer.Requirements of the health professional research. InTechnical Report D8.1.2. Khresmoi Project, 2011.

[7] A. Hanbury. Medical information retrieval, aninstance of domain. In SIGIR’12. ACM, August 2012.

[8] S. Hunt, J. J. Cimino, and D. E. Koziol. A comparisonof clinicians’s access to online knowledge resourcesusing two types of information retrieval applications inan academic hospital setting. J Med Libr Assoc,101(1):26–31, 2013.

[9] http://www.who.int/classifications/icd/en/,2011.

[10] M. Jayapandian and H. V. Jagadish. Automating thedesign and construction of query forms. ICDE, page125, 2006.

[11] F. Li and H. V. Jagadish. Usability, databases, andhci. IEEE Data Eng. Bull., 35(3):37–45, 2012.

[12] http://loinc.org/, 2011.

[13] A. Marian and W. Wang. Flexible querying ofpersonal information. IEEE Data Eng. Bull.,32(2):20–27, 2009.

[14] http://www.nlm.nih.gov/bsd/pmresources.html,2011.

[15] http://www.nlm.nih.gov/medlineplus/, 2009.

[16] http://www.linkedin.com/groups/

Choice-OpenEHR-persistence-layer-144276.S.

208531138?qid=

208adbca-fc26-4ada-bf02-7efe5a9e5661&trk=

group_most_recent_rich-0-b-ttl&goback=%2Egmr_

144276, 2013.

[17] http://www.ncbi.nlm.nih.gov/pubmed, 2011.

[18] S. A. Rahman, S. Bhalla, and T. Hashimoto.Query-by-object interface for information requirementelicitation in m-commerce. Int. J. Hum. Comput.Interaction, 20(2):135–160, 2006.

[19] X. Y. Raymond, Y. Lau, D. Song, X. Li, and J. Ma.Toward a semantic granularity model fordomain-specific information retrieval. ACM Trans. onInformation Systems., 29(3), July 2011.

[20] S. Sachdeva and S. Bhalla. Implementing high-levelquery language interfaces for archetype-basedelectronic health records database. In COMAD, 2009.

[21] http://www.ihtsdo.org/snomed-ct/, 2011.

[22] R. Varadarajan, V. Hristidis, and T. Li. Beyondsingle-page web search results. IEEE Transactions onKnowledge and Data Engineering, 20(3):411–424,2008.

[23] A. Yasir, M. Kumara Swamy, P. Krishna Reddy, andS. Bhalla. Enhanced query-by-object approach forinformation requirement elicitation in large databases.In Big Data Analytics, volume 7678 of Lecture Notesin Computer Science, pages 26–41. Springer, 2012.

1415