Refining Health Outcomes of Interest using Formal Concept Analysis and Semantic Query Expansion Olivier Curé 1 , Henri Maurer 2 , Paea Le Pendu 3 , Nigam Shah 3 1: CNRS LIGM lab, UPEM, France 2: Edinburgh University, IK 3: BMIR lab, Stanford University, USA
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Refining Health Outcomes of Interest using Formal Concept Analysis and Semantic Query
Expansion
Olivier Curé1, Henri Maurer2, Paea Le Pendu3, Nigam Shah3
● Applications need to select, extract, compare and analyze groups of patients using Electronic Health Records (EHRs)
● This require to define Health Outcomes of Interests (HOI), e.g. myocardial infarction, chronic obstructive pulmonary disease.
● With clinical text, these definitions should capture variations of terms and ensure good precision and recall of the text-mining process.
3
Problem setting (2)
● It is not practical to define precisely these HOIs with concept identifiers, e.g. UMLS CUIs.
● We provide a solution that produces and refines HOI definitions from terms provided by the end-user.
● Our solution aims to propose sound and complete definitions in a best-effort way.
4
Approach overview
Diseases
Procedures
DrugsDevices
Bioportal - Knowledge
termsconcepts
Semantic QueryExpansion
Terminology3 DB
Semantic QueryExpansion
Formal ConceptAnalysis
StatisticsBasedPruning
5
SQE
● Improve search results by expanding queries with the transitive closure of the subsumption relationship of ontology concepts.
● Queries can be generalized (resp. specialized) via expansions with ancestors (resp. descendants).
● Ex: expanding a query with 'neoplasm' or 'tumor' when searching for 'cancer'.
6
FCA
● Abstract conceptual descriptions from a set of objects described by some attributes.
● Used in machine learning and knowledge management.
● A formal context is a triple (G,M,I), resp. a set of objects, attributes and a binary relation between G and M.
● A formal context can be represented as a matrix.
7
FCA (2)
{1,2}-{CF1,F1,CF2,F2}
{3}-{CF1,F1,MF2,F2}
{6}-{BLF1,F1,MF2,F2}
{4,5}-{BLF1,F1,BLF2,F2}
{1,2,3}-{CF1,F1,F2}
{3,6}-{MF2,F1,F2}
{4,5,6}-{BLF1,F1,F2}
{1,2,3,4,5,6}-{F1,F2}
⊥
⊤
8
Method
● SQE: Relational database approach– We are using the ontologies stored in Stanford's
DB and its materialization of concept subsumption (almost 14 millions entries).
● FCA: objects and attributes of the formal context are concept identifiers (UMLS concept identifiers).
10
Method (3)
● To improve relevance, identifying potential concepts among discovered ones, a pruning FCA-based approach is designed.
● Formal contexts is composed of matching concepts as objects and candidate concepts as attributes.
● Thus the binary relation corresponds to the subsumption relationship.
11
Method (4)
● Ex: 10365: “hyperlipoproteinemia type iv” and 740154 : “disease, disorder or finding”● Standard FCA algorithms are used to define the FCA lattice.
12
Method (5)
● Qualifying a discovered concept is performed using a top-down navigation of the FCA lattice.
● For each formal concept <Ai,Bi>, we compute the transitive closure of sub concepts of Ai (resp. Bi), denoted LAi (resp. Lbi).
● If (|LBi ∩ LAi |)/ | LBi | ≥ Θ, with Θ a predefined pruning threshold then Bi is potential concept
13
Method (6)
● Concept sets:– M : matching
– D : Discovered
– P : Potential
– C : Other concept
14
Example
● Search on Hypercholesterolemia on 18 ontologies provides:– 20 matching concepts (i.e., FCA objects)
– 102 discovered concepts (i.e., FCA attributes)
● Generates an FCA lattice with 67 formal concepts
● First formal concept satisfying a Θ=.75 pruning threshold is at the 4th level of the lattice: only 4 concepts out of 16 LBi are covered by LAi .
● These 4 concepts have the following preferred labels: “hypercholesterolemia”, “cholesterolosis”, “secondary hypercholesterolemia” and “hyperlipidemia”.
15
Method (7)
● We include interactions with end-user to validate our potential discoveries.
● Hence the domain expert has the final decision on acceptance/rejection of a proposition.
● Important issue: trade-off between user interactions and precision/recall of results.
● End-user can validate whenever she wants.● Interactions are performed in a web interface providing
additional information on the search (clinical text snippets, number of patients).
16
Evaluation
● i2b2 obesity NLP reference set used as an evaluation data set
● Gold standard are the results of a previous experiment conducted at Stanford.
● Evaluation in terms of specificity, sensitivity and duration of computation (on commodity hardware)
17
Evaluation (2)
● An improvement of 2 and 3 % on resp. sensitivity and specificity.
● Computation duration in terms of seconds on a standard laptop.
18
Evaluation (3)
● More interesting is that some of our false negatives seem to be relevant to the search.
● Some of these false negative come from the matching and also the potential (i.e. FCA based) approaches:
● Matching example :– Sitosterolemia for hypercholesterolemia'' for hypercholesterolemia
hypercholesterolemia, while the gold standard contains concepts such as “hyperlipoproteinemia type ii”) concepts which confirms the relevance of using a semantic approach.
● Note that among our true positive, depending on the use case, a significant number of items have been retrieved from the potential concept set, i.e., using our FCA statistical approach.
19
Conclusion
● We have proposed a semi-automatic solution for defining HOIs.
● Approach uses SQE and FCA enriched with a statistical approach.
● Our results are comparable to state of the art methods.
● It refines HOIs definitions efficiently with relevant terms/concepts/
20
Future works
● Conduct user-driven evaluations with clinicians and researchers.
● Analyze acceptance/rejection of end-users in practical scenarios.
● Use active learning over past query refinements to improve future queries.
● Study our method's impact on mining EHRs clinical notes and cohort building tools.