GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School of Medicine of USC July 9 th , 2015 At the 11 th Data Integration in Life Sciences Conference (DILS) 2015 Marina del Rey
30
Embed
GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GEM: The GAAIN Entity Mapper
Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. TogaUSC Stevens Neuroimaging and Informatics Institute
Keck School of Medicine of USC
July 9th, 2015
At the 11th Data Integration in Life Sciences Conference (DILS) 2015Marina del Rey
Introduction: GAAIN
• GAAIN: Global Alzheimer’s Association Interactive Network
• Subject research data• Well structured• (Mostly) relational
• Data harmonization• Common data model• MAP datasets to common model
• Data ownership sentsitivity
Data Mapping
The Data Mapping Problem• Resource intensive
• “On average, converting a database to the OMOP CDM, including mapping terminologies, required the equivalent of four full-time employees for 6 months and significant computational resources for each distributed research partner. Each partner utilized a number of people with a wide range of expertise and skills to complete the project, including project managers, medical informaticists, epidemiologists, database administrators, database developers, system analysts/ programmers, research assistants, statisticians, and hardware technicians. Knowledge of clinical medicine was critical to correctly map data to the proper OMOP CDM tables. “
• Complexity of data harmonization• Several thousand data elements per dataset• Multiple datasets
• Data elements• Complex scientific concepts• Cryptic names• Domain expertise to interpret
Observations
• Rich element information in documentation • Data dictionaries !
• Element information• Descriptions• Metadata
• Need better approaches to matching element names• MOMDEMYR1• PTGNDR
Data Dictionaries
• Rich element details
Approach
• Extract element description and metadata details from data dictionaries
• Determine element matches based on above
• Block improbable match candidates based on metadata
• Determine element similarity (and thus match likelihood) based on name and description similarity
• Initial version of system knowledge-driven, then added machine-learning classification
GEM: A Software Assistant for Data Mapping
GEM Architecture
Element Extraction
• Extract and segregate element information
√
Metadata Detail Extraction
• Element categoriesFour categories
(i) Special(ii) Coded
BinaryOther coded
(iii) Numerical(iv) Text
ClassifierHeuristic based
• Other metadata detailsCardinalityRange (min, max)
√
MDB: The Metadata Database
• Extracted detailed metadata per element Source Name Description Legend Cardinality Range Category
9/8/14
√
Matching: Metadata Based “Blocking”
• Elimination of candidatesEliminate candidates from second source that are
incompatible• Incompatibility criteria
- Category mismatch- Cardinality mismatch
- For coded elements- Assume normal distribution with SD of 1
- Range mismatch
9/8/14
√
Matching Text Descriptions
• Employ a regular Tfidf cosine distance on bag-of-words• Based on unsupervised topic modeling (LDA)
- Treat element descriptions as ‘documents’ - Topic model over these documents- Each element (description) has a probability distribution over topics- Element similarity (or distance) based on similarity (not) of associated topic distributions
√
Element Name Matching
• Composite element names
P T G E N D E R
P AT G N D R
M O M D E M
F H Q D E M Y R 1
𝑇𝐶𝑆 (𝑒𝑆 ,𝑒𝑇 )=Ʃ𝑎𝑙𝑙𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛𝑇𝑎𝑏 (𝑒𝑆 )𝑀 (𝑒𝑆 ,𝑒𝑇 )
min (𝑂 (𝑇𝑎𝑏 (𝑒𝑆) ) ,𝑂 (𝑇𝑎𝑏 (𝑒𝑇 ) ) )
Table Correspondence
• Elements generally do match across ‘corresponding’ tables
• Literal table names not scalable as a feature
• Determine table correspondence heuristically, based on knowledge driven match likelihood
Experimental Results• Setup
• Various data dictionaries
• ADNI, NACC, DIAN, LAADC, INDD
• Mapping pairs
• Pairs of datasets
• ADNI-NACC, ADNI-INDD, ADNI-LAADC, …
• Dataset to GAAIN Common Model (GCM)
• ADNI-GCM, NACC-GCM, …
• Experiments
• Mapping accuracy
• Effectiveness of individual components
• Topic Modeling (text description) match and Filtering