1 From documents to datasets and back: challenges and solutions Jan Voskuil CEO Taxonic 25 May 2016 Leiden
Apr 11, 2017
1
From documents to datasets and back: challenges and solutions
Jan VoskuilCEO Taxonic
25 May 2016Leiden
2
Disclaimer
The views and opinions expressed in the following PowerPoint slides are those of the individual presenter and should not be attributed to Drug Information Association, Inc. (“DIA”), its directors, officers, employees, volunteers, members, chapters, councils, Special Interest Area Communities or affiliates, or any organisation with which the presenter is employed or affiliated. These PowerPoint slides are the intellectual property of the individual presenter and are protected under the copyright laws of the United States of America and other countries. Used by permission. All rights reserved. Drug Information Association, DIA and DIA logo are registered trademarks or trademarks of Drug Information Association Inc. All other trademarks are the property of their respective owners.
3
Disclosure Statement
I have no real or apparent relevant financial relationships to disclose I am employed by a regulatory agency, and have nothing to disclose
Please note that DIA is not requesting a numerical amount to be entered for any disclosure, please indicate by marking the check box, and then providing the company name only for those disclosures you may have.
Will any of the relationships reported in the chart above impact your ability to present an unbiased presentation? No
In accordance with the ACPE requirements, if the disclosure statement is not completed or returned, participation in this activity will be refused.
Type of Financial Interest within last 12 months Name of Commercial Interest
Grants/Research Funding
X Stock Shareholder Taxonic
Consulting Fees
Employee
Other (Receipt of Intellectual Property Rights/Patent Holder, Speaker’s Bureau)
4
Introduction
Jan Voskuil• Co-founder and CEO of Taxonic• Co-founder of OntoPharma (soon)Semantic Web technologies and natural language processingInvolved in research projects
© 2014 DIA, Inc. All rights reserved.
5
Documents, datasets and back
Currently, products are authorized based on textual documents (SmPC, Module 3, et cetera)After 2016, the same products are to be authorized based on datasetsChange in the document == change in the dataset
© 2014 DIA, Inc. All rights reserved.
IDMP-readiness includes being able to effectively manage documents,
datasets and their interdependencies
6
Research Question
How can we automatically process SmPCs and generate high-quality IDMP-compliant datasets based on this?• Define criteria and measures• Take stock of known approaches• Set up experiments• Productise results
© 2014 DIA, Inc. All rights reserved.
7
Disclaimer:
Not all attribute values can be obtained from documentsHowever, some 80% canThe rest is obtained from other information systemsIDMP Phase 1• 55 attributes out of 72
© 2014 DIA, Inc. All rights reserved.
8
Entity extraction for Content Management
Lorem ipsum dolor sit amet
Consectetur adipiscing elit Paris. In hendrerit risus augue, id aliquet massa porttitor porta. Sed molestie dui eu est bibendum, nec ornare risus rhoncus. Mauris vestibulum turpis tellus, ac consequat dolor dapibus id. Mauris id libero leo. Sed dolor ipsum, finibus in iaculis non, accumsan in libero. Morbi mollis tortor a blandit scelerisque. Fusce quis mi massa. Suspendisse vel libero dolor. Donec molestie mattis eleifend. Phasellus nulla sem, pulvinar sed bibendum nec, scelerisque ac ligula.
Donec convallis lectus eget ante posuere pretium. Cras vestibulum pellentesque consectetur. Phasellus finibus erat eu facilisis efficitur. Quisque est dui, interdum nec arcu eu, rho tincidunt enim. Praesent bibendum finibus euismod. Nunc sed mauris id nunc posuere varius eu sed justo.
© 2014 DIA, Inc. All rights reserved.
Paris.
President Hollande
European Parliament
Hollande
Van Rompuy.
Brussels. Places: Paris, Brussels
Institutions: European Parliament
People: Francois Hollande, Herman Van Rompuy
Document Metadata
EXTRACTION
9
Entity extraction for constructing datasets
© 2014 DIA, Inc. All rights reserved.
Dataset
EXTRACTION
10
Challenge: Recognizing attributes
© 2014 DIA, Inc. All rights reserved.
Indication: headacheAdverse effect: nausea
Indication: nauseaAdverse effect: headache
11
Challenge: Recognizing attributes
© 2014 DIA, Inc. All rights reserved.
Indication: headacheAdverse effect: nausea
Indication: nauseaAdverse effect: headache
With extraction for content management, attributes are inferred from the concept’s concept scheme (“Paris is a location”)
With extraction for dataset generation, attributes are inferred from analyzing document structure and linguistic analysis of the concept’s context
12
Challenge: anomalies
Some medicinal product names as found in official SmPCs:
© 2014 DIA, Inc. All rights reserved.
“ExampleCo Vet Care Hartmann’s Lactated Ringers Solution for infusion for cattle, horses, sheep, goats, pigs, dogs and cats. (In Spain (RMS): Lactato-RingerVet solución para perfusión para bovino, equino, ovino, caprino, porcino, perros y gatos) (In Germany: Ringer-Lactat-Lösung nach Hartmann B. Braun Vet Care, Infusionslösung für Rinder, Pferde, Schafe, Ziegen, Hunde und Katzen.)”
13
Challenge: anomalies
Some medicinal product names as found in official SmPCs:
© 2014 DIA, Inc. All rights reserved.
“Aminoplasmal ExampleCo 10% E; 5g/ l + 8,9g/l + 6,85g/l + 4,4g/l + 4,7g/l + 4,2g/l +1,6g/l + 6,2g/l + 11,5g/l + 3g/l + 10,5g/l +12g/l + 5,6g/l + 7,2g/l +5,5g/l +2.3g/l + 0.4g/l + 2.858g/l + 0.36g/l + 2.453 g/l + 0.508g/l + 3.581g/l solution for infusion INN: isoleucine; leucine; lysine-hyidrochloride; methionine; phenylalanine; threonine; tryptophan; valine; arginine; histidine; alanine; glycine; asparatic acid; glutamatic acid proline; serine; tyrozine; sodium- acetate trihydrate; sodium-hydroxide; potassium-acetate; magnezijum-chloride, hexahydrate; disodium phosphate dodecahydrate”
14
Challenge: nested concepts
© 2014 DIA, Inc. All rights reserved.
Attribute Attribute valueMedicinal product name EAU POUR PREPARATIONS INJECTABLES ExampleCo,
solvant pour préparation parentérale en ampouleDose form name part solvant pour préparation parentérale
Scientific name part EAU POUR PREPARATIONS INJECTABLES
Invented name part -
Company name part ExampleCo
Strength name part -
Container name part en ampoule
Time/period name part -
15
Challenge: MedDRA codes
© 2014 DIA, Inc. All rights reserved.
“Indicated for treatment of patients with locally advanced or metastatic adenocarcinoma of the pancreas “
Pancreatic adenocarcinoma (LLT=10051971)
Pancreatic adenocarcinoma metastatic (LLT=10033599)
Solution:• Step 1 – order concepts by
relevance• Step 2 – let user make
expert judgement
“Indicated for treatment of patients with locally advanced adenocarcinoma of the pancreas “
“Indicated for treatment of patients with locally metastatic adenocarcinoma of the pancreas “
16
Challenge: multiple products in one SmPC
© 2014 DIA, Inc. All rights reserved.
17
Challenge: multiple products in one SmPC
© 2014 DIA, Inc. All rights reserved.
18
Challenge: multiple products in one SmPC
© 2014 DIA, Inc. All rights reserved.
19
Some results so far
Developing a framework for measuring accuracy
StatisticsRepresentative reference setsSparsity of data
© 2014 DIA, Inc. All rights reserved.
Attribute AccuracyATC Code 98,2%
Theraputic indication 100%
Medicinal product name 100%
Dose form name part 81,2%
Scientific name part 92,0%
Invented name part 98,0%
Company name part 100%
Strength name part 79,6%
Container name part -
Time/period name part 0%
20
Reference Data Management PlatformControlled vocabularies (versioned), crosswalks
The extractor in context
© 2014 DIA, Inc. All rights reserved.
Identifiers for “Simvastatin”
21
Reference Data Management PlatformControlled vocabularies (versioned), crosswalks
The extractor in context
© 2014 DIA, Inc. All rights reserved.
Identifiers for “Simvastatin”
SYSTEM A SYSTEM BProvisions reference data
Provisions reference data
Dataflow
Translates between vocabularies
22
Reference Data Management Platfor,Controlled vocabularies (versioned), crosswalks
IDMPIDMPIDMP
The extractor in context
© 2014 DIA, Inc. All rights reserved.
Identifiers for “Simvastatin”
Referentials Management System
APISYSTEM A SYSTEM B
Provisions reference data
Provisions reference data
EMA
23
Reference Data Management PlatformControlled vocabularies (versioned), crosswalks
The extractor in context
© 2014 DIA, Inc. All rights reserved.
Identifiers for “Simvastatin”
24
The extractor in context
© 2014 DIA, Inc. All rights reserved.
Reference Data Management PlatformControlled vocabularies (versioned), crosswalks
Super-thesaurus
Extractor
Versioned extraction
results
IDMP Data hub
Identifiers for “Simvastatin”
25
The extractor in context
© 2014 DIA, Inc. All rights reserved.
Reference Data Management PlatformControlled vocabularies (versioned), crosswalks
Super-thesaurus
Extractor
Versioned extraction
results
IDMP Data hub
SYSTEM A SYSTEM BProvisions reference data
Provisions reference data
Translates between vocabularies
Identifiers for “Simvastatin”
Dataflow
26
Benefits of automated extraction
Quality control and feedback on SmPCQuality control and feedback on XEVMPD datasetsIncreased consistency in mapping from text to data and backHuge time and cost savings
© 2014 DIA, Inc. All rights reserved.
27
What to look for to become IDMP-ready
When selecting extractor tools, look at:Accuracy• Compare verified results of manual data entry with results of automated
extraction• Evaluate random samples (p-value)
Traceability • Trace extracted data back to the text• Auditability• VersioningEase of use• Override extraction results manually where necessary• Support for expert judgment where necessaryReference data management • Vocabularies used for extraction also to be used for data governance and
enterprise data integration© 2014 DIA, Inc. All rights reserved.
28
AskAsk
28