Top Banner
1 From documents to datasets and back: challenges and solutions Jan Voskuil CEO Taxonic 25 May 2016 Leiden
28

From documents to datasets and back: challenges and solutions

Apr 11, 2017

Download

Health & Medicine

Jan Voskuil
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: From documents to datasets and back: challenges and solutions

1

From documents to datasets and back: challenges and solutions

Jan VoskuilCEO Taxonic

25 May 2016Leiden

Page 2: From documents to datasets and back: challenges and solutions

2

Disclaimer

The views and opinions expressed in the following PowerPoint slides are those of the individual presenter and should not be attributed to Drug Information Association, Inc. (“DIA”), its directors, officers, employees, volunteers, members, chapters, councils, Special Interest Area Communities or affiliates, or any organisation with which the presenter is employed or affiliated.  These PowerPoint slides are the intellectual property of the individual presenter and are protected under the copyright laws of the United States of America and other countries. Used by permission. All rights reserved. Drug Information Association, DIA and DIA logo are registered trademarks or trademarks of Drug Information Association Inc. All other trademarks are the property of their respective owners.

Page 3: From documents to datasets and back: challenges and solutions

3

Disclosure Statement

I have no real or apparent relevant financial relationships to disclose I am employed by a regulatory agency, and have nothing to disclose

Please note that DIA is not requesting a numerical amount to be entered for any disclosure, please indicate by marking the check box, and then providing the company name only for those disclosures you may have.

Will any of the relationships reported in the chart above impact your ability to present an unbiased presentation? No

In accordance with the ACPE requirements, if the disclosure statement is not completed or returned, participation in this activity will be refused.

Type of Financial Interest within last 12 months Name of Commercial Interest

Grants/Research Funding

X Stock Shareholder Taxonic

Consulting Fees

Employee

Other (Receipt of Intellectual Property Rights/Patent Holder, Speaker’s Bureau)

Page 4: From documents to datasets and back: challenges and solutions

4

Introduction

Jan Voskuil• Co-founder and CEO of Taxonic• Co-founder of OntoPharma (soon)Semantic Web technologies and natural language processingInvolved in research projects

© 2014 DIA, Inc. All rights reserved.

Page 5: From documents to datasets and back: challenges and solutions

5

Documents, datasets and back

Currently, products are authorized based on textual documents (SmPC, Module 3, et cetera)After 2016, the same products are to be authorized based on datasetsChange in the document == change in the dataset

© 2014 DIA, Inc. All rights reserved.

IDMP-readiness includes being able to effectively manage documents,

datasets and their interdependencies

Page 6: From documents to datasets and back: challenges and solutions

6

Research Question

How can we automatically process SmPCs and generate high-quality IDMP-compliant datasets based on this?• Define criteria and measures• Take stock of known approaches• Set up experiments• Productise results

© 2014 DIA, Inc. All rights reserved.

Page 7: From documents to datasets and back: challenges and solutions

7

Disclaimer:

Not all attribute values can be obtained from documentsHowever, some 80% canThe rest is obtained from other information systemsIDMP Phase 1• 55 attributes out of 72

© 2014 DIA, Inc. All rights reserved.

Page 8: From documents to datasets and back: challenges and solutions

8

Entity extraction for Content Management

Lorem ipsum dolor sit amet

Consectetur adipiscing elit Paris. In hendrerit risus augue, id aliquet massa porttitor porta. Sed molestie dui eu est bibendum, nec ornare risus rhoncus. Mauris vestibulum turpis tellus, ac consequat dolor dapibus id. Mauris id libero leo. Sed dolor ipsum, finibus in iaculis non, accumsan in libero. Morbi mollis tortor a blandit scelerisque. Fusce quis mi massa. Suspendisse vel libero dolor. Donec molestie mattis eleifend. Phasellus nulla sem, pulvinar sed bibendum nec, scelerisque ac ligula.

Donec convallis lectus eget ante posuere pretium. Cras vestibulum pellentesque consectetur. Phasellus finibus erat eu facilisis efficitur. Quisque est dui, interdum nec arcu eu, rho tincidunt enim. Praesent bibendum finibus euismod. Nunc sed mauris id nunc posuere varius eu sed justo.

© 2014 DIA, Inc. All rights reserved.

Paris.

President Hollande

European Parliament

Hollande

Van Rompuy.

Brussels. Places: Paris, Brussels

Institutions: European Parliament

People: Francois Hollande, Herman Van Rompuy

Document Metadata

EXTRACTION

Page 9: From documents to datasets and back: challenges and solutions

9

Entity extraction for constructing datasets

© 2014 DIA, Inc. All rights reserved.

Dataset

EXTRACTION

Page 10: From documents to datasets and back: challenges and solutions

10

Challenge: Recognizing attributes

© 2014 DIA, Inc. All rights reserved.

Indication: headacheAdverse effect: nausea

Indication: nauseaAdverse effect: headache

Page 11: From documents to datasets and back: challenges and solutions

11

Challenge: Recognizing attributes

© 2014 DIA, Inc. All rights reserved.

Indication: headacheAdverse effect: nausea

Indication: nauseaAdverse effect: headache

With extraction for content management, attributes are inferred from the concept’s concept scheme (“Paris is a location”)

With extraction for dataset generation, attributes are inferred from analyzing document structure and linguistic analysis of the concept’s context

Page 12: From documents to datasets and back: challenges and solutions

12

Challenge: anomalies

Some medicinal product names as found in official SmPCs:

© 2014 DIA, Inc. All rights reserved.

“ExampleCo Vet Care Hartmann’s Lactated Ringers Solution for infusion for cattle, horses, sheep, goats, pigs, dogs and cats. (In Spain (RMS): Lactato-RingerVet solución para perfusión para bovino, equino, ovino, caprino, porcino, perros y gatos) (In Germany: Ringer-Lactat-Lösung nach Hartmann B. Braun Vet Care, Infusionslösung für Rinder, Pferde, Schafe, Ziegen, Hunde und Katzen.)”

Page 13: From documents to datasets and back: challenges and solutions

13

Challenge: anomalies

Some medicinal product names as found in official SmPCs:

© 2014 DIA, Inc. All rights reserved.

“Aminoplasmal ExampleCo 10% E; 5g/ l + 8,9g/l + 6,85g/l + 4,4g/l + 4,7g/l + 4,2g/l +1,6g/l + 6,2g/l + 11,5g/l + 3g/l + 10,5g/l +12g/l + 5,6g/l + 7,2g/l +5,5g/l +2.3g/l + 0.4g/l + 2.858g/l + 0.36g/l + 2.453 g/l + 0.508g/l + 3.581g/l solution for infusion INN: isoleucine; leucine; lysine-hyidrochloride; methionine; phenylalanine; threonine; tryptophan; valine; arginine; histidine; alanine; glycine; asparatic acid; glutamatic acid proline; serine; tyrozine; sodium- acetate trihydrate; sodium-hydroxide; potassium-acetate; magnezijum-chloride, hexahydrate; disodium phosphate dodecahydrate”

Page 14: From documents to datasets and back: challenges and solutions

14

Challenge: nested concepts

© 2014 DIA, Inc. All rights reserved.

Attribute Attribute valueMedicinal product name EAU POUR PREPARATIONS INJECTABLES ExampleCo,

solvant pour préparation parentérale en ampouleDose form name part solvant pour préparation parentérale

Scientific name part EAU POUR PREPARATIONS INJECTABLES

Invented name part -

Company name part ExampleCo

Strength name part -

Container name part en ampoule

Time/period name part -

Page 15: From documents to datasets and back: challenges and solutions

15

Challenge: MedDRA codes

© 2014 DIA, Inc. All rights reserved.

“Indicated for treatment of patients with locally advanced or metastatic adenocarcinoma of the pancreas “

Pancreatic adenocarcinoma (LLT=10051971)

Pancreatic adenocarcinoma metastatic (LLT=10033599)

Solution:• Step 1 – order concepts by

relevance• Step 2 – let user make

expert judgement

“Indicated for treatment of patients with locally advanced adenocarcinoma of the pancreas “

“Indicated for treatment of patients with locally metastatic adenocarcinoma of the pancreas “

Page 16: From documents to datasets and back: challenges and solutions

16

Challenge: multiple products in one SmPC

© 2014 DIA, Inc. All rights reserved.

Page 17: From documents to datasets and back: challenges and solutions

17

Challenge: multiple products in one SmPC

© 2014 DIA, Inc. All rights reserved.

Page 18: From documents to datasets and back: challenges and solutions

18

Challenge: multiple products in one SmPC

© 2014 DIA, Inc. All rights reserved.

Page 19: From documents to datasets and back: challenges and solutions

19

Some results so far

Developing a framework for measuring accuracy

StatisticsRepresentative reference setsSparsity of data

© 2014 DIA, Inc. All rights reserved.

Attribute AccuracyATC Code 98,2%

Theraputic indication 100%

Medicinal product name 100%

Dose form name part 81,2%

Scientific name part 92,0%

Invented name part 98,0%

Company name part 100%

Strength name part 79,6%

Container name part -

Time/period name part 0%

Page 20: From documents to datasets and back: challenges and solutions

20

Reference Data Management PlatformControlled vocabularies (versioned), crosswalks

The extractor in context

© 2014 DIA, Inc. All rights reserved.

Identifiers for “Simvastatin”

Page 21: From documents to datasets and back: challenges and solutions

21

Reference Data Management PlatformControlled vocabularies (versioned), crosswalks

The extractor in context

© 2014 DIA, Inc. All rights reserved.

Identifiers for “Simvastatin”

SYSTEM A SYSTEM BProvisions reference data

Provisions reference data

Dataflow

Translates between vocabularies

Page 22: From documents to datasets and back: challenges and solutions

22

Reference Data Management Platfor,Controlled vocabularies (versioned), crosswalks

IDMPIDMPIDMP

The extractor in context

© 2014 DIA, Inc. All rights reserved.

Identifiers for “Simvastatin”

Referentials Management System

APISYSTEM A SYSTEM B

Provisions reference data

Provisions reference data

EMA

Page 23: From documents to datasets and back: challenges and solutions

23

Reference Data Management PlatformControlled vocabularies (versioned), crosswalks

The extractor in context

© 2014 DIA, Inc. All rights reserved.

Identifiers for “Simvastatin”

Page 24: From documents to datasets and back: challenges and solutions

24

The extractor in context

© 2014 DIA, Inc. All rights reserved.

Reference Data Management PlatformControlled vocabularies (versioned), crosswalks

Super-thesaurus

Extractor

Versioned extraction

results

IDMP Data hub

Identifiers for “Simvastatin”

Page 25: From documents to datasets and back: challenges and solutions

25

The extractor in context

© 2014 DIA, Inc. All rights reserved.

Reference Data Management PlatformControlled vocabularies (versioned), crosswalks

Super-thesaurus

Extractor

Versioned extraction

results

IDMP Data hub

SYSTEM A SYSTEM BProvisions reference data

Provisions reference data

Translates between vocabularies

Identifiers for “Simvastatin”

Dataflow

Page 26: From documents to datasets and back: challenges and solutions

26

Benefits of automated extraction

Quality control and feedback on SmPCQuality control and feedback on XEVMPD datasetsIncreased consistency in mapping from text to data and backHuge time and cost savings

© 2014 DIA, Inc. All rights reserved.

Page 27: From documents to datasets and back: challenges and solutions

27

What to look for to become IDMP-ready

When selecting extractor tools, look at:Accuracy• Compare verified results of manual data entry with results of automated

extraction• Evaluate random samples (p-value)

Traceability • Trace extracted data back to the text• Auditability• VersioningEase of use• Override extraction results manually where necessary• Support for expert judgment where necessaryReference data management • Vocabularies used for extraction also to be used for data governance and

enterprise data integration© 2014 DIA, Inc. All rights reserved.

Page 28: From documents to datasets and back: challenges and solutions

28

AskAsk

28