Top Banner
Multi-lingual Concept Extraction with Linked Data and Human-in-the-Loop Alfredo Alba, Anni Coden, Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski, Steve Welch IBM Research
17

Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Jul 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualConceptExtractionwithLinkedDataandHuman-in-the-Loop

AlfredoAlba,Anni Coden,AnnaLisaGentile,DanielGruhl,Petar Ristoski,SteveWelch

IBMResearch

Page 2: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Motivation

Page 3: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Motivation

§ extractinformationfromanovel corpus

§ whataretherelevantconcepts inthedomain?

§ limiteddomain andlanguage knowledge

§ IDEA:combinestatisticaltechniqueswithuser-in-the-loop

Page 4: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

DomainLearningAssistant

• Startwithasmallnumberofseeds(1)

• Getsuggestionsofnewsurfaceforms

• Theuseraccept/reject

Page 5: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Findingconcept candidatesThesafetyandefficacyoffilgrastim aresimilarinadultsand childrenreceivingcytotoxicchemotherapy

Laeficacia ylaseguridad delfilgrastim sonsimilares en los adultos y en los niños tratados conquimioterapia citotóxica

Lasicurezza el’efficacia delfilgrastim sono simili negli adulti e nei bambinisottoposti achemioterapia citotossica

DieWirksamkeit undUnbedenklichkeit vonFilgrastim ist bei Erwachsenen undbei Kindern ,dieeine zytotoxische Chemotherapie erhalten ,vergleichbar

Page 6: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Findingconcept candidates

Plasmaeliminationhalf-lifeoforalpravastatin is1.5to2hours.

L’emivita plasmatica dieliminazione delpravastatin orale é compresa tra un’ora emezzoedueore.

Page 7: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Findingconcept candidatesCandidates:{eggs,flour}

“mixeggs andflour”àmix <candidate>and <candidate>

mix <candidate>and <candidate>à “mixsugarandbutter”

Candidates:{eggs,flour,sugar,butter}

“meltthebutter”àmeltthe<candidate>

Page 8: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Findingconcept candidatesCandidates:{uova,farina}

“amalgamare uova efarina”à amalgamare <candidate>e<candidate>

amalgamare <candidate>e<candidate>à “amalgamare zucchero eburro”

Candidates:{uova,farina,zucchero,burro}

“sciogliere il burro”à sciogliere il <candidate>

Page 9: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualexperimentHYPOTHESIS:samebehavior,regardlessofthelanguage

§ westartwithveryfewseeds(onecouldbesufficient)foreachlanguage§ weextractcontextpatternsandusethemtogeneratenewcandidates

§ weasktouser toaccept/reject thecandidates

§ werepeatforafixednumberofiterationsinalllanguages

Page 10: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualexperiment:DrugDiscovery§ DATA:parallelcorpusfromtheEuropeanMedicinesAgency(EMEA)§documentsrelatedtomedicinalproducts§translationsinto22officiallanguagesoftheEuropeanUnion§1,500documentsformostofthelanguages§weused4languages(en,es,it,de)

§ TASK:buildalexiconofclinicaldrugs

§user-in-the-loop simulatedbyconstructingaGoldStandard(GS)ofdrugsnamesextractedfromLinkedOpenData(weusedDBpediahttp://dbpedia.org)

Page 11: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

DrugDiscovery:Oneseed

§ initialseeds:singleseed§Onedrugnamewhichappearsineachcorpus(e.g.“irbesartan”)

§ 20iterations

§ learningcurvesforalllanguagesarecomparable

Discovery growth for glimpse for English (en), Italian (it), Spanish (es) and German (de). Average correlation amongst all languages r = 0.998.

Page 12: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

DrugDiscovery:LinkedDataseeds§ initialseeds:20%ofavailableLinkedData(DBpedia)§ 5-foldvalidation(randomlyselected20%,samedrugsforalllanguages)§ choiceofinitialseedsdoesnotimpactstheresults

Discovery growth with 5-fold cross validation on the EMEA dataset using DBpedia as seeds. Each plot shows the discovery growth for each of the randomly generated 5 folds and reports the Pearson correlation (r) amongst them.

Page 13: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

DrugDiscovery:benefitofLinkedData§glimpseà onemanuallyprovidedseed

§glimpseLDàLinkedDataseeds

§in10iterationsglimpseLD cancoverthesamelexiconthatwouldtakemorethan20iterationswithglimpse

Human-in-the-loopexperimentwithasubjectmatterexpert(physician)

Page 14: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualexperiment:Colors§ DATA:Twitterstream1st-14thofJanuary2016– lang:En,De,Es, It§containatleastonementionofacolor§ goldstandardlistsofcolorsfromWikidata andDbpedia

§ balancedatasetssizeindifferentlanguages§ 155,828tweetsperlanguage

§ TASK:expandthelexiconofcolors

§ user-in-the-loop: 4nativespeakers,10iterations

Page 15: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualexperiment:Colors§ newcoloritemsextractedfromTwitterdata:§German:5§ Italian:5§English:19§Spanish:22§azulgrana§ rojo vivo§ “limn"(inplaceofthecolorlímon)

Page 16: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

ConclusionsWHAT§ knowledgeresourcesarenevercomplete/exhaustive

§ construct/improvedictionariesfromtextcorpora

HOW§ iterativeandpurelystatistical algorithm§ nofeatureextractionrequired§ comparablebehaviorfor differentlanguages

§ organicallyincorporateshumanfeedback

Page 17: Multi-lingual Concept Extraction with ... - Anna Lisa Gentile · Alfredo Alba, AnniCoden, Anna Lisa Gentile, Daniel Gruhl, PetarRistoski, Steve Welch IBM Research. Motivation. Motivation

Multi-lingualConceptExtractionwithLinkedDataandHuman-in-the-Loop

IBMResearch

[email protected] @AnLiGentile

AlfredoAlba,Anni Coden,AnnaLisaGentile,DanielGruhl,Petar Ristoski,SteveWelch