Top Banner
Open Information Extraction Aims to extract asserted propositions from unstructured text: “Barack Obama, a former U.S president, was born in Hawaii.” 1. (Barack Obama; was born in; Hawaii) 2. (a former U.S. president; was born in; Hawaii) BIO Encoding Each tuple is encoded with respect to a single predicate, where argument labels indicate their position in the tuple. Barack A0−B Obama A0−I , O a A0−B former A0−I U.S. A0−I president A0−I , O was P−B born P−I in P−I Hawaii A1−B RNN-OIE: Bi-LSTM Sequence Tagger Inspired by recent state of the art in Semantic Role Labelling (Zhou and Xu, 2015; He et al., 2017). Features: Concatenated pretrained embeddings of current word and target predicate (identified by a verb POS). Decoding: Ignores malformed spans - if an A0-I label is not preceded by A0-I or A0-B, we treat it as O. Confidence: Estimated for an extraction E by Π "∈$ %(') Supervised Open Information Extraction Gabriel Stanovsky 2,3 , Julian Michael 2 , Luke Zettlemoyer 2 , Ido Dagan 1 github.com/gabrielStanovsky/supervised-oie Training Data We used the QA-SRL to Open IE conversion (OIE2016, Stanovsky and Dagan, 2016) to train our model. This consists of verbal propositions, automatically extracted from template QA-SRL annotations. Augmenting with QAMR annotations In addition, we converted the Question-Answer Meaning Representation bank (Michael et al, 2018 – Come see our poster tomorrow!), consisting of free-form question-answer format over a wide range of predicates. The conversion was achieved with heuristics over the QA parse tree. Resulting Training Corpus Evaluation We compare RNN-OIE against top performing Open IE systems: RNN-OIE performs competitively across all test sets, outperforming all other systems on the larger test sets. QAMR improves performance, especially on more diverse test sets. Run-time Analysis Rnn-OIE is able to leverage GPU architecture to achieve a 10 times improvement over the previous fastest system (measured in sentences per second). 1 Bar-Ilan University 3 Allen Institute for Artificial Intelligence 2 University of Washington Labels repeat when a single predicate participates in multiple propositions Multi-word predicates are allowed Test Data We test our model on four publicly available Open IE corpora, following (Schneider et al., 2017). Dataset Domain #Sentences #Tuples OIE2016 News, Wiki 3200 5077 QAMR Wikinews, wiki 3300 12952 Dataset Domain #Sentences #Tuples OIE2016 News, Wiki 3200 1729 WEB News, Web 500 461 NYT News, Wiki 222 222 PENN Mixed 100 51 ClausIE PropS Open IE4 RNN-OIE CPU 4.07 4.59 15.38 13.51 GPU --- --- --- 149.25 Error Analysis An analysis of 100 gold propositions which were missed by all systems (i.e., recall errors) reveals that they all struggle with noun relations, sentence-level inference and long or informal sentences. Noun predicate Sentence level inference Long sentences Nominalization Noisy / Informal PP attachment Andre Agassi did a similar thing in his hometown of Las Vegas. (Andre Agassi; hometown; Las Vegas) John Steinbeck also earned a lot of awards, one being the Pulitzer Prize. (John Steinbeck; earned; Pulitzer Prize) 38 40 23 28 34 45 22 28 42 45 24 28 45 23 9 21 48 47 25 26 OIE2016 WEB NYT PENN AREA UNDER PR CURVE ClausIE PropS Open IE4 RNN-OIE-verb RNN-OIE-QAMR
1

Supervised Open Information Extraction - Gabriel Stanovsky€¦ · relations, sentence-level inference and long or informal sentences. Noun predicate Sentence level inference Long

Aug 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Supervised Open Information Extraction - Gabriel Stanovsky€¦ · relations, sentence-level inference and long or informal sentences. Noun predicate Sentence level inference Long

Open Information ExtractionAims to extract asserted propositions from unstructured text:

“Barack Obama, a former U.S president, was born in Hawaii.”

1. (Barack Obama; was born in; Hawaii)2. (a former U.S. president; was born in; Hawaii)

BIO Encoding Each tuple is encoded with respect to a single predicate, where argument labels indicate their position in the tuple.

BarackA0−B ObamaA0−I ,O aA0−B formerA0−I U.S.A0−IpresidentA0−I ,O wasP−B bornP−I inP−I HawaiiA1−B

RNN-OIE: Bi-LSTM Sequence Tagger Inspired by recent state of the art in Semantic Role Labelling (Zhou and Xu, 2015; He et al., 2017).

Features: Concatenated pretrained embeddings of current word and target predicate (identified by a verb POS).

Decoding: Ignores malformed spans - if an A0-I label is not preceded by A0-I or A0-B, we treat it as O.

Confidence: Estimated for an extraction E by Π"∈$%(')

Supervised Open Information ExtractionGabriel Stanovsky2,3, Julian Michael2, Luke Zettlemoyer2, Ido Dagan1

github.com/gabrielStanovsky/supervised-oie

Training DataWe used the QA-SRL to Open IE conversion (OIE2016, Stanovskyand Dagan, 2016) to train our model. This consists of verbal propositions, automatically extracted from template QA-SRL annotations.

Augmenting with QAMR annotationsIn addition, we converted the Question-Answer Meaning Representation bank (Michael et al, 2018 – Come see our poster tomorrow!), consisting of free-form question-answer format over a wide range of predicates. The conversion was achieved with heuristics over the QA parse tree.

Resulting Training Corpus

EvaluationWe compare RNN-OIE against top performing Open IE systems:

RNN-OIE performs competitively across all test sets, outperforming all other systems on the larger test sets. QAMR improves performance, especially on more diverse test sets.

Run-time AnalysisRnn-OIE is able to leverage GPU architecture to achieve a 10 times improvement over the previous fastest system (measured in sentences per second).

1Bar-Ilan University

3Allen Institute for Artificial Intelligence

2University of Washington

Labels repeat when a single predicate participates in multiple propositions

Multi-word predicates are allowed

Test DataWe test our model on four publicly available Open IE corpora, following (Schneider et al., 2017).

Dataset Domain #Sentences #Tuples

OIE2016 News, Wiki 3200 5077

QAMR Wikinews, wiki 3300 12952

Dataset Domain #Sentences #Tuples

OIE2016 News, Wiki 3200 1729

WEB News, Web 500 461

NYT News, Wiki 222 222

PENN Mixed 100 51

ClausIE PropS Open IE4 RNN-OIE

CPU 4.07 4.59 15.38 13.51

GPU --- --- --- 149.25

Error AnalysisAn analysis of 100 gold propositions which were missed by all systems (i.e., recall errors) reveals that they all struggle with noun relations, sentence-level inference and long or informal sentences.

Noun predicate

Sentence level inference

Long sentences

Nominalization

Noisy / Informal

PP attachment

Andre Agassi did a similar thing in his hometown of Las Vegas.

(Andre Agassi; hometown; Las Vegas)

John Steinbeck also earned a lot of awards, one being the Pulitzer Prize.(John Steinbeck; earned; Pulitzer Prize)

38 40

23

28

34

45

22

28

42 45

24 28

45

23

9

21

48 47

25 26

OIE2016 WEB NYT PENN

AREA UNDER PR CURVE ClausIE PropS Open IE4 RNN-OIE-verb RNN-OIE-QAMR