Top Banner
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech
25

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Dec 31, 2015

Download

Documents

Cole Pollard

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing. Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst

Computer Science Division and SIMSUniversity of California, Berkeley

http://biotext.berkeley.edu

Supported by NSF DBI-0317510 and a gift from Genentech

Page 2: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Plan

Overview

Noun compound (NC) bracketing

Problems with Web Counts

Layers of annotation

Applying LQL to NC bracketing

Evaluation

Page 3: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Overview

Motivation: Need to re-use results of NLP processing: for additional processing for end applications: data mining etc.

Proposed solution: Layers of annotations over text

Illustration: Application to noun compound bracketing

Page 4: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Plan

Overview

Noun compound (NC) bracketing

Problems with Web Counts

Layers of annotation

Applying LQL to NC bracketing

Evaluation

Page 5: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Noun Compound Bracketing

(a) [ [ liver cell ] antibody ] (left bracketing)

(b) [ liver [cell line] ] (right bracketing)

In (a), the antibody targets the cell line. In (b), the cell line is derived from the liver.

Page 6: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Related Work

Pustejosky et al. (1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)

Lauer (1995) dependency model: Pr(w1|w3) vs. Pr(w2|w3)

Keller & Lapata (2004): use the Web unigrams and bigrams

Nakov & Hearst (2005): will be presented at coNLL! use the Web, Chi-squared n-grams paraphrases surface features

Page 7: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Nakov & Hearst (2005)

Web page hits: proxy for n-gram frequencies

Sample surface features amino-acid sequence left brain stem’s cell left brain’s stem cell right

Majority vote to combine different models

Accuracy 89.34%

Page 8: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Plan

Overview

Noun compound (NC) bracketing

Problems with Web Counts

Layers of annotation

Applying LQL to NC bracketing

Evaluation

Page 9: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Web Counts: Problems

The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care)

“health”: returns nouns “care”: returns both verbs and nouns can be adjacent by chance can come from different sentences

Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition

Page hits are inaccurate

Page 10: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Plan

Overview

Noun compound (NC) bracketing

Problems with Web Counts

Layers of annotation

Applying LQL to NC bracketing

Evaluation

Page 11: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Solution: MEDLINE+LQL

MEDLINE: ~13 million abstracts We annotated:

1.4 million abstracts ~10 million sentences ~320 million annotations

Layered Query Language: demo at ACL! http://biotext.berkeley.edu/lql/

Page 12: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

The System

Built on top of an RDBMS system

Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML

Specialized query language LQL (Layered Query Language)

Page 13: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Annotated Example

Page 14: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Plan

Overview

Noun compound (NC) bracketing

Problems with Web Counts

Layers of annotation

Applying LQL to NC bracketing

Evaluation

Page 15: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Noun Compound Extraction (1)

FROM

[layer=’shallow_parse’ && tag_type=’NP’

ˆ [layer=’pos’ && tag_type="noun"]

[layer=’pos’ && tag_type="noun"]

[layer=’pos’ && tag_type="noun"] $

] AS compound

SELECT compound.content

layers’ beginnings

should match

layers’ endings should match

Page 16: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Noun Compound Extraction (2)

SELECT LOWER(compound.content) AS lc, COUNT(*) AS freqFROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQLGROUP BY lcORDER BY freq DESC

Page 17: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Noun Compound Extraction (3)

SELECT LOWER(compound.content) AS lc, COUNT(*) AS freqFROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQLGROUP BY lcORDER BY freq DESC

layer negation

artificial range

Page 18: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Finding Bigram Counts

SELECT COUNT(*) AS freq

FROM

BEGIN_LQL

FROM

[layer=’shallow_parse’ && tag_type=’NP’

[layer=’pos’ && tag_type="noun“ &&

content="immunodeficiency"] AS word1

[layer=’pos’ && tag_type="noun“ &&

(content="virus"||content="viruses")]

]

] SELECT word1.content

END_LQL

GROUP BY lc

ORDER BY freq DESC

Page 19: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Paraphrases

Types of paraphrases (Warren,1978): Prepositional

immunodeficiency virus in humans right Verbal

virus causing human immunodeficiency left immunodeficiency virus found in humans left

Copula immunodeficiency virus that is human right

Page 20: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Prepositional Paraphrases

SELECT LOWER(prep.content) lp, COUNT(*) AS freq FROM

BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_type="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQLGROUP BY lp, ORDER BY freq DESC

optional layer

Page 21: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Plan

Overview

Noun compound (NC) bracketing

Problems with Web Counts

Layers of annotation

Applying LQL to NC bracketing

Evaluation

Page 22: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Evaluation

obtained 418,678 noun compounds (NCs) annotated the top 232 NCs (after cleaning)

agreement 88% kappa .606

baseline (left): 83.19% n-grams: Pr, #, χ2

prepositional paraphrases for inflections, we used UMLS

Page 23: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Results

correctN/Awrong

Page 24: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

Discussion

Semantics of bone marrow cells top verbal paraphrases

cells derived from bone marrow (22 instances) cells isolated from bone marrow (14 instances)

top prepositional paraphrases cells in bone marrow (456 instances) cells from bone marrow (108 instances)

Finding hard examples for NC bracketing w1w2w3 such that both w1w2 and w2w3 are

MeSH terms

Page 25: Scaling Up BioNLP: Application  of a Text Annotation Architecture  to Noun Compound Bracketing

The End

Thank you!