Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst

Computer Science Division and SIMSUniversity of California, Berkeley

http://biotext.berkeley.edu

Supported by NSF DBI-0317510 and a gift from Genentech

Plan

Overview

Noun compound (NC) bracketing

Problems with Web Counts

Layers of annotation

Applying LQL to NC bracketing

Evaluation

Overview

Motivation: Need to re-use results of NLP processing: for additional processing for end applications: data mining etc.

Proposed solution: Layers of annotations over text

Illustration: Application to noun compound bracketing

Plan

Overview





Evaluation

Noun Compound Bracketing

(a) [ [ liver cell ] antibody ] (left bracketing)

(b) [ liver [cell line] ] (right bracketing)

In (a), the antibody targets the cell line. In (b), the cell line is derived from the liver.

Related Work

Pustejosky et al. (1993) adjacency model: Pr(w1|w2) vs. Pr(w2|w3)

Lauer (1995) dependency model: Pr(w1|w3) vs. Pr(w2|w3)

Keller & Lapata (2004): use the Web unigrams and bigrams

Nakov & Hearst (2005): will be presented at coNLL! use the Web, Chi-squared n-grams paraphrases surface features

Nakov & Hearst (2005)

Web page hits: proxy for n-gram frequencies

Sample surface features amino-acid sequence left brain stem’s cell left brain’s stem cell right

Majority vote to combine different models

Accuracy 89.34%

Plan

Overview





Evaluation

Web Counts: Problems

The Web lacks linguistic annotation Pr(health|care) = #(“health care”) / #(care)

“health”: returns nouns “care”: returns both verbs and nouns can be adjacent by chance can come from different sentences

Cannot find: stem cells VERB PREPOSITION brain protein synthesis’ inhibition

Page hits are inaccurate

Plan

Overview





Evaluation

Solution: MEDLINE+LQL

MEDLINE: ~13 million abstracts We annotated:

1.4 million abstracts ~10 million sentences ~320 million annotations

Layered Query Language: demo at ACL! http://biotext.berkeley.edu/lql/

The System

Built on top of an RDBMS system

Supports layers of annotations over text hierarchical, overlapping cannot be represented by a single-file XML

Specialized query language LQL (Layered Query Language)

Annotated Example

Plan

Overview





Evaluation

Noun Compound Extraction (1)

FROM

[layer=’shallow_parse’ && tag_type=’NP’

ˆ [layer=’pos’ && tag_type="noun"]

[layer=’pos’ && tag_type="noun"]

[layer=’pos’ && tag_type="noun"] $

] AS compound

SELECT compound.content

layers’ beginnings

should match

layers’ endings should match


SELECT LOWER(compound.content) AS lc, COUNT(*) AS freqFROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] $ ] AS compound SELECT compound.content END_LQLGROUP BY lcORDER BY freq DESC


SELECT LOWER(compound.content) AS lc, COUNT(*) AS freqFROM BEGIN_LQL FROM [layer=’shallow_parse’ && tag_type=’NP’ ˆ ( { ALLOW GAPS } ![layer=’pos’ && tag_type="noun"] ( [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] [layer=’pos’ && tag_type="noun"] ) $ ) $ ] AS compound SELECT compound.content END_LQLGROUP BY lcORDER BY freq DESC

layer negation

artificial range

Finding Bigram Counts

SELECT COUNT(*) AS freq

FROM

BEGIN_LQL

FROM

[layer=’shallow_parse’ && tag_type=’NP’

[layer=’pos’ && tag_type="noun“ &&

content="immunodeficiency"] AS word1

[layer=’pos’ && tag_type="noun“ &&

(content="virus"||content="viruses")]

]

] SELECT word1.content

END_LQL

GROUP BY lc

ORDER BY freq DESC

Paraphrases

Types of paraphrases (Warren,1978): Prepositional

immunodeficiency virus in humans right Verbal

virus causing human immunodeficiency left immunodeficiency virus found in humans left

Copula immunodeficiency virus that is human right

Prepositional Paraphrases

SELECT LOWER(prep.content) lp, COUNT(*) AS freq FROM

BEGIN_LQL FROM [layer=’sentence’ [layer=’pos’ && tag_type="noun" && content = "immunodeficiency"] [layer=’pos’ && tag_type="noun" && content IN ("virus","viruses")] [layer=’pos’ && tag_type=’IN’] AS prep ?[layer=’pos’ && tag_type=’DT’ && content IN ("the","a","an")] [layer=’pos’ && tag_type="noun" && content IN ("human", "humans")] ] SELECT prep.content END_LQLGROUP BY lp, ORDER BY freq DESC

optional layer

Plan

Overview





Evaluation

Evaluation

obtained 418,678 noun compounds (NCs) annotated the top 232 NCs (after cleaning)

agreement 88% kappa .606

baseline (left): 83.19% n-grams: Pr, #, χ2

prepositional paraphrases for inflections, we used UMLS

Results

correctN/Awrong

Discussion

Semantics of bone marrow cells top verbal paraphrases

cells derived from bone marrow (22 instances) cells isolated from bone marrow (14 instances)

top prepositional paraphrases cells in bone marrow (456 instances) cells from bone marrow (108 instances)

Finding hard examples for NC bracketing w1w2w3 such that both w1w2 and w2w3 are

MeSH terms

The End

Thank you!

Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing

Documents