DISTANTLY SUPERVISED INFORMATION EXTRACTION USING BOOTSTRAPPED PATTERNS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Sonal Gupta June 2015
170
Embed
DISTANTLY SUPERVISED INFORMATION EXTRACTION USING A ...manning/dissertations/Gupta-Sonal-thesis... · matter; what matters is the quality of the research I do and whether I relish
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DISTANTLY SUPERVISED INFORMATION EXTRACTION USING
BOOTSTRAPPED PATTERNS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Sonal Gupta
June 2015
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/nt508qx3506
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Christopher Manning, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jeffrey Heer
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Percy Liang
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Information extraction (IE) involves extracting information such as entities, relations, and
events from unstructured text. Although most work in IE focuses on tasks that have abun-
dant training data by exploiting supervised machine learning techniques, in practice, most
IE problems do not have any supervised training data available. Learning conditional ran-
dom fields (CRFs), a state-of-the-art supervised approach, is impractical for such real world
applications because: (1) they require large and expensive labeled corpora, and (2) it is dif-
ficult to interpret them and analyze errors, an often-ignored but important feature.
This dissertation focuses on information extraction for tasks that have no labeled data
available, apart from some seed examples. Supervision using seed examples is usually eas-
ier to obtain than fully labeled sentences. In addition, for many tasks, the seed examples can
be acquired using existing resources like Wikipedia and other human curated knowledge
bases.
I present Bootstrapped Pattern Learning (BPL), an iterative pattern and entity learning
approach, as an effective and interpretable approach to entity extraction tasks with only
seed examples as supervision. I propose two new tasks: (1) extracting key aspects from
scientific articles to study the influence of sub-communities of a research community, and
(2) extracting medical entities from online web forums. For the first task, I propose three
new categories of key aspects and a new definition of influence based on the key aspects.
This dissertation is the first work to address the second task of extracting drugs & treatments
and symptoms & conditions entities from patient-authored text. Extracting these entities
can aid in studying the efficacy and side effects of drugs and home remedies at a large
scale. I show that BPL, using either dependency patterns or lexico-syntactic surface-word
patterns, is an effective approach to solve both problems. It outperforms existing tools and
iv
CRFs.
Similar to most bootstrapped or semi-supervised systems, BPL systems developed ear-
lier either ignore the unlabeled data or make closed world assumptions about it, resulting in
less accurate classifiers. To address this problem, I propose improvements to BPL’s pattern
and entity scoring functions by evaluating the unlabeled entities using unsupervised sim-
ilarity measures, such as word embeddings and contrasting domain-specific and general
text. I improve the entity classifier of BPL by expanding the training sets using similar-
ity computed by distributed representations of entities. My systems successfully leverage
unlabeled data and significantly outperform the baselines by not making closed world as-
sumptions.
Developing any learning system usually requires a developer-in-the-loop to tune the
parameters. I utilize the interpretability of patterns to humans, a highly desirable attribute
for industrial applications, to develop a new diagnostic tool for visualization of the output
of multiple pattern-based entity learning systems. Such comparisons can help in diagnosing
errors faster, resulting in a shorter and easier development cycle. I make source code of all
tools developed in this dissertation publicly available.
v
To my wonderful parents, Arvind and Sudha Gupta, and my partner-in-crime, Apurva.
vi
Acknowledgements
I consider myself lucky to have had great advisors during my graduate life. Thank you
Chris for your insightful short and long answers, for showing me the right path whenever I
was in doubt, and for encouraging me to pursue my research whenever I felt disheartened.
You gave me all the freedom to work on the research projects I was excited about. Your
honest and constructive feedback helped me learn how to critically assess ideas and their
implementations. Very often graduate students worry about not publishing enough – I did
too. A lot. Thanks so much for the persistent advice that the number of papers do not
matter; what matters is the quality of the research I do and whether I relish the process. I
think it is one of the best advices I have ever gotten.
I am also thankful to my committee members, Jeff and Percy. Jeff, I really enjoyed our
conversations and the brainstorming sessions. I admire your clear thinking and great ideas.
The research project with you and Diana opened the door for many successful projects
subsequently. I have always appreciated your encouragement and high-spiritedness. Percy,
you are so smart and yet so grounded!! Thanks so much for all the great, thoughtful feed-
back on my research and this dissertation.
I am indebted to Ray Mooney for becoming my mentor during my masters at UT
Austin. Ray, it is fair to say that this PhD would not have been possible without you.
Even though I spent only two years with you, I learned the skills for a lifetime. Some of
my best memories are from the time we spent together sightseeing after ECML in Belgium
and NAACL in Los Angeles.
The research in this dissertation has been possible because of the incredible people
around me. First and foremost: Diana, the collaboration and friendship with you has been
one of the highlights of my time at Stanford. Jason and Sanjay, it was a lot of fun to
vii
work with both of you. Thank you Val and Angel for labeling the inter-annotator data for
studying the key aspects of scientific articles. Thank you DanJ and Chris for mapping the
ACL Anthology topics to communities. Whenever someone asked me about the validity of
the mapping, it felt great to point them to two universally-acknowledged experts in NLP!
Thanks to Eric Xing, Kriti, and Jacob for the fun and exciting collaboration during my
quarter at CMU.
The best thing about Stanford is its grad students – brilliant and yet so approachable.
It has been amazing to be a part of the NLP group at Stanford. I have thoroughly enjoyed
hanging out with the group during the almost-daily afternoon tea time (that is, procrastina-
tion time). I also have fond memories of the various group hikes and the NLP group retreat.
People in the 2A wing – you know who you are – thanks so much for being there whenever
I needed you.
Finally, I am grateful to my family and friends. Diana, Isa, Nick, Nisha, Reyes, Suyash,
Tejo: thanks for being the stress busters I sorely needed over the years. My parents, Arvind
and Sudha Gupta, are the reason I am here. Their dedication and love has always been
selfless and I do not think I can ever repay for their sacrifices. My sister Anshu and brother
Ankur have always been there for me. Thanks so much! I am very fortunate to have the
best parents-in-law – Sandhya and Prakash Samudra. Their love and encouragement has
been unconditional. Words are not enough to express my gratitude and love towards my
husband, Apurva. We lived apart for many years so that we can pursue our own dreams,
however, that distance never came between us. Apurva, you have always been my best
Patterns are typically created using contexts around the known entities in a text corpus. The
two types of patterns I explore are lexico-syntactic surface word patterns (Hearst, 1992)
and dependency tree patterns (Yangarber et al., 2000). They have been shown to perform
better than state-of-the-art feature-based machine learning methods on some specialized
domains, such as in Chapters 4 and 5, and by Nallapati and Manning (2008). Addition-
ally, pattern-based3 systems dominate in commercial use (Chiticariu et al., 2013), mainly
because patterns are effective, interpretable, and are easy to customize by non-experts to
cope with errors. Figure 1.2 shows the distribution of pattern or rule-based vs. machine
learning-based entity extraction systems in commercial use, in a study by Chiticariu et al.
(2013).4
Comparison with sequence classifiers
One main difference between pattern-based learning systems and sequence classifiers like
CRFs is the representation – whether a system is represented using patterns or features.
Sequence classifiers learn weights on a large number of features. The features commonly
include token-level properties, neighboring tokens and their tags, and distributional simi-
larity word classes. Note that there is a continuum between patterns and features. That is,
one can think of a pattern as a big conjunction of features and use it in a sequence classi-
fier like a CRF. Conversely, many feature-based systems use feature conjunctions. Some
feature-based systems use some quite specific hand-engineered conjunctions. For example,
the Stanford part-of-speech tagger (Toutanova and Manning, 2003) models the unknown
words with a conjunction feature of words that are capitalized and have a digit and a dash
in them.
However, in practice, the distinction between patterns and features is clearer: The
feature-based systems typically have a very large number of features, starting with single
element features (such as, word to left is ‘foo’) and then considering simple conjunctions.
Generally, all instantiations of the features and the conjunctions are generated, resulting in
3I use the terms patterns and rules interchangeably in this dissertation.4It is not clear from their paper in which category BPL, and thus my systems, fall under. My systems are
pattern-based and are machine learned.
CHAPTER 1. INTRODUCTION 7
Figure 1.2: Percentage of commercial entity extraction systems that use rule-based, ma-chine learning-based, or hybrid systems. The study was conducted by Chiticariu et al.(2013). Pattern or rule-based systems dominate the commercial market, especially amonglarge vendors.
a large numbers of features, most of which are not individually very useful. The emphasis
is on coverage and recall of features. The advantage is that the features can share statis-
tics better. There has been some work on learning useful feature templates (Martins et al.,
2011). However, the focus is on whether to include entire feature templates, which are
usually much more general than patterns. The pattern-based systems are normally built on
orders of magnitude smaller numbers of patterns. Each pattern is normally a quite specific
conjunction of several things (such as, tokens and their generalized versions, dependency
paths, and wildcards) careful targeted to extract an information need. The emphasis is more
on the precision of the patterns.
The difference is not only in representation; the systems also differ in the typical learn-
ing methods used. Feature-based system normally optimize weights on every feature
in a classifier using optimization methods like stochastic gradient descent and Newton’s
method. Pattern-based learning also has a loss function, but the optimization methods are
rarely used. The focus is on choosing whether to include or exclude patterns (or to weight
them more or less) based on the change in the loss function value.
CHAPTER 1. INTRODUCTION 8
Interpretability
Even though feature-based machine learning is very popular in the academic world, indus-
try is slow and reluctant to adopt it, as seen in the earlier figure. One of the reasons is that
developers, who are often not machine learning experts, do not trust black boxes. Patterns
solve this problem because: 1. patterns are understandable to humans, 2. it is easy to find
errors and fix them in a pattern-based system, and 3. patterns are generally high precision.
However, most industrial pattern-based systems are developed using manually defined pat-
terns, which requires significant human effort and expertise in the pattern language. I work
on making the task automated using machine learning to learn good patterns; the system
preserves the interpretability of patterns but does not require much manual effort. Note that
some other forms of machine learning also have much better interpretability than methods
such as feature-based classifiers or neural networks; traditionally, decision tree and decision
list classifiers have been the prototypical examples of more interpretable machine learning
classifiers (Letham et al., 2013). In a decision tree, the conjunction of features from its root
down a path is often not so different in nature from a decision list or a pattern.
1.1.3 Bootstrapped Pattern Learning
In a bootstrapped pattern-based entity learning system, seed dictionaries and/or patterns
provide distant supervision to label data. The BPL system iteratively learns new patterns
and entities belonging to a specific class from unlabeled text (Riloff, 1996; Collins and
Singer, 1999). I discuss individual components of BPL in detail in Chapter 2. A high level
overview is: BPL is an iterative algorithm, which learns a few good patterns and entities for
each entity type in each iteration. First, patterns are created around known entities. They are
scored and ranked by their ability to extract more positive entities and less negative entities.
Top ranked patterns are then used to extract candidate entities from text. An entity scorer
is trained to score candidate entities based on the entity features and the scores of patterns
that extracted them. High scoring candidate entities are added to the dictionaries and are
used to generate more candidate patterns around them. The power of BPL comes from two
properties. First, it identifies only a few good patterns and entities in each iteration; it takes
a cautious approach to learning new information. Cautious approaches have been shown to
CHAPTER 1. INTRODUCTION 9
be more accurate in semi-supervised settings (Abney, 2004). Second, the patterns get used
in two ways: 1. they act as a filter to suggest good candidate entities to be scored by an
entity scorer, and 2. they act as good features in the entity scorer; a highly scored pattern is
more likely to extract good entities.
1.1.4 Challenges with unlabeled data
The attractive part about distantly-supervised IE – the limited supervision – is also its main
challenge. It is very hard to learn effective classifiers with little labeled data. Starting
with unlabeled data and a small seed set, only a few tokens or examples get labeled using
the seed and learned sets of entities. Existing systems either assume unlabeled data to be
negative or just ignore them. Very often unlabeled data is subsampled to generate negative
training set for an entity or relation classifier (Angeli et al., 2014). Assuming unlabeled data
to be negative can be counterproductive – since many examples subsampled as negative can
actually be positive. On the other hand, by ignoring the unlabeled data, a system does not
use the data to its full extent possible. In this thesis, I propose two improvements to BPL
that exploits the unlabeled data to make the pattern and entity scoring more accurate.
1.2 Contributions
I make the following contributions in this dissertation:
• I focus on low-resource information extraction problems and show that patterns, both
lexico-syntactic surface word patterns and dependency patterns, learned using BPL
are effective for distantly supervised IE. I bring the academic research in IE closer to
the problems in industry.
• I propose two new tasks and show that BPL is an effective approach for both of them.
1. Studying influence of sub-communities of a scientific community: I propose
new types of key aspects of research papers – focus or main contribution, tech-
niques used, and domain or problem. I also propose a new way of quantifying
influence of one research article on another. There has since been a surge in
CHAPTER 1. INTRODUCTION 10
interest in the study of academic dynamics, with IARPA funding a research
program called FUSE.
2. Extracting medical entities from patient-authored text: This dissertation is the
first work to extract drugs & treatments and symptoms & conditions from patient-
authored text. Such extractions can be used to study side-effects and the efficacy
of treatments and home remedies at a large scale. My systems outperformed
commonly used medical entity extractors and other machine learning-based
baselines.
• I leverage the unlabeled data to improve bootstrapped pattern learning in two ways.
My systems significantly outperform the existing pattern and entity scoring mea-
sures.
1. Improved pattern scoring: I propose predicting labels of unlabeled entities us-
ing unsupervised measures to improve pattern scoring in BPL. I present a new
pattern scoring method that uses the predicted labels of unlabeled entities. I
predict the labels using five unsupervised measures, such as distributional sim-
ilarity between labeled and unlabeled entities, and edit distances of unlabeled
entities from the labeled entities.
2. Improved entity scoring: I present an improved entity classifier by creating its
training set in a better way. I expand the positive and negative training ex-
amples by adding most similar unlabeled entities, computed using distributed
representations of words, to the training sets.
• I make the source code and some datasets publicly available. In addition, I also re-
lease a visualization and diagnostics tool to compare pattern-based learning systems
to make developing pattern-based systems more effective and efficient.
.
CHAPTER 1. INTRODUCTION 11
1.3 Dissertation Structure
Chapter 2 This chapter has details about the entity extraction tasks, and necessary
background information about patterns and bootstrapped pattern learning. I give a detailed
overview of individual components of BPL in this chapter. I discuss contributions made to
the system and its components by other researchers in the next chapter.
Chapter 3 I discuss work related to semi-supervised and distantly supervised IE, and
pattern learning in this chapter. I discuss related work specific to the different tasks in each
corresponding chapter.
Chapter 4 I present a new way of studying influence between sub-communities of
a research community in this chapter. I define three new key aspects to extract from a
research article: focus or main contribution, techniques used, and domains applied to. I
then describe how to use topic models to define sub-communities. I combine article-to-
community scores and key aspects of each article to compute influence of sub-communities
on each other. I present a case-study of influence of sub-communities in the computational
linguistics community, such as Speech Recognition and Machine Learning, on each other.
The content of this chapter is drawn from Gupta and Manning (2011).
Chapter 5 This chapter describes the work published in Gupta et al. (2014b). I
describe a new task of extracting drugs & treatments, and symptoms & conditions from
patient-authored text. I show that BPL using lexico-syntactic surface-word patterns is an
effective technique for extracting the information. It performs significantly better than
other approaches, including existing medical entity extraction tools like MetaMap and
Open Biomedical Annotator, on extracting the entities from posts on four forums on Med-
Help.org.
Chapter 6 I present an improved measure to compute pattern scores in BPL by lever-
age unlabeled data. I propose predicting labels of unlabeled entities extracted by patterns
using unsupervised measures and using the predicted labels in the pattern scoring function.
I describe the five unsupervised measures I use to predict the labels of unlabeled entities
and present experimental results on four forums from MedHelp.org. This work has been
published in Gupta and Manning (2014a).
CHAPTER 1. INTRODUCTION 12
Chapter 7 I present an improved entity classifier for BPL by using distributed repre-
sentation of words. I propose expanding training sets for BPL’s entity classifier, modeled
by a logistic regression, using similarity of entities computed by cosine distance between
word vectors. I present experimental results and show that expanded training sets improve
the performance significantly. This work has been published in Gupta and Manning (2015).
Chapter 8 I present and publicly release a visualization and diagnostics tools to com-
pare pattern-based learning systems. This work has been published in Gupta and Manning
(2014b).
Chapter 9 I conclude this dissertation and discuss avenues for future work.
I release the code for the systems described in this dissertation at http://nlp.
stanford.edu/software/patternslearning.shtml. I also release a visual-
ization tool, described in Chapter 8, that can be downloaded at http://nlp.stanford.
Table 2.1: Examples of patterns and how they match to sentences. X means one tokenthat will be matched, tag means the part of speech tag restriction of the target entity, FW*means up 2 or less words from {a, an, the}, SW* means 2 or less stop words, .* meanszero or more characters can match, and lemma means the lemma of the token. Lemmas,FW*, and SW* are generalizations of the patterns’ context. Colors show correspondingmatches between pattern elements and words in sample sentences.
As an example on how to vary the context window size and generalize, consider the
labeled sentence, ‘I take Advair::DT and Albuterol::DT for asthma::SC’, where ‘::’
indicates the label of the word. The patterns created around the DT word ‘Albuterol’
will be ‘DT and X’, ‘DT and X for SC’, ‘X for SC’, and so on, where X is the target
entity.
• Flexible Matching: One can create flexible patterns by ignoring certain types of
words, such as determiners and function words, while matching the pattern. One
can also allow stop words between the context and the term to be extracted. Some
systems (Agichtein and Gravano, 2000; Brin, 1999) have used vectors of context
words instead of contiguous tokens as patterns to increase flexibility in matching. I
use context as contiguous tokens or their generalized forms.
2.2.2 Dependency Patterns
A dependency tree of a sentence is a parse tree that gives dependencies (such as direct-
object, subject) between words in the sentence. It is, in my opinion, the best way to trade-
off semantic meaning representation and ‘learnability’ using the current resources we have.
Semantic representations can be more expressive but it is hard to learn how to generate the
representation for a new sentence. The expressive representations require more manually
labeled data, which is very hard to acquire. All work in this dissertation has used the
CHAPTER 2. BACKGROUND 20
Stanford English Dependencies (De Marneffe et al., 2006). Many researchers are currently
working on developing the Universal Dependencies2 to provide a universal collection of
categories with consistent annotations across different languages. My systems can be easily
customized to work with the Universal Dependencies.
Figure 2.1 shows the dependency tree for the sentence ‘We work on extracting informa-
tion using dependency graphs.’. Dependency patterns match dependency trees of sentences
to extract phrase sub-trees. The figure shows matching of two patterns: [using→ (direct-
object)] and [work→ (preposition on)]. The two patterns are part of seed patterns to extract
FOCUS and TECHNIQUE entities from scientific articles in Chapter 4.
A dependency tree matches a pattern [T → (d)], with a trigger word T and a de-
pendency d, if (1) it contains T , and (2) the trigger word’s node has a successor whose
dependency with its parent is d. In the rest of the dissertation, I call the subtree headed by
the successor as the matched phrase-tree. The notion of a phrase in a dependency grammar
is the subtree below the head node selected by the pattern. I extract the phrase correspond-
ing to the matched phrase-tree and label it with the pattern’s category. For example, the
dependency tree in Figure 2.1 matches the FOCUS pattern [work→ (preposition on)] and
the TECHNIQUE pattern [using→ (direct-object)]. Thus, the system labels the phrase corre-
sponding to the phrase-tree headed by ‘extracting’, which is ‘extracting information using
dependency graphs’, with the category FOCUS, and similarly labels the phrase ‘dependency
graphs’ as a TECHNIQUE.
I use Stanford CoreNLP (Manning et al., 2014) to get dependency trees of sentences
and use its Semgrex tool3 to match dependency patterns to the dependency trees.4
The options to consider when creating dependency trees are similar to the surface word
patterns. A few other parameters to consider are: 1. the allowed and disallowed dependen-
cies, both when generating the dependency patterns and when extracting a phrase from a
matched phrase sub-tree, 2. flexible matching of dependency patterns by allowing a cer-
tain number or type of nodes to be skipped between the trigger node and the node that is
connected by the required dependency edge.
2http://universaldependencies.github.io/3http://nlp.stanford.edu/software/tregex.shtml4More details about the dependencies are in http://nlp.stanford.edu/software/
Figure 2.1: The dependency tree for ‘We work on extracting information using dependencygraphs’. The tree is generated using the collapsed dependencies defined in the StanfordCoreNLP toolkit (the word ‘on’ is collapsed with the edge ‘preposition’). The dependency‘nn’ means ‘noun compound modifier’. The generated dependencies are not always correct,for example, the correct dependency between ‘extracting’ and ‘using’ should have been‘advcl’. Also shown are matching of two patterns. More details are in Chapter 4.
2.3 Bootstrapped Pattern Learning
Bootstrapped pattern-based entity learning (BPL) generally begins with seed sets of pat-
terns and/or example dictionaries for given labels and iteratively learns new entities from
unlabeled text (Riloff, 1996; Collins and Singer, 1999). I earlier discussed the two types
of patterns our systems learned using this approach – lexico-syntactic surface word pat-
terns (Hearst, 1992) and dependency tree patterns (Yangarber et al., 2000). In each itera-
tion, BPL learns a few patterns and a few entities of each given label. Figure 2.2 shows
the flow of the system when the supervision is provided as seed entities. For ease of ex-
position, I present the approach below for learning entities for one label l. It can easily
be generalized to multiple labels. I refer to entities belonging to l as positive and entities
belonging to all other labels as negative. Patterns are scored by their ability to extract more
positive entities and less negative entities. Top ranked patterns are used to extract candidate
entities from text. High scoring candidate entities are added to the dictionaries and are used
to generate more candidate patterns around them.
CHAPTER 2. BACKGROUND 22
Figure 2.2: A flowchart of various steps in a bootstrapped pattern-based entity learningsystem.
CHAPTER 2. BACKGROUND 23
The bootstrapping process involves the following steps, iteratively performed until no
more patterns or entities can be learned. For the ease of understanding, I use a running
example of learning ‘animal’ entities from the following unlabeled text, starting with the
seed set of entities as {dog}.
Step 1: Data labeling
The unlabeled text is partially labeled using the label dictionaries, starting with the seed
dictionaries in the first iteration. In later iterations, the text is labeled using both the seed
and the learned dictionaries. A phrase matching a dictionary phrase is labeled with the
dictionary’s label. Often, phrases are soft matched, by using lemmas of words and/or by
matching phrases within a small edit distance. In the example below, both instances of
‘dog’ are labeled as an ‘animal’.
Step 2: Pattern generation
Patterns are generated using the context around the labeled entities to create candidate
patterns. I discussed various parameters to consider when generating the patterns in Section
2.2. I generate all possible patterns and learn the good ones in the next step. Figure 2.3
shows two of the many possible candidate patterns and their extractions.
Step 3: Pattern learning
This is one of the two crucial steps in a BPL system. Candidate patterns generated in
the previous step are scored using a pattern scoring measure. Top ones are added to the
list of learned patterns for l. The maximum number of patterns to be learned and the
CHAPTER 2. BACKGROUND 24
threshold to choose a pattern are given as an input to the system by the developer. In a
supervised setting, the efficacy of patterns can be judged by their performance on a fully
labeled dataset (Califf and Mooney, 1999; Ciravegna, 2001). In a bootstrapped system,
where the data is not fully labeled, a pattern is usually judged by the number of positive,
negative, and unlabeled entities it extracts. Note that a true recall cannot be used because of
the lack of a fully labeled dataset. One of the most commonly used measures is RlogF by
Riloff (1996). It is a combination of reliability of a pattern and the frequency with which
it extracts positive entities. Let pos(p), neg(p), and unlab(p) be the number of positive,
negative, and unlabeled entities extracted by the pattern p, respectively. The RlogF score is
RlogF (p) =pos(p)
pos(p) + neg(p) + unlab(p)log pos(p) (2.4)
The first term is a very rough estimate of the precision of a pattern – it assumes unla-
beled entities to be negative. The log pos(p) term gives higher scores to patterns that extract
more positive entities. In Figure 2.3, the pattern scorer gives scores to the two candidate
patterns. The pos and unlab values for both patterns are 1 and neg is 0. Assuming that the
pattern scorer is good, that is s2 > s1, the second pattern is selected and added to the list of
learned patterns.
Figure 2.3: An example pattern learning system for the class ‘animals’ from the text. Twoof the many possible candidate patterns are shown, along with the extracted entities. Textmatched with the patterns is shown in italics and the extracted entities are shown in bold.
CHAPTER 2. BACKGROUND 25
Step 4: Entity learning
Patterns that are learned for the label in the previous step are applied to the text to extract
candidate entities. An entity scorer ranks the candidate entities and adds the top entities to
l’s dictionary. The maximum number of entities to be learned and the threshold to choose
an entity is given as an input to the system by the developer. Some systems learn every
entity extracted by the learned patterns, however, that can lead to many noisy entities. In
Chapter 7, I discuss various entity evaluation measures. In our systems, I represent an
entity as a vector of feature values. The features are used to score the entities, either by
taking an average of their values or by training a machine learning-based classifier.
Iterations
Steps 1-4 are repeated for a given number of iterations. Generally, precision drops and
recall increases with every iteration. The number of iterations can also be determined by a
threshold on precision; in Chapters 6 and 7, I do not consider output of the learning systems
when their precision drops below 75% during the post-hoc analysis.
Parameters
Similar to any learning system, there are many parameters one can tweak to improve a
system’s performance. One set of parameters are BPL related parameters – the thresholds
for learning a pattern (entity), number of patterns (entities) to learn in each iteration, and
the total number of iterations. The second set of parameters, as discussed in the previous
section, are related to the construction of patterns: minimum and maximum window of
context, annotations (such as, part-of-speech tags and word class tags) to consider for the
context tokens and the target entity, and, in the case of dependency patterns, the depth
of ancestors/dependents to consider when matching or constructing a pattern. Some of
these parameters, such as thresholds, can be hand tuned on a development dataset. One
trick to not tune the thresholds is to start with high thresholds and reduce them when no
more patterns or entities are learned by the system. I follow this approach in Chapters 5–
7. Another advantage is that initially the systems learn only highly confident patterns and
CHAPTER 2. BACKGROUND 26
entities, reducing the chances of semantic drift. Semantic drift is a phenomenon when the
system learn a few false positive entities leading to learning of more incorrect and pattern
entities over the iterations. Other parameters, such as restrictions to consider for the target
entity, can be learned by the system; patterns with and without the restrictions are generated
and are scored by the pattern scoring function.
2.4 Classifiers and Entity Features
Brown clusters
I use Brown clustering (Brown et al., 1992) to cluster words in the MedHelp dataset in
an unsupervised way. It is a greedy bottom-up hierarchical clustering approach based on
n-gram class language models. They have been used widely for other tasks, such as for
NER (Ratinov and Roth, 2009), parsing (Koo et al., 2008), and part-of-speech tagging
(Li et al., 2012a). I used its implementation by Liang (2005). I do not use the publicly
available generic word clusters because they are not from the same domain as the datasets.
I mainly used Brown clustering instead of other clustering methods such as distributional
clustering (Clark, 2001) or word embeddings (Collobert and Weston, 2008; Mikolov et al.,
2013a) because it was fast, easy to use, and produced good clusters. Additionally, Turian
et al. (2010) reported that Brown clustering induces better representations for rare words
than the embeddings from Collobert and Weston (2008), when the latter does not receive
sufficient training updates. I use the word embeddings from a neural network model to
enhance the entity classifier in Chapter 7.
Google Ngrams
Google Ngrams5 is a resource provided by Google consisting of 1–5 words long English
phrases and their observed frequency counts on the web (considering around 1 trillion word
tokens). Only phrases with frequency greater than or equal to 40 are included. It is a great
resource for building language models or for estimating usage of a phrase on the Internet. I
5https://catalog.ldc.upenn.edu/LDC2006T13, accessed in January 2008.
use this for calculating feature values of entities, on the assumption that an entity common
on the Internet (such as, ‘youtube’) is not a useful entity to extract for a specialized domain.
Logistic Regression
I use logistic regression (LR) for entity classifiers since it is one of the most commonly used
classifiers and it worked better than SVMs and Random Forests in the pilot experiments. I
used the implementation of LR in Stanford CoreNLP (Manning et al., 2014) and used the
default settings (with L2 regularization). Note that our training datasets are noisy – auto-
matically constructed seeds sets often have some noise; learned patterns and entities can be
incorrect; and the sampling to create a training set can lead to a wrongly labeled dataset.
There has been some work in modeling annotation noise to learn more robust classifiers,
such as Shift LR (Tibshirani and Manning, 2014) and Natarajan et al. (2013), that model
random labeling noise. The noise in bootstrapped systems is, however, more systematic.
The wrong labels come from the noisy dictionaries, instead of wrong annotations by human
annotators, which is presumably more random. I tried using Shift LR in our systems but it
led to poor results.
2.5 Dataset: MedHelp Patient Authored Text
In Chapters 5, 6, and 7 for experimental evaluation, I use MedHelp.org forum data. Med-
Help is one of the largest online health discussion forums. Similar to other discussion
forums, there are forums under topics like ‘Asthma’, ‘ENT’, and ‘Pregnancy: Sept 2015
babies’. A MedHelp forum consists of thousands of threads; each thread is a sequence of
posts by users. The dataset includes some medical research material posted by users but has
no clinical text. In each thread, the initiator of the thread posts a paragraph or more about
a health concern or comment. The conversations are usually about health topics, but are
also sometimes about emotional support and advice (MacLean et al., 2015). We acquired
the dataset through a research agreement with MedHelp, who anonymized the data prior
to sharing. The data spans from 2007 to May 2011. Other work on this dataset includes
MacLean and Heer (2013), MacLean et al. (2015), and MacLean (2015).
CHAPTER 2. BACKGROUND 28
Cold dry air is a common trigger, I’m also haven’t a lot of trouble keeping theasthma under control now that is it winter (only diganosed last spring).I had actually been feeling spasms in my throat that I thought were palpitationsbut it ended up not being my heart.Now I have developed a low grade fever and blisters in my throat.Would love some feedback as I’m anxious.No stuffed nose, no discharge.yes i realize that i should have used ear plugs and yes i’ve learned my lesson thati will use plugs from now on.I have chronic sinusitis, scars on both ears from past infections , and “fairly severedeviated septum, ”.I went to the doctor and he gave me augmittin it cleared the white patches rightup.I went to the health food store and found Wally’s Ear Oil about 2 weeks ago afterreading some of the posts here.I am interested in Xanax side affect of loosing taste and smell.It sounds like chronic non-infectious bronchitis.I’ve had chest x-ray-normal.Once I had my sinus surgeries my asthma improved dramatically.
Table 2.2: A few examples of sentences from the MedHelp forum. The sentences arelabeled with symptoms & conditions (in italics) and drugs & treatments (in bold) labels.
There are several challenges with extracting information from the dataset. Patients use
various slang, colloquial forms of entities and home remedies that are not found in seed
sets. They are very descriptive about their symptoms and conditions. Some examples
of sentences from the Asthma and ENT forums labeled with symptoms & conditions (in
italics) and drugs & treatments (in bold) labels are shown in Table 2.2. More information
about these two labels are in Chapter 5.
In the next chapter, I discuss previous work related to bootstrapped and pattern-based
learning methods to learn good pattern and entity ranking functions. For example, I use
logistic regression to learn an entity scorer in some of my systems. Second, dependency
paths can be used as features in a classifier, a common practice for building classifier-based
entity and relation extraction systems. Boella et al. (2013) used patterns or syntactic de-
pendencies as features in a SVM for extracting semantic knowledge from legislative text.
Patterns can also be thought of as ‘feature templates’ used in classifiers. In my opinion,
pattern-based learning approaches learn good instantiations of the feature templates. Sur-
deanu et al. (2006) proposed a co-training-based algorithm that used text categorization
along with pattern extraction, starting with seed sets.
Roth and Klakow (2013) used patterns in their system on combining generative and
discriminative relation extraction approaches. Angeli et al. (2014) used learned dependency
patterns, along with a machine learning-based MIML-RE approach (Surdeanu et al., 2012),
to predict relations between two entities.
More recently, DeepDive (Niu et al., 2012) has shown promising results on distantly
supervised relation extraction (Angeli et al., 2014) by using fast inference in Markov logic
networks. Govindaraju et al. (2013) and Zhang et al. (2013) used DeepDive on the task of
extracting structured information like tables from text.
3.3.1 Open IE systems
Open IE, a popular task in recent years, is geared towards learning generic, domain-independent
extractors. KnowItAll’s entity extraction from the web (Downey et al., 2004; Etzioni et
al., 2005) used components such as list extractors, generic and domain specific pattern
learning, and subclass learning. They learned domain-specific patterns using a seed set.
Never-Ending Language Learning (NELL) system (Carlson et al., 2010a) learned multiple
semantic types using coupled semi-supervised training from web-scale data, which is not
feasible for all datasets and entity learning tasks.
Open-IE relation extraction systems like ReVerb (Fader et al., 2011) and OLLIE (Mausam
et al., 2012) learn domain-independent generic relation extractors for web data. However,
using them for a specific domain with a moderately sized corpus leads to poor results. I
tested learning an entity extractor for a given class using ReVerb. I labeled the binary and
CHAPTER 3. RELATED WORK 36
unary ReVerb extractions using the class seed entities and retrained its confidence function,
with poor results. Poon and Domingos (2010) found a similar result for inducing a proba-
bilistic ontology: an open information extraction system extracted low accuracy relational
triples on a small corpus.
There has been some work to map generic Open IE extractions to learn extractors for
specific relations. Soderland et al. (2013) manually wrote rules to map Open IE extrac-
tions to TAC-KBP Slot Filling relations in under 3 hours and achieved reasonable perfor-
mance. Improving pattern-based learning systems would also improve the hybrid systems
described above.
Overall, even though pattern-based approaches have been less popular in the recent
years as compared to feature-based sequence models because of the trends in the wide
world, they have been shown to be successful at both supervised and bootstrapped entity
learning. The hybrid systems, which have become popular in the last few years, usually
have a pattern learning component. Improving pattern learning would presumably also im-
prove the performance of hybrid systems. In Chapters 4 and 5, I apply the bootstrapped
pattern-based learning approach to two new problems and domains. The results show that
they are very effective at expanding seed sets of entities in these domains. One key com-
ponent missing from the previous systems is the utilization of unlabeled data beyond the
matching of patterns to text. For example, when scoring patterns, unlabeled entities ex-
tracted by patterns are either considered negative or are ignored. In Chapters 6 and 7, I
propose improvements to bootstrapped pattern-based learning systems that leverage unla-
beled data in a better way.
Chapter 4
Studying Scientific Articles andCommunities
In this chapter, I present how to study influence of sub-communities of a research com-
munity by extracting key aspects of the articles published. I examine the computational
linguistics community as a case-study. I use bootstrapped pattern learning to extract the
key aspects, starting with only a few seed dependency patterns as supervision. The content
of this chapter is drawn from Gupta and Manning (2011).
4.1 Introduction
The evolution of ideas and the dynamics of a research community can be studied using
the scientific articles published by the community. For instance, we may be interested in
how methods spread from one community to another, or the evolution of a topic from a
focus of research to a problem-solving tool. We might want to find the balance between
technique-driven and domain-driven research within a field. Such a rich insight of the
development and progress of scientific research requires an understanding of more than
just “topics” of discussion or citation links between articles. As an example, to determine
whether technique-driven researchers have greater or lesser impact, we need to be able to
identify styles of work. To achieve this level of detail and to be able to connect together
how methods and ideas are being pursued, it is essential to move beyond bag-of-words
37
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 38
topical models. This requires an understanding of sentence and argument structure, and is
therefore a form of information extraction, if of a looser form than the relation extraction
methods that have typically been studied.
To study the application domains, the techniques used to approach the domain prob-
lems, and the focus of scientific articles in a community, I propose to extract the following
concepts from the articles
FOCUS: an article’s main contribution
TECHNIQUE: a method or a tool used in an article, for example, expectation maxi-
mization and conditional random fields
DOMAIN: an article’s application domain, such as speech recognition and classifica-
tion of documents.
For example, if an article concentrates on regularization in support vector machines and
shows improvement in parsing accuracy, then its FOCUS and TECHNIQUE are regularization
and support vector machines and its DOMAIN is parsing. In contrast, an article that focuses
on lexical features to improve parsing accuracy, and uses support vector machines to train
the model has FOCUS as lexical features and parsing, the TECHNIQUE being lexical fea-
tures and support vector machines, and DOMAIN still is parsing.1 In this case, even though
TECHNIQUEs and DOMAIN of both papers are very similar, the FOCUS phrases distinguish
them from each other. Note that a DOMAIN of one article can be a TECHNIQUE of another,
and vice-versa. For example, an article that shows improvements in named entity recogni-
tion (NER) has DOMAIN as NER, and an article that uses named entities as an intermediary
tool to extract relations has NER as one of its TECHNIQUEs.
I use dependency patterns to extract the above three categories of phrases from arti-
cles, which can then be used to study the influence of communities on each other. The
phrases are extracted by matching semantic (dependency) patterns in dependency trees of
sentences. The input to the extraction system are some seed patterns (see Table 4.1 for
examples) and it learns more patterns using a bootstrapping approach, similar to one de-
scribed in Chapter 2.
1A community vs. a DOMAIN: a community can be as broad as computer science or statistics whereas aDOMAIN is a specific application such as Chinese word segmentation.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 39
As a case study, I examine the computational linguistics community and consider the
influence of its sub-fields such as parsing and machine translation. For the study, I use the
document collection from the ACL Anthology Network and the ACL Anthology Reference
corpus (Bird et al., 2008; Radev et al., 2009). To get the sub-fields of the community, I use
latent Dirichlet allocation (LDA) (Blei et al., 2003) to find topics and label them by hand.2
However, our general approach can be used to study any case of the influence of academic
communities, including looking more broadly at the influence of statistics or economics
across the social sciences.
Using the approach, I study how communities influence each other in terms of tech-
niques that are reused, and show how some communities ‘mature’ so that the results they
produce get adopted as tools for solving other problems. For example, the products of the
part-of-speech tagging community have been adopted by many other communities. This is
evidenced by many papers that use part-of-speech tagging as an intermediary step to solve
other problems. Overall, our results show that speech recognition and probability theory
have been the most influential fields in the last two decades, since many communities now
use the techniques introduced by papers in those communities. Probability theory, unlike
speech recognition, is not a sub-field of computational linguistics, but it is an important
topic since many papers use and work on probabilistic approaches.
I also show the timeline of influence of communities. For example, the results show
that formal computational semantics and unification-based grammars had a lot of influence
in the late 1980s. The speech recognition and probability theory fields showed an upward
trend of influence in the mid-1990s, and even though it has decreased in recent years,3
they still have a lot of influence on recent papers mainly due to techniques like expectation
maximization and hidden Markov models.
Contributions I introduce a new categorization of key aspects of scientific articles,
which is (1) FOCUS: main contribution, (2) TECHNIQUE: method or tool used, and (3)2In this chapter, I use the terms communities, sub-communities and sub-fields interchangeably.3Speech Recognition has recently made a come-back with the advances using the deep learning approach.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 40
DOMAIN: application domain. I extract them by matching dependency patterns to de-
pendency trees, and learn patterns using bootstrapping. I present a new definition of in-
fluence of a research community on another, and present a case study on the computa-
tional linguistics community, both for verifying the results of our system and showing
novel results for the dynamics and the overall influence of computational linguistics sub-
fields. I introduce a dataset of abstracts labeled with the novel categories available at
http://nlp.stanford.edu/pubs/FTDDataset_v1.txt for the research com-
munity.
4.2 Related Work: Scientific Study
While there is some connection to keyphrase selection in text summarization (Radev et al.,
2002), extracting FOCUS, TECHNIQUE and DOMAIN phrases is fundamentally a form of
information extraction, and there has been a wide variety of prior work in this area. Some
work, including the seminal (Hearst, 1992) identified patterns (IS-A relations) using hand-
written patterns, while other work has learned patterns over dependency graphs (Bunescu
and Mooney, 2005). For more related work on pattern-based systems, see Chapter 3.
Topic models have been used to study the history of ideas (Hall et al., 2008) and schol-
arly impact of papers (Gerrish and Blei, 2010). However, topic models do not extract
detailed information from text as we do. Still, I use topic-to-word distributions from topic
models as a way of describing sub-fields.
Demner-Fushman and Lin (2007) used hand written knowledge extractors to extract in-
formation, such as population and intervention, in their clinical question-answering system
to improve ranking of relevant abstracts. Our categorization of key aspects is applicable
to a broader range of communities, and we learn the patterns by bootstrapping. Li et al.
(2010) used semantic metadata to create a semantic digital library for Chemistry. They
applied machine learning techniques to identify experimental paragraphs using keywords
features. Xu et al. (2006) and Ruch et al. (2007) proposed systems, in the clinical-trials
and biomedical domain, respectively, to classify sentences of abstracts corresponding to
categories such as introduction, purpose, method, results and conclusion to improve article
Table 4.1: Some examples of dependency patterns that extract information from depen-dency trees of sentences. A pattern is of the form T → (d), where T is the trigger wordand d is the dependency that the trigger word’s node has with its successor.
retrieval by using either structured abstracts,4 or hand-labeled sentences. Some summariza-
tion systems also use machine learning approaches to find ‘key sentences’. The systems
built in these papers are complimentary to ours since one can find relevant paragraphs or
sentences and then extract the key aspects from them. Note that a sentence can have multi-
ple phrases corresponding to our three categories, and thus classification of sentences will
not give similar results.
4.3 Approach
In this section, I explain how to extract phrases for each of the three categories (FOCUS,
TECHNIQUE and DOMAIN) and how to compute the influence of communities.
4.3.1 Extraction
From an article’s abstract and title, I use the dependency trees of sentences and a set of
semantic dependency extraction patterns to extract phrases in each of the three categories.
More details on the dependency patterns and trees are in Chapter 2. Figure 2.1 shows an
example of matching two dependency patterns to a dependency tree. I start with a few
4Structured abstracts, which are used by some journals, have multiple sections such as PURPOSE andMETHOD.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 42
handwritten patterns and learn more patterns using a bootstrapping approach. Table 4.1
shows some seed patterns.
To learn more patterns automatically, I run an iterative algorithm that extracts phrases
using semantic patterns, and then learns new patterns from the extracted phrases. Section
2.3 gives a detailed overview of a bootstrapped pattern-based learning approach. Here,
the seed supervision is provided in terms of seed patterns, instead of seed entities. More
specific details of each step are described below.
Extracting Phrases from Patterns
The pattern matching is the same as described in Section 2.2.2. To increase flexibility of
matching the patterns, when matching the dependency edge, I consider dependents and
granddependents upto 4 levels. I have special rules for paper titles. I label the whole title as
FOCUS if we are not able to extract a FOCUS phrase using the patterns, as authors usually
include the main contribution of the paper in the title. For titles from which we can extract a
TECHNIQUE phrase, I label the rest of the words (except for trigger words) with DOMAIN.
For example, for the title ‘Studying the history of ideas using topic models’, our system
extracts ‘topic models’ as TECHNIQUE, and then labels ‘Studying the history of ideas’ as
DOMAIN.
Learning Patterns from Phrases
After extracting phrases with patterns, we want to be able to construct and learn new pat-
terns. For each sentence whose dependency tree has a subtree corresponding to one of the
extracted phrases, I construct a pattern T → (d) by considering the ancestor (parent or
grandparent) of the subtree as the trigger word T , and the dependency between the head
of the subtree and its parent as the dependency d. For each category, I weight the patterns
depending on the categories of phrases from which they are derived. The weighting method
is as follows. For a set of phrases (P ) that extract a pattern (q), the weight of the pattern
q for the category FOCUS is∑
p∈P1zpcount(p ∈ FOCUS), where zp is the total frequency
of the phrase p. Similarly, I get weights of the pattern for the other two categories. Note
that we do not need smoothing since the phrase-category ratios are aggregated over all the
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 43
phrases from which the pattern is constructed. After weighting all the patterns that have
not been selected in the previous iterations, I select the top k patterns in each category (k=2
in our experiments). Table 4.3 shows some patterns learned through the iterative method.
4.3.2 Communities and their Influence
I define communities as fields or sub-fields that one wishes to study. To study communities
using the articles published, one needs to know which communities each article belongs to.
The article-to-community assignment can be computed in several ways, such as by manual
assignment, using metadata, or by text categorization of papers. In our case study, I use the
topics formed by applying latent Dirichlet allocation (Blei et al., 2003) to the text of the
papers by considering each topic as one community. In recent years, topic modeling has
been widely used to get ‘concepts’ from text; it has the advantage of defining communities
and soft, probabilistic article-to-community assignment scores in an unsupervised manner.
I combine these soft assignment scores with the phrases extracted in the previous section
to score a phrase for each community and category as follows. The score of a phrase p,
which is extracted from an article a, for a community c and the category TECHNIQUE is
calculated as
techScore (c, p, a) =1
zpcount (p ∈ TECHNIQUE | a)P (c | a; θ) (4.1)
where the function P (c | a, θ) gives the probability of a community (i.e., a topic) for an
article a given the topic modeling parameters θ. The normalization constant for the phrase,
zp, is the frequency of the phrase in all the abstracts. In the rest of the section, I use ai’s for
articles, ci’s for communities and y’s for years.
I define influence such that communities receive higher scores if they use techniques
earlier than other communities do or produce tools to solve other problems. For example,
since hidden Markov model introduced by the speech recognition community and part-
of-speech tagging tools built by the part-of-speech community have been widely used as
techniques in other communities, these communities should receive higher scores as com-
pared to some nascent or not-so-widely-used ones. Thus, I define influence of a community
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 44
based on the number of times its FOCUS, TECHNIQUE or DOMAIN phrases have been used
as a TECHNIQUE in other communities. To calculate the overall influence of one commu-
nity on another, we first need to calculate influence because of individual articles in the
community, which is calculated as follows. The influence of community c1 on another
community c2 because of a phrase p extracted from an article a1 is
Table 4.4: The precision, recall and F1 scores of each category for the different approaches.Note that the inter-annotator agreement is calculated on a smaller set.
For testing, I hand labeled 474 abstracts with the three categories to measure the pre-
cision and recall scores. For each abstract and each category, I compared the unique non-
stop-words extracted from my algorithm to the hand labeled dataset. I calculated preci-
sion, recall measures for each abstract and then averaged them to get the results for the
dataset. To compare against a non-information-extraction based baseline, I extracted all
noun phrases (and sub-trees of the noun phrase trees) from the abstracts and labeled them
with all the three categories. In addition, I labeled the titles (and their sub-trees) with the
category FOCUS. I then scored the phrases with a tf-idf inspired measure, which was the
ratio of the frequency of the phrase in the abstract and the sum of the total frequency of the
individual words, and removed phrases that had the tf-idf measure less than 0.001 (best out
of many experiments). I call this approach as ‘Baseline tf-idf NPs’.
Table 4.4 compares precision, recall and micro-averaged F1 scores for the three cat-
egories when we use: (1) only the seed patterns, (2) the combined set of learned and
seed patterns, and (3) the baseline. I also calculated inter-annotator agreement for 30 ab-
stracts, where each abstract was labeled by 2 annotators,8 and the precision-recall scores
8I annotated all 30 abstracts and two other doctoral candidates in computational linguistics annotated 15
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 49
Figure 4.1: The F1 scores for TECHNIQUE and DOMAIN categories after every five itera-tions. For reasons explained in the text, I do not learn new patterns for FOCUS.
were calculated by randomly choosing one annotation as gold and another as predicted for
each article. We can see that both precision and recall scores increase for TECHNIQUE
because of the learned patterns, though for DOMAIN, precision decreases but recall in-
creases. The recall scores for the baseline are higher as expected but the precision is very
low. Three possible reasons explain the mistakes made by our system: (1) authors some-
times use generic phrases to describe their system, which are not annotated with any of the
three categories in the test set but are extracted by the system (such as ‘We use a simplemethod . . . ’, ‘We propose a faster model . . . ’, ‘This paper presents a new approach to
. . . ’); (2) the dependency trees of some sentences are wrong; and (3) some of the patterns
learned for TECHNIQUE and DOMAIN were low-precision but high-recall, for example,
[based → (preposition on)] was learned a TECHNIQUE pattern. The first problem of er-
roneous extraction of generic phrases could perhaps be decreased by allowing restrictions
on the target of the dependency or by disallowing certain kinds of generic positive terms
like ‘simple’, ‘new’, ‘faster’. Figure 4.1 shows the F1 scores for TECHNIQUE and DOMAIN
after every 5 iterations.
each
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 50
Community Representative words Most Influential Phrases ScoreSpeech Recogni-tion
expectation maximization; hidden markov;language; contextually; segment; context in-dependent phone; snn hidden markov; n gramback off language; multiple reference speak-ers; cepstral; phoneme; least squares; speechrecognition; intra; hi gram; bu; word depen-dent; tree structured; statistical decision trees
hidden markov; expectation maximization;maximum entropy; spectral clustering; statis-tical alignment; conditional random fields ,a discriminative; statistical word alignment;string to tree; state of the art statistical machinetranslation system; single word; synchronouscontext free grammar; inversion transductiongrammar; ensemble; novel reordering
support vector machines; ensemble; machinelearning; gaussian mixture; expectation max-imization; flat; weak classifiers; statisticalmachine learning; lexicalized tree adjoininggrammar based features; natural language pro-cessing; standard text categorization collec-tion; pca; semisupervised learning; standardhidden markov; supervised learning
1.12
Table 4.5: The top 5 influential communities with their most influential phrases. Thesecond column lists the top words that describe the communities obtained by the topicmodel, and the third column shows most influential phrases that have been widely used astechniques. The last column is the score of the community computed by Equation 4.5.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 51
Community Representative words Most Influential Phrases ScoreStatistical Pars-ing
maximum entropy; hidden markov; expec-tation maximization; language; linguisticallystructured; ihmm; cross language informationretrieval; ter; factored language; billion word;hierarchical phrases; string to tree; state ofthe art statistical machine translation system;statistical alignment; ist inversion transductiongrammar; bleu as a metric; statistical machinetranslation
state of the art machine learning; conditionalrandomf ields; support vector machines; ma-chine learning; using hidden markov; maxi-mum entropy; memory based learning; hid-den markov; standard hidden markov; secondstage classifiers; weak classifiers; flat; conll2004; iob; probabilities output; high recall
conditional random fields; ensemble; maxi-mum entropy; maximum entropy; conditionalrandom fields , a discriminative; large mar-gin; perceptron; hidden markov; generalizedperceptron; pseudo negative examples; natu-ral language processing; entropy; singer; la-tent variable; character level; named entity
0.72
Table 4.6: The next 5 influential communities with their most influential phrases. Thesecond column lists the top words that describe the communities obtained by the topicmodel, and the third column shows most influential phrases that have been widely used astechniques. The last column is the score of the community computed by Equation 4.5.
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 52
Figure 4.2: The influence scores of communities in each year.
Community Communities that have influenced most (descending order)Named Entity Recognition Chunking/Memory Based Models; Discriminative Sequence Models;
Statistical Parsing Probability Theory; POS Tagging; Discriminative Sequence Mod-els; Speech Recognition; Parsing; Syntactic Theory; Cluster-ing+DistributionalSimilarity; Chunking/Memory Based Models
Word Sense Disambiguation Clustering + DistributionalSimilarity; Machine Learning Classifica-tion; Dictionary Lexicons; Collocations/Compounds; Syntax; SpeechRecognition; Probability Theory
Table 4.7: The community in the first column has been influenced the most by the commu-nities in the second column. The scores are calculated using Equation 4.4
Influence
Tables 4.5 and 4.6 show the most influential communities overall and their respective in-
fluential phrases that have been widely adopted as techniques by other communities. The
third column is the score of the community calculated using Equation 4.5. We can see that
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 53
Figure 4.3: The popularity of communitites in each year. It is measured by summing upthe article-to-topic scores for the articles published in that year (see Hall et al. (2008)).The scores are smoothed with weighted scores of 2 previous and 2 next years, and L1-normalized for each year. The scores are lower for all communities in late 2000s sincethe probability mass is more evenly distributed among many communities. Contrast therelative popularity of the communities with their relative influence shown in Figure 4.2.
speech recognition is the most influential community because of the techniques like hidden
Markov models and other stochastic methods it introduced in the computational linguistics
literature. This shows that its long-term seeding influence is still present despite the lim-
ited popularity around 2000s. Probability theory also gets a high score since many papers
in the last decade have used stochastic methods. The communities part-of-speech tagging
and parsing get high scores because they adopted some techniques that are used in other
communities, and because other communities use part-of-speech tagging and parsing in the
intermediary steps for solving other problems.
Figure 4.2 shows the change in a community’s influence over time. The scores are nor-
malized such that the total score for all communities in a year sum to one. Compare the
relative scores of communities in the figure with the relative scores in Figure 4.3, which
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 54
Figure 4.4: The influence scores of machine translation related communities. The statisticalmachine translation community, which is a topic from the topic model, is more phrase-based.
shows sum of all article-to-topics scores for each community for articles published in a
given year, and is normalized the same way as before. There is a huge spike for the Speech
Recognition community for the years 1989–1994. Hall et al. (2008) note, “These years
correspond exactly to the DARPA Speech and Natural Language Workshop, held at differ-
ent locations from 1989–1994. That workshop contained a significant amount of speech
until its last year (1994), and then it was revived in 2001 as the Human Language Technol-
ogy workshop with a much smaller emphasis on speech processing.” See their paper Hall
et al. (2008) for more analysis. Note that this analysis uses just bag-of-words-based topic
models.
Comparing Figures 4.2 and 4.3, we can see influence of a community is different from
the popularity of a community in a given year. As mentioned before, we observe that al-
though the influence score for speech recognition has declined during 1997-2009, it still has
a lot of influence, though the popularity of the community in recent years is very low. Ma-
chine learning classification has been both popular and influential in recent years. Figures
4.4 and 4.5 compare the machine translation communities in the same way as we compare
other communities in Figures 4.2 and 4.3. We can see that statistical machine translation
(more phrase-based) community’s popularity has steeply increased during late 2002-2009,
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 55
Figure 4.5: Popularity of machine translation communities in each year. The statisticalmachine translation community, which is a topic from the topic model, is more phrase-based. Contrast the relative popularity scores with the relative influence scores shown inFigure 4.4.
however, its influence has increased at a slower rate. On the other hand, the influence of
bilingual word alignment (the most influential community in 2009) has increased during
the same period, mainly because of its influence on statistical machine translation. The in-
fluence of non-statistical machine translation has been decreasing recently, though slower
than its popularity. Table 4.7 shows the communities that have the most influence on a
given community (the list is in descending order of scores by Equation 4.4).
Comparison with Supervised CRF
In this section, I present an experiment performed after Gupta and Manning (2011) was
published. To compare how the BPL approach with dependency patterns compared against
a supervised CRF, I divided the labeled examples used as the test set into two. One half was
reserved for training a CRF model and the rest was used to test both BPL and CRF. Note
that the supervision provided to BPL and CRF are very different. BPL did not have access
to the fully labeled abstracts, which the CRF used. Instead, it used the same seed patterns
as before. Since fully labeled abstracts have each token labeled, they are of much higher
quality than seed patterns. Table 4.8 shows the scores of the two systems for the labels
CHAPTER 4. STUDYING SCIENTIFIC ARTICLES AND COMMUNITIES 56
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 60
5.2 Objective
The objective is to learn new SC and DT phrases from PAT without using hand written
rules or any hand-labeled sentences. I define SC as any symptom or condition mentioned
in text. The DT label refers to any treatment taken or intervention performed in order to
improve a symptom or condition. It includes pharmaceutical treatments and drugs, surg-
eries, interventions (like ‘getting rid of cat and carpet’ for Asthma patients), and alternative
treatments (like ‘acupuncture’ or ‘garlic’). Note that our system ignores negations (for ex-
ample, in the sentence ‘I don’t have Asthma’, ‘Asthma’ is labeled SC) since it is preferable
to extract all SC and DT mentions and handle the negations separately, if required. The
labels include all relevant generic terms (for example, ‘meds’, ‘disease’). Devices used to
improve a symptom or condition (like inhalers) are included in DT, but devices that are
used for monitoring or diagnosis are not. Some examples of sentences from the Asthma
and ENT forums labeled with SC (in italics) and DT (in bold) labels are shown below:
I don’t agree with my doctor’s diagnostic after research and I think I may have
a case of Sinus Mycetoma
I started using an herbal mixture especially meant for Candida with limited
success.
however , with the consistent green and occasional blood in nasal discharge
(but with minimal “stuffy” feeling), I wonder if perhaps a problem with chronic
sinusitis and or eustachian tubes
She gave me albuteral and symbicort (plus some hayfever meds and asked
me to use the peak flow meter.
My sinus infections were treated electrically, with high voltage million voltelectricity, which solved the problem, but the treatment is not FDA approved
and generally unavailable, except under experimental treatment protocols.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 61
5.3 Related Work: Medical IE
Medical term annotation is a long-standing research challenge. However, almost no prior
work focuses on automatically annotating PAT. Tools like TerMINE (Frantzi et al., 2000)
and ADEPT (MacLean and Heer, 2013) do not identify specific entity types. Other ex-
isting tools like MetaMap (Aronson, 2001), the OBA (Jonquet et al., 2009), and Apache
cTakes1 perform poorly mainly because they are designed for fine-grained entity extraction
on expert-authored text. They essentially perform dictionary matching on text based on
source ontologies (Aronson, 2001; Jonquet et al., 2009; Aronson and Lang, 2010). Despite
being the go-to tools for medical text annotation, previous studies (Pratt and Yetisgen-
Yildiz, 2003) comparing OBA and MetaMap to human annotator performance underscore
two sources of performance error, which we also notice in our results. The first is ontology
incompleteness, which results in low recall, and second is inclusion of contextually irrel-
evant terms (MacLean and Heer, 2013). For example, when restricted to the RxNORM
ontology and semantic type Antibiotic (T195), OBA will extract both Today and Penicillin
from the sentence “Today I filled my Penicillin rx”. Other approaches focusing on expert-
authored text show improvement in identifying food and drug allergies (Epstein et al., 2013)
and disease normalization (Kang et al., 2012) with the use of statistical methods. While
these statistically-based approaches tend to perform well, they require hand labeled data,
which is both manually intensive to collect and does not generalize across PAT sources.
The most relevant work to ours is in building the Consumer Health Vocabularies (CHVs).
CHVs are ontologies designed to bridge the gap between patient language and the UMLS
Metathesaurus. We are aware of two CHVs: the (OAC) CHV (Zeng and Tse, 2006)2 and
the MedlinePlus CHV3. To date, most work in this area focuses on identifying candidate
terms of general medical relevance, and not specific entity types. We use the OAC CHV to
construct our seed dictionaries.
There has been some work that extracts information from PAT. In a study investigating
the feasibility of mining adverse drug events from user comments on DailyStrength (www.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 63
Initial Labeling Using Dictionaries
As the first step, I ‘partially’ label data using matching phrases from our DT and SC dic-
tionaries. Our DT dictionary, comprising 38,684 phrases, was sourced from Wikipedia’s
list of drugs, surgeries and delivery devices; RxList5; MedlinePlus6; Medicinenet7 phrases
with semantic type ‘procedures’ from MedDRA8; and phrases with relevant semantic types
(Antibiotic, Clinical Drug, Laboratory Procedure, Medical Device, Steroid, and Therapeu-
tic or Preventive Procedure) from the NCI thesaurus.9
Our SC dictionary comprises 100,879 phrases, and was constructed using phrases from
MedlinePlus, Medicinenet, and from MedDRA (with semantic type ‘disorders’). We ex-
panded both dictionaries using the OAC Consumer Health Vocabulary10 by adding all syn-
onyms of the phrases previously added. Because the dictionaries are automatically con-
structed with no manual editing, they might have some incorrect phrases. However, the
results show that they perform effectively.
I label a phrase with the dictionary label when the sequence of non-stop-words (or their
lemmas) matches an entry in the dictionary. To match spelling mistakes and morpholog-
ical variations (like ‘tickly’), which are common in PAT, I do a fuzzy matching. A token
matches a word in the dictionary if the token is longer than 6 characters and the token and
the word are edit distance one away. I ignore words ‘disease’, ‘disorder’, ‘chronic’, and
‘pre-existing’ in the dictionaries when matching phrases. I remove phrases that are very
common on the Internet by compiling a list of the 2000 most common words from Google
Ngrams, called GoogleCommonList henceforth. See Section 2.4 for more information on
Google N-grams. This helps exclude words like ‘Today’ and ‘AS’, which are also names
of medicines. Tokens that are labeled as SC by the SC dictionary are not labeled DT, to
avoid labeling ‘asthma’ as DT in the phrase ‘asthma meds’, in case ‘asthma meds’ is in the
DT dictionary.
5www.rxlist.com, Accessed January 2013.6http://www.nlm.nih.gov/medlineplus, Accessed January 2013.7http://www.medicinenet.com, accessed January 20138MedDRA stands for Medical Dictionary for Regulatory Activities. http://www.meddra.org, Ac-
cessed February 20139http://ncit.nci.nih.gov, Accessed March 2013.
10Open Access, Collaborative Consumer Health Vocabulary Initiative. http://www.consumerhealthvocab.org, accessed February 2013.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 64
5.5 Inducing Lexico-Syntactic Patterns
In Chapter 2, I gave a high level overview of the steps of a bootstrapped pattern learning
system. Below is a summary of the steps.
1. Label data using dictionaries
2. Create patterns using the labeled data and choose top K patterns
3. Extract phrases using the learned patterns and choose top N words
4. Add new phrases to the dictionaries
5. Repeat 1-4 T times or until converged
I experimented with different phrase and pattern weighting schemes (for example, ap-
plying log sublinear scaling in the weighting formulations below) and parameters for our
system. I selected the ones that performed best on the Asthma forum test sentences. Below,
I explain the algorithm using DT as an example label for ease of explanation.
5.5.1 Creating Patterns
I create potential patterns by looking at two to four words before and after the labeled
tokens. I discard contexts that consist of 2 or fewer stop words because they are too general
and extract many noisy entities. Contexts with 3 or more stop words are included because
the long context makes them less general, for example, ‘I am on X’ is a good pattern to
extract DTs. Words that are labeled with one of the dictionaries are generalized with the
class of the dictionary. I create flexible patterns by ignoring the words {‘a’, ‘an’, ‘the’,
‘,’, ‘.’} while matching the patterns and by allowing at most two stop words between the
context and the term to be extracted. I create two sets of the above patterns – with and
without the part-of-speech (POS) restriction of the target phrase (for example, that it only
contains nouns). Since many symptoms and drugs tend to be more than just one word, I
allow matching 1 to 2 tokens. In our experiments, matching 3 or more consecutive terms
extracted noisy phrases, mostly by patterns without the POS restriction. Table 2.1 shows
an example of two patterns and how they match to two sentences.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 65
5.5.2 Learning Patterns
I learn new patterns by weighting them using normalization measures and selecting the top
patterns. In essence, we want to trade off precision and recall of the patterns to extract the
correct phrases. The weighting scheme for a pattern i is
pti =
∑mk=1
√freq(i, wk)∑n
j=1
√freq(i, wj)
(5.1)
wherem is the number of words with the label DT that match the pattern, n is the number of
all words that match the pattern, and freq(i, wk) is the number of times pattern i matched
the phrase wk. Sublinear scaling of the frequency prevents high frequency words from
overshadowing the contribution of low frequency words. Using the RlogF pattern scoring
function (Riloff, 1996) led to lower scores in the pilot experiments. I discard patterns that
have weight less than a threshold (=0.5 in our experiments). I also discard patterns when m
is equal to n since adding them would be of no benefit for learning new phrases. I remove
patterns that occur in the top 500 patterns for the other label. After calculating weights for
all the remaining patterns, I choose the top K (=50 in our experiments) patterns.
5.5.3 Learning Phrases
I apply the patterns selected by the above process to all the sentences and extract the
matched phrases. The phrase weighting scheme is a combination of TF-IDF scoring,
weight of the patterns, and relative frequency of the phrases in different dictionaries. The
latter weighting term assigns higher weight to words that are sub-phrases of phrases in the
entity’s dictionary. The weighting function for a phrase p for the label DT is
weight(p,DT) =(∑t
i=1 num(p, i)× ptilog(freqp)
)× 1 + dictDTFreqp
1 + dictSCFreqp(5.2)
where t is the number of patterns that extract the phrase p, num(p, i) is the number of times
phrase p is extracted using pattern i, pti is the weight of the pattern i from the previous
equation, freqp is frequency of phrase p in the corpus, dictDTFreqp and dictSCFreqpare the frequency of phrase p in the n-grams of the phrases from the DT dictionary and the
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 66
SC dictionary, respectively. I discard phrases with weight less than a threshold (=0.2 in our
experiments). I also discard phrases that are matched by less than 2 patterns to improve
precision of the system – phrases extracted by multiple patterns tend to be more accurate.
I remove the following kinds of phrases from the set of potential phrases: (1) list of
specialists and physicians downloaded from WebMD11, (2) words in the GoogleCommon-
List, and (3) 5000 most frequent tokens from around 1 million tweets from Twitter to avoid
learning slang words like ‘asap’, (4) phrases that are already in any of the dictionaries. I
then extract up to the top N (=10 in our experiments) words and label those phrases in the
sentences. I also remove body parts phrases (198 phrases that were curated from Wikipedia
and manually expanded by us) from the set of potential DT phrases.
I repeat the cycle of learning patterns and learning phrases T times (=20 in our experi-
ments) or until no more patterns and words can be extracted.
5.6 Evaluation Setup
5.6.1 Test Data
I tested our system and the baselines on two forums – Asthma and ENT. For each forum, I
randomly sampled 500 sentences, and my collaborator and I annotated 250 sentences each.
The test sentences were removed from the data used in the learning system. The labeling
guidelines for the annotators for the test sentences were to include the minimum number of
words to convey the medical information. To calculate the inter-annotator agreement, the
annotators labeled 50 sentences from the 250 sentences assigned to the other annotator; the
agreement is thus calculated on 100 sentences out of the 500 sentences. The token-level
agreement for the Asthma test sentences was 96% with Cohen’s kappa κ=0.781, and for
the ENT test sentences was 96.2% with Cohen’s kappa κ=0.801. I used the Asthma forum
as a development forum to select parameters, such as the maximum number of patterns and
phrases added in an iteration, total number of iterations, and the thresholds for learning
patterns and phrases. I discuss effect of varying these parameters in the additional exper-
iments section below. I used ENT as a test forum; no parameters were tuned on the ENT
Table 5.1: F1 scores for labeling with Dictionaries using different types of labelingschemes. ‘-F’ means using fuzzy matching and ‘-C’ means pruning words that are inGoogleCommonList.
Our system vs. Other systems
Tables 5.2–5.5 show the scores for DT and SC labels on the Asthma and ENT forums.
The horizontal line separates systems that do not learn new phrases from the systems that
do. An asterisk denotes that our system is statistically significantly better (for two-tailed
p-value <0.05) than the system using approximated randomization.
In most cases, our system significantly outperforms current standard tools in medical
informatics. MetaMap and OBA have lower computational time since they do not match
words fuzzily or learn new dictionary phrases, but have lower performance. All systems
extract SC terms with higher recall than DT terms because many simple SC terms (such
as ‘asthma’) occurred frequently and were present in the dictionary. The improvement in
performance of our system over the baselines is higher for DT as compared to SC, mainly
because SC terms are usually verbose and descriptive, and hence are harder to extract using
patterns. In addition, the performance is higher on Asthma than on ENT for two reasons.
First, the system was tuned on the Asthma forum. Second, the Asthma test set had many
easy to label DT and SC phrases, such as ‘asthma’ and ‘inhaler’. On the other hand, many
ENT phrases were longer and not present in seed dictionaries, such as ‘milk free diets’ and
‘smelly nasal discharge’.
One of the reasons that CRF does not perform so well, despite being very popular for ex-
tracting entities from human-labeled text data, is that the data is partially labeled using the
dictionaries. Thus, the data is noisy and lacks full supervision provided in human-labeled
data, making the word-level features not very predictive. CRF missed extracting some
common terms like ‘inhaler’ and ‘inhalers’ as DT (‘inhaler’ occurred only as a sub-phrase
in the seed dictionary), and extracted some noisy terms, such as ‘afraid’ and ‘icecream’. In
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 71
addition, CRF uses context for labeling data – we show in the Additional Experiments sec-
tion that using context in the form of patterns performs worse than dictionary matching for
labeling data. Our system, on the other hand, learned new dictionary phrases by exploiting
context but labeled data by dictionary matching. Self-training the CRF initially increased
the F1 score for DT but performed worse in the subsequent iterations. Xu et al.’s system
performed worse because of its overdependence on the seed patterns: it gave low scores to
patterns that extracted phrases that had low overlap with the phrases extracted by the seed
patterns, which resulted in lower recall.
I believe token-level evaluation is better suited than entity-level evalution for the task.
However, for completeness, I have included entity-level evaluation results in Tables 5.6-5.9.
Scores of all systems are better when measured at the token level than at the entity level
because they get credit for extracting partial entities. The entity-level evaluation results
show a similar trend as the token-level evaluation: our system performs better than other
systems, albeit the difference is smaller for the DT label on the ENT forum.
5.7.1 Analysis
Tables 5.10–5.13 show the top 10 patterns extracted from the Asthma and the ENT forums
for the two labels. To improve readability, I have shown only the sequences of lemmas
from the patterns. X indicates the target phrase and ‘pos:’ indicates the part-of-speech
restriction. As we can see, several context tokens are generalized to their labels. Some
target entities do not have part-of-speech restriction, especially when the context is very
predictive of the label, such as ‘and be diagnose with’ for label SC.
Table 5.1 shows the top 15 phrases extracted from the Asthma and ENT forums by our
system. Figures 5.2 and 5.3 show the phrases extracted by our system from the following
three forums: Acne, Breast Cancer, and Adult Type II Diabetes. We can broadly group the
extracted phrases into 4 categories, which are described below.
New Terms
One goal of extracting medical information from PAT is to learn new treatments patients
are using or symptoms they are experiencing. Our system extracted phrases like ‘stabbing
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 72
Dictionary-F-C 65.57 50 56.73Xu et al.-25 64.63 50.45 56.67Xu et al.-50 64.78 52.03 57.71
CRF 63.27 50.67 56.28CRF-2 62.01 50.22 55.49
CRF-20 61.38 50 55.11Our System 62.53 55.88 59.02
Table 5.9: Entity-level Precision, Recall, and F1 scores of our system and the baselines onthe ENT forum for the label SC.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 76
i be put on (X | pos: noun)i have be on (X | pos:noun)use DT and (X | pos:noun)put he on (X | pos:noun)prescribe DT and (X | pos:noun)mg of Xhe put I on Xto give he (X | pos:noun)i have be use Xand put I on X
Table 5.10: Top 10 patterns learned for the label DT on the Asthma forum.
(X | pos: noun) SC etc.reduce SC (X | pos:noun)first SC (X | pos:noun)have history of (X | pos:noun)develop SC (X | pos:noun)really bad SC (X | pos:noun)not cause SC (X | pos:noun)symptom be (X | pos:noun)(X | pos:noun) SC feeland be diagnose with X
Table 5.11: Top 10 patterns learned for the label SC on the Asthma forum.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 77
have endoscopic (X | pos: noun)include DT (X | pos: noun)and put I on (X | pos: noun)(X | pos: noun) 500 mg2 round of (X | pos: noun)and be put on (X | pos: noun)have put I on (X | pos: noun)(X | pos: adj) DT and useent put I on X(X | pos: noun) and nasal rinse
Table 5.12: Top 10 patterns learned for the label DT on the ENT forum.
persistent SC (X | pos: noun)have have problem with (X | pos: noun)diagnose I with SC (X | pos: noun)morning with SC (X | pos: noun)(X | pos: noun) SC cause SChave be treat for (X | pos: noun)year SC (X | pos: noun)(X | pos: noun) SC even though(X | pos: noun) SC like SCdaughter have SC X
Table 5.13: Top 10 patterns learned for the label SC on the ENT forum.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 78
Figure 5.2: Top 50 DT phrases extracted by our system for three different forums. Erro-neous phrases (as determined by us) are shown in gray. Full forms of some abbreviationsare in italics. Note that Abbreviations are also New Terms but are categorized separatelybecause of their frequency in PAT.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 81
Figure 5.3: Top 50 SC phrases extracted by our system for three different forums. Erro-neous phrases (as determined by us) are shown in gray. Full forms of some abbreviationsare in italics. Note that Abbreviations are also New Terms but are categorized separatelybecause of their frequency in PAT.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 82
(a) Top DT phrases.
(b) Top SC phrases.
Figure 5.4: Top DT and SC phrases extracted by our system, MetaMap, and MetaMap-Cfor the Diabetes forum. Numbers in parentheses indicate the number of times the phrasewas extracted by the system. Erroneous phrases (as determined by us) are shown in gray.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 83
top most frequent phrases extracted from the Diabetes forum in Figure 5.4. We can see that
our system extracts more relevant phrases. The reason we do not extract insulin is because
it exists (incorrectly) in the automatically curated SC dictionary and we do not label DT
phrases that are in the SC dictionary. For our system, I concatenated all consecutive words
with the same label as one phrase, in contrast with MetaMap, which many times extracted
consecutive words as different phrases (leading to the difference in the frequency of some
phrases). For example, our system extracted ‘diabetes drug dependency’, but MetaMap
extracted it as ‘diabetes’ and ‘drug dependency’. Similarly, our system extracted ‘latent
autoimmune diabetes in adults’, whereas MetaMap extracted ‘latent’ and ‘autoimmune
diabetes’.
Below, I demonstrate a use case of the system to explore alternative treatments people
use for a symptom or condition. I manually labeled posts that mentioned new treatments
identified by our system as DTs and explore their efficacy by mining sentiment towards
them in the forum.
5.7.2 Case study: Anecdotal Efficacy
Our system can be used to explore different (possibly previously unknown) treatments peo-
ple are using for a condition. In turn, this can lead to novel insights, which can be further
explored by the medical community. For example, for Diabetes, our system extracted ‘Cin-
namon’ and ‘Vinegar’ as DTs. To study the anecdotal efficacy of ‘Cinnamon’ and ‘Vinegar’
for managing Diabetes, we manually labeled the posts that mentioned the terms as treat-
ment for Diabetes (47 out of 49 posts for ‘Cinnamon’ and 26 out of 30 posts for ‘Vinegar’)
with the sentiment towards that treatment. Both terms were extracted as DT by our system
for the Diabetes forum. ‘Strongly positive’ means the treatment helped the person. ‘Weakly
positive’ means the person is using the treatment or has heard positive effects of it. ‘Neu-
tral’ means the user is not using the treatment and did not express an opinion in the post.
‘Weakly negative’ means the person has heard that the treatment does not work. ‘Strongly
negative’ means the treatment did not work for the person. An informal analysis of the
posts reveals that the ‘Cinnamon’ was generally considered helpful by the community and
‘Vinegar’ had mixed reviews (Figure 5.5). Below are more details about each label.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 84
Figure 5.5: Study of efficacy of ‘Cinnamon’ and ‘Vinegar’, two DTs extracted by oursystem, for treating Type II Diabetes.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 85
• Strongly positive: The person has explicitly mentioned that the treatment is helping
the subject of the post (many times the posts discuss health of a family member)
for Diabetes. Example: “. . . A relative with the same problem told her about taking
cinnamon gel tabs which had greatly helped her. She found a brand at the local health
store by the name of NewChapter titled Cinnamon Force. She was afraid to take it
with so many other medications and it sat in the cabinet about five months. Last week,
she got brave and took two tabs behind the two largest meals of the day.Wow! the
level dropped down into the safe range and has remained there for several days.All
that I can tell you about the product, is that it contains 140mg of cinnamon per gel
tab.We are so thrilled that after so many years of frustration, that we see a great
change in blood sugar levels . . . ”
• Weakly positive: The subject of the post is either using the treatment or heard/read
positive effects of the treatment for Diabetes. Example: “... Some people do think
things such as vinegar help. My belief is those things are worth trying but they are
secondary to tried and true things such as weight loss, exercise and lowering carb
intake.”
• Neutral: The subject of the post is neither using the treatment nor expressed any
sentiment about it in the post. Example: “. . . I may be wrong, but I haven’t heard of
cinnamon lowering glucose levels. Please take your mother to a doctor for a checkup
asap . . . ” Posts that asked a question about using the about treatment were also
labeled neutral. For example, “Does vinegar help diabetes” is labeled Neutral.
• Weakly negative: The post mentioned that the user has heard that the treatment does
not work. For example, people citing studies that showed inconclusive evidence of
the efficacy of the treatment. Example: “. . . Studies now show that cinnamon doesn’t
lower glucose levels, but has been known to regulate blood pressure. I can vouch for
the latter . . . ”
• Strongly negative: The post mentioned that the treatment is not working from per-
sonal experience of the subject of the post (for example, a family member). Example:
“I have tried the Apple Cider Vinegar and it didn’t work for me . . . ”
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 86
DT SCSystem Precision Recall F1 Precision Recall F1
Our system 86.88 58.67 70.04 78.10 75.56 76.81Pattern Matches (No Gen.) 45.26 13.73 21.07 50.75 12.33 19.85
Table 5.16: Effects of use of GoogleCommonList in OBA on the Asthma forum. Precision,Recall, and F1 scores of OBA when words in GoogleCommonList are not labeled (‘-C’suffix), and when words in GoogleCommonList and in manually identified negative phrasesare not labeled (‘-C-T5’ suffix).
Manually removing top negative words from MetaMap and OBA
I sorted all words extracted by MetaMap and OBA by their frequency and manually iden-
tified top 5 words that I judged as incorrect (without considering the context). I ran experi-
ments in which those words were not labeled by OBA-C and MetaMap-C, that is, I added
them to the stop words list. The systems are marked as ‘OBA-C-T5’ and ‘MetaMap-C-
T5’, respectively, in Tables 5.16-5.17 and 5.18-5.19. The motivation to compare the per-
formance of these systems is when a user might be interested in manually identifying the
top negative words and adding them to the stop words list. Removing the manually iden-
tified words generally increased precision, but reduced recall. I suspect the recall dropped
because the words might be correct when they appeared in some contexts. The reason for
the same scores for MetaMap-C and MetaMap-C-T5 for the SC label on the Asthma forum
is that the negative words were already in the GoogleCommonList.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 88
DT SCSystem Precision Recall F1 Precision Recall F1
Table 5.17: Effects of use of GoogleCommonList in OBA on the ENT forum. Precision,Recall, and F1 scores of OBA when words in GoogleCommonList are not labeled (‘-C’suffix), and when words in GoogleCommonList and in manually identified negative phrasesare not labeled (‘-C-T5’ suffix).
DT SCSystem Precision Recall F1 Precision Recall F1
Table 5.18: Effects of use of GoogleCommonList in MetaMap on the Asthma forum. Preci-sion, Recall, and F1 scores of MetaMap when words in GoogleCommonList are not labeled(‘-C’ suffix), and when words in GoogleCommonList and in manually identified negativephrases are not labeled (‘-C-T5’ suffix).
DT SCSystem Precision Recall F1 Precision Recall F1
Table 5.19: Effects of use of GoogleCommonList in MetaMap on the ENT forum. Preci-sion, Recall, and F1 scores of MetaMap when words in GoogleCommonList are not labeled(‘-C’ suffix), and when words in GoogleCommonList and in manually identified negativephrases are not labeled (‘-C-T5’ suffix).
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 89
Table 5.20: Scores when our system is run with different phrase threshold values. Increas-ing the threshold increases the precision but reduces recall. The value in bold was used inour final system.
Table 5.21: Scores when our system is run with different pattern threshold values. Allother parameters remain unchanged. The threshold of 0.2 and 0.5 did not make a differencebecause all patterns extracted had score of more than 0.5. The threshold of 0.8 led to higherprecision but lower recall. Threshold of 1.0 did not extract any patterns. The value in boldwas used in our final system.
Parameter Tuning
In our experiments, I tuned the parameters, such as N , K, and T , on the Asthma forum.
In this section, we discuss the effect of varying some of the parameters (keeping others the
same as the final system) on extracting DT phrases from the Asthma forum. We experi-
enced a similar effect of varying the parameters for extracting SC phrases from the Asthma
forum.
Phrase and pattern thresholds
Tables 5.20 and 5.21 show scores of our system when different phrase and pattern thresh-
olds are used. In both cases, generally increasing the threshold resulted in higher precision
but lower recall.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 90
Table 5.23: Scores when our system is run with different values of K. Increasing K de-creases precision but improves recall. The values shown in bold were used in our finalsystem.
Number of phrases in each iteration (N )
Our system learned a maximum of 200 phrases (with maximum number of phrases in each
iteration N=10 and maximum number of iterations T=20). Table 5.22 shows scores for
different combinations of values of N and T, keeping the total number of phrases learned
constant.
Number of patterns in each iteration (K)
Table 5.23 shows results for different values of K, that is, the maximum number of pattern
learned in each iteration.
5.9 Future Work
Future improvements to performance would allow us to reap enhanced benefits from au-
tomatic medical term extraction. Improving precision, for example, would reduce manual
effort required for verifying extracted terms to do an analysis similar to one shown in Fig-
ure 5.5. Improving recall would increase the range of terms that we extract. For example,
at present, our system still misses relevant terms, such as ‘oatmeal’ as a DT for Diabetes.
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 91
Our results open several avenues for future work on mining and analyzing PAT. Extrac-
tion of DT and SC entities allows us to investigate connections and relationships between
drug pairs, and drugs and symptoms. Prior work has successfully identified adverse drug
events in electronic medical records (Tatonetti et al., 2012); using self-report patient data
(such as that found on MedHelp), we might uncover novel information on how particular
drug combinations affect users. One such case study to identify side effects of drugs was
presented in Leaman et al. (2010). Our system can also help to analyze sentiment towards
various treatments, including home remedies and alternative treatments, for a particular
disease – manually enumerating all treatments, along with their morphological variations,
is difficult. Finally, I note that our system does not require any labeled data of sentences
and thus can be applied to many different types of PAT (like patient emails) and entity types
(like diagnostic tests).
5.10 Conclusion
I demonstrate a method for identifying medical entity types in patient-authored text. I in-
duce lexico-syntactic patterns using a seed dictionary of desirable terms. Annotating spe-
cific types of medical terms in PAT is difficult because of lexical and semantic mismatches
between experts’ and consumers’ description of medical terms. Previous ontology-based
tools like OBA and MetaMap are good at fine-grained concept mapping on expert-authored
text, but they have low accuracy on PAT.
I demonstrate that our method improves performance for the task of extracting two
entity types: drugs & treatments (DT) and symptoms & conditions (SC), from MedHelp’s
Asthma and ENT forums by effectively expanding dictionaries in context. Our system
extracts new entities missing from the seed dictionaries; abbreviations, relevant sub-phrases
of seed dictionary phrases, and spelling mistakes. In evaluation, in most cases, our system
significantly outperformed MetaMap, OBA, an existing system that uses word patterns for
extracting diseases, and a conditional random field classifier. I believe that the ability to
effectively extract specific entities is the key first step towards deriving novel findings from
PAT.
Pattern and entity scoring are the critical components of a bootstrapped pattern-based
CHAPTER 5. INFORMATION EXTRACTION ON MEDICAL FORUMS 92
learning system. The system developed in this chapter utilizes only the supervision pro-
vided by seed sets to score patterns and entities. Thus, many entities extracted by patterns
are unlabeled. During the pattern scoring phase, the unlabeled entities extracted by patterns
are considered negative. However, many of these unlabeled entities are actually positive,
resulting in lower scores for good patterns that extract many good (that is, positive) unla-
beled entities. In the next chapter, I propose improvements to the pattern scoring phase by
evaluating unlabeled entities using unsupervised measures. It leads to improved precision
and recall.
Chapter 6
Leveraging Unlabeled Data ImprovesPattern Learning
In the previous chapters, I discussed bootstrapped pattern-based learning (BPL) as an effec-
tive approaches for entity extraction with minimal distantly supervised data. In this chapter,
I propose improvements to BPL by leveraging unlabeled data to enhance the pattern scoring
function. The work has been published in Gupta and Manning (2014a).
6.1 Introduction
In a pattern-based entity learning system, scoring patterns and scoring entities are the most
important steps. In the pattern-scoring phase, patterns are scored by their ability to extract
more positive entities and less negative entities. In a supervised setting, the efficacy of
patterns can be judged by their performance on a fully labeled dataset (Califf and Mooney,
1999; Ciravegna, 2001). In contrast, in a BPL system, seed dictionaries and/or patterns pro-
vide weak supervision. Thus, most entities extracted by candidate patterns are unlabeled,
making it harder for the system to learn good patterns.
Existing systems score patterns by making closed world assumptions about the unla-
beled entities. The problem is similar to the closed world assumption in distantly super-
vised relation extraction systems, when all propositions missing from a knowledge base
are considered false (Ritter et al., 2013; Xu et al., 2013). Consider the example discussed
93
CHAPTER 6. LEVERAGING UNLABELED DATA 94
in Chapter 2, also shown in Figure 6.1. Current pattern learning systems would score both
patterns, ‘own a X’ and ‘my pet X’, equally by either ignoring the unlabeled entities or
assuming them as negative. However, these scoring schemes cannot differentiate between
patterns that extract good versus bad unlabeled entities. Systems that ignore the unlabeled
entities do not leverage the unlabeled data in scoring patterns. Frequently, these systems
learn patterns that extract some positive entities but many bad unlabeled entities. Systems
that assume unlabeled entities to be negative are very conservative; in the example, they
wrongly penalize ‘Pattern 1’ that extracted good unlabeled entity, ‘cat’.
Predicting labels of unlabeled entities can improve scoring patterns. Features like dis-
tributional similarity can predict that ‘cat’ is closer to the seed set {dog} than ‘house’, and
a pattern learning system can use that information to rank ‘Pattern 1’ higher than ‘Pattern
2’. In this chapter, I improve the scoring of patterns for an entity class by defining a pat-
tern’s score by the number of positive entities it extracts and the ratio of number of positive
entities to expected number of negative entities it extracts. I propose five features to predict
the scores of unlabeled entities. One feature is based on Google Ngrams that exploits the
specialized nature of our dataset; entities that are frequent on the web are less likely to be a
drug-and-treatment entity. The other four features can be used to learn entities for generic
domains as well.
My main contribution is introducing the expected number of negative entities in pat-
tern scoring – I predict probabilities of unlabeled entities belonging to the negative class.
I estimate an unlabeled entity’s negative class probability by averaging probabilities from
various unsupervised class predictors, such as distributional similarity, string edit distances
from learned entities, and TF-IDF scores. Our system performs significantly better than ex-
isting pattern scoring measures for extracting drug-and-treatment entities from four medical
forums on MedHelp.
6.2 Related Work
I discuss pattern-based systems in Chapter 3. Here, I review the pattern-scoring aspects
of previous pattern-based systems. The pioneering work by Hearst (1992) used hand writ-
ten patterns to automatically generate more rules that were manually evaluated to extract
CHAPTER 6. LEVERAGING UNLABELED DATA 95
Figure 6.1: An example pattern learning system for the class ‘animals’ from the text start-ing with the seed entity ‘dog’. The figure shows two candidate patterns, along with theirextracted entities, in the first iteration. Text matched with the patterns is shown in italicsand the extracted entities are shown in bold.
hypernym-hyponym pairs from text. Other supervised systems like SRV (Freitag, 1998),
SLIPPER (Cohen and Singer, 1999), (LP )2 (Ciravegna, 2001), and RAPIER (Califf and
Mooney, 1999) used a fully labeled corpus to either create or score patterns.
Riloff (1996) used a set of seed entities to bootstrap learning of rules for entity extrac-
tion from unlabeled text. She scored a rule by a weighted conditional probability measure,
called RlogF, estimated by counting the number of positive entities among all the entities
extracted by the pattern. Thelen and Riloff (2002) extended the above bootstrapping al-
gorithm for multi-class learning. Riloff and Jones (1999) used the similar pattern scoring
measure as Riloff (1996) for their multi-level bootstrapping approach. Snowball (Agichtein
and Gravano, 2000) used the same scoring function for patterns as Riloff (1996). Yangar-
ber et al. (2002) and Lin et al. (2003) used a combination of accuracy and confidence of
a pattern for multiclass entity learning, where the accuracy measure ignored the unlabeled
entities and the confidence measure treated them as negative. Talukdar et al. (2006) used
seed sets to learn trigger words for entities and a pattern automata. Their pattern scoring
measure is same as (Lin et al., 2003). In Chapter 5, I use the ratio of scaled frequencies
of positive entities among all extracted entities. None of the above measures predict labels
of unlabeled entities to score patterns. Our system outperforms them in our experiments.
CHAPTER 6. LEVERAGING UNLABELED DATA 96
Stevenson and Greenwood (2005) used Wordnet to assess patterns, which is not feasible
for domains that have low coverage in Wordnet, such as medical data. Zhang et al. (2008)
used HITS algorithm (Kleinberg, 1999) over patterns (authorities) and instances (hubs)
to overcome some of the problems with the above systems – unlabeled entities extracted
by patterns are either considered negative or are ignored when computing pattern scores.
However, they do not use any external unsupervised knowledge for evaluating the unlabeled
entities.
Current open entity extraction systems either ignore the unlabeled entities or consider
them as negative. KnowItAll’s entity extraction from the web (Downey et al., 2004; Etzioni
et al., 2005) used components such as list extractors, generic and domain specific pattern
learning, and subclass learning. They learned domain-specific patterns using a seed set and
scored them by ignoring unlabeled entities. One of our baselines is similar to their domain-
specific pattern learning component. Carlson et al. (2010a) learned multiple semantic types
using coupled semi-supervised training from web-scale data, which is not feasible for all
datasets and entity learning tasks. They assessed patterns by their precision, assuming un-
labeled entities to be negative; one of our baselines is similar to their pattern assessment
method. Other open information extraction systems like ReVerb (Fader et al., 2011) and
OLLIE (Mausam et al., 2012) are mainly geared towards generic, domain-independent rela-
tion extractors for web data. ReVerb used manually written patterns (called as constraints)
to extract potential tuples, which were scored using a logistic regression classifier trained
on around 1000 manually labeled sentences. OLLIE ranked patterns by their frequency of
occurrence in the dataset. For more discussion on these systems, see Chapter 2.
6.3 Approach
In Chapter 2, I discussed the skeleton of a bootstrapped pattern-based learning system. In
this chapter, I use the same framework with lexico-syntactic surface word patterns. I extract
entities from unlabeled text starting with seed dictionaries of entities for multiple classes.
The success of bootstrapped pattern learning methods crucially depends on the effec-
tiveness of the pattern scorer and the entity scorer. Here I focus on improving the pattern
scoring measure.
CHAPTER 6. LEVERAGING UNLABELED DATA 97
6.3.1 Creating Patterns
Candidate patterns are created using contexts of words or their lemmas in a window of two
to four words before and after a positively labeled token. Context words that are labeled
with one of the classes are generalized with that class. The target term has a part-of-speech
(POS) restriction, which is the POS tag of the labeled token. I create flexible patterns by
ignoring the words {‘a’, ‘an’, ‘the’} and quotation marks when matching patterns to the
text. Some examples of the patterns are shown in Table 6.4.
6.3.2 Scoring Patterns
Judging the efficacy of patterns without using a fully labeled dataset can be challenging
because of two types of failures: 1. penalizing good patterns that extract good (that is,
positive) unlabeled entities, and 2. giving high scores to bad patterns that extract bad (that
is, negative) unlabeled entities. Existing systems that assume unlabeled entities as negative
are too conservative in scoring patterns and suffer from the first problem. Systems that
ignore unlabeled entities can suffer from both the problems. For a pattern r, let sets Pr, Nr,
and Ur denote the positive, negative, and unlabeled entities extracted by r, respectively.
One commonly used pattern scoring measure RlogF (Riloff, 1996) calculates a pattern’s
score by the function |Pr||Pr|+|Nr|+|Ur| log(|Pr|). The first term is a rough measure of precision,
which assumes unlabeled entities as negative. The second term gives higher weights to
patterns that extract more positive entities. The function has been shown to be effective for
learning patterns in many systems. However, it gives lower scores to patterns that extract
many unlabeled entities – regardless of whether those entities are good or bad.
I propose to estimate the labels of unlabeled entities to more accurately score the pat-
terns. The pattern score, ps(r) is calculated as
ps(r) =|Pr|
|Nr|+∑
e∈Ur(1− score(e))
log(|Pr|) (6.1)
where |.| denotes the size of a set. The function score(e) gives the probability of an entity
e belonging to C. If e is a common word, score(e) is 0. Otherwise, score(e) is calculated
as the average of five feature scores (explained below), each of which give a score between
CHAPTER 6. LEVERAGING UNLABELED DATA 98
0 and 1. The feature scores are calculated using the seed dictionaries, learned entities for
all labels, Google Ngrams, and clustering of domain words using distributional similarity.
The log |Pr| term, inspired from RlogF, gives higher scores to patterns that extract more
positive entities. Candidate patterns are ranked by ps(r) and the top patterns are added to
the list of learned patterns.1
To calculate score(e), I use features that assess unlabeled entities to be either closer
to positive or negative entities in an unsupervised way. I motivate my choice of the five
features below with the following insights. If the dataset consists of informally written
text, many unlabeled entities are spelling mistakes and morphological variations of labeled
entities. I use two edit distance based features to predict labels for these unlabeled entities.
Second, some unlabeled entities are substrings of multi-word dictionary phrases but do
not necessarily belong to the dictionary’s class. For example, for learning drug names,
the positive dictionary might contain ‘asthma meds’, but ‘asthma’ is negative and might
occur in a negative dictionary as ‘asthma disease’. To predict the labels of entities that
are a substring of dictionary phrases, I use SemOdd, which I also used in Chapter 5 to
learn entities. Third, for a specialized domain, unlabeled entities that commonly occur
in generic text are more likely to be negative. I use Google Ngrams (called GN) to get
a fast, non-sparse estimate of the frequency of entities over a broad range of domains.
The above features do not consider the context in which the entities occur in text. I use
the fifth feature, DistSim, to exploit contextual information of the labeled entities using
distributional similarity. The features are defined as:
Edit distance from positive entities (EDP): This feature gives a score of 1 if e has low edit
distance to the positive entities. It is computed as
maxp∈Pr
1
(editDist(p, e)
|p|< 0.2
)where 1(c) returns 1 if the condition c is true and 0 otherwise, |p| is the length of p,
and editDist(p, e) is the Damerau-Levenshtein string edit distance between p and e.
1Including the |Pr| term in the denominator of Equation 6.1 resulted in comparable but a bit lower per-formance in some experiments.
CHAPTER 6. LEVERAGING UNLABELED DATA 99
The hard cut-off for the edit distance function resulted in better results in the pilot
experiments as compared to a soft scoring function.
Edit distance from negative entities (EDN): It is similar to EDP and gives a score of 1 if e
has high edit distance to the negative entities. It is computed as
1−maxn∈Nr
1
(editDist(n, e)
|n|< 0.2
)Semantic odds ratio (SemOdd): First, I calculate the ratio of frequency of the entity term
in the positive entities to its frequency in the negative entities with Laplace smooth-
ing. The ratio is then normalized using a softmax function. The feature values for the
unlabeled entities extracted by all the candidate patterns are then normalized using
the min-max function to scale the values between 0 and 1. I do min-max normaliza-
tion on top of the softmax normalization because the maximum and minimum value
by softmax might not be close to 1 and 0, respectively. And, treating the out-of-
feature-vocabulary entities the same as the worst scored entities by the feature, that
is giving them a score of 0, performed best on the development dataset.
Google Ngrams score (GN): I calculate the ratio of scaled frequency of e in the dataset to
the frequency in Google Ngrams. The scaling factor is to balance the two frequencies
and is computed as the ratio of total number of phrases in the dataset to the total of
phrases in Google Ngrams. The feature values are normalized in the same way as
SemOdd.
Distributional similarity score (DistSim): Words that occur in similar contexts, such as
‘asthma’ and ‘depression’, are clustered using distributional similarity. Unlabeled
entities that get clustered with positive entities are given higher score than the ones
clustered with negative entities. To score the clusters, I learn a logistic regression
classifier using cluster ID as features, and use their weights as scores for all the
entities in those clusters. The dataset for logistic regression is created by considering
all positively labeled words as positive and sampling negative and unlabeled words
as negative. The scores for entities are normalized in the same way as SemOdd and
GN.
CHAPTER 6. LEVERAGING UNLABELED DATA 100
Entities outside the feature vocabulary are given a score of 0 for the features SemOdd,
GN, and DistSim. I use a simple way of combining the feature values: I give equal weights
to all features and average their scores. Features can be combined using a weighted average
by manually tuning the weights on a development set; I leave it to the future work. Another
way of weighting the features is to learn the weights using machine learning. I discuss this
approach in the last section of the chapter.
6.3.3 Learning Entities
I apply the learned patterns to the text and extract candidate entities. I discard common
words, negative entities, and those containing non-alphanumeric characters from the set.
The rest are scored by averaging the scores of DistSim, SemOdd, EDO, and EDN features
from Section 6.3.2 and the following features.
Pattern TF-IDF scoring (PTF): For an entity e, it is calculated as
1
log freqe
∑r∈R
ps(r)
where R is the set of learned patterns that extract e, freqe is the frequency of e in
the corpus, and ps(r) is the pattern score calculated in Equation 6.1. Entities that are
extracted by many high-weighted patterns get higher weight. To mitigate the effect
of many commonly occurring entities also getting extracted by several patterns, I
normalize the feature value with the log of the entity’s frequency. The values are
normalized in the same way as DistSim and SemOdd.
Domain N-grams TF-IDF (DN): This feature gives higher scores to entities that are more
prevalent in the corpus compared to the general domain. For example, to learn enti-
ties about a specific disease from a disease-related corpus, the feature favors entities
related to the disease over generic medical entities. It is calculated in the same way
as GN except the frequency is computed in the n-grams of the generic domain text.
Including GN in the phrase scoring features or including DN in the pattern scoring
features did not perform well on the development set in our pilot experiments.
CHAPTER 6. LEVERAGING UNLABELED DATA 101
6.4 Experiments
6.4.1 Dataset
I evaluate our system on extracting drug-and-treatment (DT) entities in sentences from four
forums on the MedHelp user health discussion website: 1. Acne, 2. Adult Type II Diabetes
(called Diabetes), 3. Ear Nose & Throat (called ENT), and 4. Asthma.
I used Asthma as the development forum for feature engineering and parameter tuning.
Similar to Chapter 5, a DT entity is defined as a pharmaceutical drug, or any treatment
or intervention mentioned that may help a symptom or a condition. It includes surgeries,
lifestyle changes, alternative treatments, home remedies, and components of daily care and
management of a disease, but does not include diagnostic tests and devices. Refer to Chap-
ter 2 for examples of sentences from these forums and the labeled entities. I used entities
from the following classes as negative: symptoms and conditions (SC), medical specialists,
body parts, and common temporal nouns to remove dates and dosage information.
Seed dictionaries
I used the DT and SC seed dictionaries from Chapter 5. The DT seed dictionary (36,091
phrases) and SC seed dictionary (97,211 phrases) were automatically constructed from
various sources on the Internet and expanded using the OAC Consumer Health Vocabulary,
which maps medical jargon to everyday phrases and their variants. Both dictionaries are
large because they contain many variants of entities. The dictionaries matched with 1065
phrases on the Acne forum, 1232 phrases on the Diabetes forum, 2271 phrases on the ENT
forum, and 1007 phrases on the Asthma forum. For each system, the SC dictionary was
further expanded by running the system with the SC class as positive (considering DT and
other classes as negative) and adding the top 50 words extracted by the top 300 patterns to
the SC class dictionary. This helps in adding corpus-specific SC words to the dictionary.
The lists of body parts and temporal nouns were obtained from Wordnet (Fellbaum, 1998).
The common words list was created using most common words on the web and Twitter. I
used the top 10,000 words from Google Ngrams and the most frequent 5,000 words from
CHAPTER 6. LEVERAGING UNLABELED DATA 102
Twitter.2
6.4.2 Labeling Guidelines
For evaluation, I hand labeled the learned entities pooled from all systems, to be used only
as a test set. For class DT, I labeled entities belonging to DT as positive and all others as
negative. I queried ‘word + forum name’ on Google and manually inspected the results.
Apart from the definition of the DT class above, the following instructions were followed
for each class.
Positive
The following types of variations of DT entities were allowed: spelling mistakes, abbre-
viations, and phonetically similar variations (for example, ‘brufen’ for ‘ibuprofen’). If a
word or phrase was a part of a DT entity, then it was labeled positive. For example, ‘nux’ is
considered positive because ‘nux vomica’ is sometimes used medicinally. Generic entities
that can be used as a treatment for the medical condition were included (like ‘moisturizer’
for Acne). Brand names of DT entities, like ‘Amway’, were labeled positive. Ways to
administer a medicine were included, such as ‘syrup’, ‘tabs’, and ‘inhalation’. Phrases like
‘anti-bacterial’ or ‘asthma meds’ were also considered positive.
Negative
Entities that were not labeled as positive were considered negative. If a phrase had any
non-DT word then it was considered negative, except when phrases had the name of the
disease or symptom for which the treatment is mentioned. For example, ‘sinus meds’ was
considered positive. Websites, dosages, diagnosis tests or devices, doctors, specialists were
labeled as negative.
Inter-annotator agreement between the annotator and another researcher was computed
on 200 randomly sampled learned entities from each of the Asthma and ENT forum. The
agreement for the entities from the Asthma forum was 96% and from the ENT forum was
92.46%. The Cohen’s kappa scores were 0.91 and 0.83, respectively. Most disagreements2www.twitter.com, accessed from May 19 to 25, 2012.
Table 6.2: Individual feature effectiveness: Area under Precision-Recall curves when oursystem uses individual features during pattern scoring. Other features are still used forentity scoring.
Table 6.3: Feature ablation study: Area under Precision-Recall curves when individualfeatures are removed from our system during pattern scoring. The feature is still used forentity scoring.
6.4.5 Results
Figure 6.2–6.5 plot the precision and recall of systems. I do not show plots of PNOdd
and RlogF-PN to improve clarity; they performed similarly to other baselines. All systems
extract more entities for Acne and ENT because different drugs and treatments are more
prevalent in these forums. Diabetes and Asthma have more interventions and lifestyle
changes that are harder to extract. Table 6.1 shows AUC-PR scores for all systems. RlogF-
PN and PNOdd have low value for Diabetes because they learned generic patterns in initial
iteration, which led them to learn incorrect entities. Overall our system performed sig-
nificantly better than existing systems. This is because the system is able to exploit the
unlabeled data in better scoring the patterns – patterns that extract good unlabeled entities
get ranked higher than the patterns that extract bad unlabeled entities.
To compare the effectiveness of each feature in our system, Table 6.2 shows the AUC-
PR values when each feature was individually used for pattern scoring (other features were
still used to learn entities). EDP and DistSim were strong predictors of labels of unlabeled
entities because many good unlabeled entities were spelling mistakes of DT entities and
occurred in similar context as them. Table 6.3 shows the AUC-PR values when each feature
was removed from the set of features used to score patterns (the feature was still used for
learning entities). Removing GN and DistSim reduced the AUC-PR scores for all forums.
Table 6.4 shows some examples of patterns and the entities they extracted along with
their labels when the pattern was learned. Our system learned the first pattern because
‘pinacillin’ has low edit distance from the positive entity ‘penicillin’. Similarly, it scored
CHAPTER 6. LEVERAGING UNLABELED DATA 109
the second pattern higher than the baseline because ‘desoidne’ is a typo of the positive
entity ‘desonide’. Note that the seed dictionaries are noisy – the entity ‘metro’, part of the
positive entity ‘metrogel’, was falsely considered a negative entity because it was in the
common web words list. Our system learned the third pattern for two reasons: ‘inhaler’,
‘inhalers’, and ‘hfa’ occurred frequently as sub-phrases in the DT dictionary, and they
were clustered with positive entities by distributional similarity. Since RlogF-PUN does
not distinguish between unlabeled and negative entities, it is does not learn the pattern.
Table 6.5 shows top 10 patterns learned for the ENT forum by our system and RlogF-PUN,
the best performing baseline for the forum. Our system preferred to learn patterns with
longer contexts, which are usually higher precision, first.
Forum Pattern Positive entities Negative Unlabeled OurSys-tem
Baseline
ENT he give I more X antibiotics, steroid, an-tibiotic
pinacillin 68NA(RlogF-PUN)
Acne topical DT ( X prednisone, clin-damycin, differin,benzoyl peroxide,tretinoin, metrogel
metro desoidne 149231(RlogF-PN)
Asthma i be put on X cortisone, prednisone,asmanex, advair, aug-mentin, bypass, nebu-lizer, xolair, steroids,prilosec
inhaler,inhalers,hfa
8NA(RlogF-PUN)
Table 6.4: Example patterns and the entities extracted by them, along with the rank atwhich the pattern was added to the list of learned patterns. NA means that the system neverlearned the pattern. Baseline refers to the best performing baseline system on the forum.The patterns have been simplified to show just the sequence of lemmas. X refers to thetarget entity; all of them in these examples had noun POS restriction. Terms that havealready been identified as the positive class were generalized to their class DT.
CHAPTER 6. LEVERAGING UNLABELED DATA 110
Our System RlogF-PUNlow dose of X* mg of Xmg of X treat with XX 10 mg take DT and Xshe prescribe X be take XX 500 mg she prescribe Xbe take DT and X* put on Xent put I on X* stop take XDT ( like X:NN i be prescribe X
like DT and X have be take Xthen prescribe X* tell I to take X
Table 6.5: Top 10 (simplified) patterns learned by our system and RlogF-PUN from theENT forum. An asterisk denotes that the pattern was never learned by the other system. Xis the target entity slot with noun POS restriction.
6.5 Discussion and Conclusion
Our system extracted entities with higher precision and recall than other existing systems.
Since most entities extracted by patterns, especially in the crucial initial iterations, are un-
labeled, existing pattern scoring functions either unfairly penalize good patterns and/or do
not penalize bad patterns enough. Our system successfully leveraged the unlabeled data to
score patterns better – it evaluated unlabeled entities extracted by patterns in an unsuper-
vised way. However, learning entities from an informal text corpus that is partially labeled
from seed entities presents some challenges. Our system made mistakes primarily due to
three reasons. One, it sometimes extracted typos of negative entities that were not easily
predictable by the edit distance measures, such as ‘knowwhere’. Second, patterns that ex-
tracted many good but some bad unlabeled entities got high scores because of the good
unlabeled entities. However, the bad unlabeled entities extracted by the highly weighted
patterns were scored high by the PTF feature during the entity scoring phase, leading to
extraction of the bad entities. Better features to predict negative entities and robust text
normalization would help mitigate both the problems. Third, we used automatically con-
structed seed dictionaries that were not dataset specific, which led to incorrectly labeling
of some entities (for example, ‘metro’ as negative in Table 6.4). Reducing noise in the
dictionaries would increase precision and recall.
CHAPTER 6. LEVERAGING UNLABELED DATA 111
In our proposed system, the features are weighted equally by taking the average of the
feature scores. In pilot experiments, learning a logistic regression classifier on heuristically
labeled data did not work well for either pattern scoring or entity scoring. In the next
chapter, I use a logistic regression to learn an entity classifier; I improved sampling of
examples to create a training set resulting in better results using a classifier. In retrospect,
this approach could also be successfully applied to the system in this chapter.
One limitation of our system and evaluation is that I learned single word entities, since
calculating some features for multi-word phrases is not straightforward. For example, word
clusters using distributional similarity were constructed for single words. Our future work
includes expanding the features to evaluate multi-word phrases. Another avenue for fu-
ture work is to use our pattern scoring method for learning other kinds of patterns, such
as dependency patterns, and in different kinds of systems, such as hybrid entity learning
systems (Etzioni et al., 2005; Carlson et al., 2010a).
In conclusion, I show that predicting the labels of unlabeled entities in the pattern scorer
of a bootstrapped entity extraction system significantly improves precision and recall of
learned entities. Our experiments demonstrate the importance of having models that con-
trast domain-specific and general domain text, and the usefulness of features that allow
spelling variations when dealing with informal texts. Our pattern scorer outperforms ex-
isting pattern scoring methods for learning drug-and-treatment entities from four medical
web forums.
Chapter 7
Distributed Word Representations toGuide Entity Classifiers
In the last chapter, I improve the pattern scoring function of a bootstrapped pattern-based
learning system using unlabeled data. In this chapter, I leverage the unlabeled data to
improve the entity scoring function. I model it by training a logistic regression and use
the unlabeled data to enhance its training set. The work has been published in Gupta and
Manning (2015).
7.1 Introduction
The limited supervision provided in bootstrapped systems, though an attractive quality, is
also one of its main challenges. When seed sets are small, noisy, or do not cover the label
space, the bootstrapped classifiers do not generalize well. I use a major guiding inspiration
of deep learning and earlier approaches such as LSA (Landauer et al., 1998): we can learn
a lot about syntactic and semantic similarities between words in an unsupervised fashion
and capture this information in word vectors. This distributed representation can inform an
inductive bias to generalize in a bootstrapping system.
In the previous chapter, I used averaging of feature values to predict an entity’s class in
a bootstrapped system. In this chapter, I use a logistic regression classifier to predict scores
for candidate entities. My main contribution is a simple approach of using the distributed
112
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 113
Figure 7.1: An example of expanding a bootstrapped entity classifier’s training set usingword vector similarity. The entities in blue represent known positive entities and the entitiesin red represent known negative entities. The entities in black are unlabeled but can beincorporated in the corresponding positive and negative sets because of their proximity tothe known entities in the word vector space.
vector representations of words to expand training data for entity classifiers. To improve
the step of learning an entity classifier, I first learn a vector representation of entities using
the continuous bag of words model (Mikolov et al., 2013b). I then use kNN to expand the
training set of the classifier by adding unlabeled entities close to seed entities in the training
set. Figure 7.1 shows an example of expansion of a training set for a drugs-and-treatment
entity classifier tailored for online health forums. The unlabeled entities shown in the
figure are usually not found in seed sets that are automatically constructed using medical
ontologies. However, these entities can be incorporated into the training set because they
occur in similar contexts in the dataset. Expanding a training set not only makes it larger
but also less susceptible to false negatives, since the process of sampling the unlabeled
entities as negative is guided by the frequency and context of entities.
The key insight is to use the word vector similarity indirectly by enhancing training
data for the entity classifier. I do not directly label the unlabeled entities using the similar-
ity between word vectors, which I show extracts many noisy entities. I show that classifiers
trained with expanded sets of entities perform better on extracting drug-and-treatment en-
tities from four online health forums from MedHelp.
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 114
7.2 Related Work
In a pattern-based system, if the patterns are not very specific, they can extract noisy terms.
On the other hand, too specific patterns can result in low recall. Many systems, such
as RAPIER (Califf and Mooney, 1999) and Kozareva and Hovy (2013), learn patterns
and extract all fillers that match the patterns. In supervised systems (e.g. RAPIER), the
patterns are scored using fully supervised data, and hence the patterns are presumably more
accurate. Learning all matched entities is a bigger problem in bootstrapped systems since
there is little labeled data to judge patterns. Kozareva and Hovy (2013) extended ontologies
using bootstrapping; they learned very specific ‘doubly-anchored’ patterns.
To mitigate the problem of extracting noisy entities, some BPL systems have an en-
tity evaluation step and they learn only the top ranked entities. There are several ways to
rank the candidate entities. Systems, such as Thelen and Riloff (2002), Lin et al. (2003),
and Agichtein and Gravano (2000), score entities using the number and scores of patterns
that extracted them. In Chapter 5, I used a similar function to rank the entities. Snow-
ball (Agichtein and Gravano, 2000) and DIPRE (Brin, 1999) also took into account how
well a pattern matched a sentence to extract an entity. All of the above systems use only
the patterns to score entities extracted by them. Surprisingly, only a few systems also use
entity-based features to score the entities. StatSnowball proposed MLNs to extract enti-
ties and used token-level features and joint entity-level features. In Chapter 6, I used five
features to evaluate an entity, four of them were entity-based features.
Some open IE systems like KnowItAll use the web to assess the quality of extractions.
KnowItAll’s assessor used querying search engines to get a PMI score of occurrence of the
entity by itself vs. as a slot of the extractors. The PMI scores are used as features in a naive
Bayes classifier. Downey et al. (2010) proposed a probabilistic urn model and compared
against noisy-or and PMI scoring models.
Most of the BPL systems do not use a machine learning-based classifier for the entity
scoring step. In this chapter, I model the entity scoring function using a logistic regression
classifier. To the best of my knowledge, this work is the first to improve a bootstrapped
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 115
system’s entity evaluation by expanding the classifier’s training set. I use distributed rep-
resentations of words to compute unlabeled entities that are similar to known entities. Dis-
tributed representations of words have been shown to be successful at improving general-
ization. Passos et al. (2014) proposed word embeddings that leverage lexicons and used the
embeddings to improve a CRF-based named entity recognition system.
7.3 Approach
In this section, I propose an entity classifier and its enhancement by expanding its training
set using an unsupervised word similarity measure.
I build a one-vs-all entity classifier using logistic regression. In each iteration, for
label l, the entity classifier is trained by treating l’s dictionary entities (seed and learned
in previous iterations) as positive and entities belonging to all other labels as negative. To
improve generalization, I also sample the unlabeled entities that are not function words as
negative. To train with a balanced dataset, I randomly sub-sample the negatives such that
the number of negative instances is equal to the number of positive instances.
The features for the entities are similar to the ones described in Chapter 6 from Gupta
and Manning (2014a): edit distances from positive and negative entities, relative frequency
of the entity words in the seed dictionaries, word classes computed using the Brown clus-
tering algorithm, and pattern TF-IDF score. Note that in Chapter 6, I averaged the feature
values to predict an entity’s score; one of the features was the score of the word class clus-
ter belonging to a label. First, the words were clustered using the Brown clustering method
(Brown et al., 1992). Then, each cluster was considered as an instance in a logisitic regres-
sion classifier, which was trained to give a probability of whether a cluster belongs to the
given label. I then used this cluster score as a feature in the average function. Here, I simply
include the word cluster id directly as a feature in the logistic regression classifier, which
is trained to give a score of whether an entity belongs to the given label. The last feature,
pattern TF-IDF score, gives higher scores to entities that are extracted by many learned
patterns and have low frequency in the dataset. In the experiments, I call this classifier
NotExpanded.
The lack of labeled data to train a good entity classifier is one of the challenges in
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 116
bootstrapped learning. I use distributed representations of words, in the form of word
vectors, to guide the entity classifier by expanding its training set. I expand the positive
training set by labeling the unlabeled entities that are similar to the seed entities of the label
as positive examples, and labeling the unlabeled entities that are similar to seed entities of
other labels as negative examples. I take the cautious approach of finding similar entities
only to the seed entities and not the learned entities. The algorithm can be modified to
find similar entities to learned entities as well. Cautious approaches have been shown to be
better for bootstrapped learning (Abney, 2004; Surdeanu et al., 2006).
To compute similarity of an unlabeled entity to the positive entities, I find k most similar
positive entities, measured by cosine similarity between the word vectors, and average the
scores. Similarly, I compute similarity of the unlabeled entity to the negative entities. If the
entity’s positive similarity score is above a given threshold θ and is higher than its negative
similarity score, it is added to the training set with positive label. I expand the negative
entities similarly. I tried expanding just the positive entities and just the negative entities.
Their relative performance, though higher than the baselines, varied between the datasets.
Expanding both positives and negatives gave more stable results across the datasets. Thus,
I present results only for expanding both positives and negatives.
An alternative to our approach is to directly label the entities using the vector simi-
larities. Our experimental results suggest that even though exploiting similarities between
word vectors is useful for guiding the classifier by expanding the training set, it is not ro-
bust enough to use for labeling entities directly. For example, for our development dataset,
when the similarity threshold θ was set as 0.4, 16 out of 41 unlabeled entities that were
expanded into the training set as positive entities were false positives. Increasing θ ex-
tracted far fewer entities. Setting θ to 0.5 extracted only 5 entities, all true positives, and
to 0.6 extracted none. Thus, labeling entities solely based on similarity scores resulted in
lower performance. A classifier, on the other hand, can use other sources of information as
features to predict an entity’s label.
I compute the distributed vector representations using the continuous bag-of-words
model (Mikolov et al., 2013b; Mikolov et al., 2013a) implemented in the word2vec toolkit.1
Table 7.1: Area under Precision-Recall curve for all the systems. Expanded is our systemwhen word vectors are learned using the Wiki+Twit+MedHelp data and Expanded-M iswhen word vectors are learning using the MedHelp data. Average is the average of featurevalues, similar to Gupta and Manning (2014a).
The publicly available word vectors are not tailored towards the online health forums do-
main and thus I train new vector representations. I train 200-dimensional vector representa-
tions on a combined dataset of a 2014 Wikipedia dump (1.6 billion tokens), a sample of 50
million tweets from Twitter (200 million tokens), and an in-domain dataset of all MedHelp
forums (400 million tokens). The three types of datasets have words and context of differ-
ent kinds: the Wikipedia data mainly consists of domain-independent words; the Twitter
data has many slang and colloquial words, also common on online forums; and the Med-
Help data has the in-domain content. I tried learning 500-dimensional and 50-dimensional
vectors; the 200-dimensional vectors worked best on the developmental data. I removed
words that occurred less than 20 times, resulting in a vocabulary of 89k words. I call this
dataset Wiki+Twit+MedHelp. I used the parameters suggested in Pennington et al. (2014):
negative sampling with 10 samples and a window size of 10. I ran the model for 3 itera-
tions, which were enough to get good results; more iterations would presumably result in
better vectors.
7.4 Experimental Setup
I present results on the same experimental setup, dataset, and seed lists as discussed in
Chapter 6 from Gupta and Manning (2014a). The task is to extract drug-and-treatment
(DT) entities in sentences from four forums on the MedHelp user health discussion website:
1. Asthma, 2. Acne, 3. Adult Type II Diabetes (called Diabetes), and 4. Ear Nose &
Throat (called ENT). A DT entity is defined as a pharmaceutical drug, or any treatment
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 118
0.75
0.8
0.85
0.9
0.95
1
0 0.2 0.4 0.6 0.8 1
Precis
ion
Recall (268.0 correct entities)
ASTHMA
ExpandedNotExpanded
Average
Figure 7.2: Precision vs. Recall curves of our system and the baselines for the Asthmaforum.
0.75
0.8
0.85
0.9
0.95
1
0 0.2 0.4 0.6 0.8 1
Precis
ion
Recall (268.0 correct entities)
ACNE
ExpandedNotExpanded
Average
Figure 7.3: Precision vs. Recall curves of our system and the baselines for the Acne forum.
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 119
0.75
0.8
0.85
0.9
0.95
1
0 0.2 0.4 0.6 0.8 1
Precis
ion
Recall (268.0 correct entities)
DIABETES
ExpandedNotExpanded
Average
Figure 7.4: Precision vs. Recall curves of our system and the baselines for the Diabetesforum.
0.75
0.8
0.85
0.9
0.95
1
0 0.2 0.4 0.6 0.8 1
Precis
ion
Recall (268.0 correct entities)
ENT
ExpandedNotExpanded
Average
Figure 7.5: Precision vs. Recall curves of our system and the baselines for the ENT forum.
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 120
or intervention mentioned that may help a symptom or a condition. I judged the output
of all systems, following the guidelines in the previous chapter. I used Asthma as the
development forum for parameter and threshold tuning. I used threshold θ as 0.4 and use k
(number of nearest neighbors) as 2 when expanding the seed sets.
I evaluate systems by their precision and recall (see Chapter 2 for details). Similar to
the previous chapter. I present the precision and recall curves for precision above 75% to
compare systems when they extract entities with reasonably high precision. Recall is de-
fined as the fraction of correct entities among the total unique correct entities pooled from
all systems. Note that calculating lower precisions or true recall is very hard to compute.
Our dataset is unlabeled and manually labeling all entities is expensive. Pooling is a com-
mon evaluation strategy in such situations (such as, in information retrieval (Buckley et al.,
2007) and the TAC-KBP shared task). I calculate the area under the precision-recall curves
(AUC-PR) to compare the systems.
I call our system Expanded in the experiments. To compare the effects of word vectors
learned using different types of datasets, I also study our system when the word vectors are
learned using just the in-domain MedHelp data, called Expanded-M. I compare against two
baselines: NotExpanded as explained in previous section, and Average, in which I average
the feature values, similar to Gupta and Manning (2014a).
7.5 Results and Discussion
Table 7.1 shows AUC-PR of various systems and Figures 7.2–7.5 show the precision-recall
curves. Our systems Expanded and Expanded-M, which used similar entities for training,
improved the scores for all four forums. I believe the improvement for the Diabetes fo-
rum was much higher than other forums because the baseline’s performance on the forum
degraded quickly in later iterations (see the figure), and improving the classifier helped
in adding more correct entities. Additionally, Diabetes DT entities are more lifestyle-
based and hence occur frequently in web text, making the word vectors trained using the
Wiki+Twit+MedHelp dataset better suited.
In three out of four forums, word vectors trained using a large corpus perform better
than those trained using the smaller in-domain corpus. For the Acne forum, where brand
CHAPTER 7. WORD EMBEDDINGS IMPROVE ENTITY CLASSIFIERS 121
Table 7.2: Examples of unlabeled entities that were expanded into the training sets. Graycolored entities were judged by the authors as falsely labeled.
name DT entities are more frequent, the entities expanded by MedHelp vectors had fewer
false positives than those expanded by Wiki+Twit+MedHelp.
Table 7.2 shows some examples of unlabeled entities that were included as positive/neg-
ative entities in the entity classifiers. Even though some entities were included in the train-
ing data with wrong labels, overall the classifiers benefited from the expansion.
7.6 Conclusion
I improve entity classifiers in bootstrapped entity extraction systems by enhancing the train-
ing set using unsupervised distributed representations of words. The classifiers learned us-
ing the expanded seed sets extract entities with better F1 score. This supports our hypoth-
esis that generalizing labels to entities that are similar according to unsupervised methods
of word vector learning is effective in improving entity classifiers, notwithstanding that the
label generalization is quite noisy. Using the word embedding based similarity measure to
directly label the data resulted in low scores. However, training a classifier with expanded
training sets improved the scores, underscoring its robustness to noise.
In the last three chapters, I worked on applying bootstrapped pattern-based learning
to extract entities from PAT, improving scoring of both patterns and entities by exploiting
unlabeled data. In the next chapter, I turn briefly to another aspect important to real life use
of pattern-based systems – their interpretability and explainability.
Chapter 8
Visualizing and Diagnosing BPL
In the previous chapters, I discussed bootstrapped pattern-based learning, along with its
improvements, as an effective practical tool for entity extraction. In this chapter, I dis-
cuss why patterns are popular in industry and present a visualization tool for developing
a pattern-based system more effectively and efficiently. The work has been published in
Gupta and Manning (2014b).
8.1 Introduction
Entity extraction using patterns dominates commercial industry, mainly because patterns
are effective, interpretable by humans, and easy to customize to cope with errors (Chiti-
cariu et al., 2013). Patterns or rules, which can be hand crafted or learned by a system,
are commonly created by looking at the context around already known entities, such as
lexico-syntactic surface word patterns and dependency patterns. Building a pattern-based
learning system is usually a repetitive process, usually performed by the system developer,
of manually examining a system’s output to identify improvements or errors introduced by
changing the entity or pattern extractor. Interpretability of patterns makes it easier for hu-
mans to identify sources of errors by inspecting patterns that extracted incorrect instances
or instances that resulted in learning of bad patterns. Parameters range from window size
of the context in surface word patterns to thresholds for learning a candidate entity. At
present, there is a lack of tools helping a system developer to understand results and to
122
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 123
improve results iteratively.
Visualizing diagnostic information of a system and contrasting it with another system
can make the iterative process easier and more efficient. For example, consider a user trying
to decide on the context’s window size in surface words patterns. And the user deliberates
that part-of-speech (POS) restriction of context words might be required for a reduced
window size to avoid extracting erroneous mentions. A shorter context size usually extracts
entities with higher recall but lower precision. By comparing and contrasting extractions
of two systems with different parameters, the user can investigate the cases in which the
POS restriction is required with smaller window size, and whether the restriction causes the
system to miss some correct entities. In contrast, comparing just accuracy of two systems
does not allow inspecting finer details of extractions that increase or decrease accuracy and
to make changes accordingly.
In this chapter, I present a pattern-based entity learning and diagnostics tool, SPIED. It
consists of two components: 1. pattern-based entity learning using bootstrapping (SPIED-
Learn), and 2. visualizing the output of one or two entity learning systems (SPIED-Viz).
SPIED-Viz is independent of SPIED-Learn and can be used with any pattern-based entity
learner. For demonstration, I use the output of SPIED-Learn as an input to SPIED-Viz.
SPIED-Viz has pattern-centric and entity-centric views, which visualize learned patterns
and entities, respectively, and the explanations for learning them. SPIED-Viz can also con-
trast two systems by comparing the ranks of learned entities and patterns. As a concrete ex-
ample, I learn and visualize drug-treatment (DT) entities from unlabeled patient-generated
medical text, starting with seed dictionaries of entities for multiple classes. This is the same
task proposed and developed in Chapter 5 and 6 from Gupta et al. (2014b) and Gupta and
Manning (2014a).
My contributions are: 1. I present a novel diagnostic tool for visualization of output of
multiple pattern-based entity learning systems, and 2. I release the code of an end-to-end
pattern learning system, which learns entities using patterns in a bootstrapped system and
visualizes its diagnostic output. The pattern learning and the visualization code are avail-
able at http://nlp.stanford.edu/software/patternslearning.shtml.
SPIED-Learn is based on the system described in Chapter 6 published in Gupta and Man-
ning (2014a). The system builds upon the previous bootstrapped pattern-learning work and
proposes an improved measure to score patterns. It learns entities for given classes from
unlabeled text by bootstrapping from seed dictionaries. Patterns are learned using labeled
entities, and entities are learned based on the extractions of learned patterns. The process
is iteratively performed until no more patterns or entities can be learned.
SPIED-Learn provides an option to use any of the pattern scoring measures described
in (Riloff, 1996; Thelen and Riloff, 2002; Yangarber et al., 2002; Lin et al., 2003; Gupta
et al., 2014b). A pattern is scored based on the positive, negative, and unlabeled entities
it extracts. The positive and negative labels of entities are heuristically determined by the
system using the dictionaries and the iterative entity learning process. The oracle labels
of learned entities are not available to the learning system. Note that an entity that the
system considered positive might actually be incorrect, since the seed dictionaries can be
noisy and the system can learn incorrect entities in the previous iterations, and vice-versa.
SPIED-Learn’s entity scorer can be chosen between the systems described in Chapter 6 or
7.
Each candidate entity is scored using weights of the patterns that extract it and other
entity scoring measures, such as TF-IDF. Thus, learning of each entity can be explained by
the learned patterns that extract it, and learning of each pattern can be explained by all the
entities it extracts.
8.3 Design Criteria
The following design criteria are considered when designing the interface.
• Quick summary: The interface should provide a quick summary of the learned en-
tities and patterns, including the percentage of correct and incorrect entities, if gold
labels are provided.
• Provenance: In a pattern-based system, provenance of an extracted entity or pattern
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 125
is much easier than in a feature-based system. The visualization needs to be able to
drill down a dictionary entry (usually a learned entity) to the learned patterns that
extracted it. Similarly, it should have the ability to go from a pattern to the lists of
entities (divided by good labels, if provided) it extracted and its perceived goodness.
• Individual goodness and its quick identification: By using a heuristic criteria, the
system should identify good and bad learned patterns and entities. It would help in
quick identification and diagnosis of errors. In SPIED-viz, an exclamation mark is
shown for a pattern if more than half of the entities extracted by it are incorrect. In
addition, various signs, such as trophy (correct entity extracted by only one system)
and star (unlabeled entity extracted by only one system), are used to identify various
types of entities.
• Comparison: The interface should be able to compare multiple systems, both at a
higher level and a fine-grained entity/pattern level.
• Pattern-centric and entity-centric views: These views can provide detailed informa-
tion, either from a pattern point of view or from an entity point of view.
• Easy and fast: The tool should not require any cumbersome installation and should
be fast to use. Web browser-based tools are easy to use since they do not require
installation of a new software.
8.4 Visualizing Diagnostic Information
SPIED-Viz visualizes learned entities and patterns from one or two entity learning systems,
and the diagnostic information associated with them. It optionally uses the oracle labels
of learned entities to color code them, and contrast their ranks of correct/incorrect enti-
ties when comparing two systems. The oracle labels are usually determined by manually
judging each learned entity as correct or incorrect. SPIED-Viz has two views: 1. a pattern-
centric view that visualizes patterns of one to two systems, and 2. an entity centric view that
mainly focuses on the entities learned. Figure 8.1 shows a screenshot of the entity-centric
view of SPIED-Viz. It displays following information:
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 126
Summary: Summary information of each system at each iteration and overall. It shows
for each system the number of iterations, the number of patterns learned, and the
number of correct and incorrect entities learned.
Learned Entities with provenance: It shows ranked list of entities learned by each system,
along with an explanation of why the entity was learned. The details shown include
the entity’s oracle label, its rank in the other system, and the learned patterns that
extracted the entity. Such information can help the user to identify and inspect the
patterns responsible for learning an incorrect entity. The interface also provides a
link to search the entity along with any user provided keywords (such as domain of
the problem) on Google.
System Comparison: SPIED-Viz can be used to compare entities learned by two systems.
It marks entities that are learned by one system but not by the other system, by either
displaying a trophy sign (if the entity is correct), a thumbs down sign (if the entity is
incorrect), or a star sign (if the oracle label is not provided).
The second view of SPIED-Viz is pattern-centric. Figure 8.2 shows a screenshot of the
pattern-centric view. It displays the following information.
Summary: Summary information of each system including the number of iterations and
number of patterns learned at each iteration and overall.
Learned Patterns with provenance: It shows a ranked list of patterns along with the en-
tities it extracts and their labels. Note that each pattern is associated with a set of
positive, negative and unlabeled entities, which were used to determine its score.1 It
also shows the percentage of unlabeled entities extracted by a pattern that were even-
tually learned by the system and assessed as correct by the oracle. A smaller percent-
age means that the pattern extracted many entities that were either never learned or
learned but were labeled as incorrect by the oracle.1Note that positive, negative, and unlabeled labels are different from the oracle labels, correct and incor-
rect, for the learned entities. The former refer to the entity labels considered by the system when learning thepattern, and they come from the seed dictionaries and the learned entities. A positive entity considered by thesystem can be labeled as incorrect by the human assessor, in case the system made a mistake in labeling data,and vice-versa.
CHAPTER 8. VISUALIZING AND DIAGNOSING BPL 127
Figure 8.3 shows an option in the entity-centric view when hovering over an entity
opens a window on the side that shows the diagnostic information of the entity learned by
the other system. This direct comparison is to directly contrast learning of an entity by both
systems. For example, it can help the user to inspect why an entity was learned at an earlier
rank than the other system.
An advantage of making the learning entities component and the visualization compo-
nent independent is that a developer can use any pattern scorer or entity scorer in the system
without depending on the visualization component to provide that functionality.
I develop a list-based visualization since it is easy to navigate and it can compare learn-
ing of individual entities/patterns. Additionally, since most pattern-based systems are it-
erative, ranking the entities/patterns in the visualization by the number of iterations helps
in diagnosing errors better. Other variations, such as clustering of entities based on pat-
terns, can give higher-level insights into the learning process, however, it is more difficult
to diagnose sources of errors.
8.5 System Details
SPIED-Learn uses TokensRegex (Chang and Manning, 2014) to create and apply surface
word patterns to text. SPIED-Viz takes details of learned entities and patterns as input in a
JSON format. It uses Javascript, angular, and jquery to visualize the information in a web
browser.
8.6 Related Work
Most interactive IE systems focus on annotation of text, labeling of entities, and manual
writing of rules. Some annotation and labeling tools are: MITRE’s Callisto2, Knowta-
tor3, SAPIENT (Liakata et al., 2009), brat4, Melita (Ciravegna et al., 2002), and XConc
Figure 8.3: When the user clicks on the compare icon for an entity, the explanations of theentity extraction for both systems (if available) are displayed. This allows direct compari-son of why the two systems learned the entity.
Suite (Kim et al., 2008). Akbik et al. (2013) interactively helps non-expert users to manu-
ally write patterns over dependency trees. GATE5 provides the JAPE language that recog-
nizes regular expressions over annotations. Other systems focus on reducing manual effort
for developing extractors (Brauer et al., 2011; Li et al., 2011). ICE (He and Grishman,
2015) is an interface for building entity, relation, and event extractors using dependency
patterns. Valenzuela-Escarcega et al. (2015) built an interactive web-based event extraction
tool for event grammar development via rules. In contrast, our tool focuses on visualizing
and comparing diagnostic information associated with pattern learning systems.
WizIE (Li et al., 2012b) is an integrated environment for annotating text and writing
pattern extractors for information extraction. It also generates regular expressions around
labeled mentions and suggests patterns to users. It is most similar to our tool as it displays
an explanation of the results extracted by a pattern. However, it is focused towards hand
writing and selection of rules. In addition, it cannot be used to directly compare two pattern
learning systems.
What’s Wrong With My NLP?6 is a tool for jointly visualizing various natural language
processing formats such as trees, graphs, and entities. It is the same as our system in the
focus on diagnosing errors so they can be fixed, but it is different in providing no tools to
drill down and find the source of errors. Since I focus on a particular task and a learning
mechanism, I am able to develop a specialized tool that can provide more functionality,5http://gate.ac.uk6https://code.google.com/p/whatswrong