-
Findings of the Association for Computational Linguistics: EMNLP
2020, pages 3414–3423November 16 - 20, 2020. c©2020 Association for
Computational Linguistics
3414
A Dual-Attention Network for Joint Named Entity Recognition
andSentence Classification of Adverse Drug Events
Susmitha WunnavaWorcester Polytechnic Institute
100 Institute RdWorcester, MA [email protected]
Xiao QinIBM Research - Almaden
650 Harry RoadSan Jose CA 95120
[email protected]
Tabassum KakarWorcester Polytechnic Institute
100 Institute RdWorcester, MA [email protected]
Xiangnan KongWorcester Polytechnic Institute
100 Institute RdWorcester, MA [email protected]
Elke A. RundensteinerWorcester Polytechnic Institute
100 Institute RdWorcester, MA [email protected]
Abstract
An adverse drug event (ADE) is an injury re-sulting from medical
intervention related toa drug. ADE detection from text can be
ei-ther fine-grained (ADE entity recognition) orcoarse-grained (ADE
assertive sentence classi-fication), with limited efforts
leveraging inter-dependencies among these two granularities.We
instead design a multi-grained joint deepnetwork model MGADE to
concurrently solveboth ADE tasks MGADE takes advantage oftheir
symbiotic relationship, with a transfer ofknowledge between the two
levels of granular-ity. Our dual-attention mechanism
constructsmultiple distinct representations of a sentencethat
capture both task-specific and semantic in-formation in the
sentence, providing strongeremphasis on the key elements essential
for sen-tence classification. Our model improves state-of-art
F1-score for both tasks: (i) entity recog-nition of ADE words
(12.5% increase) and (ii)ADE sentence classification (13.6%
increase)on MADE 1.0 benchmark of EHR notes.
1 Introduction
Background. Adverse drug events (ADEs), in-juries resulting from
medical intervention, are aleading cause of death in the United
States and costaround $30˜$130 billion every year (Donaldsonet al.,
2000). Early detection of ADE incidents aidsin the timely
assessment, mitigation and preven-tion of future occurrences of
ADEs. Natural Lan-guage Processing techniques have been
recognizedas instrumental in identifying ADEs and
relatedinformation from unstructured text fields of sponta-neous
reports and electronic health records (EHRs)and thus in improving
drug safety monitoring andpharmacovigilance (Harpaz et al.,
2014).
Fine-grained ADE detection identifies namedADE entities at the
word-level, while coarse-grained ADE detection (also ADE assertive
textclassification) identifies complete sentences de-scribing
drug-related adverse effects. (Gurulin-gappa et al., 2011)’s system
for identification ofADE assertive sentences in medical case
reportstargets the important application of detecting
under-reported and under-documented adverse drug ef-fects. Lastly,
multi-grained ADE detection identi-fies ADE information at multiple
levels of granu-larity, namely, both entity and sentence level.
As example, Figure 1 displays ADE and non-ADE sentences. The
first is an ADE sentence wherethe mentions of Drugname and ADE
entities havethe appropriate relationship with each other. Sec-ond
and third sentences show that the mention ofan ADE entity by itself
is not sufficient to assert adrug-related adverse side effect.
Recently, deep learning-based sequence ap-proaches have shown
some promise in extractingfine-grained ADEs and related named
entities fromtext (Liu et al., 2019). However, the prevalenceof
entity-type ambiguity remains a major hurdle,such as,
distinguishing between Indication entitiesas the reason for taking
a drug versus ADE entitiesas unintended outcomes of taking a drug.
Coarse-grained sentence-level detection performs well inidentifying
ADE descriptive sentences, but is notequipped to detect
fine-grained information suchas words associated with ADE related
named enti-ties. Unfortunately, when the interaction betweenthese
two extraction tasks is ignored, we miss theopportunity of the
transfer of knowledge betweenthe ADE entity and sentence prediction
tasks.
Attention-based neural network models havebeen shown to be
effective for text classification
-
3415
Figure 1: Each sentence is classified as ADE sentence (binary
yes/no). Each word is labeled using beginning of anentity (B-...)
vs inside an entity (I-...) for ADE related named entities
(multiple classes). O denotes no entity tag.
tasks (Luong et al., 2015; Bahdanau et al., 2014)from alignment
attention in translation (Liu et al.,2016) to supervising attention
in binary text clas-sification (Rei and Søgaard, 2019). Previous
ap-proaches typically apply only a single round ofattention
focusing on simple semantic informationIn our ADE detection task,
instead, key elementsof the sentence can be linked to multiple
categoriesof task-specific semantic information of the
namedentities (ADE, Drug, Indication, Severity, Doseetc.). Thus,
single attention is insufficient in explor-ing this multi-aspect
information and consequentlyrisks losing important cues.Proposed
Approach. In our work, we tackle theabove shortcomings by designing
a dual-attentionbased neural network model for multi-grained
jointlearning, called MGADE, that jointly identifiesboth ADE
entities and ADE assertive sentences.The design of MGADE is
inspired by multi-taskRecurrent Neural Network architectures for
jointlylearning to label tokens and sentences in a
binaryclassification setting (Rei and Søgaard, 2019). Inaddition,
our model makes use of a supervised self-attention mechanism based
on entity-level predic-tions to guide the attention function –
aiding it intackling the above entity-type ambiguity problem.We
also introduce novel strategies of constructingmultiple
complementary sentence-level represen-tations to enhance the
performance of sentenceclassification.
Our key contributions include:1. Joint Model. We jointly model
ADE entity recog-
nition as a multi-class sequence tagging problemand ADE
assertive text classification as binary clas-sification. Our model
leverages the mutually ben-eficial relationships between these two
tasks, e.g.,ADE sentence classification can influence ADE en-tity
recognition by identifying clues that contributeto ADE
assertiveness of the sentence and matchthem to ADE entities.
2. Dual-Attention. Our novel method for generatingand pooling
multiple attention mechanisms pro-
duces informative sentence-level representations.Our
dual-attention mechanisms based on word-level entity predictions
construct multiple repre-sentations of the same sentence. The
dual-attentionweighted sentence-level representations captureboth
task-specific and semantic information in asentence, providing
stronger emphasis on key ele-ments essential for sentence
classification.
3. Label-Awareness. We introduce an augmentedsentence-level
representation comprised of pre-dicted entity labels which adds
label-context tothe proposed dual-attention sentence-level
repre-sentation for better capturing the word-level
labeldistribution and word dependencies within the sen-tence. This
further boosts the performance of thesentence classification
task.
4. Model Evaluation. We compare our joint modelwith state-of-art
methods for the ADE entity recog-nition and ADE sentence
classification tasks. Ex-periments on MADE1.0 benchmark of EHR
notesdemonstrate that our MGADE model drives upthe F1-score for
both tasks significantly: (i) entityrecognition of ADE words by
12.5% and by 23.5%and (ii) ADE sentence classification by 13.6%
andby 23.0%, compared to state-of-art single task andjoint-task
models, respectively.
2 Related Work
Fine-grained ADE Detection. Jagannatha andYu (2016b) have
employed a bidirectional LSTM-CRF model to label named entities
from electronichealth records of cancer patients. Pandey et
al.(2017) proposed a bidirectional recurrent neuralnetwork with
attention to extract ADRs and clas-sify the relationship between
entities from Medlineabstracts and EHR datasets. Wunnava et al.
(2019)presented a three-layer deep learning architecturefor
identifying named entities from EHRs, consist-ing of a Bi-LSTM
layer for character-level encod-ing, a Bi-LSTM layer for word-level
encoding, anda CRF layer for structured prediction.
-
3416
Coarse-grained ADE Detection. Huynh et al.(2016) applies
Convolutional Neural Networks us-ing pre-trained word embeddings to
detect sen-tences describing ADEs. Tafti et al. (2017) utilizeda
feed-forward ANN to discover ADE sentences onPubMed Central data
and social media. Dev et al.(2017) developed a binary document
classifier us-ing logistic regression, random forests and LSTMsto
classify an AE case as serious vs. non-serious.Multi-grained ADE
Detection. Zhang et al.(2018) developed a multi-task learning model
thatcombines entity recognition with document clas-sification to
extract the adverse event from a casenarrative and classify the
case as serious or non-serious. However, they fall short in
tackling ourproblem. Not only do their targeted labels not fallinto
the drug-related adverse side effects categoryin which a causal
relationship is suspected and re-quired, but their attention model
is only simpleself-attention. As consequence, MGADE outper-forms
their model by 23.5% in F1 score for entityrecognition and 23.0%
for assertive text classifica-tion as seen in Section 4.
3 The Proposed Model: MGADE
3.1 Task Definition
In the ADE and medication related information de-tection task,
the entities are ADE, Drugname, Dose,Duration, Frequency,
Indication, Route, Severityand Other Signs & Symptoms. The
no-entity tagis O. Because some entities (like weight gain) canhave
multiple words, we work with a BIO taggingscheme to distinguish
between beginning (tag B-...)versus inside of an entity (tag
I-...). The notation weuse is given in Fig 2. Given a sentence (a
sequenceof words), task one is the multi-class classificationof ADE
and medication related named entities inthe text sequence, i.e.,
entity recognition. Task twois the binary classification of a
sentence as ADEassertive text. The overall goal is to minimize
theweighted sum of entity recognition loss and sen-tence
classification loss.
3.2 Input Embedding Layer
The input of this layer is a sentence represented bya sequence
of words S = 〈w1, w2, ..., wN 〉, whereN is sentence length. The
words are first brokeninto individual characters and
character-level repre-sentations which capture the morphology of a
wordcomputed with a bidirectional-LSTM over the se-quence of
characters in the input words. We employ
the pre-trained word vector, GloVe (Penningtonet al., 2014), to
obtain a fixed word embeddingof each word. A consolidated dense
embedding,comprised of pre-trained word embedding concate-nated
with a learned character-level representation,is used to represent
a word. The output of this layeris X = [x1, x2, ..., xN ].
3.3 Contextual LayerLSTM is a type of recurrent neural network
thateffectively captures long-distance sequence infor-mation and
the interaction between adjacent words(Hochreiter and Schmidhuber,
1997). The wordrepresentations xt are given as input to two
sep-arate LSTM networks (Bi-LSTM) that scan thesequence forward and
backward, respectively. Thehidden states learned by the forward and
backwardLSTMs are denoted as
−→h t and
←−h t, respectively.
−→h t = LSTM
(xt,−→h t−1
)(1)
←−h t = LSTM
(xt,←−h t+1
)(2)
The output of this layer is a sequence of hiddenstates H = [h1,
h2, ..., hN ], where ht is a concate-nation of
−→h t and
←−h t. This way, the hidden state ht
of a word encodes information about the tth wordand its
context:
ht =[−→h t;←−h t
](3)
3.4 Word-level (NER) Output LayerThe hidden states ht are passed
through a non-linear layer and then with the softmax
activationfunction to k output nodes, where k denotes thenumber of
entity-types (classes). Entity-type labelsare the named entities in
the BIO format. Eachoutput node belongs to some entity-type and
out-puts a score for that entity-type. The output of thesoftmax
function is a categorical probability distri-bution, where output
probabilities of each class isbetween 0 and 1, and the total sum of
all outputprobabilities is equal to 1.
a(i)t =
exp(e(i)t
)∑k
j=1 exp(e(j)t
) (4)Data is classified into a entity-type that has the
highest probability value.
ât = maxi∈{1,2,...,k}
a(i)t (5)
-
3417
Figure 2: The architecture of the proposed Multi-Grained ADE
Detection Network (MGADE)
3.5 Dual-Attention Layer
The purpose of the attention mechanism in the sen-tence
classification task is to select important wordsin different
contexts to build informative sentencerepresentations. Different
words have different im-portance for ADE sentence classification
task. Forinstance, key elements (words/phrases) in the ADEdetection
task are linked to multiple aspects of se-mantic information
associated with the named en-tity categories - ADE, Drugname,
Severity, Dose,Duration, Indication. . . etc. It is necessary to
assignthe weight for each word according to its contribu-tion to
the ADE sentence classification task.
Moreover, certain named entities are task-specific and are
considered essential for ADE sen-tence classification. There exists
a direct correspon-dence between such task-specific named
entitiesand the sentence. Hence, we anticipate that therewould be
at least one word of the same label as thesentence-level label. For
instance, a sentence thatis labeled as an ADE sentence has a
correspondingADE entity word. Although other named entitywords
detect important information and contributeto the ADE
sentence-level classification task, astronger focus should be on
task-specific ADEwords indicative of the ADE sentence core
mes-sage. A single attention distribution tends to beinsufficient
to explore the multi-aspect informationand consequently may risk
losing important cues(Wang et al., 2017).
We address this challenge by generating and us-
ing multiple attention distributions that offer ad-ditional
opportunities to extract relevant semanticinformation. This way, we
focus on different as-pects of an ADE sentence to create a more
infor-mative representation. For this, we introduce anovel
dual-attention mechanism, which in additionto selecting the
important semantic areas in thesentence (henceforth referred as
supervised self-attention (Bahdanau et al., 2014; Yang et al.,
2016;Rei and Søgaard, 2019)), it also provides strongeremphasis on
task-specific semantic aspect areas(henceforth referred as
task-specific attention). Thetask-specific attention promotes the
words impor-tant to the ADE sentence-classification task andreduces
the noise introduced by words which areless important for the
task.
Similar to (Rei and Søgaard, 2019; Yang et al.,2016), we use a
self-attention mechanism where,based on softmax probabilities and
normalization,attention-weights are extracted from
word-levelprediction scores. The difference between the
twoattention mechanism is that the supervised self-attention
recognizes word-level prediction scores ofall named entities while
the task-specific attentionrecognizes word-level prediction scores
w.r.t onlyselective named entities (one which correspond tothe ADE
sentence and ignores other named enti-ties). Specifically, the
weights of the supervisedself-attention and task-specific attention
are calcu-lated as follows:
Word-level prediction w.r.t the task-specific
-
3418
named entity (i.e.,) ADE:
a(ADEentity)t =
exp(e(ADEentity)t
)∑k
j=1 exp(e(j)t
) (6)Task-specific Attention Weight, normalized to
sum up to 1 over all values in the sentence, is:
αt =a(ADEentity)t∑N
n=1
(a(ADEentity)n
) (7)Supervised Self-Attention Weight, normalized
to sum up to 1 over all values in the sentence:
βt =ât∑N
n=1 ân(8)
Fig 3 shows the examples of the supervised self-attention and
task-specific attention distributionsgenerated from our attention
layer. The color depthexpresses the degree of importance of the
weightin attention vector. As depicted in Fig. 3, the task-specific
attention emphasizes more on the partsrelevant to the ADE sentence
classification task.
Attention-based Sentence Representations.To generate informative
and more accurate sen-tence representations, we construct two
differentsentence representations as a weighted sum of
thecontext-conditioned hidden states using the task-specific
attention weight αt and supervised self-attention weight βt,
respectively.
1. Task-specific attention weighted sentence rep.:
TSS =N∑t=1
αtht (9)
2. Supervised self-attention weighted sentence rep.:
SSS =
N∑t=1
βtht (10)
Attention Pooling A combination of multiple sen-tence
representations obtained from focusing ondifferent aspects captures
the overall contextualsemantic information about a sentence. The
twoattention-based representations are concatenated toform a
dual-attention contextual sentence represen-tation:
CS = [TSS ;SSS ] (11)
3.6 Entity Prediction Embedding Layer
ADE detection is a challenging task. Understand-ing the
co-occurrence of named entities (labels)is essential for ADE
sentence classification. Al-though we implicitly capture long-range
label de-pendencies with Bi-LSTM in the contextual layer,and make
even more informative sentence-levelrepresentations with the help
of the dual-attentionlayer, explicitly integrating information on
thelabel-distribution in a sentence is further helpful tounderstand
the label co-occurrence structure anddependencies in the sentence.
The idea is to furtherimprove the performance of ADE sentence
classifi-cation task by learning the output word-level
labelknowledge. For a better representing of the word-level label
distribution and to capture potential labeldependencies within each
sentence, we propose En-tity Prediction Embedding (EPE), a
sentence-levelvector representation of entity labels predicted
atthe word-level output layer (Sec. 3.4).
l̂t = argmaxi∈{0,1,2,...,k}
a(i)t (12)
LS = [v0, v1, v2, ..., vk] ; vi ∈ {0, 1} (13)
3.7 Sentence Encoding Layer
A final sentence representation that captures theoverall
contextual semantic information and labeldependencies within the
sentence is constructedby combining the dual-attention weighted
sentencerepresentation and Entity Prediction
Embedding,respectively.
S = [CS ;LS ] (14)
3.8 Sentence Classification Output Layer
Finally, we apply a fully connected function anduse sigmoid
activation to output the sentence pre-diction score.
ŷsentence = p(y(j=1) | S
)(15)
3.9 Optimization objective
The objective is to minimize the mean squarederror between the
predicted sentence-level scoreŷ(sentence) and the gold-standard
sentence labely(sentence) across all m sentences:
Lsentence =∑m
(y(m) − ŷ(m)
)2(16)
The objective is to minimize the cross-entropyloss between the
predicted word-level probability
-
3419
(a) Task-specific Attention (b) Supervised Self-attention
(c) Distribution of attention weights.Figure 3: Attention
Visualizations: Highlighted words indicate attended words. Stronger
color denotes higher fo-cus of attention. (a) Task-specific
attention: Recognizes task-specific semantic aspect areas of
sentence, with focuson ADE entity words essential for ADE sentence
classification task. (b) Supervised Self-attention: Recognizes
allimportant areas in the sentence. (c) Distribution of
Task-specific attention and Supervised Self-attention weights.
score ŷ(entity) and the gold-standard sentence labely(entity)
across all N words in the sentence:
Lword = −∑m
N∑t=1
k∑i=1
[a(m)ti log
(â(m)ti
)](17)
Similar to (Rei and Søgaard, 2019), we alsoadd another loss
function for joining the sentence-level and word-level objectives
that encourages themodel to optimize for two conditions on the
ADEsentence (i) an ADE sentence must have at leastone ADE entity
word, and (ii) ADE sentence musthave at least one word that is
either non-ADE entityor a no-entity word.
Lattn =∑m
(min
(â(m)t,ADE
)− 0)2+∑
m
(max
(â(m)t,ADE
)− y(m)
)2 (18)We combine different objective functions using
weighting parameters to allow us to control theimportance of
each objective. The final objectivethat we minimize during training
is then:
L = λsent · Lsent + λword · Lword + λattn · Lattn(19)
By using word-level entity predictions as attentionweights for
composing sentence-level representa-tions, we explicitly connect
the predictions at bothlevels of granularity. When both objectives
work intandem, they help improve the performance of oneanother. In
our joint model, we give equal impor-tance to both tasks and set
λword = λsentence = 1.
4 Experimental Study
4.1 Data SetMADE1.0 NLP challenge for detecting medicationand
ADE related information from EHR (Jagan-
natha and Yu, 2016a) used 1089 de-identified EHRnotes from 21
cancer patients (Training: 876 notes,Testing: 213 notes). The
annotation statistics of thecorpus are provided (Jagannatha et al.,
2019).
Named Entity Labels. The notes are annotatedwith several
categories of medication information.Adverse Drug Event (ADE),
Drugname, Indicationand Other Sign Symptom and Diseases
(OtherSSD)are specified as medical events that contribute toa
change in a patient’s medical status. Severity,Route, Frequency,
Duration and Dosage specifiedas attributes describe important
properties aboutthe medical events. Severity denotes the severity
ofa disease or symptom. Route, Frequency, Durationand Dosage as
attributes of Drugname label themedication method, frequency of
dosage, durationof dosage, and the dosage quantity,
respectively.
Sentence Labels. MADE 1.0 text has eachword manually annotated
with ADE or medicationrelated entity types. For words that belong
to theADE entity type, an additional relation annotationdenotes if
the ADE entity is an adverse side effectof the prescription of the
Drugname entity. SinceMADE 1.0 dataset does not have
sentence-levelannotations, we use the relation annotation with
theword annotation to assign each sentence a label asADE or nonADE.
In this work, the relation labelsare used only to assign the
sentence labels, but theyare not used in the supervised learning
process.
4.2 Hyper-parameter Settings
The model operates on tokenized sentences. To-kens were
lower-cased, while the character-levelcomponent receives input with
the original capital-ization to learn the morphological features of
eachword. As input, the pre-trained publicly availableGlove word
embeddings of size 300 (Pennington
-
3420
(a) Single Task-specific Attention (b) Dual Task-specific
attention
(c) Single Supervised Self-attention (d) Dual Supervised
Self-attention
(e) Distribution of attention weights (f) Sentence prediction
scoresFigure 4: Single v.s. dual attention distribution. The color
intensity corresponds to the weight given to each word.Attention
weight of each word are given in the parenthesis. Single
attention-based models (a) and (c) fail to capturesufficient
attention weight on the key semantic areas of the sentence. The
dual-attention based model where thetwo attention distributions are
combined, accurate weights are assigned (b) and (d).
et al., 2014). The size of the learned character-levelembedding
are 100 dimensional vectors. The sizeof LSTM hidden layers for
word-level and char-level LSTM are size 300 and 100 respectively.
Thehidden combined representation ht was set to size200; the
attention weight layer et was set to size100. The
attention-weighted sentence representa-tions TSS and SSS , are 200
dimensional vectorsand therefore their combination context vector
CSis 400 dimensional. The Entity Prediction Em-bedding (EPE) LS is
of size k entities that are inBIO format. Hence EPE is a size 19
dimensionalbinary vector (eighteen entities plus the no entitytag).
The final concatenated sentence-level S vec-tor is thus size 419.
To avoid over-fitting, we applya dropout strategy (Ma and Hovy,
2016; Srivastavaet al., 2014) of 0.5 for our model. All models
weretrained with a learning rate of 0.001 using Adam(Kingma and Ba,
2014).
4.3 Results
4.3.1 ADE Assertive Sentence ClassificationTable 1 compares our
model against two baselinesof individual ADE sentence
classification models.(i) Similar to (Dernoncourt et al., 2017),
LAST is aBi-LSTM based sentence classification model thatuses the
last hidden states for sentence composi-tion; (ii) Similar to (Yang
et al., 2016), ATTN is a B-LSTM model that used simple attention
weights forsentence composition. Our full model, MGADEsucceeds to
improve the F1 scores by 13.6% overthe LAST baseline in testing. We
also comparewith a model similar to (Zhang et al., 2018) joint-
Table 1: ADE sentence classification: F1 scores.Model F1Baseline
Individual ModelsLAST (Dernoncourt et al., 2017) 0.66ATTN (Yang et
al., 2016) 0.63Baseline Joint Model(Zhang et al., 2018) 0.61MGADE
0.75
task model based on self-attention. MGADE out-performs their
model by 23.0% for sentence classi-fication.
Table 2: ADE entity recognition: F1 scores.Model F1Baseline
Individual ModelsBi-LSTM (Wunnava et al., 2019) 0.56Bi-LSTM + CRF
(Wunnava et al., 2019) 0.63Baseline Joint Model(Zhang et al., 2018)
0.51MGADE 0.63
4.3.2 ADE Named Entity RecognitionTable 2 compares our model
against the best per-forming models on MADE1.0 benchmark in
theliterature (Wunnava et al., 2019) for ADE entityrecognition. The
entity recognition component ofour MGADE is similar to their
Bi-LSTM model.MGADE improves the F1 score by 12.5% overtheir
Bi-LSTM only model. Our model achievedcomparable results with their
Bi-LSTM + CRFcombination model. The models with CRF layerpredict
the label sequence jointly instead of pre-dicting each label
individually which is helpful topredict sequences where the label
for each wordin a sequence depends on the label of the previous
-
3421
Table 3: Effect of dual-attention layer. † denotes models with
single-attention with Task-specific attention removedfrom
Supervised Self-attention model, and vice versa.
ADE Entity Recognition ADE Sentence ClassificationModel P R F1 P
R F1MGADE-SelfA † 0.58 0.52 0.55 0.84 0.55 0.67MGADE-TaskA † 0.62
0.50 0.55 0.82 0.64 0.72MGADE-DualA 0.68 0.55 0.61 0.87 0.65
0.74MGADE 0.70 0.57 0.63 0.86 0.67 0.75
word. Adding an CRF component to our modelmight further improve
the performance of the entityrecognition task. We also compare with
a modelsimilar to (Zhang et al., 2018) joint-task modelbased on
self-attention. MGADE outperforms theirmodel by 23.5% for entity
recognition.
4.3.3 Ablation AnalysisTo evaluate the effect of each part in
our model,we remove core sub-components and quantify theperformance
drop in F1 score.Types of Attention. Table 3 studies the two
typesof attention we generate: Supervised self-attention(β) and
Task-specific attention (α) for composingsentence-level
representations. † denotes the mod-els with single-attention. As
shown in the table,models that used only a single attention
compo-nent, be it Supervised Self-Attention based (SSS)or
Task-specific attention based sentence represen-tation (TSS)
achieved the same F1-score for theentity recognition task. However,
their sentenceclassification task performance varies,
demonstrat-ing that the two attentions capture different aspectsof
information in the sentence. The type of at-tention captured plays
a critical role in compos-ing an informative sentence
representation. Bothsingle-attention models performed better than
thebaseline individual sentence-classification modelsLAST and ATTN
(see Table 1). TSS achievedsuperior sentence classification
performance overSSS . Intuitively, stronger focus should be
placedon the words indicative of the sentence type, andTSS which
emphasizes more on the parts relevantto the ADE sentence
classification task is moreaccurate in identifying ADE
sentences.Single Attention v.s. Dual-Attention. Table 3studies
impact of dual-attention component. Asseen, the model with
dual-attention sentence repre-sentation which combines two
attention-weightedsentence representations CS outperforms the
mod-els with single-attention (denoted by †) in bothentity
recognition and sentence classification tasks.Label-Awareness.
Table 3 studies the effectof adding the label-awareness component
in im-
proving the sentence representation. Our fullmodel MGADE, with
both dual-attention and label-aware components further improves the
perfor-mance of sentence classification and entity recog-nition
tasks by 1.0% and 2.0% respectively com-pared to MGADE-DualA, the
model with onlydual-attention component.Case Study. Dual-attention
is not only effec-tive in capturing multiple aspects of semantic
in-formation in the sentence, but also in reducing therisk of
capturing incorrect or insufficient attentionwhen only one of the
single attentions (either task-specific or supervised
self-attention) is used. Fig 4shows such an example where single
attention, ei-ther task-specific or supervised self-attention,
failsto capture sufficient attention weight on the keysemantic
areas of the sentence necessary to makea correct prediction on the
sentence. The incor-rect distribution of attention weights assigned
inthe single task-specific and single supervised self-attention
(Figures 4a and 4c) is addressed by thedual-attention mechanism.
The later corrects thedistribution and assigns appropriate weights
to therelevant semantic words as in Figures 4b and 4d.In Figures 4e
and 4f, we demonstrate the effective-ness of the dual-attention
mechanism by plottingattention weight distributions and the
sentence pre-diction scores when specific type of attention
iscomposed into the sentence representation. Thebar chart depicts
the ADE sentence-level classifi-cation confidence scores w.r.t
single-attention anddual-attention models and confirms the utility
ofdual-attention.
5 Conclusion
We propose a dual-attention network for multi-grained ADE
detection to jointly identify ADE enti-ties and ADE assertive
sentences from medical nar-ratives. Our model effectively supports
knowledgesharing between the two levels of granularity, i.e.,words
and sentences, improving the overall qualityof prediction on both
tasks. Our solution featuressignificant performance improvements
over state-of-the-art models on both tasks. Our MGADE
-
3422
architecture is pluggable, in that other sequentiallearning
models including BERT (Devlin et al.,2019) or other models for
sequence labelling andtext classification could be substituted in
place ofthe Bi-LSTM sequential representation learningmodel. We
leave this enhancement of our modeland its study to future
work.
ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2014. Neural machine translation byjointly learning to
align and translate. CoRR,abs/1409.0473.
Franck Dernoncourt, Ji Young Lee, and Peter Szolovits.2017.
Neural networks for joint sentence classifica-tion in medical paper
abstracts. In Proceedings ofthe 15th Conference of the European
Chapter of theAssociation for Computational Linguistics, EACL2017,
Valencia, Spain, April 3-7, 2017, Volume 2:Short Papers, pages
694–700. Association for Com-putational Linguistics.
Shantanu Dev, Shinan Zhang, Joseph Voyles, andAnand S Rao. 2017.
Automated classification ofadverse events in pharmacovigilance. In
2017IEEE International Conference on Bioinformaticsand Biomedicine
(BIBM), pages 905–909. IEEE.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova.
2019. BERT: pre-training ofdeep bidirectional transformers for
language under-standing. In Proceedings of the 2019 Conferenceof
the North American Chapter of the Associationfor Computational
Linguistics: Human LanguageTechnologies, NAACL-HLT 2019,
Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short
Pa-pers), pages 4171–4186. Association for Computa-tional
Linguistics.
Molla S. Donaldson, Janet M. Corrigan, Linda T. Kohn,and
Editors. 2000. To err is human: building a saferhealth system,
volume 6. National Academies Press.
Harsha Gurulingappa, Juliane Fluck, Martin Hofmann-Apitius, and
Luca Toldo. 2011. Identification of ad-verse drug event assertive
sentences in medical casereports. In First international workshop
on knowl-edge discovery and health care management (KD-HCM),
European conference on machine learningand principles and practice
of knowledge discoveryin databases (ECML PKDD), pages 16–27.
Rave Harpaz, Alison Callahan, Suzanne Tamang, YenLow, David
Odgers, Sam Finlayson, Kenneth Jung,Paea LePendu, and Nigam H Shah.
2014. Text min-ing for adverse drug events: the promise,
challenges,and state of the art. Drug safety, 37(10):777–790.
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term
memory. Neural computation, 9:1735–1780.
Trung Huynh, Yulan He, Alistair Willis, and StefanRüger. 2016.
Adverse drug reaction classificationwith deep neural networks. In
Proceedings of COL-ING 2016, the 26th International Conference
onComputational Linguistics: Technical Papers, pages877–887.
Abhyuday Jagannatha, Feifan Liu, Weisong Liu, andHong Yu. 2019.
Overview of the first natural lan-guage processing challenge for
extracting medica-tion, indication, and adverse drug events from
elec-tronic health record notes (made 1.0). Drug
safety,42(1):99–111.
Abhyuday N Jagannatha and Hong Yu. 2016a. Bidi-rectional rnn for
medical event detection in elec-tronic health records. In
Proceedings of the confer-ence. ACL. North American Chapter.
Meeting, vol-ume 2016, page 473. NIH Public Access.
Abhyuday N. Jagannatha and Hong Yu. 2016b. Struc-tured
prediction models for rnn based sequence la-beling in clinical
text. In Proceedings of the Con-ference on Empirical Methods in
Natural LanguageProcessing. Conference on Empirical Methods
inNatural Language Processing, volume 2016, page856. Proceedings of
the Conference on EmpiricalMethods in Natural Language Processing.
Confer-ence on Empirical Methods in Natural LanguageProcessing.
Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for
stochastic optimization. arXiv preprintarXiv:1412.6980.
Feifan Liu, Abhyuday Jagannatha, and Hong Yu. 2019.Towards drug
safety surveillance and pharmacovigi-lance: current progress in
detecting medication andadverse drug events from electronic health
records.
Lemao Liu, Masao Utiyama, Andrew M. Finch, andEiichiro Sumita.
2016. Neural machine translationwith supervised attention. In
COLING 2016, 26thInternational Conference on Computational
Linguis-tics, Proceedings of the Conference: Technical Pa-pers,
December 11-16, 2016, Osaka, Japan, pages3093–3102. ACL.
Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015.
Effective approaches to attention-based neural machine translation.
arXiv preprintarXiv:1508.04025.
Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end se-quence
labeling via bi-directional lstm-cnns-crf. InProceedings of the
54th Annual Meeting of the As-sociation for Computational
Linguistics, ACL 2016,August 7-12, 2016, Berlin, Germany, Volume
1:Long Papers. Proceedings of the 54th Annual Meet-ing of the
Association for Computational Linguis-tics, ACL 2016, August 7-12,
2016, Berlin, Ger-many, Volume 1: Long Papers.
Chandra Pandey, Zina M. Ibrahim, Honghan Wu, Eht-esham Iqbal,
and Richard J. B. Dobson. 2017. Im-
http://arxiv.org/abs/1409.0473http://arxiv.org/abs/1409.0473https://doi.org/10.18653/v1/e17-2110https://doi.org/10.18653/v1/e17-2110https://doi.org/10.18653/v1/n19-1423https://doi.org/10.18653/v1/n19-1423https://doi.org/10.18653/v1/n19-1423https://www.aclweb.org/anthology/C16-1291/https://www.aclweb.org/anthology/C16-1291/
-
3423
proving RNN with attention and embedding for ad-verse drug
reactions. In Proceedings of the 2017 In-ternational Conference on
Digital Health, London,United Kingdom, July 2-5, 2017, pages
67–71.
Jeffrey Pennington, Richard Socher, and ChristopherManning.
2014. Glove: Global vectors for word rep-resentation. In
Proceedings of the 2014 conferenceon empirical methods in natural
language process-ing (EMNLP), pages 1532–1543. Proceedings of
the2014 conference on empirical methods in naturallanguage
processing (EMNLP).
Marek Rei and Anders Søgaard. 2019. Jointly learn-ing to label
sentences and tokens. In Proceedings ofthe AAAI Conference on
Artificial Intelligence, vol-ume 33, pages 6916–6923.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya
Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to
prevent neural networksfrom overfitting. The Journal of Machine
LearningResearch, 15:1929–1958.
Ahmad P Tafti, Jonathan Badger, Eric LaRose, EhsanShirzadi,
Andrea Mahnke, John Mayer, Zhan Ye,David Page, and Peggy Peissig.
2017. Adverse drugevent discovery using biomedical literature: a
bigdata neural network adventure. JMIR medical infor-matics,
5(4):e51.
Peng Wang, Lingqiao Liu, Chunhua Shen, Zi Huang,Anton van den
Hengel, and Heng Tao Shen. 2017.Multi-attention network for one
shot learning. InProceedings of the IEEE Conference on
ComputerVision and Pattern Recognition, pages 2721–2729.
Susmitha Wunnava, Xiao Qin, Tabassum Kakar, CansuSen, Elke A
Rundensteiner, and Xiangnan Kong.2019. Adverse drug event detection
from electronichealth records using hierarchical recurrent
neuralnetworks with dual-level embedding. Drug
safety,42(1):113–122.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and
Eduard Hovy. 2016. Hierarchi-cal attention networks for document
classification.In Proceedings of the 2016 conference of the
NorthAmerican chapter of the association for computa-tional
linguistics: human language technologies,pages 1480–1489.
Shinan Zhang, Shantanu Dev, Joseph Voyles, andAnand S Rao. 2018.
Attention-based multi-tasklearning in pharmacovigilance. In 2018
IEEEInternational Conference on Bioinformatics andBiomedicine
(BIBM), pages 2324–22328. IEEE.