A Dual-Attention Network for Joint Named Entity Recognition ...nition of ADE words (12.5% increase) and (ii) ADE sentence classiﬁcation (13.6% increase) on MADE 1.0 benchmark of

Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3414–3423November 16 - 20, 2020. c©2020 Association for Computational Linguistics

3414

A Dual-Attention Network for Joint Named Entity Recognition andSentence Classification of Adverse Drug Events

Susmitha WunnavaWorcester Polytechnic Institute

100 Institute RdWorcester, MA [email protected]

Xiao QinIBM Research - Almaden

650 Harry RoadSan Jose CA 95120

[email protected]

Tabassum KakarWorcester Polytechnic Institute


Xiangnan KongWorcester Polytechnic Institute


Elke A. RundensteinerWorcester Polytechnic Institute


Abstract

An adverse drug event (ADE) is an injury re-sulting from medical intervention related toa drug. ADE detection from text can be ei-ther fine-grained (ADE entity recognition) orcoarse-grained (ADE assertive sentence classi-fication), with limited efforts leveraging inter-dependencies among these two granularities.We instead design a multi-grained joint deepnetwork model MGADE to concurrently solveboth ADE tasks MGADE takes advantage oftheir symbiotic relationship, with a transfer ofknowledge between the two levels of granular-ity. Our dual-attention mechanism constructsmultiple distinct representations of a sentencethat capture both task-specific and semantic in-formation in the sentence, providing strongeremphasis on the key elements essential for sen-tence classification. Our model improves state-of-art F1-score for both tasks: (i) entity recog-nition of ADE words (12.5% increase) and (ii)ADE sentence classification (13.6% increase)on MADE 1.0 benchmark of EHR notes.

1 Introduction

Background. Adverse drug events (ADEs), in-juries resulting from medical intervention, are aleading cause of death in the United States and costaround $30˜$130 billion every year (Donaldsonet al., 2000). Early detection of ADE incidents aidsin the timely assessment, mitigation and preven-tion of future occurrences of ADEs. Natural Lan-guage Processing techniques have been recognizedas instrumental in identifying ADEs and relatedinformation from unstructured text fields of sponta-neous reports and electronic health records (EHRs)and thus in improving drug safety monitoring andpharmacovigilance (Harpaz et al., 2014).

Fine-grained ADE detection identifies namedADE entities at the word-level, while coarse-grained ADE detection (also ADE assertive textclassification) identifies complete sentences de-scribing drug-related adverse effects. (Gurulin-gappa et al., 2011)’s system for identification ofADE assertive sentences in medical case reportstargets the important application of detecting under-reported and under-documented adverse drug ef-fects. Lastly, multi-grained ADE detection identi-fies ADE information at multiple levels of granu-larity, namely, both entity and sentence level.

As example, Figure 1 displays ADE and non-ADE sentences. The first is an ADE sentence wherethe mentions of Drugname and ADE entities havethe appropriate relationship with each other. Sec-ond and third sentences show that the mention ofan ADE entity by itself is not sufficient to assert adrug-related adverse side effect.

Recently, deep learning-based sequence ap-proaches have shown some promise in extractingfine-grained ADEs and related named entities fromtext (Liu et al., 2019). However, the prevalenceof entity-type ambiguity remains a major hurdle,such as, distinguishing between Indication entitiesas the reason for taking a drug versus ADE entitiesas unintended outcomes of taking a drug. Coarse-grained sentence-level detection performs well inidentifying ADE descriptive sentences, but is notequipped to detect fine-grained information suchas words associated with ADE related named enti-ties. Unfortunately, when the interaction betweenthese two extraction tasks is ignored, we miss theopportunity of the transfer of knowledge betweenthe ADE entity and sentence prediction tasks.

Attention-based neural network models havebeen shown to be effective for text classification

3415

Figure 1: Each sentence is classified as ADE sentence (binary yes/no). Each word is labeled using beginning of anentity (B-...) vs inside an entity (I-...) for ADE related named entities (multiple classes). O denotes no entity tag.

tasks (Luong et al., 2015; Bahdanau et al., 2014)from alignment attention in translation (Liu et al.,2016) to supervising attention in binary text clas-sification (Rei and Søgaard, 2019). Previous ap-proaches typically apply only a single round ofattention focusing on simple semantic informationIn our ADE detection task, instead, key elementsof the sentence can be linked to multiple categoriesof task-specific semantic information of the namedentities (ADE, Drug, Indication, Severity, Doseetc.). Thus, single attention is insufficient in explor-ing this multi-aspect information and consequentlyrisks losing important cues.Proposed Approach. In our work, we tackle theabove shortcomings by designing a dual-attentionbased neural network model for multi-grained jointlearning, called MGADE, that jointly identifiesboth ADE entities and ADE assertive sentences.The design of MGADE is inspired by multi-taskRecurrent Neural Network architectures for jointlylearning to label tokens and sentences in a binaryclassification setting (Rei and Søgaard, 2019). Inaddition, our model makes use of a supervised self-attention mechanism based on entity-level predic-tions to guide the attention function – aiding it intackling the above entity-type ambiguity problem.We also introduce novel strategies of constructingmultiple complementary sentence-level represen-tations to enhance the performance of sentenceclassification.

Our key contributions include:1. Joint Model. We jointly model ADE entity recog-

nition as a multi-class sequence tagging problemand ADE assertive text classification as binary clas-sification. Our model leverages the mutually ben-eficial relationships between these two tasks, e.g.,ADE sentence classification can influence ADE en-tity recognition by identifying clues that contributeto ADE assertiveness of the sentence and matchthem to ADE entities.

2. Dual-Attention. Our novel method for generatingand pooling multiple attention mechanisms pro-

duces informative sentence-level representations.Our dual-attention mechanisms based on word-level entity predictions construct multiple repre-sentations of the same sentence. The dual-attentionweighted sentence-level representations captureboth task-specific and semantic information in asentence, providing stronger emphasis on key ele-ments essential for sentence classification.

3. Label-Awareness. We introduce an augmentedsentence-level representation comprised of pre-dicted entity labels which adds label-context tothe proposed dual-attention sentence-level repre-sentation for better capturing the word-level labeldistribution and word dependencies within the sen-tence. This further boosts the performance of thesentence classification task.

4. Model Evaluation. We compare our joint modelwith state-of-art methods for the ADE entity recog-nition and ADE sentence classification tasks. Ex-periments on MADE1.0 benchmark of EHR notesdemonstrate that our MGADE model drives upthe F1-score for both tasks significantly: (i) entityrecognition of ADE words by 12.5% and by 23.5%and (ii) ADE sentence classification by 13.6% andby 23.0%, compared to state-of-art single task andjoint-task models, respectively.

2 Related Work

Fine-grained ADE Detection. Jagannatha andYu (2016b) have employed a bidirectional LSTM-CRF model to label named entities from electronichealth records of cancer patients. Pandey et al.(2017) proposed a bidirectional recurrent neuralnetwork with attention to extract ADRs and clas-sify the relationship between entities from Medlineabstracts and EHR datasets. Wunnava et al. (2019)presented a three-layer deep learning architecturefor identifying named entities from EHRs, consist-ing of a Bi-LSTM layer for character-level encod-ing, a Bi-LSTM layer for word-level encoding, anda CRF layer for structured prediction.

3416

Coarse-grained ADE Detection. Huynh et al.(2016) applies Convolutional Neural Networks us-ing pre-trained word embeddings to detect sen-tences describing ADEs. Tafti et al. (2017) utilizeda feed-forward ANN to discover ADE sentences onPubMed Central data and social media. Dev et al.(2017) developed a binary document classifier us-ing logistic regression, random forests and LSTMsto classify an AE case as serious vs. non-serious.Multi-grained ADE Detection. Zhang et al.(2018) developed a multi-task learning model thatcombines entity recognition with document clas-sification to extract the adverse event from a casenarrative and classify the case as serious or non-serious. However, they fall short in tackling ourproblem. Not only do their targeted labels not fallinto the drug-related adverse side effects categoryin which a causal relationship is suspected and re-quired, but their attention model is only simpleself-attention. As consequence, MGADE outper-forms their model by 23.5% in F1 score for entityrecognition and 23.0% for assertive text classifica-tion as seen in Section 4.

3 The Proposed Model: MGADE

3.1 Task Definition

In the ADE and medication related information de-tection task, the entities are ADE, Drugname, Dose,Duration, Frequency, Indication, Route, Severityand Other Signs & Symptoms. The no-entity tagis O. Because some entities (like weight gain) canhave multiple words, we work with a BIO taggingscheme to distinguish between beginning (tag B-...)versus inside of an entity (tag I-...). The notation weuse is given in Fig 2. Given a sentence (a sequenceof words), task one is the multi-class classificationof ADE and medication related named entities inthe text sequence, i.e., entity recognition. Task twois the binary classification of a sentence as ADEassertive text. The overall goal is to minimize theweighted sum of entity recognition loss and sen-tence classification loss.

3.2 Input Embedding Layer

The input of this layer is a sentence represented bya sequence of words S = 〈w1, w2, ..., wN 〉, whereN is sentence length. The words are first brokeninto individual characters and character-level repre-sentations which capture the morphology of a wordcomputed with a bidirectional-LSTM over the se-quence of characters in the input words. We employ

the pre-trained word vector, GloVe (Penningtonet al., 2014), to obtain a fixed word embeddingof each word. A consolidated dense embedding,comprised of pre-trained word embedding concate-nated with a learned character-level representation,is used to represent a word. The output of this layeris X = [x1, x2, ..., xN ].

3.3 Contextual LayerLSTM is a type of recurrent neural network thateffectively captures long-distance sequence infor-mation and the interaction between adjacent words(Hochreiter and Schmidhuber, 1997). The wordrepresentations xt are given as input to two sep-arate LSTM networks (Bi-LSTM) that scan thesequence forward and backward, respectively. Thehidden states learned by the forward and backwardLSTMs are denoted as

−→h t and

←−h t, respectively.

−→h t = LSTM

(xt,−→h t−1

)(1)

←−h t = LSTM

(xt,←−h t+1

)(2)

The output of this layer is a sequence of hiddenstates H = [h1, h2, ..., hN ], where ht is a concate-nation of

−→h t and

←−h t. This way, the hidden state ht

of a word encodes information about the tth wordand its context:

ht =[−→h t;←−h t

](3)

3.4 Word-level (NER) Output LayerThe hidden states ht are passed through a non-linear layer and then with the softmax activationfunction to k output nodes, where k denotes thenumber of entity-types (classes). Entity-type labelsare the named entities in the BIO format. Eachoutput node belongs to some entity-type and out-puts a score for that entity-type. The output of thesoftmax function is a categorical probability distri-bution, where output probabilities of each class isbetween 0 and 1, and the total sum of all outputprobabilities is equal to 1.

a(i)t =

exp(e(i)t

)∑k

j=1 exp(e(j)t

) (4)Data is classified into a entity-type that has the

highest probability value.

ât = maxi∈{1,2,...,k}

a(i)t (5)

3417

Figure 2: The architecture of the proposed Multi-Grained ADE Detection Network (MGADE)

3.5 Dual-Attention Layer

The purpose of the attention mechanism in the sen-tence classification task is to select important wordsin different contexts to build informative sentencerepresentations. Different words have different im-portance for ADE sentence classification task. Forinstance, key elements (words/phrases) in the ADEdetection task are linked to multiple aspects of se-mantic information associated with the named en-tity categories - ADE, Drugname, Severity, Dose,Duration, Indication. . . etc. It is necessary to assignthe weight for each word according to its contribu-tion to the ADE sentence classification task.

Moreover, certain named entities are task-specific and are considered essential for ADE sen-tence classification. There exists a direct correspon-dence between such task-specific named entitiesand the sentence. Hence, we anticipate that therewould be at least one word of the same label as thesentence-level label. For instance, a sentence thatis labeled as an ADE sentence has a correspondingADE entity word. Although other named entitywords detect important information and contributeto the ADE sentence-level classification task, astronger focus should be on task-specific ADEwords indicative of the ADE sentence core mes-sage. A single attention distribution tends to beinsufficient to explore the multi-aspect informationand consequently may risk losing important cues(Wang et al., 2017).

We address this challenge by generating and us-

ing multiple attention distributions that offer ad-ditional opportunities to extract relevant semanticinformation. This way, we focus on different as-pects of an ADE sentence to create a more infor-mative representation. For this, we introduce anovel dual-attention mechanism, which in additionto selecting the important semantic areas in thesentence (henceforth referred as supervised self-attention (Bahdanau et al., 2014; Yang et al., 2016;Rei and Søgaard, 2019)), it also provides strongeremphasis on task-specific semantic aspect areas(henceforth referred as task-specific attention). Thetask-specific attention promotes the words impor-tant to the ADE sentence-classification task andreduces the noise introduced by words which areless important for the task.

Similar to (Rei and Søgaard, 2019; Yang et al.,2016), we use a self-attention mechanism where,based on softmax probabilities and normalization,attention-weights are extracted from word-levelprediction scores. The difference between the twoattention mechanism is that the supervised self-attention recognizes word-level prediction scores ofall named entities while the task-specific attentionrecognizes word-level prediction scores w.r.t onlyselective named entities (one which correspond tothe ADE sentence and ignores other named enti-ties). Specifically, the weights of the supervisedself-attention and task-specific attention are calcu-lated as follows:

Word-level prediction w.r.t the task-specific

3418

named entity (i.e.,) ADE:

a(ADEentity)t =

exp(e(ADEentity)t

)∑k

j=1 exp(e(j)t

) (6)Task-specific Attention Weight, normalized to

sum up to 1 over all values in the sentence, is:

αt =a(ADEentity)t∑N

n=1

(a(ADEentity)n

) (7)Supervised Self-Attention Weight, normalized

to sum up to 1 over all values in the sentence:

βt =ât∑N

n=1 ân(8)

Fig 3 shows the examples of the supervised self-attention and task-specific attention distributionsgenerated from our attention layer. The color depthexpresses the degree of importance of the weightin attention vector. As depicted in Fig. 3, the task-specific attention emphasizes more on the partsrelevant to the ADE sentence classification task.

Attention-based Sentence Representations.To generate informative and more accurate sen-tence representations, we construct two differentsentence representations as a weighted sum of thecontext-conditioned hidden states using the task-specific attention weight αt and supervised self-attention weight βt, respectively.

1. Task-specific attention weighted sentence rep.:

TSS =N∑t=1

αtht (9)

2. Supervised self-attention weighted sentence rep.:

SSS =

N∑t=1

βtht (10)

Attention Pooling A combination of multiple sen-tence representations obtained from focusing ondifferent aspects captures the overall contextualsemantic information about a sentence. The twoattention-based representations are concatenated toform a dual-attention contextual sentence represen-tation:

CS = [TSS ;SSS ] (11)

3.6 Entity Prediction Embedding Layer

ADE detection is a challenging task. Understand-ing the co-occurrence of named entities (labels)is essential for ADE sentence classification. Al-though we implicitly capture long-range label de-pendencies with Bi-LSTM in the contextual layer,and make even more informative sentence-levelrepresentations with the help of the dual-attentionlayer, explicitly integrating information on thelabel-distribution in a sentence is further helpful tounderstand the label co-occurrence structure anddependencies in the sentence. The idea is to furtherimprove the performance of ADE sentence classifi-cation task by learning the output word-level labelknowledge. For a better representing of the word-level label distribution and to capture potential labeldependencies within each sentence, we propose En-tity Prediction Embedding (EPE), a sentence-levelvector representation of entity labels predicted atthe word-level output layer (Sec. 3.4).

l̂t = argmaxi∈{0,1,2,...,k}

a(i)t (12)

LS = [v0, v1, v2, ..., vk] ; vi ∈ {0, 1} (13)

3.7 Sentence Encoding Layer

A final sentence representation that captures theoverall contextual semantic information and labeldependencies within the sentence is constructedby combining the dual-attention weighted sentencerepresentation and Entity Prediction Embedding,respectively.

S = [CS ;LS ] (14)

3.8 Sentence Classification Output Layer

Finally, we apply a fully connected function anduse sigmoid activation to output the sentence pre-diction score.

ŷsentence = p(y(j=1) | S

)(15)

3.9 Optimization objective

The objective is to minimize the mean squarederror between the predicted sentence-level scoreŷ(sentence) and the gold-standard sentence labely(sentence) across all m sentences:

Lsentence =∑m

(y(m) − ŷ(m)

)2(16)

The objective is to minimize the cross-entropyloss between the predicted word-level probability

3419

(a) Task-specific Attention (b) Supervised Self-attention

(c) Distribution of attention weights.Figure 3: Attention Visualizations: Highlighted words indicate attended words. Stronger color denotes higher fo-cus of attention. (a) Task-specific attention: Recognizes task-specific semantic aspect areas of sentence, with focuson ADE entity words essential for ADE sentence classification task. (b) Supervised Self-attention: Recognizes allimportant areas in the sentence. (c) Distribution of Task-specific attention and Supervised Self-attention weights.

score ŷ(entity) and the gold-standard sentence labely(entity) across all N words in the sentence:

Lword = −∑m

N∑t=1

k∑i=1

[a(m)ti log

(â(m)ti

)](17)

Similar to (Rei and Søgaard, 2019), we alsoadd another loss function for joining the sentence-level and word-level objectives that encourages themodel to optimize for two conditions on the ADEsentence (i) an ADE sentence must have at leastone ADE entity word, and (ii) ADE sentence musthave at least one word that is either non-ADE entityor a no-entity word.

Lattn =∑m

(min

(â(m)t,ADE

)− 0)2+∑

m

(max

(â(m)t,ADE

)− y(m)

)2 (18)We combine different objective functions using

weighting parameters to allow us to control theimportance of each objective. The final objectivethat we minimize during training is then:

L = λsent · Lsent + λword · Lword + λattn · Lattn(19)

By using word-level entity predictions as attentionweights for composing sentence-level representa-tions, we explicitly connect the predictions at bothlevels of granularity. When both objectives work intandem, they help improve the performance of oneanother. In our joint model, we give equal impor-tance to both tasks and set λword = λsentence = 1.

4 Experimental Study

4.1 Data SetMADE1.0 NLP challenge for detecting medicationand ADE related information from EHR (Jagan-

natha and Yu, 2016a) used 1089 de-identified EHRnotes from 21 cancer patients (Training: 876 notes,Testing: 213 notes). The annotation statistics of thecorpus are provided (Jagannatha et al., 2019).

Named Entity Labels. The notes are annotatedwith several categories of medication information.Adverse Drug Event (ADE), Drugname, Indicationand Other Sign Symptom and Diseases (OtherSSD)are specified as medical events that contribute toa change in a patient’s medical status. Severity,Route, Frequency, Duration and Dosage specifiedas attributes describe important properties aboutthe medical events. Severity denotes the severity ofa disease or symptom. Route, Frequency, Durationand Dosage as attributes of Drugname label themedication method, frequency of dosage, durationof dosage, and the dosage quantity, respectively.

Sentence Labels. MADE 1.0 text has eachword manually annotated with ADE or medicationrelated entity types. For words that belong to theADE entity type, an additional relation annotationdenotes if the ADE entity is an adverse side effectof the prescription of the Drugname entity. SinceMADE 1.0 dataset does not have sentence-levelannotations, we use the relation annotation with theword annotation to assign each sentence a label asADE or nonADE. In this work, the relation labelsare used only to assign the sentence labels, but theyare not used in the supervised learning process.

4.2 Hyper-parameter Settings

The model operates on tokenized sentences. To-kens were lower-cased, while the character-levelcomponent receives input with the original capital-ization to learn the morphological features of eachword. As input, the pre-trained publicly availableGlove word embeddings of size 300 (Pennington

3420

(a) Single Task-specific Attention (b) Dual Task-specific attention

(c) Single Supervised Self-attention (d) Dual Supervised Self-attention

(e) Distribution of attention weights (f) Sentence prediction scoresFigure 4: Single v.s. dual attention distribution. The color intensity corresponds to the weight given to each word.Attention weight of each word are given in the parenthesis. Single attention-based models (a) and (c) fail to capturesufficient attention weight on the key semantic areas of the sentence. The dual-attention based model where thetwo attention distributions are combined, accurate weights are assigned (b) and (d).

et al., 2014). The size of the learned character-levelembedding are 100 dimensional vectors. The sizeof LSTM hidden layers for word-level and char-level LSTM are size 300 and 100 respectively. Thehidden combined representation ht was set to size200; the attention weight layer et was set to size100. The attention-weighted sentence representa-tions TSS and SSS , are 200 dimensional vectorsand therefore their combination context vector CSis 400 dimensional. The Entity Prediction Em-bedding (EPE) LS is of size k entities that are inBIO format. Hence EPE is a size 19 dimensionalbinary vector (eighteen entities plus the no entitytag). The final concatenated sentence-level S vec-tor is thus size 419. To avoid over-fitting, we applya dropout strategy (Ma and Hovy, 2016; Srivastavaet al., 2014) of 0.5 for our model. All models weretrained with a learning rate of 0.001 using Adam(Kingma and Ba, 2014).

4.3 Results

4.3.1 ADE Assertive Sentence ClassificationTable 1 compares our model against two baselinesof individual ADE sentence classification models.(i) Similar to (Dernoncourt et al., 2017), LAST is aBi-LSTM based sentence classification model thatuses the last hidden states for sentence composi-tion; (ii) Similar to (Yang et al., 2016), ATTN is a B-LSTM model that used simple attention weights forsentence composition. Our full model, MGADEsucceeds to improve the F1 scores by 13.6% overthe LAST baseline in testing. We also comparewith a model similar to (Zhang et al., 2018) joint-

Table 1: ADE sentence classification: F1 scores.Model F1Baseline Individual ModelsLAST (Dernoncourt et al., 2017) 0.66ATTN (Yang et al., 2016) 0.63Baseline Joint Model(Zhang et al., 2018) 0.61MGADE 0.75

task model based on self-attention. MGADE out-performs their model by 23.0% for sentence classi-fication.

Table 2: ADE entity recognition: F1 scores.Model F1Baseline Individual ModelsBi-LSTM (Wunnava et al., 2019) 0.56Bi-LSTM + CRF (Wunnava et al., 2019) 0.63Baseline Joint Model(Zhang et al., 2018) 0.51MGADE 0.63

4.3.2 ADE Named Entity RecognitionTable 2 compares our model against the best per-forming models on MADE1.0 benchmark in theliterature (Wunnava et al., 2019) for ADE entityrecognition. The entity recognition component ofour MGADE is similar to their Bi-LSTM model.MGADE improves the F1 score by 12.5% overtheir Bi-LSTM only model. Our model achievedcomparable results with their Bi-LSTM + CRFcombination model. The models with CRF layerpredict the label sequence jointly instead of pre-dicting each label individually which is helpful topredict sequences where the label for each wordin a sequence depends on the label of the previous

3421

Table 3: Effect of dual-attention layer. † denotes models with single-attention with Task-specific attention removedfrom Supervised Self-attention model, and vice versa.

ADE Entity Recognition ADE Sentence ClassificationModel P R F1 P R F1MGADE-SelfA † 0.58 0.52 0.55 0.84 0.55 0.67MGADE-TaskA † 0.62 0.50 0.55 0.82 0.64 0.72MGADE-DualA 0.68 0.55 0.61 0.87 0.65 0.74MGADE 0.70 0.57 0.63 0.86 0.67 0.75

word. Adding an CRF component to our modelmight further improve the performance of the entityrecognition task. We also compare with a modelsimilar to (Zhang et al., 2018) joint-task modelbased on self-attention. MGADE outperforms theirmodel by 23.5% for entity recognition.

4.3.3 Ablation AnalysisTo evaluate the effect of each part in our model,we remove core sub-components and quantify theperformance drop in F1 score.Types of Attention. Table 3 studies the two typesof attention we generate: Supervised self-attention(β) and Task-specific attention (α) for composingsentence-level representations. † denotes the mod-els with single-attention. As shown in the table,models that used only a single attention compo-nent, be it Supervised Self-Attention based (SSS)or Task-specific attention based sentence represen-tation (TSS) achieved the same F1-score for theentity recognition task. However, their sentenceclassification task performance varies, demonstrat-ing that the two attentions capture different aspectsof information in the sentence. The type of at-tention captured plays a critical role in compos-ing an informative sentence representation. Bothsingle-attention models performed better than thebaseline individual sentence-classification modelsLAST and ATTN (see Table 1). TSS achievedsuperior sentence classification performance overSSS . Intuitively, stronger focus should be placedon the words indicative of the sentence type, andTSS which emphasizes more on the parts relevantto the ADE sentence classification task is moreaccurate in identifying ADE sentences.Single Attention v.s. Dual-Attention. Table 3studies impact of dual-attention component. Asseen, the model with dual-attention sentence repre-sentation which combines two attention-weightedsentence representations CS outperforms the mod-els with single-attention (denoted by †) in bothentity recognition and sentence classification tasks.Label-Awareness. Table 3 studies the effectof adding the label-awareness component in im-

proving the sentence representation. Our fullmodel MGADE, with both dual-attention and label-aware components further improves the perfor-mance of sentence classification and entity recog-nition tasks by 1.0% and 2.0% respectively com-pared to MGADE-DualA, the model with onlydual-attention component.Case Study. Dual-attention is not only effec-tive in capturing multiple aspects of semantic in-formation in the sentence, but also in reducing therisk of capturing incorrect or insufficient attentionwhen only one of the single attentions (either task-specific or supervised self-attention) is used. Fig 4shows such an example where single attention, ei-ther task-specific or supervised self-attention, failsto capture sufficient attention weight on the keysemantic areas of the sentence necessary to makea correct prediction on the sentence. The incor-rect distribution of attention weights assigned inthe single task-specific and single supervised self-attention (Figures 4a and 4c) is addressed by thedual-attention mechanism. The later corrects thedistribution and assigns appropriate weights to therelevant semantic words as in Figures 4b and 4d.In Figures 4e and 4f, we demonstrate the effective-ness of the dual-attention mechanism by plottingattention weight distributions and the sentence pre-diction scores when specific type of attention iscomposed into the sentence representation. Thebar chart depicts the ADE sentence-level classifi-cation confidence scores w.r.t single-attention anddual-attention models and confirms the utility ofdual-attention.

5 Conclusion

We propose a dual-attention network for multi-grained ADE detection to jointly identify ADE enti-ties and ADE assertive sentences from medical nar-ratives. Our model effectively supports knowledgesharing between the two levels of granularity, i.e.,words and sentences, improving the overall qualityof prediction on both tasks. Our solution featuressignificant performance improvements over state-of-the-art models on both tasks. Our MGADE

3422

architecture is pluggable, in that other sequentiallearning models including BERT (Devlin et al.,2019) or other models for sequence labelling andtext classification could be substituted in place ofthe Bi-LSTM sequential representation learningmodel. We leave this enhancement of our modeland its study to future work.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua

Bengio. 2014. Neural machine translation byjointly learning to align and translate. CoRR,abs/1409.0473.

Franck Dernoncourt, Ji Young Lee, and Peter Szolovits.2017. Neural networks for joint sentence classifica-tion in medical paper abstracts. In Proceedings ofthe 15th Conference of the European Chapter of theAssociation for Computational Linguistics, EACL2017, Valencia, Spain, April 3-7, 2017, Volume 2:Short Papers, pages 694–700. Association for Com-putational Linguistics.

Shantanu Dev, Shinan Zhang, Joseph Voyles, andAnand S Rao. 2017. Automated classification ofadverse events in pharmacovigilance. In 2017IEEE International Conference on Bioinformaticsand Biomedicine (BIBM), pages 905–909. IEEE.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers), pages 4171–4186. Association for Computa-tional Linguistics.

Molla S. Donaldson, Janet M. Corrigan, Linda T. Kohn,and Editors. 2000. To err is human: building a saferhealth system, volume 6. National Academies Press.

Harsha Gurulingappa, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2011. Identification of ad-verse drug event assertive sentences in medical casereports. In First international workshop on knowl-edge discovery and health care management (KD-HCM), European conference on machine learningand principles and practice of knowledge discoveryin databases (ECML PKDD), pages 16–27.

Rave Harpaz, Alison Callahan, Suzanne Tamang, YenLow, David Odgers, Sam Finlayson, Kenneth Jung,Paea LePendu, and Nigam H Shah. 2014. Text min-ing for adverse drug events: the promise, challenges,and state of the art. Drug safety, 37(10):777–790.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation, 9:1735–1780.

Trung Huynh, Yulan He, Alistair Willis, and StefanRüger. 2016. Adverse drug reaction classificationwith deep neural networks. In Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers, pages877–887.

Abhyuday Jagannatha, Feifan Liu, Weisong Liu, andHong Yu. 2019. Overview of the first natural lan-guage processing challenge for extracting medica-tion, indication, and adverse drug events from elec-tronic health record notes (made 1.0). Drug safety,42(1):99–111.

Abhyuday N Jagannatha and Hong Yu. 2016a. Bidi-rectional rnn for medical event detection in elec-tronic health records. In Proceedings of the confer-ence. ACL. North American Chapter. Meeting, vol-ume 2016, page 473. NIH Public Access.

Abhyuday N. Jagannatha and Hong Yu. 2016b. Struc-tured prediction models for rnn based sequence la-beling in clinical text. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing. Conference on Empirical Methods inNatural Language Processing, volume 2016, page856. Proceedings of the Conference on EmpiricalMethods in Natural Language Processing. Confer-ence on Empirical Methods in Natural LanguageProcessing.

Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Feifan Liu, Abhyuday Jagannatha, and Hong Yu. 2019.Towards drug safety surveillance and pharmacovigi-lance: current progress in detecting medication andadverse drug events from electronic health records.

Lemao Liu, Masao Utiyama, Andrew M. Finch, andEiichiro Sumita. 2016. Neural machine translationwith supervised attention. In COLING 2016, 26thInternational Conference on Computational Linguis-tics, Proceedings of the Conference: Technical Pa-pers, December 11-16, 2016, Osaka, Japan, pages3093–3102. ACL.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025.

Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2016,August 7-12, 2016, Berlin, Germany, Volume 1:Long Papers. Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguis-tics, ACL 2016, August 7-12, 2016, Berlin, Ger-many, Volume 1: Long Papers.

Chandra Pandey, Zina M. Ibrahim, Honghan Wu, Eht-esham Iqbal, and Richard J. B. Dobson. 2017. Im-

http://arxiv.org/abs/1409.0473http://arxiv.org/abs/1409.0473https://doi.org/10.18653/v1/e17-2110https://doi.org/10.18653/v1/e17-2110https://doi.org/10.18653/v1/n19-1423https://doi.org/10.18653/v1/n19-1423https://doi.org/10.18653/v1/n19-1423https://www.aclweb.org/anthology/C16-1291/https://www.aclweb.org/anthology/C16-1291/

3423

proving RNN with attention and embedding for ad-verse drug reactions. In Proceedings of the 2017 In-ternational Conference on Digital Health, London,United Kingdom, July 2-5, 2017, pages 67–71.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP), pages 1532–1543. Proceedings of the2014 conference on empirical methods in naturallanguage processing (EMNLP).

Marek Rei and Anders Søgaard. 2019. Jointly learn-ing to label sentences and tokens. In Proceedings ofthe AAAI Conference on Artificial Intelligence, vol-ume 33, pages 6916–6923.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15:1929–1958.

Ahmad P Tafti, Jonathan Badger, Eric LaRose, EhsanShirzadi, Andrea Mahnke, John Mayer, Zhan Ye,David Page, and Peggy Peissig. 2017. Adverse drugevent discovery using biomedical literature: a bigdata neural network adventure. JMIR medical infor-matics, 5(4):e51.

Peng Wang, Lingqiao Liu, Chunhua Shen, Zi Huang,Anton van den Hengel, and Heng Tao Shen. 2017.Multi-attention network for one shot learning. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 2721–2729.

Susmitha Wunnava, Xiao Qin, Tabassum Kakar, CansuSen, Elke A Rundensteiner, and Xiangnan Kong.2019. Adverse drug event detection from electronichealth records using hierarchical recurrent neuralnetworks with dual-level embedding. Drug safety,42(1):113–122.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 conference of the NorthAmerican chapter of the association for computa-tional linguistics: human language technologies,pages 1480–1489.

Shinan Zhang, Shantanu Dev, Joseph Voyles, andAnand S Rao. 2018. Attention-based multi-tasklearning in pharmacovigilance. In 2018 IEEEInternational Conference on Bioinformatics andBiomedicine (BIBM), pages 2324–22328. IEEE.

A Dual-Attention Network for Joint Named Entity Recognition ...nition of ADE words (12.5% increase) and (ii) ADE sentence classiﬁcation (13.6% increase) on MADE 1.0 benchmark of

Documents