Top Banner
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 1 Extracting Protein-Interactions from Text with the Unified AkaneRE Event Extraction System Rune Sætre, Kazuhiro Yoshida, Makoto Miwa, Takuya Matsuzaki, Yoshinobu Kano, and Jun’ichi Tsujii Abstract—Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biological information extraction. In 2009, the majority of these RE and EE research efforts were centered around the BioCreative II.5 Protein-Protein Interaction (PPI) challenge and the “BioNLP event extraction shared task”. Although these challenges took somewhat different approaches, they share the same ultimate goal of extracting bio-knowledge from literature. This paper compares the two challenge task definitions, and presents a unified system that was successfully applied in both these and several other PPI extraction task settings. The AkaneRE system has three parts: A core engine for RE, a pool of modules for specific solutions, and a configuration language to adapt the system to different tasks. The core engine is based on machine learning, using either Support Vector Machines or Statistical Classifiers and features extracted from given training data. The specific modules solve tasks like sentence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named entity recognition, generation of potential relations, generation of machine learning features for each relation, and finally assignment of confidence scores and ranking of candidate relations. With these components, the AkaneRE system produces state-of-the-art results, and the system is freely available for academic purposes at http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/. Index Terms—Text Mining, Machine Learning, Language Parsing and Understanding, Bioinformatics (genome or protein) databases. 1 I NTRODUCTION M ORE than 2000 new articles about molecular bi- ology and medicine are added every day to the Medline Database, so there is a significant need for automation to help deal with the flow of new informa- tion. To address this issue, considerable research efforts have been dedicated to Natural Language Processing of Biological Texts (BioNLP) in the last decade. One important goal is to automatically and reliably extract interesting facts from publications. Earlier systems tar- geted primarily abstracts, but recently open-access full text journal articles are also targeted. The major factors driving progress in the field have been shared task chal- lenges, shared data formats, publicly available systems and shared corpora. This paper examines the differences and similari- ties between different Protein-Protein Interaction (PPI) approaches. Around the beginning of 2009, both the “BioNLP Event Extraction” (BioNLP-EE) and the BioCre- ative (BC) II.5 PPI extraction shared task challenges were arranged. The goal of both tasks was to extract information about protein-interactions from biomedical text, but the challenges took quite different approaches to this. We will compare the “BC Interaction Pair Task” (BC- IPT) and task 1 in the BioNLP-EE shared task, and see how they relate to other Relation Extraction (RE) tasks. The authors are with the Department of Computer Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan. E-mail:{rune.saetre,kyoshida,mmiwa,matuzaki,kano,tsujii}@is.s.u- tokyo.ac.jp J. Tsujii is also affiliated with the National Center for Text Mining, University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK. The goal of this paper is twofold. First, explaining from an engineering point of view how to use the publicly available open source system called AkaneRE, and then presenting the underlying methods from a scientific point of view. The Akane system was initially trained and evaluated on the AIMed corpus [1] using proper 10-fold cross-validation [2]. It can now handle all the formats used in recent challenges and corpora. AkaneRE has been evaluated on all the five REMerge PPI corpora [3] and in the BioNLP-EE shared task challenge [4]. The Akane system achieved competitive results in all of these settings. It has also been evaluated in several independent external studies [5], [6], [7], [8]. The PPI part of AkaneRE was evaluated both in the BC II [9] and II.5 shared task challenges (this paper). It has been pointed out that there is “an urgent need to make the best tools publicly available” [5], so the AkaneRE (including PPI) system is available as an open source package 1 for BioNLP experts. The PPI part is also available as a simple-to-use component in the U-Compare system 2 . Our system can be used in many other RE tasks. The Akane system is based on a flexible XML configuration language that allows users to specify the precise task settings. By using this simple configuration language, AkaneRE can be applied to any relation extraction task. The only requirement is that the related entities can be recognized within a single sentence. We start by exploring the state-of-the-art in BioNLP and RE in Section 2, before explaining the purpose and the originality of our approach in Section 3. Section 4 1. http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/ 2. http://u-compare.org/
14

Extracting protein interactions from text with the unified AkaneRE event extraction system

Apr 23, 2023

Download

Documents

Zheng Liu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 1

Extracting Protein-Interactions from Text with theUnified AkaneRE Event Extraction System

Rune Sætre, Kazuhiro Yoshida, Makoto Miwa, Takuya Matsuzaki, Yoshinobu Kano, and Jun’ichi Tsujii

Abstract —Currently, relation extraction (RE) and event extraction (EE) are the two main streams of biological information extraction.In 2009, the majority of these RE and EE research efforts were centered around the BioCreative II.5 Protein-Protein Interaction (PPI)challenge and the “BioNLP event extraction shared task”. Although these challenges took somewhat different approaches, they sharethe same ultimate goal of extracting bio-knowledge from literature. This paper compares the two challenge task definitions, and presentsa unified system that was successfully applied in both these and several other PPI extraction task settings.The AkaneRE system has three parts: A core engine for RE, a pool of modules for specific solutions, and a configuration language toadapt the system to different tasks. The core engine is based on machine learning, using either Support Vector Machines or StatisticalClassifiers and features extracted from given training data. The specific modules solve tasks like sentence boundary detection,tokenization, stemming, part-of-speech tagging, parsing, named entity recognition, generation of potential relations, generation ofmachine learning features for each relation, and finally assignment of confidence scores and ranking of candidate relations.With these components, the AkaneRE system produces state-of-the-art results, and the system is freely available for academicpurposes at http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/.

Index Terms —Text Mining, Machine Learning, Language Parsing and Understanding, Bioinformatics (genome or protein) databases.

1 INTRODUCTION

MORE than 2000 new articles about molecular bi-ology and medicine are added every day to the

Medline Database, so there is a significant need forautomation to help deal with the flow of new informa-tion. To address this issue, considerable research effortshave been dedicated to Natural Language Processingof Biological Texts (BioNLP) in the last decade. Oneimportant goal is to automatically and reliably extractinteresting facts from publications. Earlier systems tar-geted primarily abstracts, but recently open-access fulltext journal articles are also targeted. The major factorsdriving progress in the field have been shared task chal-lenges, shared data formats, publicly available systemsand shared corpora.

This paper examines the differences and similari-ties between different Protein-Protein Interaction (PPI)approaches. Around the beginning of 2009, both the“BioNLP Event Extraction” (BioNLP-EE) and the BioCre-ative (BC) II.5 PPI extraction shared task challengeswere arranged. The goal of both tasks was to extractinformation about protein-interactions from biomedicaltext, but the challenges took quite different approaches tothis. We will compare the “BC Interaction Pair Task” (BC-IPT) and task 1 in the BioNLP-EE shared task, and seehow they relate to other Relation Extraction (RE) tasks.

• The authors are with the Department of Computer Science, University ofTokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan.E-mail:{rune.saetre,kyoshida,mmiwa,matuzaki,kano,tsujii}@is.s.u-tokyo.ac.jp

• J. Tsujii is also affiliated with the National Center for Text Mining,University of Manchester, 131 Princess Street, Manchester, M1 7DN, UK.

The goal of this paper is twofold. First, explainingfrom an engineering point of view how to use thepublicly available open source system called AkaneRE,and then presenting the underlying methods from ascientific point of view. The Akane system was initiallytrained and evaluated on the AIMed corpus [1] usingproper 10-fold cross-validation [2]. It can now handleall the formats used in recent challenges and corpora.AkaneRE has been evaluated on all the five REMerge PPIcorpora [3] and in the BioNLP-EE shared task challenge[4]. The Akane system achieved competitive results inall of these settings. It has also been evaluated in severalindependent external studies [5], [6], [7], [8]. The PPI partof AkaneRE was evaluated both in the BC II [9] and II.5shared task challenges (this paper). It has been pointedout that there is “an urgent need to make the besttools publicly available” [5], so the AkaneRE (includingPPI) system is available as an open source package1

for BioNLP experts. The PPI part is also available as asimple-to-use component in the U-Compare system2.

Our system can be used in many other RE tasks. TheAkane system is based on a flexible XML configurationlanguage that allows users to specify the precise tasksettings. By using this simple configuration language,AkaneRE can be applied to any relation extraction task.The only requirement is that the related entities can berecognized within a single sentence.

We start by exploring the state-of-the-art in BioNLPand RE in Section 2, before explaining the purpose andthe originality of our approach in Section 3. Section 4

1. http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/2. http://u-compare.org/

Page 2: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 2

explains the differences between the BC-IPT and theBioNLP-EE tasks, and Section 5 explains how AkaneREcould be successfully applied to the BC-IPT tasks, withonly small additions from the BioNLP-EE task. The sys-tem performed well in both these challenges, by combin-ing the best available BioNLP modules (tokenizer, parser,named entity recognizer) into one system. Features areextracted from each module in the pipeline, so thatthe machine learning classifier at the end can producethe competitive results presented in Section 6. Finally,Section 7 discusses future improvements, while Section 8concludes this paper.

2 RELATED WORKS

In this section, we briefly discuss recent developmentsin major areas in BioNLP and related NLP research.

A wealth of information extraction methods have beenintroduced in recent studies. The National Institute ofStandards and Technology (NIST) has arranged severalevaluation conferences since 1987. The corpora producedfor the conferences in Table 1 are not directly com-parable to those of BioNLP-EE, BC-IPT or REMerge,but the nature of these tasks are very similar. Withminimal adaption, these corpora could be processed bythe AkaneRE system. For a description of the TRECgenomics track corpora, see [10].

Methods targeting Protein-Protein Interaction (PPI)in particular include AkanePPI [2], NAViGaTOR [6],RelEx [11], PIE [12], AliBaba [13], OpenDMAP [14], andChowdhary [15]. These methods feature varying levelsof accessibility to potential users, ranging from simplya published method description of RelEx to AliBaba’sonline graphical interface. The use of NLP tools anddomain knowledge also varies, from combining the anal-yses from several parsers in AkanePPI to “search[ing] forsimple co-occurrences in the same sentence” for manyrelations in Ali Baba3.

Shared task challenges have played a major role inthe development of BioNLP by focusing the efforts ofthe community to certain problems, providing trainingdatasets and broadly comparable evaluations of differentmethods and establishing both the state-of-the-art andproblems with current technology. Domain challengeshave advanced from early basic tasks, such as informa-tion retrieval and named entity detection, to progres-sively more detailed resolution of entities and higherinformation extraction targets such as relations and events.

In 2009, two challenges were arranged to evaluate thestate-of-the-art in PPI and event extraction from free text,namely the BioCreative (BC) and the BioNLP sharedtasks. The aim in BC was to automatically recognizeand normalize interacting gene and protein names inbiomedical journal papers and to further extract theinteractions between these entities [16], [17]. The BioNLPshared task took a more language-oriented approach

3. http://alibaba.informatik.hu-berlin.de/methods.html

aimed at the recognition of all specific statements of eventtypes (binding, transcription, regulation, etc. of genes orproteins) as well as relations between such events [18].

The two recent shared tasks posed quite different chal-lenges to the participants (detailed in Section 4), whichseemed to require largely separate approaches to ad-dress them. This could split the relatively small BioNLPcommunity into two separate groups (“BC-style” and“BioNLP-style”) despite having similar goals. A singlesystem that can perform competitively in both situationsis therefore particularly valuable in demonstrating howto unify the two perspectives without “doubling thework.”

Shared corpus resources are vital to meaningful eval-uation and comparison of methods, and a substantialnumber of annotated corpora are available for varioustasks in BioNLP. The five most widely used PPI corporaare AIMed [1], BioInfer [19], HPRD50 [11], IEPA [20]and LLL [21]. They have been converted into a unifiedrepresentation in the REMerge project by abstractingaway some of the original differences [22]. This hasmade approaches such as the one in [3] and the onedescribed in this paper more feasible. In addition tothese pure PPI corpora, the GENIA event corpus wasrecently released in connection with the BioNLP eventextraction shared task [18], [23]. The event corpus targetsinformation closely related to the PPI corpora despiteemploying a richer event representation.

In addition to standardizing the format of the PPIcorpora, there are also ongoing initiatives to simplify theintegration of heterogeneous systems into simple-to-usepackages through the use of workflow systems. GeneralArchitecture for Text Engineering (GATE4) and Taverna5

are two examples of such architectures.

3 THE AKANE SYSTEM

The Akane system was designed to provide a PPIextraction tool to BioNLP researchers. It also aims atmaking the tool available to Biologists and other non-NLP experts through easy accessibility provided by U-Compare6. The AkanePPI system [2] has been usedin several shared task challenges and experiments [9],[3], [4], [25], [26], [27], [28]. The name changed fromAkanePPI to AkaneRE during the BioNLP shared taskchallenge, since the system can now do more advancedRE in addition to simple PPI extraction.

The overall architecture of the Akane system is a(semi-parallel) pipeline, going through the steps of sen-tence boundary detection, tokenization, stemming, part-of-speech tagging, parsing, named entity recognition,generation of potential relations, generation of featuresfor each relation, and finally assignment of confidencescores and ranking of candidate relations according to

4. http://sourceforge.net/projects/gate/5. http://taverna.sourceforge.net/6. http://u-compare.org/

Page 3: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 3

Name YearsURLMUC: Message Understanding Conference 1987-1997http://www-nlpir.nist.gov/related projects/muc/ACE: Automatic Content Extraction 1999-presenthttp://www.itl.nist.gov/iad/mig/tests/ace/TREC: Text REtrieval Conference 1992-presenthttp://trec.nist.gov/

-Question Answering Track 1999-2007http://trec.nist.gov/data/qamain.html

-Genomics Track 2002-2007http://ir.ohsu.edu/genomics/

TABLE 1Conferences with corpora that could be used by the AkaneRE system

PROT(p53)is

activatorof

proteinPROT(perlak)

Sentence Boundary Detector

Input Text

Stemmer/Lemmatizer

Syntactic / Semantic Parser

Named Entity Recognizer (NER)

Relation Extractor

Interacting Proteins

Tokenizer

A new gene synthesized by Dr. Perlakp53 is an activator of Dr. Perlak protein,named after him.

<title>A new gene synthesized by Dr. Perlak</title><sentence>p53 is an activator of Dr. Perlak protein,named after him.</sentence>

[A] [new] [gene] [synthesized] [by] [Dr.] [Perlak][p53] [is] [an] [activator] [of] [Dr.] [Perlak] [protein] [,]

[named] [after] [him] [.]

[DT] [JJ] [NN] [VBN] [IN] [NNP] [NNP][NN] [VBZ] [DT] [NN] [IN] [NNP] [NNP] [NN] [,][VBN] [IN] [PRP] [.]

Part-of-Speech (POS) tagger

[a] [new] [gene] [synthesize] [by] [dr.] [perlak][p53] [be] [an] [activator] [of] [dr.] [perlak] [protein] [,]

[name] [after] [him] [.]

A new gene synthesized by [Person/Prot][Prot] is an activator of [Person/Prot] protein,named after him.

p53

is

an

activator

of

dr. perlak

protein

, named

after

him

ROOT

arg1 arg2

arg1

arg1

arg2

arg1 arg1

arg2

arg1 arg1

arg1

arg2

UNKNOWN

p53 -> Dr. Perlak

Fig. 1. A general event/interaction prediction pipeline. Onthe left, a sample input sentence is transformed step bystep according to the processing units in the pipeline onthe right.

whether they are truly events/interactions or not. This isillustrated in Figure 1.

A new XML specification language is used to config-ure the Akane system. The configuration file describeshow to evaluate specific relation or event extractiontasks. With small changes in the configuration, severaldifferent tasks can be tackled by the AkaneRE system.The configuration language is inspired by the specifica-tions for UIMA component descriptors [29].

The configuration file consists of eight main parts,each representing important design decisions that must

be made when implementing and evaluating a machinelearning relation extraction system:Akane: settings related to which mode the system

should run in (e.g. learning, prediction orcross-validation).

Text-file: location of the original plain text file(s) thatall annotations refer to.

Gold-file: information about the manual PPI or eventannotations needed for learning, evaluationor cross-validation.

NER-file (optional): information provided by aNamed Entity Recognition (NER) system, tobe evaluated as a part of the event extrac-tion pipeline.

Features: list of features to extract from the parsers’output, and where to look for the parseroutput-files.

SVM: information for the machine learning sys-tem (e.g. the Support Vector Machine).

Models: location of generated model files, for writ-ing and/or reading.

Output: location and type of the output files togenerate.

A sample configuration file with more detailed expla-nations can be found on the Akane system web page7,and also in this paper’s Supplemental Material.

When comparing different extraction tasks, most dif-ferences can be found in the “Gold-file” entry. Thisspecifies the format of the training data, and effectivelywhat kind of predictions the system learns to handle.

The “SVM” entry can be used to tune the MachineLearning system’s parameters (like C-value or thresh-old).

3.1 Relation Representation and Data Format

The new and simple idea in AkaneRE compared toother PPI systems, is that any n-ary relation/event (e.g.binding of more than two proteins into a complex) can bereduced to a set of binary relations (pairs). This makes it

7. http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/

Page 4: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 4

Decompose p50-p52-p55-cRelMatrix Spokes Chain

Pairs

p50 - p52 p50 - cRel p50 - p52p50 - p55 p52 - cRel p52 - p55p50 - cRel p55 - cRel p55 - cRelp52 - p55p52 - cRelp55 - cRel

TABLE 2Three different ways to decompose a four-protein

complex into binary pairs

possible to apply a single system to seemingly differenttasks, like BioNLP-EE and BC. There are several waysa complex n-ary event can be reduced to pairs, and wehave implemented three of them, namely the “matrix”,“spokes” and “chain” decompositions. As an example,take the phrase:

“DNA-binding complexes that contain p50, p52, p65,and cRel”8.

This phrase describes the event of four proteins be-ing bound in a single protein-complex. Table 2 showsdifferent ways a single protein complex binding event(between proteins p50, p52, p65, and cRel) can be turnedinto binary pairs. The matrix contains all the six possiblebinary relations between the entities, while the spokesmodel contains only three pairs: the main (“bait”) entitycombined with the three other (“prey”) entities. Finally,the chain decomposition also makes three pairs, but ina linear fashion.

The Akane system can handle any event type as longas it can be decomposed into binary relations between allthe involved entities. Earlier versions of the system couldonly handle simple pairs, but now complex, recursivestructures, with multiple involved entities and events arealso handled. When the current version of the system isused with only simple binary pairs (like in BioCreative),it is sometimes desirable to exclude the possibility ofpredicting relations with fewer or more than two proteinparticipants, by specifying MIN and MAX values in theconfiguration file. Without these values, any combinationof Event, Themes and Causes that is present in the trainingdata can be extracted by the system.

Different tasks usually require different data formatsto represent entities, relations and events. For example,each entity or event can be connected to one or more spe-cific words in the text, or they can be loosely associatedwith the text without specifying the exact location. Theinput text can also be of different types, ranging fromsingle sentences, via short abstracts, to complete journalpapers. Also, the text can be stored in one large file, orin several smaller files.

In practice, some data conversion must always takeplace before our system can be applied. The trainingdata is usually distributed with one XML or a pair of

8. http://mcb.asm.org/cgi/content/abstract/14/5/2926

stand-off files for each article/abstract to be evaluated.We try to make this conversion as simple as possible, byproviding a tool called the Stand-Off Manager (SOM)[30]. We also reduce the need for data-conversion byallowing different input styles to be specified in theconfiguration file.

Many new Named Entity Recognizers (NERs) can di-rectly produce stand-off output, but most existing NERs(like ABNER or BANNER) still provide their annotationsin the BIO format9 or as inline XML tags (ie. mixedwith the input text). The SOM utility tool can be used totransform such inline XML into the format used by theAkane System. SOM can generate U-Compare readabletext and tag files straight from XML files, and the otherway around. When XML-tagged text like an AIMedcorpus file or ABNER output is provided, a pair of stand-off output files are produced: One with just plain text(to be parsed) and one with the stand-off annotations(proteins, interactions, etc.).

In many RE task settings, the evaluation is done on in-dividual text-bound sentence-based events. However, in theBC-IPT task, the evaluation is done on the article-level,regardless of how many times each relation is actuallymentioned in the text. In this case, the configuration fileoffers an alternative way of inputting the training data.

3.2 Relation templates

The major contribution from the AkaneRE system is theautomatic construction of relation templates, based onlyon the training data and a simple configuration file.

Even for a single event type, e.g. binding, there maybe several different types of relations to extract. Forexample, there may be two, three or more entities inthe binding event, and there may or may not be a givencausal event or protein acting as a trigger. These possi-bilities/restrictions are often explicitly given in the taskspecification, but we can also assume that the trainingdata contains at least one example of each possiblerelation type. So during the training phase, the trainingdata can be used to determine which events to extract byautomatically constructing event templates. In that case,the only input to the template module is a list of possibleNamed Entity (NE) types and the Roles they can play ina relation.

In the original AkanePPI (trained on the AIMed cor-pus, which is very similar to the BC PPI task), the onlyNE type was Protein, and the only role was Theme (p1and p2). All the events were pair-wise interactions (PPI),so there was no explicit event role. In other tasks, suchas the BioNLP-EE task, events must be represented bya specific term in the text, just like the protein namesare also text-bound (see Table 3). This means that aspecific word or phrase is representing each event orentity. Taking advantage of this, we can use the “spokes”model and include the event-term itself as the main

9. BIO: Each word is either the Beginning, Inside or Outside aNamed Entity mention

Page 5: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 5

“entity” in its relation-template. The Event role is thenmore important than the Theme and Cause roles since itprovides the type information for the whole relation. Inthe BioNLP-EE and other event-based tasks, the eventslot must always be filled by an event “entity” (triggerword), and never by a Protein entity. Exclusion of Proteinterms from this slot is necessary since they do not carrythe semantics to represent special recursive event entitytypes, like “binding” or “phosphorylation”.

By treating text-bound event trigger-words as entitieswe can create binary relations between the main trigger-word (representing the main event) and all the otherinvolved proteins and trigger-words (events). These re-lations are then used to extract the (binary) featuresneeded for the machine learner to classify event candidatesas true (likely) or false (unlikely) events.

The only change needed in the configuration lan-guage to change from binary PPI prediction to complexevents, is to add the candidate trigger word classes tothe protein class in the list of entities. For example,the nine event types (“Binding, Gene expression, Lo-calization, Negative regulation, Phosphorylation, Posi-tive regulation, Protein, Protein catabolism, Regulationand Transcription”) can be thought of as (event) entities. Aseparate module has to provide the text-bound candidatetrigger words for each of these event mentions. This canbe done by any re-trainable Named Entity Recognition(NER) system, as described in Section 5.2.

As an example of how events are extracted, we usea sentence from the GENIA event-corpus10 which wasused in the BioNLP shared task. This sentence is takenfrom a PubMed abstract11 with ID 7706710:

“At 10 microM, both compounds inhibited IL-2 mRNAand protein levels in the NFAT-1-linked lac-Z transfec-tants, and in human lymphocytes.”

In this sentence, “both compounds” are recognizedas the Cause of several Negative regulation (inhibition)events. It is left to a coreference system to discover that“both compounds” actually represents TWO entities,and then TWO (different) events with the same typewill be produced. It is also possible to make only onecomplex event instead of two simpler events. Then thereal Causes can be enumerated as Cause1 and Cause2, andthe breakdown to simple binary pairs will be handledinside the AkaneRE system instead. First, a new templatewith two causes will be created when the relation isencountered during the training phases. And later, eachcause will be linked with the main event trigger-wordduring the feature extraction phase. Currently, only oneor two neighboring sentences are being considered whenextracting the grammatical path between the “entities”,e.g. between CauseX (both compounds) and Event0 (inhi-bition). So, if the real Causes are more than one sentenceaway, a small generalization is needed in AkaneRE, forexample by extraction the path via the head-words in all

10. http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/11. http://www.ncbi.nlm.nih.gov/pubmed/7706710

BioNLP BioCreativerepresentationtype event relationarity n-ary binarydirected yes noE/R types 9 1role types 2 1training datasize 800 abstracts 740 articlesannotation text-bound article-levelinputtext abstracts articlesentities given not givenoutputentities (as given) normalizedE/R scope all events unique pairsE/R polarity any positiveE/R certainty any affirmativeE/R evidence no yes

TABLE 3Main features of the BioNLP Shared Task on Event

Extraction (BioNLP-EE) and the BioCreative II.5Interaction Pair Task (BC-IPT). Only the primary “PPI”

tasks (Task 1 and IPT) are considered. E/R abbreviatesfor “event/relation”, referring to the primary

representation of extracted information.

the intervening sentences.Once the trigger-words (e.g. inhibition) are recognized,

they can be treated just like normal entity-names by therelation extraction system. The only extra challenge iswhen deciding a threshold for which results to output.For example, in the phrase “Regulation of p50 GeneTranscription” there are two events:

(1) Transcription, THEME1: Protein[p50](2) Regulation, THEME1: Event[(1)]

When the theme or cause is another event (like in 2), theoutput function must recursively first check whether allsuch “sub-events” are above the confidence threshold.If the theme event (1) is below the output threshold,then the mother event (2) should not be output ei-ther, even though that binary relation itself (Regulation-Transcription) may have a high confidence value.

4 B IOCREATIVE (II.5) AND B IONLP (2009)SHARED TASKS

In this section, we discuss in detail the two highlyrelevant information extraction problems mentioned inTable 3: BC-IPT and BioNLP-EE. They represent twoquite different views of domain information extractiongoals. Most current systems and corpora focus on binaryRelation Extraction, like in the BC-IPT. We will comparethe primary tasks in the two challenges, highlightingthe aspects that make a unified approach particularlychallenging.

Of the many differences between the BioNLP-EE andBC-IPT tasks, the representation of the extracted in-formation is perhaps the most fundamental. The BC-IPT task follows the approach taken in the majority of

Page 6: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 6

domain IE efforts, including the previous LLL sharedtask [21], in adopting a simple relation representation.In the BC-IPT task, the goal is to extract un-typed,unordered pairs of interacting proteins. By contrast, theBioNLP-EE task aims to move toward a more expres-sive representation of extracted information. Nine typedevent structures were used instead of the simple Protein-Protein relation representation. Each event structure as-sociates a biological event with related proteins or otherevents in specific roles, such as being the Theme or Causeof the event. While the BC-IPT task can be approached asa classification task on protein pairs, the BioNLP-EE taskrequires the output of the system to be more structured.

A second significant difference is the type of text thatis used, and the way relations or events are identified inthe text. The BioNLP-EE task, like most medical domainIE work, uses PubMed abstracts as its text source. TheBC-IPT task is more challenging because it uses full-text articles containing much more information than justabstracts. In the BioNLP-EE task, events are associatedwith specific expressions in the text (“text-bound anno-tation”), while the BC-IPT relations are linked to fullarticles, without any attempt at identifying the exactphrase, sentence or paragraph where they are men-tioned. These choices carry very different implicationsfor how the provided data can be applied in training amachine learning system. For example, the generation ofnegative (text-bound, sentence-based) training examplesis not straightforward in the BC-IPT. Sometimes eventhe positive examples cannot be directly identified inthe text, as in this example12 from the BC challenge:“Gsg1 (uniprotkb:Q8R1W2), TPAP (uniprotkb:Q9WVP6)and Calmegin (uniprotkb:P52194) colocalize (MI:0403)by cosedimentation (MI:0027).” The only evidence of thisinteraction is shown as an alignment in Figure 5B in thatpaper, and it is not stated explicitly anywhere in the text.Another case which cannot be handled by text-boundRE is when the information is given only in a Table,without any further mentions in the text. For example,one abstract13 from the BC challenge lists 27 proteins inits Table 1, and says in the text that “Their relationshipwith YidC is not immediately obvious...”. Since only 4of the 27 protein names are mentioned explicitly in thetext, a text-bound approach misses at least 23 of thesebiologically meaningfull YidC-interactions. Such differ-ences creates further challenges that must be addressedto use a single approach for both tasks.

The tasks also pose very different requirements re-lating to the entities (proteins) associated by the targetrepresentations. The BioNLP-EE task removes the re-quirement to detect named entities by providing goldentities as input and requiring (in the primary task)no other entities than those provided. By contrast, theBC-IPT task requires the resolution of the relativelydemanding subtask of recognizing named entities and

12. doi:10.1016/j.febslet.2008.01.06513. doi:10.1016/j.febslet.2008.02.082

normalizing them to their database references.The approaches taken to the polarity, modality and

certainty of the expressions of relations / events andthe evidence provided are also different. In the primaryBioNLP-EE task, all expressed events are annotated with-out regard for these issues. In the BC-IPT task, onlyaffirmatively stated relations, with supporting evidencesomewhere in the paper, are considered. Evidence meansthat biological experimental evidence must be providedsomewhere in the text.

Since only proteins are given in the BioNLP-EE testdata, a Named Entity Recognition (NER) system is stillneeded to find all the trigger words, so that the Akanesystem can extract relations (events) between them andthe corresponding proteins.

As shown in Table 3, there are nine event types forthe BioNLP-EE task, while the BC-IPT does not discerndifferent relations. However, there are actually severaldifferent interaction types present in the BC-IPT task aswell, but they are all represented in the same way. Thepossible types of interactions are specified in the Molec-ular Interaction ontology14 [31], and include all physicalassociations, some of which are more specifically classi-fied as direct interactions. Based on this definition of theBC-IPT events, a relation could contain more than twointeracting entities. The requirement that only pairs willbe evaluated was imposed by the organizers to simplifythe evaluation of the different participating systems. Thisis also how PPI extraction tasks have traditionally beenevaluated.

5 OUR BC II.5 SYSTEM

Two improvements were added to the AkaneRE systemfor the BC challenge: First, the ability to deal with“article-level relations” (while keeping links to their“text-bound sentence-level evidence relations”), and sec-ond, the possibility of ranking predicted interacting en-tities and relations to optimize the performance.

5.1 The BCMS interface and U-Compare Workflows

Figure 2 shows the overall architecture of the BC sharedtask. The BioCreative Meta-Server (BCMS) sends aninput article to our system. The pre-processing unitextracts the plain text and applies some simple tokeniza-tion rules. The plain text is then passed to one of the fiveU-Compare workflows built for the BC task (Figure 3).Each workflow uses a slightly different configurationfile to evaluate how the combination of different pre-processing steps influences the overall interaction detec-tion performance.

Our system uses the Unstructured Information Man-agement Architecture (UIMA15) through the U-Comparesystem (u-compare.org) which provides a large numberof interoperable Natural Language Processing (NLP)

14. www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI15. http://incubator.apache.org/uima/

Page 7: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 7

U-Compareworkflow

(Fig. 3)

XML-RPC

BC2.5 MetaServer

BC2.5 App-Server

interface

Pre-proc

Post-proc

Input Article

INT/IPT Annotation

Fig. 2. Overall System Architecture: The BioCreativeMeta-Server sends requests to our XML-RPC interfacewhich simply forward the request to our U-Compare sys-tem (see Figure 3.

Re-ranker

AkanerRE

SwissProtThreshold

STEPP Tokenizerand POS Tagger

Mogura HPSG Parser

Genia Sentence splitter

Re-tokenizer

SwissProtAll

TrEMBLAll

GDep Dependency Parser

1,2,53,4

AllAll

1,3 2,45

``Words’’or Tokens

Syntax

MedTNER (Fig. 4)Entities

Relation Extraction and Result Ranking

Fig. 3. Five Different Work-Flows: We evaluated fivecombinations of available processing modules. Work-flows 3&4 used some extra tokenization rules to removesentence splitting errors caused by ”Fig., Prof., Dr., etc.”.For Named Entity Recognition, work-flows 1&3 used onlystrong predictions from SwissProt, while 2&4 used allSwissProt predictions, and work-flow 5 used all predic-tions from TrEMBL.

UIMA components and a UIMA compliant NLP plat-form.

UIMA is a framework for integration and exchange ofsoftware components. It is used by IBM and many otherNatural Language Processing (NLP) research groups[29], [32], [33]. UIMA encourages a standard way ofannotating and sharing information about free text. Italso simplifies the integration of different components,such as sentence boundary detection systems, parsers,Named Entity Recognizers (NER) and interaction ex-traction modules, into one system. Basing the systemon the UIMA framework allows us to take advantageof further developments in the particular componentswithout requiring the constant (re-) integration of newtools as they are introduced.

U-Compare [34] is based on the UIMA frameworkand provides a higher level of interoperability by usingcarefully defined data-type definitions that cover a widerange of NLP concepts. A large number of ready-to-use,data-type-compatible UIMA components are availablevia U-Compare. All the components used in our work-flow are U-Compare compatible UIMA components. U-Compare also provides an easy-to-use interface to createand configure workflows from all the available NLPcomponents, and to run such workflows with visualiza-tion of the results.

The available NLP components in U-Compare havebeen created according to UIMA design guidelines,which encourages many small annotators to add theirown specialized annotation to the original text. The com-ponents have been identified as common sub-modulesin large NLP systems, and extracted to be used asbuilding blocks in similar new systems. In the BC-IPTtask we re-used the components shown in the workflowsin Figure 3. All the workflows use the Enju HPSGparser16 for bio-English and the GENIA-Dependencyparser (GDep17). Enju has been trained on newswirearticles (Penn Treebank), but it can also compute accurateanalysis of biomedical texts. This is possible throughdomain adaptation using the GENIA Treebank [35] toadjust the parsing model. The resulting bio-performanceis an F-score of 86.9% [36]. In BC II.5, we used theMogura version of Enju, since Mogura is much fasterwith only slightly lower performance. Section 6 showshow the five different workflows performed comparedto each other, and to the other participating systems.

5.2 Entity and Interaction Detection

The “Medical Text, Named Entity Regognizer” (MedT-NER) is a system that recognizes proteins and map(normalize) them to their database identifiers. MedTNERis a pipeline of 3 modules which perform dictionarylookup, filtering and disambiguation (Figure 4).

The dictionary look-up module does case insensi-tive string matching. Non-alphanumeric characters are

16. http://www-tsujii.is.s.u-tokyo.ac.jp/enju/17. http://www.cs.cmu.edu/∼sagae/parser/gdep/

Page 8: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 8

Text data

False positive filtering

Disambiguation

Dictionary look-up

Syntactic information

(POS tags, etc.)

Text data related to each dictionary entry

Synonym list(UniProt, GENA, Entrez

Gene)

GENIA named entity corpus

BioCreative II.5 training data

Training

Training

Dictionary entries

Processing byNLP tools

Candidate protein mentions and their UniProt IDs

Filtered protein mentions and

IDs

Protein mentions and IDs with probability scores

Used as features

Used as features

Fig. 4. Named Entity Recognition System: MedTNER.

ignored and the dictionary consists of synonym listsfrom UniProt [37] supplemented with information fromEntrez Gene [38] and the GENA dictionary [39].

The false positive filtering module does binary clas-sification (logistic regression) and returns confidencescores that are used in the disambiguation module. TheGENIA corpus was used to train this classifier. Thedisambiguation module also does binary classification,and the the features include similarity between the targetdocument and all the documents which are linked toeach dictionary entry. The Official BC training corpuswas used to train this classifier.

The results from MedTNER are expanded in the fol-lowing manner to improve the recall performance. If onephrase is tagged as a protein somewhere in the article,then all (untagged) identical phrases in the article arealso tagged with the same protein identifiers.

Once all the modules have made their annotationsin the required format, the information is passed on tothe AkaneRE interaction detection module. This moduleputs together all possible combinations of entities intoevents or relations, and assigns a confidence score sayingwhether the combination is likely or unlikely. The confi-dence score is calculated based on similar relations givenin the AIMed corpus which was used as training datafor the interaction detection module.

5.3 Re-Ranking and iPR-AUC optimization

The modules above are trained on the AIMed corpussince they require sentence-level (“text-bound”) annota-tions as explained towards the end of Section 3.2). Totake advantage of the given BC “article-level” trainingdata, we added a re-ranking module at the end of theAkaneRE pipeline. The iPR-AUC increases when correctanswers are in high-ranked positions, so we re-rankthe normalized Named Entities (NEs) and the pairs

predicted by the Akane interaction detection moduleusing the BC corpus as training data.

The following three steps are taken to create featuresfor the re-ranker. First, stop word lists are created toremove general phrases from the NE candidates. Second,species expression lists are created using the NCBI En-trez Taxonomy18 to tag all species mentioned in each sen-tence. Finally, more species information is annotated tothe sentences using reference/citation information anda UniProt dictionary. If the same word (e.g. vertebrate)can be used as a synonym for multiple species (e.g. bat,turtle, salamander...) all possible occurrences are used.This gives the re-ranker more information to extractgood patterns from the gold training data.

Logistic regression is used for the re-ranking, compar-ing all predictions pair-wise to other predictions, like inRanking SVM [40]. The features for normalized NEs areshown in Table 4, and the features for interacting pairsare shown in Table 5. Since the iPR-AUC can only in-crease when more possibly correct results are added, wedo not apply any cutoff at a certain threshold. Instead,we return all the (both confident and not confident) re-ranked NEs and pairs (interactions) created by the Akaneinteraction detection module.

In AkanePPI Tree-Kernels [41] were used, but in thenew AkaneRE we use Maximum Entropy (ME) model-ing (a.k.a. Logistic Regression) with linear features. MEmodeling is provided by the LIBLINEAR program [42]which can produce confidence values between 0 and 1for all the predictions.

6 RESULTS

Here we provide our best AkaneRE results for BC-IPT,which was comprised of both on-line and off-line chal-lenges, and compare them to the scores of other teams,and to the scores for the similar BioNLP-EE task. We alsoshow our scores for the five REMerge PPI corpora.

In the BC challenge, both raw and filtered results wereprovided. The raw results contained all the predictionsfrom the participants, while the filtered results werepre-processed as follows: The correct species for eachpaper was decided. Then homonym ortholog proteinnames (from different species) were mapped to thecorrect species. Finally, other predictions from irrelevantorganisms were filtered away.

In the Interactor Normalization Task (INT), our rawresults were third among 52 submitted result sets, withan AUC of 38%. Our best filtered INT result was second,with an AUC of 54%. In the Interacting Pairs Task (BC-IPT), our highest raw result was ranked as number fouramong 45 submitted result sets, with an AUC of 17%,and results produced for all the manually annotatedarticles. Our highest filtered result was one of the topthree with an AUC of 29%, just two points behind theoverall highest score. Our system provides high-recall,so it produced 78 correct results for the 61 articles with

18. http://www.ncbi.nlm.nih.gov/taxonomy

Page 9: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 9

Occurrence (frequency)NE mentions in each section (title, abstract, etc.)NE mentions with orphan brackets, or parenthesesNE mentions with no alphabetNE mentions in the stop word listMentions of NE-related species in the NE-including sentencesMentions of NE-related species in the articleMentions of NE-related species within 3 words before and after NEsMentions of NE-related species within 3 words before NEsMentions of NE-related species within 3 words after NEsWords within 3 words before NEsWords within 3 words after NEsWords within 3 words before and after NEsWords in NEs

Confidence (predicted)Maximum confidence (by MedTNER)Confidence (by MedTNER) of NE with strong independence assumption

BooleanFor each species (human, mouse, etc.): is it the right species for the NENE-related species are referenced in the NE-including sentencesNE-related species are in the article referencesNo NE-related species are mentioned in the NE-including sentencesNo NE-related species are mentioned in the article

TABLE 4Features for Normalized Named Entity (NE) Re-Ranking

Occurrence (frequency)Pair mentions in the articleMentions of entity-related species (weighted by all Akane confidence values for the pair), for each entityMentions of both entity-related species (weighted by all Akane confidence values for the pair)References including the entity-related species (weighted by all Akane confidence values for the pair), for each entityReferences including the species of both entities (weighted by all Akane confidence values for the pair)Mentions of entity-related species in the article for each entity (weighted by all Akane confidence values for the pair)Words (weighted by all Akane confidence values for the pair)

Confidence (predicted)Maximum confidence (by MedTNER) for each entityMultiplication of the maximum confidence (by MedTNER) for each entityConfidence (by MedTNER) of each entity with strong independence assumptionConfidence (by MedTNER) of both entities with strong independence assumptionConfidence of each entity by NE re-rankerMultiplication of the confidences of the entities by NE re-ranker

BooleanEntity-related species are referred in the references for each entityNo entity-related species are mentioned in the sentences which pairs are mentionedNo entity-related species are mentioned in the references of sentences which pairs are mentionedNo entity-related species are mentioned in the articleNo entity-related species are mentioned in the referenceFor each species (human, mouse, etc.): is it the right species for each entityBoth entities have the same species

TABLE 5Features for Pairs Re-Ranking

interactions, while the highest scoring system providedonly 15 pairs.

Figures 5 and 6 show our best performance curves forthe “homonym ortholog mapped and filtered” (referredto as filtered in this article) results for the BC Workshop.The red line is the actual Precision/Recall curve forall ranks across all documents. The blue line is theinterpolated Precision/Recall curve. See [43] for a moredetailed explanation about this. The F-score19 at thebalanced Precision/Recall is also plotted.

Table 6 shows the filtered results for our five work-flows. We see that server 3 produces the highest AUC

19. F = 2 P∗RP+R

values, both for the INT and IPT tasks, and that server 1is very close. This means that the filtering based onthe thresholds provided by MedTNER (based on speciesdisambiguation etc.) makes it easier for AkaneRE to rankthe correct relations higher. This is also why server 5 isnot very competitive. There are simply to many potentialprotein names in the complete UniProt TrEMBL databasefor our system to handle in an efficient manner. Thereare small improvements from servers 1 and 2 to servers 3and 4, so the manually created re-tokenizer rules helpsa little. However, the impact on the final INT andIPT scores are less than one percent-point for the bestperforming servers.

Page 10: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 10

BC-INT BC-IPTWF P R F AUC On>Off P R F AUC On>Off

1 18.7 67.1 25.1 54.0 +9.3 7.4 44.3 8.7 28.7 +10.02 11.7 71.8 14.6 51.5 +13.1 3.1 49.7 2.6 24.2 +12.23 18.7 67.1 25.1 54.4 +9.5 7.4 43.8 8.7 29.2 +10.64 11.3 72.3 14.5 53.4 +8.2 3.3 51.0 3.0 27.0 +15.05 16.3 64.6 21.4 45.2 +4.1 4.8 36.5 5.9 17.4 +7.9

T42 74.3 55.1 58.8 53.0 N/A 53.1 34.5 37.4 31.5 N/A

TABLE 6BC-INT and BC-IPT results for the 5 different off-line workflows, with the AUC improvement from the on-line settingshown (see Figure 3). Values are macro-averaged (per document), and only counting documents with at least one

prediction. WF means Work-Flow and T42 is the server with the highest reported Precision and AUC values amongthe BC-IPT participants.

F=45

INTPrecision

Recall

Fig. 5. BioCreative protein Interactor Normalization Task(INT) Results.

F=25%

IPT

Precision

Recall

Fig. 6. BC-IPT Results.

POS NEG P R F AUCBioCreativeDevTest 216 0 10.6 64.2 14.3 35.2Workshop 236 0 0.2 34.5 0.4 14.5

Re-DevTest 216 0 3.5 63.3 5.4 48.1Re-Test 236 0 7.4 43.8 8.7 29.2Best Re-Test 216 0 53.1 34.5 37.4 31.5BioNLPAkane dev 1,809 51,963 49.7 32.0 38.9Akane test 3,182 53,767 53.6 28.1 36.9Best test 3,182 ? 58.5 46.7 52.0

TABLE 7AkaneRE system PPI results. Precision, Recall, F- andAUC scores are given as percents(%). POS and NEG

are the numbers of all positive and negative relations toclassify.

Table 6 also shows that the AUC performance isaround 10%-points higher for the off-line task settingcompared to the on-line task setting. This is primarilybecause of strict processing time limits which caused 4-6 (10%) of the articles with many proteins to time-out inthe on-line setting. Some of this improvement also comesfrom fixing small bugs discovered in the on-line phase,before submitting the off-line results.

The upper half of Table 7 shows our BC-IPT resultspublished at the workshop and a re-calculated versionwhere articles without any predictions are not counted.We notice a big drop in the scores from the developmenttest data (DevTest and Re-DevTest) to the blind test data(Workshop and Re-Test).

Since our goal was to optimize the AUC value, allranked results (likely and unlikely) were returned. Thisleads to a very low Precision and F-scores when the en-tire result-set is evaluated. This can be improved triviallyby the end-user, since the results are ranked according toa confidence value between 0 and 1. By letting the end-user select the desired threshold, Precision and Recallcan be balanced to produce a good F-value. The F-scoresshown in Figures 5 and 6 were among the highest F-scores across all the articles containing interacting pro-tein pairs.

Page 11: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 11

The bottom of Table 7 shows the AkaneRE perfor-mance in the BioNLP-EE task20. “Akane dev” is thedevelopment test-set used to tune the system. “Akanetest” is our official blind test-score, which was rankedas number 6 of 24 teams, and the last line shows theperformance for the best team. We do not know howmany negative training samples were created by thisteam.

Since there was no ranking or thresholds in theBioNLP shared task, AUC values cannot be computed,so the results cannot be compared directly to the BC-IPT results. However, we notice that the drop in perfor-mance from DevTest to Test is much bigger for BC-IPTthan for BioNLP-EE (not looking at the re-calculated F-scores). This means that either there is a bigger differencebetween the BC-IPT DevTest and Test sets, or that oursystem was more over-fitted to the DevTest in the BC-IPT task than in the BioNLP-EE task. The best way tocheck this is to apply other systems to both tasks, andmeasure the performance differences.

Table 8 shows 10-fold cross-validation results for thefive REMerge PPI corpora. These results are an improvedversion of those in [3] with calculation of standarddeviation values for the AUC and F-scores.

7 DISCUSSION AND FUTURE WORK

Here we will point out the strengths and weaknessesof our scientific approach and the main differencesbetween BC-IPT and BioNLP-EE. First, the AkaneREsystem uses the same linguistic features to classify anykind of relation. This is a strength, because the systemcan directly be moved to a different RE task. However,it can also be a weakness, since the current features maybe too optimized towards PPI or BioNLP events. In thefuture, we would like to investigate which features areespecially useful for different tasks. We would like to seeif the feature selection for a new domain can be handledby only switching the NER system, or if new featuresmust be created for the interaction detection module aswell.

This paper showed that hand-made tokenization rulescan improve the NER and RE performance a little, butthat the major contribution comes from using the rightsource for the protein dictionary. The performance ofthe relation re-ranker module was best when only alimited, confident, set of SwissProt database identifierswere passed from the NER module. We also saw thatcomplementary approaches are needed to extract entitiesand relations that are only mentioned in figures andtables, but not in the text.

The main difference between BC-IPT and BioNLP-EE is that BC-IPT is not text-bound. BC-IPT is also amuch more difficult challenge because it requires recog-nition of protein names and which species they belongto, mapping of the names to database identifiers, and

20. http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/results/results-master.html

making sure biological experimental evidence is givensomewhere in the article for each predicted interaction.

AkaneRE’s configuration file has a section for specify-ing different output formats. The “outputfiles” tag (seeSupplemental Material) can be used to specify the loca-tion and the format to be used for outputting results anderror analysis information. The following output formatsare supported: “BioCreative”, “BioNLP”, “REMerge”,“AIMed”. It would be advantageous if future shared taskchallenges tried to re-use one of these formats.

For all the available PPI and event corpora there areseveral different formats. Since AIMed was the firstabstract-based PPI corpus, its format was used for theearlier versions of this system. Several conversion scriptshave been generated to convert the AIMed XML formatinto the U-Compare stand-off format. There are alsoconversion scripts available for the “five-corpora PPI”and for the BioNLP shared task formats.

8 CONCLUSION

In this paper, we showed how the BC-IPT, BioNLP-EEand other PPI extraction tasks are related and how asingle open source system can achieve good results inall these settings. We explained the main differences be-tween the BC-IPT and BioNLP-EE shared task challengesand proposed a general XML configuration language forbiomedical information extraction experiments.

The new unified AkaneRE system can take an XMLconfiguration file as input and has been showed toachieve the best results published so far on publiclyavailable PPI corpora. The system was applied in theBioNLP shared task and ranked as number 6, with anF-score of 37%, among 24 participating systems. Finally,the system was recently applied in the BioCreative II.5challenge, with state-of-the-art iPR-AUC scores of 29%in the Interacting Pairs Task (BC-IPT), and 54% in theInteracting proteins Normalization Task (BC-INT). Fur-thermore, AkaneRE produced filtered interactions for 59of the 61 test articles that contained manually annotatedinteractions, while the highest ranked system only found18 of these test articles. The AkaneRE system withthe XML configuration language is freely available foracademic purposes from http://www-tsujii.is.s.u-tokyo.ac.jp/satre/akane/.

ACKNOWLEDGMENTS

This work was partially supported by Grant-in-Aidfor Specially Promoted Research (MEXT, Japan) andGenome Network Project (MEXT, Japan). The authorswould like to thank Sebastian Riedel for making the pro-gram that adjusts the BioNLP stand-off notation betweentwo differently tokenized versions of the same text. Jin-Dong Kim and Sampo Pyysalo provided substantial in-put for the sections on the BioNLP shared task challenge,and together with Brian Kemper helped proof-readingthis manuscript. Three anonymous reviewers and theguest editor provided valuable feedback that helped

Page 12: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 12

POS NEG P R F σF AUC σAUC

AIMed 1000 4834 62.7 66.6 64.2 5.3 0.891 0.030BioInfer 2534 7119 63.6 72.8 67.6 3.0 0.861 0.044HPRD50 163 270 66.8 75.2 69.7 10.3 0.828 0.080IEPA 335 482 73.5 77.3 74.4 5.8 0.856 0.042LLL 164 166 76.6 87.1 80.5 15.1 0.860 0.104

TABLE 8REMerge corpora results. POS and NEG shows the number of all positive and negative interaction to be classified.

Precision, Recall and F-scores are given as percents(%). AUC is “Area Under the ROC Curve”

improve the manuscript further. Kenji Sagae providedthe GENIA Dependency Parser (GDep) and YuichiroMatsubayashi initially created the Stand-Off Manager(SOM).

REFERENCES

[1] R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. K.Ramani, and Y. W. Wong, “Comparative Experiments on LearningInformation Extractors for Proteins and their Interactions,” JournalArtificial Intelligence in Medicine: Special Issue on Summarizationand Information Extraction from Medical Documents, 2004. [Online].Available: http://www.ncbi.nlm.nih.gov/pubmed/15811782

[2] R. Sætre, K. Sagae, and J. Tsujii, “Syntactic features for protein-protein interaction extraction,” in Short Paper Proceedings of the2nd International Symposium on Languages in Biology and Medicine(LBM 2007), ser. ISSN 1613-0073, C. J. Baker and S. Jian, Eds.,vol. 319. CEUR Workshop Proceedings (CEUR-WS.org), January2008, pp. 6.1–6.14. [Online]. Available: http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-319/Paper6.pdf

[3] M. Miwa, R. Sætre, Y. Miyao, , and J. Tsujii, “A rich featurevector for protein-protein interaction extraction from multiplecorpora,” in 2009 Conference on Empirical Methods in NaturalLanguage Processing. Singapore: Association for ComputationalLinguistics, August 2009, pp. 121–130. [Online]. Available:http://www.aclweb.org/anthology/D/D09/D09-1013.pdf

[4] R. Sætre, M. Miwa, K. Yoshida, and J. Tsujii, “Fromprotein-protein interaction to molecular event extraction,”in Proceedings of Natural Language Processing in Biomedicine(BioNLP) NAACL 2009 Workshop, 2009, pp. 103–106. [Online].Available: http://www-tsujii.is.s.u-tokyo.ac.jp/∼satre/papers/bioShared2009 satre.pdf

[5] R. Kabiljo, A. Clegg, and A. Shepherd, “A realistic assessment ofmethods for extracting gene/protein interactions from free text,”BMC Bioinformatics, vol. 10, no. 1, pp. 233+, July 2009. [Online].Available: http://dx.doi.org/10.1186/1471-2105-10-233

[6] Y. Niu, D. Otasek, and I. Jurisica, “Evaluation of linguisticfeatures useful in extraction of interactions from PubMed;Application to annotating known, high-throughput and predictedinteractions in I2D.” Bioinformatics (Oxford, England), vol. 26,no. 1, pp. 111–119, January 2010. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btp602

[7] T. Fayruzov, M. De Cock, C. Cornelis, and V. Hoste, “Therole of syntactic features in protein interaction extraction,” inProceedings of the 2nd international workshop on Data and text miningin bioinformatics. ACM New York, NY, USA, 2008. [Online].Available: http://portal.acm.org/citation.cfm?id=1458463

[8] S. Van Landeghem, Y. Saeys, B. De Baets, and Y. Van dePeer, “Extracting protein-protein interactions from text using richfeature vectors and feature selection,” in Proceedings of the ThirdInternational Symposium on Semantic Mining in Biomedicine (SMBM2008), Turku, Finland, T. Salakoski, D. Rebholz-Schuhmann,and S. Pyysalo, Eds. Turku Centre for Computer Science(TUCS), 2008, pp. 77–84. [Online]. Available: http://mars.cs.utu.fi/smbm2008/files/smbm2008proceedings/smbmpaper 4.pdf

[9] F. Leitner, M. Krallinger, C. Rodriguez-Penagos, J. Hakenberg,C. Plake, C.-J. Kuo, C.-N. Hsu, R. T.-H. Tsai, H.-C. Hung, W. W.Lau, C. A. Johnson, R. Sætre, K. Yoshida, Y. H. Chen, S. Kim,S.-Y. Shin, B.-T. Zhang, W. A. Baumgartner, L. H. Jr., B. Haddow,M. Matthew, X. Wang, P. Ruch, F. Ehrler, A. Ozgur, G. Erkan, D. R.

Radev, M. Krauthammer, T. Luong, R. Hoffmann, C. Sander,and A. Valencia, “Introducing meta-services for biomedicalinformation extraction,” Genome Biology, vol. 9, no. S2, p. S6,2008, special Issue on the BioCreative Challenge Evaluation.[Online]. Available: http://genomebiology.com/2008/9/S2/S6

[10] P. Roberts, A. Cohen, and W. Hersh, “Tasks, topics and relevancejudging for the TREC Genomics Track: Five years of experienceevaluating biomedical text information retrieval systems,” Infor-mation Retrieval, vol. 12, no. 1, pp. 81–97, 2009. [Online]. Available:http://www.springerlink.com/content/940478r304656141/

[11] K. Fundel, R. Kuffner, and R. Zimmer, “RelEx–Relation extractionusing dependency parse trees,” Bioinformatics, vol. 23, no. 3, pp.365–371, February 2007. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btl616

[12] S. Kim, S.-Y. Shin, I.-H. Lee, S.-J. Kim, R. Sriram, andB.-T. Zhang, “Pie: an online prediction system for protein-protein interactions from text,” Nucl. Acids Res., vol. 36,no. suppl 2, pp. W411–415, July 2008. [Online]. Available:http://dx.doi.org/10.1093/nar/gkn281

[13] P. Palaga, L. Nguyen, U. Leser, and J. Hakenberg, “High-performance information extraction with AliBaba,” in EDBT ’09:Proceedings of the 12th International Conference on Extending DatabaseTechnology. New York, NY, USA: ACM, 2009, pp. 1140–1143.[Online]. Available: http://doi.acm.org/10.1145/1516360.1516498

[14] L. Hunter, Z. Lu, J. Firby, W. Baumgartner, H. Johnson, P. Ogren,and K. B. Cohen, “OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturingknowledge regarding protein transport, protein interactionsand cell-type-specific gene expression,” BMC Bioinformatics,vol. 9, no. 1, pp. 78+, January 2008. [Online]. Available:http://dx.doi.org/10.1186/1471-2105-9-78

[15] R. Chowdhary, J. Zhang, and J. S. Liu, “Bayesian inferenceof protein-protein interactions from biological literature,”Bioinformatics, vol. 25, no. 12, pp. 1536–1542, June 2009. [Online].Available: http://dx.doi.org/10.1093/bioinformatics/btp245

[16] M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe,J. Wilbur, L. Hirschman, and A. Valencia, “Evaluation of text-mining systems for biology: overview of the Second BioCreativecommunity challenge,” Genome Biology, vol. 9, no. S2, 2008.[Online]. Available: http://dx.doi.org/10.1186/gb-2008-9-s2-s1

[17] F. Leitner and A. Valencia, “A text-mining perspective on therequirements for electronically annotated abstracts,” FEBS Letters,vol. 582, no. 8, pp. 1178–1181, April 2008. [Online]. Available:http://dx.doi.org/10.1016/j.febslet.2008.02.072

[18] J.-D. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii,“”overview of bionlp’09 shared task on event extraction”,” in”Proceedings of the BioNLP 2009 Workshop Companion Volumefor Shared Task”, 2009, pp. 1–9. [Online]. Available: http://www.aclweb.org/anthology/W/W09/W09-1401.pdf

[19] S. Pyysalo, F. Ginter, J. Heimonen, J. Bjorne, J. Boberg, J. Jarvinen,and T. Salakoski, “BioInfer: a corpus for information extractionin the biomedical domain,” BMC Bioinformatics, vol. 8, no. 1,pp. 50+, 2007. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-8-50

[20] J. Ding, D. Berleant, D. Nettleton, and E. Wurtele, “MiningMEDLINE: abstracts, sentences, or phrases?” Pac Symp Biocomput,pp. 326–337, 2002. [Online]. Available: http://view.ncbi.nlm.nih.gov/pubmed/11928487

[21] C. Nedellec, “Learning Language in Logic – Genic InteractionExtraction Challenge,” in Proceedings of the 4th Learning Languagein Logic Workshop (LLL05), J. Cussens and C. Nedellec,

Page 13: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 13

Eds., Bonn, August 2005, pp. 31–37. [Online]. Available:http://www.cs.york.ac.uk/aig/lll/lll05/lll05-nedellec.pdf

[22] S. Pyysalo, A. Airola, J. Heimonen, J. Bjorne, F. Ginter, andT. Salakoski, “Comparative analysis of five protein-proteininteraction corpora,” BMC Bioinformatics, vol. 9, no. Suppl 3,pp. S6+, 2008. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-9-S3-S6

[23] J. D. Kim, T. Ohta, and J. Tsujii, “Corpus annotation for miningbiomedical events from literature,” BMC Bioinformatics, vol. 9,no. 1, pp. 10+, 2008. [Online]. Available: http://dx.doi.org/10.1186/1471-2105-9-10

[24] A. Yakushiji, “Relation information extraction using deepsyntactic analysis,” Ph.D. dissertation, University of Tokyo,2006. [Online]. Available: http://www-tsujii.is.s.u-tokyo.ac.jp/∼akane/papers/dissertation yakushiji.pdf

[25] R. Sætre, K. Yoshida, A. Yakushiji, Y. Miyao, Y. Matsubyashi, andT. Ohta, “AKANE System: Protein-Protein Interaction Pairs inBioCreAtIvE2 Challenge, PPI-IPS subtask,” in Proceedings of theSecond BioCreative Challenge Evaluation Workshop, L. Hirschman,M. Krallinger, and A. Valencia, Eds. Spain: CNIO, April 2007,pp. 209–212. [Online]. Available: http://www-tsujii.is.s.u-tokyo.ac.jp/∼satre/papers/BC2 PPI IPS T19 BC2.pdf

[26] Y. Kano, N. Nguyen, R. Sætre, K. Yoshida, Y. Miyao, Y. Tsuruoka,Y. Matsubayashi, S. Ananiadou, and J. Tsujii, “Filling thegaps between tools and users: A tool comparator, usingprotein-protein interactions as an example,” in Proceedings ofThe Pacific Symposium on Biocomputing (PSB), no. 13, Hawaii,USA, January 2008, pp. 616–627. [Online]. Available: http://psb.stanford.edu/psb-online/proceedings/psb08/kano.pdf

[27] Y. Miyao, K. Sagae, R. Sætre, T. Matsuzaki, and J. Tsujii,“Evaluating contributions of natural language parsers to protein-protein interaction extraction,” Bioinformatics, vol. 25, no. 3,pp. 394–400, 2009. [Online]. Available: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/3/394

[28] M. Miwa, R. Sætre, Y. Miyao, and J. Tsujii, “Protein-proteininteraction extraction by leveraging multiple kernels andparsers,” International Journal of Medical Informatics, vol. 78,no. 12, pp. e39–e46, 2009, mining of Clinical and BiomedicalText and Data Special Issue. [Online]. Available: http://www.ijmijournal.com/article/S1386-5056%2809%2900076-8/

[29] D. Ferrucci and A. Lally, “UIMA: an architectural approach tounstructured information processing in the corporate researchenvironment,” Natural Language Engineering, vol. 10, no. 3-4,pp. 327–348, 2004. [Online]. Available: http://portal.acm.org/citation.cfm?id=1030318.1030325

[30] R. Sætre, “Akane system home page,” http://www-tsujii.is.s.u-tokyo.ac.jp/∼satre/akane/, 2009.

[31] H. Hermjakob, L. Montecchi-Palazzi, G. Bader, R. Wojcik,L. Salwinski, A. Ceol, S. Moore, S. Orchard, U. Sarkans,C. von Mering, B. Roechert, S. Poux, E. Jung, H. Mersch,P. Kersey, M. Lappe, Y. Li, R. Zeng, D. Rana, M. Nikolski,H. Husi, C. Brun, K. Shanker, S. Grant, C. Sander, P. Bork,W. Zhu, A. Pandey, A. Brazma, B. Jacq, M. Vidal, D. Sherman,P. Legrain, G. Cesareni, L. Xenarios, D. Eisenberg, B. Steipe,C. Hogue, and R. Apweiler, “The HUPOPSI’s MolecularInteraction format - a community standard for the representationof protein interaction data,” NATURE BIOTECHNOLOGY,vol. 22, no. 2, pp. 177–183, FEB 2004. [Online]. Available:http://www.ncbi.nlm.nih.gov/pubmed/14755292

[32] U. Hahn, E. Buyko, K. Tomanek, S. Piao, J. McNaught,Y. Tsuruoka, and S. Ananiadou, “An annotation type systemfor a data-driven NLP pipeline,” in Proceedings of the LinguisticAnnotation Workshop. Prague, Czech Republic: Association forComputational Linguistics, June 2007, pp. 33–40. [Online]. Avail-able: http://www.aclweb.org/anthology/W/W07/W07-1505

[33] W. A. Baumgartner, B. K. Cohen, and L. Hunter, “Anopen-source framework for large-scale, flexible evaluation ofbiomedical text mining systems,” Journal of Biomedical Discoveryand Collaboration, vol. 3, pp. 1+, January 2008. [Online]. Available:http://dx.doi.org/10.1186/1747-5333-3-1

[34] Y. Kano, W. A. Baumgartner, L. McCrohon, S. Ananiadou,K. B. Cohen, L. Hunter, and J. Tsujii, “U-compare: shareand compare text mining tools with uima,” Bioinformatics,vol. 25, no. 15, pp. 1997–1998, August 2009. [Online]. Available:http://dx.doi.org/10.1093/bioinformatics/btp289

[35] J.-D. Kim, T. Ohta, Y. Tateishi, and J. Tsujii, “GENIA corpus - asemantically annotated corpus for bio-textmining,” Bioinformatics,

vol. 19, no. suppl. 1, pp. i180–i182, 2003, iSSN 1367-4803.[Online]. Available: http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl 1/i180

[36] T. Hara, Y. Miyao, and J. Tsujii, “Adapting a probabilisticdisambiguation model of an HPSG parser to a newdomain,” in IJCNLP 2005, ser. LNAI, R. Dale, K.-F. Wong,J. Su, and O. Y. Kwong, Eds., vol. 3651. Jeju Island,Korea: Springer-Verlag, October 2005, pp. 199–210, iSSN 0302-9743. [Online]. Available: http://www-tsujii.is.s.u-tokyo.ac.jp/∼harasan/papers/harasan-IJCNLP2005.pdf

[37] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann,S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J.Martin, D. A. Natale, C. O’Donovan, N. Redaschi, and L.-S. L.Yeh, “UniProt: the Universal Protein knowledgebase,” Nucl. AcidsRes., vol. 32, no. suppl 1, pp. D115–119, January 2004. [Online].Available: http://dx.doi.org/10.1093/nar/gkh131

[38] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova, “Entrez Gene:gene-centered information at NCBI,” Nucl. Acids Res., vol. 33,no. suppl 1, pp. D54–58, January 2005. [Online]. Available:http://dx.doi.org/10.1093/nar/gki031

[39] A. Koike and T. Takagi, “Gene/protein/family name recognitionin biomedical literature,” in Proceedings of Biolink 2004, 2004, pp.9–16. [Online]. Available: http://www.cs.brandeis.edu/∼jamesp/biolink2004/papers/pdf/BIO002.pdf

[40] T. Joachims, “Optimizing search engines using clickthroughdata,” in KDD ’02: Proceedings of the eighth ACM SIGKDDinternational conference on Knowledge discovery and data mining.New York, NY, USA: ACM, 2002, pp. 133–142. [Online]. Available:http://doi.acm.org/10.1145/775047.775067

[41] A. Moschitti, “Making tree kernels practical for naturallanguage learning.” in EACL. The Association for ComputerLinguistics, 2006. [Online]. Available: http://acl.ldc.upenn.edu/E/E06/E06-1015.pdf

[42] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.Lin, “LIBLINEAR: A library for large linear classification,”Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.[Online]. Available: http://www.csie.ntu.edu.tw/∼cjlin/papers/liblinear.pdf

[43] L. A. Hirschman, S. A. Mardis, G. Cesareni, M. Krallinger,F. Leitner, and A. Valencia, “An Overview of BioCreative II.5,”Transactions on Computational Biology and Bioinformatics, 2009.

Rune Sætre received his M.Sc. and Ph.D. de-grees in computer science in 2003 and 2006,from the Norwegian University of Science andTechnology (NTNU) in Trondheim, Norway. Bothhis M.Sc. and Doctoral theses are based onwork for the GeneTUC project. He has sincebeen working as a research associate at theUniversity of Tokyo (UoT), under the supervisionof prof. Jun’ichi Tsujii. Rune is interested in Nat-ural Language Processing for Bio-Medical Texts(BioNLP) and he is collaborating with cancer

researcher in order to make useful real-life applications. He is workingon Relation Extraction in the PathText project and on Analyzing GeneRanking Algorithms in the bilateral Japan-Slovenia AGRA project.

Kazuhiro Yoshida received his M.Sc. degreein information science and technology, from theUniversity of Tokyo, Japan, in 2005. He laterworked as a researcher at the University ofTokyo. He is interested in machine learning andnatural language analysis, including natural lan-guage parsing.

Page 14: Extracting protein interactions from text with the unified AkaneRE event extraction system

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 7, NO. 2, APRIL-JUNE 2010 14

Makoto Miwa received his M.Sc. and Ph.D.degrees, both in science, from the University ofTokyo, Japan, in 2005 and 2008, respectively.He has since been working as a researcherat the University of Tokyo, under the supervi-sion of prof. Jun’ichi Tsujii. His interest is Ma-chine Learning in Natural Language Processingfor Bio-Medical Texts (BioNLP) and ComputerGames.

Takuya Matsuzaki received his M.Sc. and Ph.D.degrees, both in information science and tech-nology, from the University of Tokyo, Japan, in2004 and 2007, respectively. He has since beenworking as a researcher at the University ofTokyo, under the supervision of prof. Jun’ichiTsujii. His interest is in natural language parsingand its application.

Yoshinobu Kano received his M.Sc. degree ininformation science and technology, from theUniversity of Tokyo, Japan, in 2003. He hasbeen working as a research associate at theUniversity of Tokyo, under the supervision ofprof. Jun’ichi Tsujii. He has been leading the U-Compare joint project as the main developer. Heis also interested in psychological plausibility ofNLP parser models.

Jun’ichi Tsujii received his B.Eng., M.Eng. andPh.D. degrees in Electrical Engineering fromKyoto University, Japan, in 1971, 1973, and 1978respectively. He was assistant professor and as-sociate professor, Kyoto University, before takingup the position of professor of ComputationalLinguistics of University of Manchester Institutefor Science and Technology (UMIST) in 1988.Since 1995, he is Professor of Department ofComputer Science at the University of Tokyo. Heis also Professor of Text Mining of University of

Manchester (half time) and research director of UK National Centre forText mining (NaCTeM) since 2004. He was President of ACL (Associa-tion for Computational Linguistics) in 2006 and is a permanent memberof ICCL (International Committee on Computational Linguistics) since1992.