A Rule-based Approach to External Context Extraction from Biomedical Literature: URL and Role Extraction A dissertation submitted to The University of Manchester for the degree of Master of Science Informatics In the Faculty of Engineering and Physical Sciences 2010 Azad Dehghan School of Computer Science
84
Embed
A Rule-based Approach to External Context Extraction from … · 2010-12-06 · A Rule-based Approach to External Context Extraction from Biomedical Literature: URL and Role Extraction
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Rule-based Approach to External Context Extraction from
Biomedical Literature: URL and Role Extraction
A dissertation submitted to The University of Manchester for the degree of
Master of Science Informatics
In the Faculty of Engineering and Physical Sciences
2010
Azad Dehghan
School of Computer Science
2 | P a g e
Table of Contents
Table of Contents .......................................................................................................................... 2
List of Tables ................................................................................................................................. 4
List of Figures ................................................................................................................................ 6
List of Abbreviations ..................................................................................................................... 7
National Centre for Biotechnology Information NCBI
National Institute of Health NIH
National Library of Medicine NLM
Natural Language Processing NLP
Object Oriented Programming OOP
PubMed Central PMC
Relational Database Management System RDBMS
Right-hand-side LHS
Role Expression RE
Separation of Concern SoC
Software Development Processes SDP
Software Requirements Engineering SRE
Software Requirements Specification SRS
Text Mining TM
8 | P a g e
Abstract
With a huge number of publications within the biomedical domain, there is an increasing number
of references to URLs, and acknowledgements of individuals and funding organisations. This
project was motivated by providing a look-into the scope of the problem of URL decay, and to
explore and uncover fact of e.g., most active funding organisations, relationship between funding
agencies and research themes, and scientists and research themes, and so on.
EXTernal CONtext eXtractor 2 (ExtConX2) was developed in order to aid with this aim. Rule-
based approaches were adopted in order to extract URLs and acknowledgements from PubMed
Central documents. From the entire PMC dataset of roughly 190, 000 PMC documents processed,
147, 133 URLs, and 194,539 roles were extracted.
Using this data, we have analysed some trends in URL decay and acknowledgments. For example,
we found that URL decay can be described as a function of publication year: the older the
publication the less accessible resource contained within publications. We also found that most
funding acknowledgements were associated with National Institutes of Health, National Science
Foundation, and Wellcome Trust respectively.
The adopted approach for URL extraction achieved precision of 98.6% and a recall of 96%. The
role extraction task achieved a recall of 67.6% and precision of 92.6%.
.
9 | P a g e
Declaration No portion of the work referred to in the dissertation has been submitted in support of an
application for another degree or qualification of this or any other university or other institute of
learning.
Copyright Statement
i. The author of this dissertation (including any appendices and/or schedules to this
dissertation) owns any copyright in it (the ―Copyright‖) and he has given The University of
Manchester the right to use such Copyright for any administrative, promotional, educational and/or teaching purposes.
ii. Copies of this dissertation, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these
regulations may be obtained from the Librarian. This page must form part of any such
copies made.
iii. The ownership of any patents, designs, trademarks and any and all other intellectual
property rights except for the Copyright (the ―Intellectual Property Rights‖) and any
reproductions of copyright works, for example graphs and tables (―Reproductions‖), which may be described in this dissertation, may not be owned by the author and may be owned
by third parties. Such Intellectual Property Rights and Reproductions cannot and must not
be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication and exploitation of this dissertation, the Copyright and any Intellectual Property Rights and/or
Reproductions described in it may take place is available from the Head of School of
Computer Science.
10 | P a g e
Dedication This project is first and foremost dedicated to Science. I hope that the excel of science and reason
will continue to prevail! The earth is round indeed!
Secondly, I would also like to dedicate this project to my family: my parents Siavash Dehghan and
Shahnaz Gharehjani, and my brother Arash for his support.
Acknowledgement
I am grateful to Dr. Goran Nenadic for helpful comments and suggestions. I would also like to acknowledge the gnTeam for providing the PubMed Central dataset.
11 | P a g e
1. Introduction
The presence of overwhelming amounts of unstructured textual information within scientific
literature has made the need for machine-supported analysis of text ever more important to aid
scientists with scientific hypothesis generation and knowledge discovery (Ananiadou & McNaught,
2006; Ananiadou et al., 2005; Uramoto et al., 2004). A specific problem domain is that of
biological sciences, reflected by the share volume of academic publications. For instance, in the
previous year alone (2009), over 710,000 approved references were added to MEDLINE®/
PubMed®; or between 60,000-120,000 reference added each month (NLM 2008; NLM 2009). The
share numbers of publications is simply not human digestible by any individual scientist.
This domain in particular has made the application of text mining (TM) techniques to analyse huge
quantities of unstructured information a vital means to extend and further scientific/knowledge
discovery (Ananiadou & McNaught, 2006). The implications of traditional knowledge discovery or
to generate scientific hypothesis without the aid of TM techniques should be evident.
With a huge number of publications within the biomedical domain, (1) there is an increasing
number of references to URLs or online resources (e.g., publications, software, and so on), and (2)
acknowledgements of individuals and funding organisations. The aim of this dissertation may be
described as discovery-oriented (see Fayyad et al., 1996), i.e., to uncover previously unknown facts
or knowledge in regards to relationships/patterns involving these aspects using TM techniques.
1.1. Motivation
With unprecedented growth of biomedical literature coupled with the increase practice of
referencing of online resources (URLs) that become inaccessible over time (i.e., URL decay). This
project is motivated by providing an analysis of the scope of this problem. While previous studies
(Wren, 2004; Wren, 2008) have confirmed the issue of URL decay, this project will extend upon
previous researches by providing a more holistic conclusion through the analysis of a broader
dataset.
Another motivation is similarly and partly derived from the unprecedented quantities of research
and publication within the biomedical domain. As biomedical research attracts billion of pounds of
research grants and investment from governmental, commercial, and academic sources worldwide
each year; it will be interesting to explore and uncover patterns of e.g., most active funding
12 | P a g e
agencies or institutions, relationship between funding agencies and research themes, and scientists
and research themes, and so on.1
1.2. Project Aims
The aim of this project is to design and implement a system to enable the analysis of trends such as
URL decay (i.e., the phenomenon of inaccessible online resources), type of online resources most
often referenced, and exploration of acknowledgements: of individuals and organisations and their
respective roles in relation to the research/article where acknowledged. Therefore, the system must
enable extraction of so called external context from biomedical research: (1) URLs and (2)
acknowledgments. This software system will be referred to as EXTernal CONtext eXtractor 2 or
ExtConX2 hereafter.2
Moreover, ExtConX2 may be described as two systems in one: (1) URL extractor and (2)
acknowledgement extractor. Description of these subsystems follows:
(1) URL Extractor
The URL Extractor must enable (1) extraction of URLs, (2) for each URL extracted, the system
must determine the type of resource referenced (i.e., Document, Databank, Software, or
Organisation), and (3) determine if the URL is accessible or not.
(1) Acknowledgement Extractor
The Acknowledgement Extractor must enable the identification and extraction of (1) name entities
(NEs) such as persons and organisations, (2) role expression (RE) or the acknowledged role of
given NE, and (3) identify relations or association between a NE and corresponding RE.
1.2.1. Conceptualisation of Project Specific Terminology
Various project specific terminologies are used throughout this dissertation. This section provides
conceptualisation of these terms for easy referencing:
1 Apart from providing practical applications as described in section 1.2.1, biomedical research could at time be
controversial (e.g., stem-cell research; health risk of cigarettes), hence, uncovering of patterns between funding organisations and research could be important to maintain scientific and academic integrity. 2 2 – Indicates the number of tasks the system handles: (1) URL extraction and (2) acknowledgement extraction.
13 | P a g e
(1) Conceptualisation of Role Entities:
i. Collaborator – any NE (person or organisation), apart from the author(s), that provide any
non-financial support (e.g., editorial, conceptual, technical, and so on).
ii. Funder – any NE that provides financial support to the corresponding research.
iii. Role Expression – the literal role of a collaborator or funder.
Note that collaborator / contributor, and sponsor / funder will be used interchangeably throughout
this report.
(2) Conceptualisation of Resource Types:
i. Databank – any database or repository of information which may facilitate dynamic
information retrieval.
ii. Document – any article, report, book, or any static information resource.
iii. Organisation – any organisation or institute (literal definition).
iv. Software – any computer program or application (literal definition).
1.3. Project Objectives This project will aim to achieve the following objectives:
1. Design and implement a relational database (Db) schema to store extracted data.
2. Design and implement a module to extract URLs from documents, determine if the given
URL is accessible or not, determine type of resource (or URL) extracted/referenced and
insert this data into a database.
3. Design and implement a module to identify and extract funders and collaborators (i.e.,
persons/organisations and their respective roles) from acknowledgements and insert this
data into a database.
4. Design and implement a GUI that will facilitate exploration of system functionalities and
which provides general statistics.
5. Evaluation of the purposed methodology.
1.4. Availability
The PubMed Central dataset will be available from gnode1 (gnode1.mib.man.ac.uk) for the use
within this project.
14 | P a g e
1.5. Overview of Chapters
The remainder of this dissertation is organised as followed:
Chapter 2 – Background: provides a general description of the project background such as Text
Mining (TM) processes and concepts, and review of related work
Chapter 3 – Software Requirements: provides a high-level description of the main requirements
of ExtConX2, and further defines functional and non-functional requirements.
Chapter 4 – System Design and Analysis: illustrates and discusses the overall system design and
individual software components of ExtConX2.
Chapter 5 – Implementation: discusses the implementation of the system by analysing selected implementation components.
Chapter 6 – Evaluation: presents and discusses the results of the knowledge discover stage of the dissertation and evaluation of adopted methods.
Chapter 7 – Conclusion: concludes the dissertation by reflection of the project aims, limitation of
the system, and suggestions for future work.
15 | P a g e
2. Background
2.1. Text Mining
TM generally involves the application of techniques such as Information Retrieval (IR), Natural
Language Processing (NLP), Information Extraction (IE), and Data Mining (DM) (JISC, 2006;
Uramoto et al., 2004) to unstructured text. Hearst (2003) summarises the general notion of TM as:
the discovery by computer of new, previously unknown information, by automatically [or
semi-automatically] extracting information from different written resources. A key element
is the linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation.
While TM is often an iterative process, its techniques/stages are generally applied in an ordered
manner. TM or knowledge discovery is a process-oriented activity. Further, due to the relative new
research field of TM, concepts used are not always consistent across literature (see Hotho et al.,
2005; Fayyad et al., 1996). However, while it is not within the scope of this report to further
discuss this issue it is important to acknowledge. Hence, this section will briefly review processes,
techniques, and concepts involved within TM. This ought to clarify the conceptual foundation and
aid the understanding of further description of the overall project pursued.
2.1.1. Information Retrieval
Information retrieval is a discipline and problem concerned with the finding of
documents/information (Hotho et al., 2005). IR covers a wide variety of research areas such as
document classification and categorisation, data visualisation, filtering, modelling, and so forth
(Baeza-Yates & Ribeiro-Neto, 1999). Often referenced IR systems are search engines such as
Yahoo3 and Google
4 which identify documents/information according to the user‘s search queries
(JISC, 2006). IR systems within the biomedical domain include Entrez PubMed and PubMed
Central (PMC). PubMed® is a free resource which provides access to MEDLINE
® (Medical
Literature Analysis and Retrieval System Online), the U.S. National Library of Medicine‘s (NLM)
database of citation and abstracts. Currently, PubMed contains over 19 million references from
approximately 5,400 biomedical journals published worldwide (NLM, 2010a). PubMed Central is
the corresponding (free) full-text digital archive developed and managed by U.S. National Institute
of Health‘s (NIH) National Centre for Biotechnological Information (NCBI).
3 www.yahoo.co.uk 4 www.google.co.uk
16 | P a g e
Moreover, within the context of TM or knowledge discovery process, IR refers to the process of
finding and retrieving appropriate documents relevant to some particular problem (JISC, 2006).
While IR is considered as a sub-process of NLP by some researchers (e.g., Polajnar, 2006), within
this project, IR will be regarded as a separate and antecedent process of NLP.
2.1.2. Natural Language Processing
Natural language processing is concerned with the problem of understanding natural language (NL)
by the use of computers (JISC, 2006; Hotho et al., 2005). Due to the inherent ambiguity of NL, the
complexity to analyse NL by the use of machines is a evident reality. Thus, NLP is commonly
divided into several layers of processing (Hahn & Wermter, 2006): lexical, syntactic, and semantic
level. The lexical level processing deals with how words can be recognised, analysed, and
identified to enable further processing (Hahn & Wermter, 2006). The syntactic level analysis deals
with identification of structural relationships between groups of words in sentences, and the
semantic level is concerned with the content-oriented perspective or the meaning attributed to the
various entities identified within the syntactic level (Hahn & Wermter, 2006).
(1) Lexical Level Processing
The tokenisation process or the segmentation of text into individual meaningful elements is the
initial stage of lexical level processing. Tokens such as words, acronyms, abbreviations, numbers,
and so on are linguistically identified (Hahn & Wermter, 2006). Other interrelated sub-processes
associated with lexical level processing include (Hahn & Wermter, 2006):
Part-Of-Speech (POS) tagging which is considered as the core of this level processing
Morphological analysis (the association/linking of varied forms of lexical elements to their
canonical base form)
Unknown word handling
Acronym detection
Name Entity Recognition (NER)
An example of a widely used and reliable POS tagger within the biomedical domain is GENIA
Tagger v3.0 (Tsuruoka et al., 2005). Computational lexicons (e.g., BioThesaurus) are also utilised
at this stage to aid with the overall lexical level processing. While lexicons often vary depending
upon domain/task, in general and the bare minimum, computational lexicons contain lexical
elements such as full or canonical base forms of words and additional linguistic information (e.g.,
part-of-speech category and morphological information), and so on.
17 | P a g e
(2) Syntactic Level Processing
Common methods applied within the syntactic level processing are chunkers and parsers.
Chunkers partition or label sentences into phrasal units (i.e., noun, preposition, verb, or adjective
phrases) (see Hahn & Wermter, 2006, p.23 for details), and parsers identify clauses such as word
sequences containing a subject and a predicate (Hahn & Wermter, 2006, p.25). An example of
domain specific (i.e., biomedical) shallow parser is GENIA Tagger. Moreover, the application of
name entity recogniser (NER) at this level of processing has proven beneficial within biological
text mining as most name entities are contained within nouns or prepositional phrases (Hahn &
Wermter, 2006). Some examples of NER systems include ANNIE for, e.g., person and organisation
name recognition (Cunningham et al., 2010), LINNAEUS for species name recognition (Gerner et
al., 2010), and TerMine for technical terms recognition.
Resources commonly utilised to aid with the overall syntactic level process are grammars and
treebanks. Treebanks are annotated text corpora with syntactic annotations at sentence level (i.e.,
POS tags and syntactic structures), and grammars contain some subset of linguistic syntax,
commonly, rules or constraints which characterises morpho-syntactic and nonterminal grammar
categories (see Hahn & Wermter, 2006, p.21). An example of widely used Treebank (within the
biomedical domain) is GENIA Treebank v1.0, which is based upon annotated PubMed abstracts
(Kim & Tsujii, 2006; Tateisi, 2004).
(3) Semantic Level Processing
The semantic level analysis consists of linking terms or concepts to form logical/knoweldge
propositions (Hahn & Wermter, 2006). This level processing is directly based upon the
combination of the lexical and syntactic level analysis. For instance, within the scope of this
project, the semantic level processing involves the linking of NEs and their respective roles.
2.2. Information Extraction
Information extraction may be described as a subsequent stage of NLP. IE is the process of
automatically or semi-automatically extracting predefined data from unstructured text (JISC, 2006)
and inserting this data into forms or templates (see McNaught & Black, 2006, p.143), which
subsequently convey the data into some factual information (Hotho et al., 2005). As defined by
Message of Understanding Conference (MUC), tasks commonly associated with IE are:
Recognition and classification of words denoting name of persons, organisations, locations;
and numeric and temporal expressions (i.e., name entity task).
Identifying links references to entities extracted (i.e., coreference task)
18 | P a g e
Extracting identifying and descriptive attributes of name entities (i.e., template element
task).
Extracting relationships between name entities (i.e., template relation task).
Extracting events in combination with either template element/relation tasks (McNaught
and Black, 2006, p.147).
Moreover, a common used method to aid the overall NER process include the use of gazetteers
(i.e., lists defining NEs such as persons, organisation, etc).
Data mining refers to the process of identifying patterns from a (often large) structured datasets
(such as a database). Within the TM process, DM techniques are typically applied to facts extracted
during the IE stage in the purpose to identify patterns and discover new knowledge (JISC, 2006).
2.2.1. Rule-based and Statistical-based Approaches to IE
Methods which may be used for IE tasks include rule-based (e.g., Common Pattern Specification
Language; Java Annotated Pattern Engine) and statistical-based (e.g., Support Vector Machines;
Hidden Markov Models) approaches. Both types of methods have their strengths and weaknesses.
For instance, statistical-based methods tend to require more computing resources as opposed to
rule-based which tend to be more light-weight (thus resulting in faster processing). On the other
hand, rule-based or knowledge engineering approach is domain or even task dependent, while
statistical or automatic training approach is relatively domain independent (Appelt & Israel 1999).
Hence, domain portability is quite straightforward with statistical-based approaches (Appelt &
Israel, 1999). While both methods could be equally labour and time intensive these methods differ
in their inherit way of designing an IE application. Rule-based approach often requires domain
knowledge and a skilled knowledge engineer to implement effective rules for the IE task. On the
other hand, statistical-based approach requires annotator(s) with some knowledge about the domain
and task in order to annotate some training corpus for model information sought to be extracted
(Appelt & Israel, 1999).
2.2.2. IE Application Development Tools/Software
Many tools/software are available to aid scientists and developers to create IE applications, e.g.,
CAFETIERE (see Black et al., 2005), LingPipe,5 MinorThird,
for Text Engineering)7. A common denominator across the latter three tools is that they provide
Java APIs for use within custom build standalone applications.
(1) CAFETIERE (or Conceptual Annotation for Facts, Events, Terms, Individual Entities, and
RElations) is a rule-based information extraction system for various IE tasks as specified within its
title. CAFETIERE provides various NLP components as tokenisers, POS taggers, NERs, etc., for
text pre-processing and a customised rule-based language that may be used for semantic level
processing of text (Black et al., 2005). Further, CAFETIERE provides a graphical user interface
(GUI) (i.e., Analyser and Annotation Editor) which supports viewing and editing annotation (which
is useful for iterative development of IE rules).
(2) LingPipe may be described as a toolkit for processing text using computational linguistics and
primarily contains Java APIs for NER, POS, classification, and so on.
(3) MinorThird is another toolkit containing a collection of Java APIs for various NLP and IE
tasks. In contrast to Lingpipe, MinorThird also provides a GUI for invoking APIs and debugging or
manipulating annotations.
(4) GATE may be considered as the more mature tool of the latter two, due to its extensive
documentation and user friendly GUI. GATE is in essence an integrated development environment
providing reusable processing resources enabling the development and deployment of customised
applications to solve NLP problems/tasks (Cunningham et al., 2010). Processing resources are
individual NLP processing components such as tokanisers, POS taggers, NERs, etc., which may be
applied to individual documents or a corpus in a customised order to create an IE application.8
These resources are collectively known as a Collection of REusable Objects for Language
Engineering (CREOLE). GATE may be used to create annotations over documents (for instance, to
be used with statistical-based approaches) or create IE applications which may be used apart from
GATE interface via APIs (GATE Embedded) 9 (Cunningham et al., 2010).
2.3. NLM Journal Archiving and Publishing DTDs
Both PubMed and PubMed Central (PMC) documents are provided in XML formats (defined by
NLM Journal Archiving and Publishing DTDs) as an alternative to common Portable Document
Format (pdf). As previously mentioned, PubMed contains citations and abstracts, and PMC is the
7 http://www.Gate.ac.uk 8 Java APIs from LingPipe, Google, Yahoo (and many more) for NLP/IE are provided as processing resources. 9 GATE API to integrate the IE application into a Java application.
20 | P a g e
corresponding full-text digital archive. The dataset from PMC, which contains approximately 190,
000 documents, will be used in this project.
While NLM Journal Archiving and Interchange Tag Suit was created in order to provide a common
format for publisher and archives to exchange journal content (NLM, 2010b), its usefulness for TM
applications has been widely appreciated. This Tag Suit defines elements and attributes to describe
full article contents such as meta-data, acknowledgement, abstract, article body, citations, URLs,
and so on. This has proven beneficial to researchers who may only be interested in a particular
section(s) of articles, e.g., abstracts or acknowledgements. For, instance instead of using regular
expression over a whole document to identify particular sections of interest, researcher could use
XML parser10
to parse documents and extract relevant section. This has at least couple of
advantages over the use of regular expressions. Providing that a tag set exists for particular
document content of interest, the utilisation of an XML tags to extract this content could often be
more accurate than using regular expressions (hence improving results). In addition, when
designing a TM application, which often processes huge amount of documents, given the
opportunity to only parse documents for specific content rather than process whole documents
could significantly improve performance (i.e., response time and use of computing resources).
Currently there exist seven different types of Tag Suit versions or Document Type Definitions (or
DTDs)11
for PMC articles. However, these versions are consistent in regards to tags used for
content which are of interest to this project, namely for acknowledgements and URLs.
Table 1 describes XML tags which will be used in the implementation of ExtConX2 (NLM,
2010c):
Table 1 – Relevant XML Tags
(1a) <ext-link> </ext-link>
Tag defining external resource outside of the scope
of an article.
(1b)ext-link-type=”uri”
Tag (1a) must contain attribute: ext-link-type
which has the value uri. This indicates that the tag contains a URL.
(1c)xlink:href
Finally within the tag element a third attribute (1c)
must identify the external link. (2)<ack> </ack> Tag defining acknowledgement content/section.
Below is a simplified XML skeleton in the NLM Archiving and Interchange format. Sample of tags
described in Table 1 may be found at lines 28 and 34 in the following example:
10 XML Parser generally refers to an API that enables one to programmatically read XML files and extract content of
interest. Common APIs used for XML parsing in Java include Document Object Model and Simple API for XML. 11
Tag Suit versions include: 1.0, 1.1, 2.0, 2.1, 2.2, 2.3, and 3.0 (current).
2. Design and implement a module to extract URLs from PMC XML documents
Functional Requirement: [R5]. The module shall be able to identify and extract URLs
from PMC XML documents.
Risk: Low.
External Dependency: Availability of Shared Db.
Priority: Intermediate.
Pre-condition: Objective 1, and Objective 6 (A)
Post-condition: A set of extracted URLs.
Difficulty: Intermediate
Process overview:
1. Objective 6, process A (Table 13).
2. Parse document and extract URL(s).
Table 10 – Implementation Objective 3
3. Design and implement a module to determine type of resource (or URL)
extracted/referenced.
Functional Requirement: [R6]. The module shall be able to identify the type of online resource referenced; Databank, Document, Organisation,
or Software.
Risk: Low.
External Dependency: Availability of Shared Db.
Priority: High.
Pre-condition: Objective 2 (this module is in essence a sub-module of Obj. 2).
Post-condition: Return type of resource or URL referenced (i.e., Databank,
Document, Organisation, or Software).
Difficulty: Intermediate
Process overview:
1. Get URL context.
2. Determine resource type by: a. keyword(s) within the
URL string, b. keyword(s) within URL reference context (i.e., title of reference and/or description of reference), or
c. keyword(s) within the article body where the URL is
cited.
31 | P a g e
3. Return resource type.
Table 11 – Implementation Objective 4
4. Design and implement a module to determine URL status: active or inactive link Functional Requirement: [R7]. The module shall be able to determine if URL is active or
inactive URLs (accessible or not).
Risk: Low.
External Dependency: No direct dependency, see pre-condition.
Priority: High.
Pre-condition: Objective 2 (this module is in essence a sub-module of Obj. 2).
Post-condition: Return URL status: 0/FALSE if inaccessible or 1/TRUE if
accessible.
Difficulty: Easy
Process overview:
1. Get URL to be checked (see Obj. 2).
2. Check if URL is active/inactive: if inactive return
0/FALSE, else (if active) return 1/TRUE.
Table 12 – Implementation Objective 5
5. Design and implement a module to identify and extract sponsors and contributors (NEs such as persons/organisations and their respective roles) from acknowledgments
Functional Requirements: [R8]. The module shall be able to identify NEs, such as persons
and organisations/institutions.
[R9]. The module shall be able to identify REs (i.e., sponsors/funders or collaborators/contributors).
[R10]. The module shall be able to link NEs to their respective
REs. [R11]. The module shall be able to extract NEs and their
respective roles from annotated documents.
Risk: High. Main reasons for risk level:
Dependent upon the use of appropriate methodology, and efficient use of tools (i.e., GATE 5.2.1).
Difficulty: Hard
Time constraint: as approaching project deadline. External Dependency: GATE 5.2.1 (see Section 2.2.2).
Priority: High.
Pre-condition: Objective 1, and Objective 6 (A). Post-condition: Return NEs and corresponding REs identified.
Difficulty: Hard
Process overview:
1. Implementation objective 6, process A (see Table 13)
2. Parse document and extract acknowledgement passage. 3. Process acknowledgement passage through text processing
application designed with GATE 5.2.1 (which returns a
Gate XML document with tags representing annotated entities: NEs corresponding REs).
4. Parse Gate XML document.
5. Extract annotated NEs and their respective roles.
32 | P a g e
Table 13 – Implementation Objective 6
6. Design and implement a module to handle database operations: (1) ensure synchronisation
of retrieval of documents for processing and documents already processed, (2) insert extracted/processed data into the system database.
Functional Requirements: [R12]. The module shall be able to synchronise retrieval of
documents for processing (from the Shared Db) and documents already processed (in the System Db).
[R13]. The module shall be able to insert given (tuple) of data
into the system database.
Risk: Low.
External Dependency: -
Priority: High.
Pre-condition: Implementation objectives 2-4, or 5.
Post-condition: Relevant data is inserted into the System Db.
Difficulty: Easy
Process overview:
This module is separated into two different tasks: (A)
synchronisation of processed documents (in System Db) and of
retrieval of documents (from the Shared Db) for processing, and (B) data insertion into the System Db.
A. Check last document processed for role extraction / URL
extraction: a. if none, get first document (documents may be retrieved in an ascending order enabled by auto-
incremented keys of records in the Shared Db)16
from the
Shared DB, b. else, get auto-incremented id of last document processed in the System Db and start retrieval
process from Shared Db by last document processed + 1.
B. Either get URL data (implementation objective 2-4) or
role data (implementation objective 5) and insert this data into the system database.
Table 14 – Implementation Objective 7
7. Design and implement a GUI that will facilitate exploration of system functionalities and
provides general statistics.
Functional Requirements:
[R14]. The module shall be able to view general statistics upon user request, such as; (1) number of documents processed,
(2) number of URLs extracted, (2a) descriptive statistics
of URL status (i.e., by year; in total), and (3) number of roles extracted.
[R15]. The module shall be able to invoke user parameters for
numbers of documents to be processed.
Risk: Intermediate. Main reasons for risk level:
Time constraint: approaching project deadline.
Dependent on successful completion of previous modules.
External Dependency: No direct dependency, see pre-condition.
16 The implementation will take advantage of the available auto-incremented key within the Shared Db (and the
corresponding foreign key in the System Db) to keep track of documents processed or documents to be processed when new session is initiated.
33 | P a g e
Priority: Intermediate.
Pre-condition: Implementation objectives 1-6.
Post-condition: Interactive GUI.
Difficulty: Intermediate.
Process overview: See Use Case Diagram (Figure 2). Table 15 – (Implementation) Objective 8
8. Evaluation of the purposed methodology
Functional Requirement: N/A
Risk: Intermediate.
1. Time constraint: approaching project deadline. 2. Dependent upon successful completion system modules.
External Dependency: No direct dependency, see pre-condition.
Priority: High.
Pre-condition: Completion of 1-4
Post-condition: -
Difficulty: Easy
Process overview: 1. Choose a random sample of results derived from previous
steps and apply evaluation metrics (see Chapter 6)
3.2.3. Requirement Traceability Matrix Requirement Traceability Matrix (Table 16) by User and System Functional Requirements versus
project objectives:
Table 16 – Requirement Traceability Matrix
Obj. 2 Obj. 3 Obj. 4
[R01] X
[R02] X
[R03] X
[R04] X
[R05] X
[R06] X
[R07] X
[R08] X
[R09] X
[R10] X
[R11] X
[R12] X X
[R13] X X
[R14] X
[R15] X
34 | P a g e
3.3. Non-Functional Requirements
In addition to functional requirements, a set of non-functional requirements have been derived from
the (SRE process or) requirement elucidation and analysis stage. While non-functional
requirements typically include product, external, and organisational requirements (Sommerville,
2004), this dissertation solely focuses on product requirements, specifically, system properties to
guide the architectural design and implementation of ExtConX2.
1. Extensibility
Within software engineering, extensibility refers to the notion of design/implementation of
a system which takes into consideration potential future extension of system functionalities
(Wikipedia, 2009). Extensibility may also be described as a system architecture designed to
accommodate future changes with minimal effort. For instance, system architecture based
upon modularity or compartmentalisation of which various software functions/components
are separated by concern (SoC)17
may address this requirement. Use of Object Oriented
Programming (OOP) language may also aid to achieve this end.
2. Maintainability
The notion of maintainability is similar to extensibility to some respect as the approaches
to accommodate these requirements may intersect. Nevertheless, the aim of this
requirement is to accommodate effortless maintenance of the system, to ease feature
amendment to implementation, and locate potential hidden software bugs. The use of OOP
language and SoC, and detailed documentation may be used to fulfil this requirement.
3. Reusability
The system ought to enable reusability of modules to the extent possible. This will
facilitate both extensibility and maintainability, in addition to provide software components
which may be used within future (unrelated) applications/research. The application of SoC
at class level may be used to fulfil this requirement.
17
Separation of concern (SoC) – refers to a logical separation of system functionalities. For instance, an analogy may be
drawn from the Model-View Controller (MVC) paradigm often used in web applications.
35 | P a g e
4. System Design and Analysis
This chapter is divided into two general sections:
a) Generic overview of the system architecture/design which describes high-level approaches
to extraction of external context (i.e., URLs and acknowledgements).
b) System Design and Analysis.
4.1. Generic System Architecture
A high-level overview of ExtConX2 is provided below (Figure 3, see footnotes for description of
arrows). Brief description follows (Figure 3):
Figure 3 – High-Level System Architecture18
1. The Database Module is responsible for (1) synchronisation between the Shared Database
(containing PMC XML documents) and the System Database, (2) retrieval of documents
(Db Traverser) for processing, and (3) insertion of extracted/processed data (Data Inserter)
into the System Database.
2. The URL Module is responsible for (1) parsing of PMC documents and extracting URLs
(URL Extractor), (2) determining if given URL is accessible or not (URL Status) and (3)
determining the type of resource referenced (Resource Type).
3. The IE Module is responsible for role extraction (IE Application). This module
encapsulates text pre-processing and IE task required to identify and extract NEs and
respective REs.
18
Solid arrows represent data flow, dashed arrows may be described as sub-module (of): the arrows head point toward
the super module.
36 | P a g e
4. The Parser Module encapsulates the XML parser. In addition, it handles NLM Journal
Archiving and Interchange DTDs. These are needed to parse PMC documents. The
DTDResover redirected the XML System IDs to a local repository where the DTDs are
stored.
ExtConX2 architecture is guided by the designed principle of SoC at the system level: Database
Module (including the Shared Db and System Db) encapsulates database operations (i.e., Database
Layer), and the URL Module and IE Module (including the Parser Module) encapsulates
application logic (i.e., Application Layer). This approach is coined as subsystems architecture
where each subsystem represents different level of abstraction (Bennet et al., 2006).19
This could be
considered as an approach to fulfil non-functional requirements previously defined (Section 3.3).
4.2. Description of External Context Extraction
This section provides high-level description of external context extraction based upon the generic
system design (Figure 3).
4.2.1. URL Module
The URL Module (refer to Figure 3) contains three main tasks: (1) extraction of URLs from PMC
documents, (2) determine resource type for each URL extracted, and (3) determine if a URL is
active or inactive (i.e., if resource is accessible or not).
An approach to process a given sentence containing a citation to an online resource is illustrated
below (Figure 4).
19
The system is divided into SoC: the Database Layer deals solely with retrieving documents and inserting
data (this includes the RDBMS), while the Application Layer is solely responsible for application logic.
37 | P a g e
Figure 4 – URL Module Overview
Given the following sentence:
1. The report was provided by World Health Organisation (http://www.who.int).
The output (Processed Data) of the given process (Figure 4) ought to be as followed (Table 17):
Table 17 – Ideal Results from URL Extraction Process
URL Type of Resource URL Status Date Inserted
(1) http://www.who.int Document Active 2010-09-01
A more detailed description follows. The following subsection describes (a) extraction of URLs
and determination of URL status, and (b) determining resource type (from the extracted URL
context), respectively:
a) URL Extraction
As PMC documents are provided in the NLM Archive and Interchange format (XML), the unique
tag provided for identifying URLs may be used to extract these URLs. For instance, given
hypothetical example of a URL within a PMC document (disregarding any context):
Another valid value for ext-link-type is: ftp (File Transfer Protocol). 21 An external URL refers to resources/URLs outside the scope of the article. For instance, there exist other (which may
be described as internal) URLs within PMC documents which are for various XML specific validation (e.g., namespace declaration, and so on); these are non-valid.
39 | P a g e
6 </ext-link>
7 ).
8 </citation>
9 </ref>
A potential solution to determine referenced resource type is:
1. Analyse the URL string extracted for keywords that characterise specific URL
classes (e.g., report could be used as a keyword indicating Document resource
type); if unable to determine resource type, try next process (b):
2. Get the URL context:
3-7 The report was provided by World Health Organisation (http://www.who.int/report).
3. Subsequently, analyse this context (word by word) for keywords, starting from the
location of URL within the string until the start of the sentence (see bold text in
example given above).
In this example, report could be used as keywords to determine the resource type (Document). For
each of the URL types, a list of characteristic keywords will be constructed and used.
4.2.2. IE Module
The IE Module encapsulates the IE application which is responsible for role extraction.
Specifically, given an acknowledgement sentence, the IE Module must enable the identification and
extraction of NEs and their respective REs.
a) Acknowledgement Extraction
A rule-based approach in conjunction with gazetteers may be adopted for role extraction. Apart
from common TM stages previously discussed (see Section 2.1), some notable highlights are:
1. The use of gazetteers to define:
i. NEs: persons and organisations
ii. REs: collaborators and funders (Table 19)
Table 19 – Examples of REs for Collaborators and Funders
Collaborator Roles Funder Roles
Editorial support Financial support
Reviewing the manuscript Grant-in-aid
Helpful comments Grant
Helpful suggestions Funding
2. A rule-based approach applied at semantic level processing (see Section 2): linking of NEs
and their respective REs (Role Matcher: Figure 5).
40 | P a g e
3. Subsequently, programmatically extract these sets of NEs and corresponding REs (IE) and
insert them into a predefined template/database.
The generic NLP/IE pipeline is given in Figure 5.
Figure 5 - Generic NLP/IE Pipeline
For instance, consider the following acknowledgements:
1. The authors are grateful to John Dough for reviewing the manuscript.
2. This research was funded by BBSRC.
The NLP/IE process is as followed:
a) Get NEs
i. Person NE: John Dough
ii. Organisation NE: BBSRC
b) Get REs
i. Collaborator RE: reviewing the manuscript
ii. Funder RE: funded
c) Identify respective RE for each NE :
Patterns which indicate association between NE and RE, identified from above examples
are:
1. NE for RE (collaborator)
2. RE by NE (funder)
Hence, the application of rules to identify given patterns will be sufficient at semantic level
processing, for the given example.
d) Insert this data into predefined template/database:
Table 20 - Results of TM Process
41 | P a g e
(1) Name Entity: John Dough
Role (enumeration): Collaborator
Role Expression: reviewing the manuscript
(2) Name Entity: BBSRC
Role (enumeration): Funder
Role Expression: funded
4.3. System Architecture System Architecture is the organisation of a system in terms of its software components, including
subsystems and the relationship and interaction among them, and the principles that guide the
design of that software system (Bennett et al. 2006, p.340). System architecture could directly
influence non-functional features of a system (Bennett et al., 2006). For instance, subsystems
architecture is known for advantages such as maximising reusability and improving maintainability
among other things (Bennett et al., 2006). Therefore, the guidance of non-functional requirements
previously defined (Section 3.2) has been a central factor in the architectural design and
implementation ExtConX2.
4.3.1. Subsystems Architecture
The design of ExtConX2 is based on subsystems architecture, i.e., SoC at system level or
subdivision of software components which share some common properties (Bennett et al., 2006).
This means that a system is subdivided into different layers of abstraction or layers of service
which are responsible for different aspect of functionality of the system as whole (Bennett et al.,
2006, p.350). This approach has several known advantages such as:
Maximise reusability
Aid developers to handle complexities
Improve maintainability
Aid portability
ExtConX2 has three layers of abstraction:
1. Presentation Layer
The presentation layer is the topmost layer and is responsible for the human computer
interaction (HCI). This layer enables interaction between the user, and system
functionalities through a graphical user interface (GUI). A user is able to control/initiate
system functionalities (encapsulated by layer 2 or the application layer) through input
parameters, and view output resulting from the processing of the application layer. The
presentation layer satisfies functional user requirements 1-4 and functional system
requirements 14-15 (refer to Section 3.2).
2. Application Layer
42 | P a g e
The application layer is responsible for domain logic or domain specific functionalities of
ExtConX2: the core functional requirements of the system (i.e., functional system
requirements 5-11).
3. Database Layer
The database layer encapsulates the relational database management system (RDBMS) and
system specific database operations such as synchronisation between Shared DB and
System DB (i.e., between processed documents and PMC documents available for
processing), retrieval of documents to be processed, and insert data into the System DB.
The database layer satisfies functional system requirements 12-13.
The architecture of ExtConX2 is based on layered subsystems (see Bennett et al. 2006, p.351): any
layer N can only use the services provided by the layer immediately below it (N -1). For instance,
the presentation layer cannot directly use any services provided by the database layer (see Figure
6). This level of abstraction minimises dependencies among layers (and software components) and
facilitates extensibility and maintainability of the system (Bennett et al., 2006).
Figure 6 - ExtConX2 Layered Subsystems
4.4. System Design
This section provides detailed description of the system design, such as: database, application, and
presentation layers. All illustrations provided are based on class implementations. Complete system
designed is provided in Appendix A, Figure 14.
43 | P a g e
4.4.1. Database Layer
The database layer encapsulates system functionalities or services which are responsible for
database operations. This layer provides services for the application layer directly above it (N + 1).
The following Figure 7 illustrates main components of the database layer.
Figure 7 - ExtConX2 Database Layer
a) Description of Database Layer
1. Db Manager - The Db Manager is responsible for maintaining synchronisation between
the Shared Db (containing PMC XML documents) and System Db. This is achieved by two
methods: (1) determines the last existing PMC document in the Shared Db, and (2) to
determines the last processed PMC document stored in the System Db.22
2. Db Traverser - The Db Traverser is responsible for retrieving data from the Shared Db. In
addition, Db Manager is utilised by Db Traverser to ensure synchronisation.
3. Data Inserter - The Data Inserter encapsulates methods to insert processed data into the
System Db.
b) Relational System’s Schema
Below is the Relational Database Schema used by ExtConX2, the EER Diagram may be viewed in
Appendix A, Figure 13. The Shared Db (in part) 23
and System Db are both represented by the
following Figure 8.
PMC Articles contains PMC articles in XML format, and is linked from the Shared Db.
The System Db contains four relations: Meta Data, URL, Role, and Acknowledgement.
22 Both methods relay on the auto-incremented key and foreign key in the Shared Db and System Db respectively. 23
Only the relevant relation (PMC-Articles) and attributes of the Shared Db is included in the Relational/EER diagram.
44 | P a g e
Figure 8 - Relational Database Schema
24
4.4.2. Application Layer
The application layer encapsulates domain logic: functional system requirements 5-11. This layer is
further subdivided into three separate modules (see Figure 9):
URL Module, which contains classes for URL extraction and related processes.
IE Module, which contains classes for role extraction and related processes.
Parser Module, encapsulates classes for parsing and handling NLM Journal Archiving and
Interchange DTDs.
This subdivision of the application layer into further refined SoC is another example (in addition to
the subdivision at system level) of architectural design which addresses non-functional
requirements of ExtConX2.
24
Different types of arrows are only for visibility.
45 | P a g e
Figure 9 - ExtConX2 Application Layer
a) URL Module
The URL Module is responsible for extracting URLs from PMC documents,25
checking each URL
extracted if it is accessible or not, and determine the type of resource referenced. The URL Module
contains the following classes:
1. URL - The URL class may be described as a super-class; its responsibility includes
extraction of URLs from PMC documents and invoking other operations (i.e., URL Status
and Resource Type). In addition, URL acts as a gateway between the database layer and
application layer (i.e., retrieving PMC documents and returning processed data).
2. URL Status - URL Status checks if a given URL is accessible or not.
3. URL Identifier - URL Identifier is responsible for syntactically validating URLs, and to
identify URL protocols if any (i.e., http:// and ftp://). The latter functionality is used by
URL Status.
25
Not including URLs which are part of the article metadata, i.e., the corresponding prepublication paper and licence
(http://creativecommons.org).
46 | P a g e
4. Resource Type - Resource Type is responsible for collecting possible types of resource
referenced (i.e., Databank, Document, Organisation, or Software). Refer to Section 0 for
further description.
5. Soft Decision - Soft Decision may be described as a sub-class of Resource Type which
contains a method to determine the most likely URL resource type from a set of collected
possibilities (refer to Section 4.2.1 for description).
b) IE Module
The IE Module encapsulates the TM application which handles role extraction. Specifically, pre-
processing of acknowledgement text i.e., NLP, and subsequent IE (or extraction of collaborators
and funders, and their respective REs).
1. IE - The IE class is a the super-class within the IE Module, that extracts acknowledgement
text from PMC documents, and invokes the IE Application and Role Extractor in order to
complete the acknowledgement extraction sequence.
2. IE Application - The IE Application encapsulates the TM application (designed with
GATE). This class handles the pre-processing of acknowledgement text (including,
providing annotation over NEs and their respective REs). Further description is provided in
Section 4.4.2.
3. Role Extractor - The Role Extractor extracts NEs and their corresponding roles from pre-
processed acknowledgement text.
c) Parser Module
The Parser Module encapsulates the parser and a class to handle NLM Journal Archiving and
Interchange DTDs.
1. Parser - The Parser encapsulates the Document Object Model (DOM) parser used to parse
PMC documents.
2. DTD Resolver - DTD Resolver is responsible for redirecting XML System IDs 26
to the
local directory where NLM Journal Archiving and Interchange DTDs are stored. This class
is needed due to the variety of DTDs required for parsing PMC documents.
26
System ID is the URI/URL pointing to the given XML document‘s DTD.
47 | P a g e
4.4.3. Presentation Layer The presentation layer encapsulates methods for HCI (Figure 10). It includes the following classes:
Figure 10 - ExtConX2 Presentation Layer
1. Function Panel - This class constructs the function panel or buttons to initiate various
functionalities (e.g., initiating URL extraction and role extraction).
2. Entry Panel - This class constructs the entry panel: e.g., text fields for user input such as
parameters for number of documents to be processed etc.
3. Quitable Frame - This class is responsible for the popup dialog box to confirm user of
application and to exit the program.
4. GUI - This class constructs the GUI by invoking other classes.
5. InvokeApp - acts as a gateway to the application layer initiating application logic by user
input (see Appendix A, Figure 14).
48 | P a g e
5. Implementation
This chapter describes the implementation of the main functional requirements of ExtConX2: URL
Module and IE Module (refer to Figure 9). However, these descriptions are not comprehensive, as
only a few of the more noteworthy aspects are included. Other materials not provided in this
dissertation are available on the project website (http://gnode1.mib.man.ac.uk/projects/ExtConX2/).
5.1. Tools & Implementation Environment
Tools used to implement various component of ExtConX2 include:
This section provides some discussions of facts presented in regards to URLs within PMC
documents and the underlying implementation of ExtConX2 which may have affected those facts.
Potential suggestions or improvements are also provided.
(1) FTPs
One of the limitations of the data presented is that FTP URIs was not checked for availability.
However, the analysis of data extracted showed that out of 147,133 URLs extracted, only 791 (or
0.5%) were FTPs, hence we can conclude that the impact on the statistics present is minimal.
(2) Resource Availability
The method adopted to check availability of resource has some weaknesses. As availability is only
checked once before insertion into the database, the accuracy of availability results may be
implicated. For instance, web servers do not have 100% up-time or unlimited capacity for online
traffic. Hence, either of latter factors may have impacted the results. A better approach to
maximise accuracy of URL availability is to implement an additional module which crawls the
database and updates the URL status appropriately. For instance, Wren‘s (2004, 2008) approach
would be ideal: URLs were checked every day over a 4 week period and any URL which was
accessible over 90% of the times was deemed as an active resource.
In addition, due to the project time constraint, the implementation for checking URLs availability
had a 10 seconds time-out limit.35
As some web servers take longer to respond to HTTP requests,
this limit may have affected the results presented.36
(3) Soft Decision and Resource Identification
Approximately 10.5% of the resources identified were incorrectly classified and 8.5% were not
identified at all. A manual review of these documents (and others) shows two primary issues with
the implementation. The use of keywords to identify resources failed due to (1) lack of keywords
within the citation context to indicate type of resource, and (2) non-accounted resource type within
the implementation, e.g., laboratory tools and equipment. The latter limitation may be addressed
by creating a new list of keywords that characterises laboratory tools and equipment and do some
minor amendments to the implementation to facilitate an additional resource type.
35
Wren‘s (2004, 2008) implementation had a 60 second time-out limit: this is probably an appropriate limit. As the URL
data was loaded into the database the last 10 days of the dissertation, a 60 second timeout would have taken take around 12 days to insert into the database (considering existing system issues: see Section 6.3) 36
Testing of the implementation to check URL availability confirmed cases which did take more than 10 seconds to
confirm accessibility of URLs.
66 | P a g e
Moreover, both the soft decision algorithm (i.e., distributed weight applied to instances) and
method used for resource classification could be further improved. For instance, consider the
following generic citation, which is similar to examples found in manually analysed documents,
which the soft decision failed to classify:
1. James [1] proved that the method has good performance.
This example does not include any keywords per se enabling classification of resource referenced.
However, the citation style (James [1]) indicates Document type. Thus, the use of regular
expressions to match the following pattern: ‗NE [NUMBER]‘ may be applied as an additional
method to use of keyword lists.
6.2. Role Extraction
The adopted rule-based approach to role extraction (i.e., extraction of NEs and corresponding REs)
achieved a recall of 67.6% and precision of 92.6% and F-score of 77.7%. The NER achieved a
recall of 69.9% and precision of 95%, and the extraction of REs achieved a recall of 75% and
precision of 97.6%. The evaluation was based on a random sample of 50 documents. From the
whole PMC dataset processed 86,751 acknowledgements were extracted, 71,615 of these were
identified as containing roles.
(1) Evaluation Principles:
The evaluation was guided by the following principles:
Acknowledgements of NE with no roles were not considered and ignored.
Acknowledgement of entities that were not individuals or organisations (e.g., laboratory
staff, teams/groups, etc) were not considered and ignored
In addition, some acknowledgments, in particular of organisation could have two valid REs. Thus,
either role extracted was considered as a true positive. For instance, in the following example both
supported and grant is considered as true positives:37
1. This work was supported by NIH grant
37
For the evaluation results of REs extracted, this example would be considered as containing 1 RE, and either one
extracted would be considered as true positive.
67 | P a g e
Acknowledgements of multiple NEs with identical RE were considered as separate
acknowledgements. For instance, the following acknowledgement would be considered to contain
three separate roles (see Table 31):
2. We like to thank John Dough, Jim Baker, Zoe Zindan for reviewing the manuscript.
Table 31 – True Positives: Role Extraction
(1) Name Entity: John Dough
Role Expression: reviewing the manuscript
(2) Name Entity: Jim Baker
Role Expression: reviewing the manuscript
(3) Name Entity: Zoe Zindan
Role Expression: reviewing the manuscript
(2) Extracted Facts
Table 32 shows the result of most acknowledged funding organisations within PMC. As the role
extraction system does not handle acronyms prior to IE (i.e, organisations and their corresponding
acronyms are extracted as separate roles) additional manual analysis was needed to present this
result. In addition, some organisations have identical names in different countries. For instance,
National Cancer Institute exists both in US and Canada. This was not taken into consideration.
However, other organisations presented (Table 32) are unique, either by country or globally.
Table 32 – Most Acknowledged Funding Organisation
Name of Funding Organisations Total Nr.
Acknowledgements
1 National Institutes of Health 10,613
2 National Science Foundation 3,099
3 Wellcome Trust 2,287
4 European Union 1,443
5 Deutsche Forschungsgemeinschaft 1,301
6 National Cancer Institute (US and Canada) 1,114
7 Canadian Institutes of Health Research 928
8 Biotechnology and Biological Sciences Research Council (BBSRC) 829
9 European Commission 746
10 National Health and Medical Research Council (NHMRC) 663
11 National Natural Science Foundation of China 548
12 Swedish Research Council 538
13 Swiss National Science Foundation 467
68 | P a g e
6.2.1. Discussions
The overall performance of the IE task was quite poor in terms of recall. This was due to a
combination of factors. However, the most notable factor being the performance of the NER. As
both the RE Tranducer and Role Context Tranducer (refer to Section 4.4.2) rely on the good
performance of the NER, a domino effect lead to the overall poor performance. Description of the
NER, RE Tranducer, and Role Context Traducers follows:
(1) NER
Couple of issues with the NER processing resources include: none or partial recognition (1) of non-
English names and (2) of multi-word organisation NEs.
NEs that did not adhere to customary orthographical rules used in English spelling of names (i.e.,
capitalised initials of NNPs) accounted for significant number of cases. For instance, common
examples included Italian names e.g., Marco de Bartol (note: bold), and Chinese names, which
often adhere to English orthography, but include two letter NNPs e.g., Hurng-Yi Wang (note: bold)
which was not recognised by the NER.
Another issue was the non-recognition of multi-word organisations. Some examples from the data
extracted include:
i. Ministry of Health, Labour and Welfare of Japan
ii. Ministry of Education, Science, Sports and Culture of Japan
iii. Mental Illness Research, Education and Clinical Centre
A potential approach to handle this issue would be at the lexical level processing such as the
expansion of the gazetteer. While around 150 organisation names was added to during the
development process this was clearly inadequate.
(2) RE Tranducer
Factors affecting the performance of the RE Tranducer (labelling of collaboration and funder roles)
include: (1) the poor performance of the NER system, and (2) limitations in terms of variety of
rules used.
The sole pattern used for labelling collaboration roles was (see Table 33 for explanation):38
i. [Person] [for|who|provided] [PRP]? [ROLE]
38
The given pattern is somewhat simplified, but represents the generic rule applied in the RE Tranducer.
69 | P a g e
Table 33 – Description of RE Transducer Rule
Pattern Description
[Person] NE: person
[for|who|provided] Word token: for, who, or provided
[PRP]? Possessive pronoun: his, her, their, etc. (may or may not exist).
[ROLE] The role being labelled, if and only if, the preceding patterns were matched.
Thus, roles that did not adhere to the above pattern were ignored. Below is a common example
identified during the evaluation of the system (NEs are in bold):
i. We like to thank Jim Dough, John Stew, and John Crow from Manchester University, UK,
for helping with the laboratory work.
Hence, as a NE is not preceding the relevant RE (i.e., helping with the laboratory work). As a result
the processing resource fails to identify the RE. See discussion of Role Context Transducer for an
example of the RE Transducer failure to identify a RE due to the poor performance of the NER.
(3) Role Context Transducer
The performance of the Role Context Transducer is almost entirely dependent on preceding
resources, in particular, the NER and RE Transducer. The semantic level processing uses an
identical pattern used by the RE Transducer. However, in contrast, a NE or consecutive NEs which
are followed by a RE (identified by prior processing resource) are collectively labelled as Role
Context. Given that the NER and RE Transducer have correctly identified existing NEs and a RE,
the following example illustrates the ideal result of the application of the Role Context Transducer
(see highlighted text):
i. We are indebted to Brian Boyle, Mark Andersen, and Jeffrey Dean for critically
reviewing the manuscript.
However, due to a domino effect initiated by the poor performance of the NER, the performance of
the Role Context Tranducer and therefore the evaluation results were affected. The following
examples illustrated couple of common results observed during the evaluation stage (identified NEs
are in bold and identified RE is in bold and underlined):
i. We are indebted to Michel Cusson, Pierre Fobert, Frédéric Vigneault, Brian Boyle, Mark
Andersen, and Jeffrey Dean for critically reviewing the manuscript.
ii. We are indebted to Michel Cusson, Brian Boyle, Mark Andersen, and especially Jeffrey
Dean for critically reviewing the manuscript.
70 | P a g e
In the first example given, Mark Andersen is not identified as a NE by the NER process. Therefore,
as the Role Context Transducer relies on either consecutive NEs39
or a single NE followed by a RE,
only 1 out of 6 roles is identified by the Role Context Transducer.
In the second example, the NER processing has failed to identify Jeffrey Dean, hence, the RE
Transducer is unable to identify any RE, and subsequently the Role Context Transducer fails to
identify any roles.
This domino effect initiated by the poor performance of the NER was one of the most significant
issues of the IE application. This limitation may be addressed by expanding the gazetteer and
adding additional rules for recognition of non-English NEs.
6.3 System Limitations
The following environment (Table 34) was used during the development and evaluation of
ExtConX2:
Table 34 - Development and Evaluation Environment
Nr. Environment Value
1 Operating System Windows 7 Home Edition 32-bit
2 Database Server MySQL 5.0
3 Processor Intel Core2 Solo 1.4Ghz
4 Memory Ram 2GB
5 JVM Maximum Memory 512MB
Following sections discusses couple specific software issues uncovered during the evaluation stage:
(1) URL Module
The current implementation to check URL availability contains a bug. The bug is inherited from
the Java API used to check URL availability (i..e, HttpUrlConnection). While the cause has not
been undoubtedly confirmed, it seems to be caused by severs which do not allow HTTP connection
programmatically. This is assumed because, none of the URLs manually checked were unavailable
or had any syntactical issue. Furthermore, the API used freezes when trying to get a response code
from the host to determine if the URL is accessible or not. This issue can be solved by the use of
threads: if no response is received within a certain amount of time, the thread can safely be
terminated (without affecting any concurrent processes) and the URL could be marked for manual
check.
39
Consecutive NEs must be separated by commas or the word token: and.
71 | P a g e
(2) IE Module
The IE Application which handles the text-pre-processing is unable to process acknowledgment
paragraphs over 200 words in the used environment. A java.lang.OutofMemoryError: Java heap
space exception is thrown. This due to the reason that: Java Virtual Machines (JVM) heap size is
insufficient. This is a known issue with GATE API (Cunningham et al. 2010, p.35). However, due
to the environment used, the JVM maximum memory couldn‘t be increased to address this issue.
However, in order to address this issue, the Java maximum heap size needs to be set to 768MB or
more.
72 | P a g e
7. Conclusion
The aim of this project was to develop a text mining system (ExtConX2) to enable:
(1) the exploration of acknowledgements of individuals and organisations, and
(2) analysis of URL decay and most often referenced online resources.
Table 35 summarises the project aims, which have all been fully met.
Table 35 – Accomplished Project Aims
Project Aims
1 Design and implement a relational database (Db) schema to store extracted data.
2 Design and implement a module to extract URLs from documents, determine if the given
URL is accessible or not, determine type of resource (or URL) extracted/referenced and
insert this data into a database.
3 Design and implement a module to identify and extract funders and collaborators (i.e.,
persons/organisations and their respective roles) from acknowledgements and insert this
data into a database.
4 Design and implement a GUI that will facilitate exploration of system functionalities and which provides general statistics.
5 Evaluation of the purposed methodology.
TM techniques were used to achieve the main functional requirements of the system. In particular
NLP processing such as lexical, syntactic, and semantic level processing was used for
acknowledgement extraction. In addition, a rule-based approach (JAPE) was used for semantic
level processing to enable the IE task of role extraction. We differentiated between two classes of
roles: funders and contributors. Finally, a combination of regular expressions and lists containing
keywords were used for extraction of URLs and classification of these resources into four classes
(i.e., Databank, Document, Organisation, and Software).
As part of the project, we have processed a set of 190,000 full-text journal articles from PubMed
Central.40
A subset of 50 documents was manually checked to evaluate ExtConX‘s performance.
For URL extraction, the system achieved 98.6% precision and 96% recall. For URL resource
classification, the system was able to correctly classify 81.1% of URLs (recall) with precision of
88.7%. For role extraction, the system achieved 92.7% precision, 67.6% recall and an F measure of
77.7%.
Using this data, we have analysed some trends in URL decay and acknowledgments. For example,
we found that URL decay can be described as a function of publication year: the older the
publication the less accessible resource contained within publications. We also found that most
funding acknowledgements were associated with National Institutes of Health.
40
However, the full dataset was not available in XML format. Hence, roughly 120,000-130,000 were processed.
73 | P a g e
While prior research has had similar applications as ExtConX2, this project has extended the scope
of that research by analysing larger datasets and adopting more sophisticated approaches. For
instance, Wren‘s (2004, 2008) study was solely confined to PubMed citations, while ExtConX2 has
enable the analysis of URL decay within full-text articles. This has enabled us to draw a more
holistic conclusion in regards to the scope of URL decay within the biomedical domain. In
addition, ExtConX2 is the first system to enables acknowledgement extraction within PMC.
7.1. Limitations and Future Work
The following list defines ExtConX2‘s limitations and provides suggestions for future
enhancements:
1. The URL Module is currently only able to check HTTP (i.e.,http:// and https://) for
availability. Additional implementation is needed for File Transfer Protocol.
2. The IE Module extracts organisation names and its abbreviation as separate NEs, hence
resulting in two separate roles. This could be handled by implementing an additional for
acronym detection.
3. Soft decision and keywords for resource classification may be further studied and
improved. For instance, additional category type: laboratory tools and equipment ought to
be added.
4. Implementation of concurrent processing to speed up check of resource availability and to
handle non responding URLs to address system issues discussed.
5. Currently the implementation is only analysing acknowledgements within defined
acknowledgements sections. However, other
6. The facts presented are quite limited, with available data extracted other
patterns/relationships may be uncovered e.g., (1) resource types and journals which are
most affected by URL decay, and (2) relationship between funding organisations and
discipline of research most often sponsored.
In addition, other topics of interesting was realised during the course of this project:
1. Document representation seems to be changing. More and more documents do not provide
visible/printable URLs, instead, hyperlinks encapsulating URL strings are provided.
2. It would be interesting to analyse the type of applications referenced within PMC. For
instance, what types of software are referenced and what are their uses?
74 | P a g e
References
Ananiadou, S. & McNaught, J., 2006. Text Mining for Biology and Biomedicine. Artech House: London.
Ananiadou, S. et al., 2005. The National Centre for Text Mining: Aim and Objectives. Ariadne, [online] 30 Jan., (42). Available at: http://www.ariadne.ac.uk/issue42/ananiadou/ [Accessed 13
April 2010].
Appelt, E.D. & Israel, J.D., 1999. Introduction to Information Extraction Technology: A Tutorial Prepared for IJCAI-99. [Online] Available at: http://user.phil-fak.uni-
duesseldorf.de/~rumpf/SS2005/ Informationsextraktion/Pub/AppIsr99.pdf [Accessed 1 May 2010].
(ACE04). [Online] Available at: http://www.itl.nist.gov/iad/mig//tests/ace/2004/ [Accessed 10 May
2010].
Baeza-Yates, R. & Ribeiro-Neto, B., 1999. Modern Information Retrieval. Pearson
Education Limited. ACM Press, New York.
Bennet, S., McRobb, S. & Farmer, R., 2006. Object-Oriented Systems Analysis and
Design, 3rd
ed. McGraw-Hill: London.
Berners-Lee, T., Fielding, R. & Frystyk, H., 1996. Hypertext Transfer Protocol -- HTTP/1.0.
[Online] Available at: http://www.ietf.org/rfc/rfc1945.txt [Accessed 4 September 2010].
Black, J.W. et al., 2005. CAFETIERE: Conceptual Annotation for Facts, Events, Terms, Individual Entities, and Relations. Parmenides Technical Report TR-U4.3.1. [Online] Available at:
http://ilk.uvt.nl/~kzervanou/dwn/TRU431.pdf [Accessed 4 September 2010].
Chinchor, N. & Sundheim, B., 1993. MUC-5 Evaluation Metrics. Proceedings of the 5
th
Conference of Message Understanding. Baltimore, Maryland, USA 25-27 August 1993. [Online]
Available at: http://www.aclweb.org/anthology-new/M/M93/M93-1007.pdf [Accessed 9 May 2010].
Cunningham, H. et al., 2010. Developing Language Processing Components with GATE Version 5
(a User Guide). [Online] Available at: http://Gate.ac.uk/sale/tao/tao.pdf [Accessed 9 May 2010].
Cunningham, H., 2006. Information Extraction, Automatic. In: Brown, K., ed. Encyclopedia of
Language & Linguistics, 2nd
ed. Oxford: Elsevier.
Fayyad, U. Piatetsky-Shapiro, G. & Smyth, P., 1996. Knowledge Discovery and Data Mining:
Towards a Unifying Framework. Proceedings of the Second International Conference on
Knowledge Discovery and Data Mining. Portland, Oregon, USA, 2-4 August 1996. [Online] Available at: http://www.aaai.org/Papers/KDD/1996/KDD96-014.pdf [Accessed 21 April 2010].
Frankling, S., 2010. XML Parser: DOM and SAX Put to the Test. [Online] Available at: http://www.devx.com/xml/Article/16922/1954 [Accessed 27 August 2010].
Frantzi, K., Ananiadou, S. & Mima, H., 2000. Automatic Recognition of Multi-word Terms. International Journal of Digital Libraries, 3(2), p.117-132.
Gerner, M. Nenadic, G. & Bergman, C. M., 2010. An Exploration of Mining Gene Expression
Mentions and their Anatomical Locations from Biomedical Text. Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Uppsala, Sweden, 15 July 2010. [Online]
Available at: http://www.aclweb.org/anthology/W/W10/W10-1909.pdf [Accessed 4 September
through automatic acknowledgment indexing. PNAS, 101(51), pp.599-604.
Hahn, U. & Wermter, J., 2006. Levels of Natural Language Processing for Text Mining. In:
Ananiadou, S. & McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House:
London.
Hearst, M.A., 1999. Untangling Text Data Mining. Proceedings of the 37
th Annual Meeting of the
Association for Computational Linguistics on Computational Linguistics. College Park, Maryland,
USA 20-26 June 1999. [Online] Available at: http://www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html [Accessed 14 April 2010].
Hotho, A. Nurnberger, A. & Paaß, G., 2005. A Brief Survey of Text Mining. LDV-Forum, 20(1), pp.19-62.
JISC, 2006. Text Mining: Briefing Paper. [Online] Available at:
http://www.jisc.ac.uk/media/documents/publications/textminingbp.pdf [Accessed 16 April 2010].
Kim, J. & Tsujii, J., 2006. Corpora and Their Annotation. In: Ananiadou, S. & McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House: London. Hearst, M.A., 2003. What is
Text Mining? [Online] Available at: http://www.ischool.berkeley.edu/~hearst/text-mining.html
[Accessed 14 April 2010].
McNaught, J. & Black, W.J., 2006. Information Extraction. In: Ananiadou, S. &
McNaught, J., ed. Text Mining for Biology and Biomedicine. Artech House: London.
National Institute of Health (NIH), 2010. [Online] http://www.nih.gov/icd/ [Accessed 6 August
2010].
National Library of Medicine (NLM), 2010a. Fact Sheet. [Online] Available at:
http://www.nlm.nih.gov/pubs/factsheets/pubmed.html [Accessed 13 April 2010].
National Library of Medicine (NLM), 2010b. http://dtd.nlm.nih.gov/publishing/ [Accessed 25 August 2010].
National Library of Medicine (NLM), 2010c. http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/ tagging-guidelines/article/tags.html [Accessed 25 August 2010].
National Library of Medicine (NLM), 2009. Key MEDLINE
® Indicators. [Online] Available at:
http://www.nlm.nih.gov/bsd/bsd_key.html [Accessed 13 April 2010].
National Library of Medicine (NLM), 2008. Fact Sheet: MEDLINE®. [Online] Available at:
http://www.nlm.nih.gov/pubs/factsheets/medline.html [Accessed 13 April 2010].
Polajnar, T., 2006. Survey of Text Mining of Biomedical Corpora. [Online] Available at:
http://www.dcs.gla.ac.uk/~tamara/surveyoftm.pdf [Accessed 10 May 2010].
Sommerville, I., 2004. Software Engineering.7
th ed. London: Pearson.
Tateisi, Y., 2004. GENIA Corpus. [Online] Available at: http://www-tsujii.is.s.u-
tokyo.ac.jp/~genia/topics/Corpus/ [Accessed 13 May 2010].
Tsuruoka, Y. et al., 2005. Developing a Robust Part-of-Speech Tagger for Biomedical Text.
Advances in Informatics: 10th Panhellenic Conference on Informatics. Volas, Greece 11-13
76 | P a g e
November 2005. [Online] Available at:
http://www.springerlink.com/content/3275150j32h61345/fulltext.pdf [Accessed 14 May 2010].
Uramoto, N. et al., 2004. A text-mining System for Knowledge Discovery from Biomedical
Documents. IBM Systems Journal, 43(3), pp.516-533.
Wikipedia, 2009. Extensibility. [Online] Available at: http://en.wikipedia.org/wiki/Extensibility
[Accessed 22 August 2010].
Wikipedia, 2010. Research Funding. [Online] Available at:
http://en.wikipedia.org/wiki/Research_funding [Accessed 6 August 2010].
Wren, D.J., 2004. 404 not found: the stability and persistence of URLs published in MEDLINE.
Zelenko D. Aone C. & Richardella, A., 2003. Kernel Methods for Relation Extraction. Journal of
Machine
Learning Research, 2003(3), pp.1083-1106
Zhou, G. Su, Jian. Zhang, Jie. & Zhang, Min., 2005. Exploring Various Knowledge in Relation
Extraction. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
(pp. 419–426).
Zhou, G. Su, Jian. Zhang, Jie. & Zhang, Min., 2005. Exploring Various Knowledge in Relation
Extraction. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan, USA 25-30 June 2005. [Online] Available at:
http://www.aclweb.org/anthology-new/P/P05/P05-1053.pdf [Accessed 10 May 2010].