Network-based approaches for pathway level analysis
Tin Nguyen1, Cristina Mitrea2, and Sorin Draghici2,3
1Department of Computer Science and Engineering, University of
Nevada, Reno2Department of Computer Science, Wayne State
University
3Department of Obstetrics and Gynecology, Wayne State
University
September 26, 2017
Contents
1 Abstract 2
2 Introduction 2
3 Network-based pathway analysis 33.1 Software availability and
implementation . . . . . . . . . . . . . . . . . . . . . . . . .
43.2 Experiment input and pathway databases . . . . . . . . . . . .
. . . . . . . . . . . . 63.3 Graph models . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Pathway
scoring strategies . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 11
3.4.1 Hierarchically aggregated scoring . . . . . . . . . . . .
. . . . . . . . . . . . . 113.4.2 Multivariate analysis and
Bayesian network . . . . . . . . . . . . . . . . . . . 143.4.3
Other approaches . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 14
4 Challenges in pathway analysis 154.1 Systematic bias of
pathway analysis methods . . . . . . . . . . . . . . . . . . . . .
. 164.2 Querying data from knowledge bases . . . . . . . . . . . .
. . . . . . . . . . . . . . . 184.3 Missing benchmark datasets and
comparison . . . . . . . . . . . . . . . . . . . . . . 19
5 Conclusions 19
6 Conflict-of-Interest Statement 20
7 Acknowledgment 20
To whom the correspondence should be addressed. Phone:
(313)-577-6679, Fax: (313)-577-6868, Email:[email protected]
1
1 Abstract
Identification of impacted pathways is an important problem
because it allows us to gain insightsinto the underlying biology
beyond the detection of differentially expressed genes. In the
pastdecade, a plethora of methods have been developed this purpose.
The last generation of pathwayanalysis methods are designed to take
into account various aspects of pathway topology in orderto
increase the accuracy of the findings. Here we cover 34 such
topology-based pathway analysismethods published in the past 13
years. We compare these methods on categories related to
im-plementation, availability, input format, graph models, and
statistical approaches used to computepathway level statistics and
statistical significance. We also discuss a number of critical
challengesthat need to be addressed, arising both in methodology
and pathway representation, includinginconsistent terminology, data
format, lack of meaningful benchmarks, and more importantly,
asystematic bias that is present in most existing methods.
2 Introduction
With rapid advances in high-throughput technologies, various
kinds of genomic data are prevalentin most of biomedical research.
Advanced techniques in sequencing (e.g., RNA-Seq,
miRNA-Seq,DNA-Seq) and microarray assays (e.g., gene expression,
methylation) have transformed biologicalresearch by enabling
comprehensive monitoring of biological systems. Vast amounts of
data of alltypes have accumulated in many public repositories, such
as Gene Expression Omnibus (GEO) [1,2],Array Express [3, 4], The
Cancer Genome Atlas (http://cancergenome.nih.gov/), and cBioPor-tal
[5,6]. However, there is a large gap between the ease of data
collection and our ability to extractknowledge from these data.
Contributing to this gap is the fact that living organisms are
complexsystems whose emerging phenotypes are the results of
multiple complex interactions taking placeon various metabolic and
signaling pathways.
Regardless of technology being used, a typical comparative
analysis (e.g., disease versus control,treated vs. not treated,
drug A vs. drug B, etc.) often yields a set of genes that are
differentiallyexpressed (DE) between the two phenotypes. Even
though these lists of DE genes are important inidentifying the
genes that may be involved in biological changes, they fail to
reveal the underlyingmechanisms. In order to translate these lists
of DE genes into a better understanding of biologicalphenomena,
researchers have developed a variety of knowledge bases that map
genes to functionalmodules. Depending on the amount of information
that one wishes to include, these modules canbe described as simple
gene sets based on a function, process or component (e.g., the
MolecularSignatures Database MSigDB [7]), organized in a
hierarchical structure that contains informationabout the
relationship between the various modules, as found in the Gene
Ontology [8], or organizedinto pathways that describe in detail all
known interactions between the various genes that areinvolved in a
certain phenomenon. Biological processes, in which genes are known
to interact witheach other, are accumulated in public databases,
such as the Kyoto Encyclopedia of Genes andGenomes (KEGG) [9,10],
Reactome [11], and Biocarta (www.biocarta.com).
Concurrently, statistical methods have been developed to
identify the functional modules orpathways that are impacted from
the differential expression evidence. They allow us to gain
insightsinto the functional mechanisms of cells beyond the
detection of differential expressed genes. Theearliest approaches
use Over-Representation Analysis (ORA) [12, 13] to identify gene
sets thathave more DE genes than expected by chance. The drawbacks
of this type of approach includethat: i) it only considers the
number of DE genes and completely ignores the magnitude of the
2
http://cancergenome.nih.gov/www.biocarta.com
actual expression changes, resulting in information loss; ii) it
assumes that genes are independent,which they are not (since the
pathways are graphs that describe precisely how these genes
influenceeach other); and iii) it ignores the interactions between
various genes and modules. FunctionalClass Scoring (FCS)
approaches, such as Gene Set Enrichment Analysis (GSEA) [14] and
Gene SetAnalysis (GSA) [15], have been developed to address some of
the issues raised by ORA approaches.The main improvement of FCS is
the observation that small but coordinated changes in expressionof
functionally related genes can have significant impact on
pathways.
ORA and FCS approaches are often referred as gene set enrichment
methods. Comprehen-sive lists of gene set analysis approaches, as
well as comparisons between them can be found inwell-developed
surveys [1620]. While useful for the purpose for which they have
been developed- to analyze sets of genes - these methods completely
ignore the topology and interaction betweengenes. Topology-based
approaches, which fully exploit all the knowledge about how genes
inter-act as described by pathways, have been developed more
recently. The first such techniques wereScorePAGE [21] for
metabolic pathways and the Impact Analysis [22, 23] for signaling
pathways.Figure 1 shows an example pathway named ERBB signaling
pathway, in which the nodes representgenes and compounds while the
edges represent the known interaction between the compounds.Gene
set analysis approaches are not able to account for the topological
order of genes, nor ableto explain the signal propagation and
mechanisms of the pathway. These approaches would yieldexactly the
same significance value for this pathway even if the graph were to
be completely re-designed by future discovery. In contrast,
network-based pathway analysis can distinguish betweenthis pathway
and any other pathways with the same proportion of DE genes.
Here we provide a survey for 34 network-based methods developed
for pathway analysis. Oursurvey of commercial tools for pathway
analysis found iPathwayGuide (Advaita Corporation,
http://www.advaitabio.com) and MetaCore (Thomson Reuters,
http://www.thomsonreuters.com)to be topology-based. Other
commercial tools, such as Ingenuity Pathway Analysis
(https://www.ingenuity.com) or Genomatix (http://www.genomatix.de),
do not use pathway topologyinformation and thus are not included in
this review. The existence of only two commercial toolsis extra
evidence to the challenging task undertaken by the developers of
such methods due to lackof standards.
In this document, we categorize and compare the methods based on
the following criteria: thetype of input the method accepts, graph
model, and the statistical approaches used to evaluategene and
pathway changes. In section 3.2 we describe the types of input the
surveyed methodsuse. In section 3.3 we provide details regarding
the graph models used for the biological networks.In section 3.4 we
discuss the statistical methods used by the surveyed methods to
assess geneand pathway changes between two phenotypes. In section
4, we discuss a number of outstandingchallenges that needed to be
addressed in order to improve the reliability, as well as the
relevanceof the next-generation pathway analysis approaches. We
discuss the key elements and limitationsof the surveyed methods
without going into details of each technique. For a more detailed
reviewand technical descriptions of network-based pathway analysis
methods, please refer to Mitrea etal. [24] or referenced
manuscripts (Table 1).
3 Network-based pathway analysis
The term pathway analysis is used in a very broad context in the
literature, including biologicalnetwork construction and inference.
Here we focus on methods that are able to exploit
biologicalknowledge in public repositories, rather than on network
inference approaches that attempt to infer
3
http://www.advaitabio.comhttp://www.advaitabio.comhttp://www.thomsonreuters.comhttps://www.ingenuity.comhttps://www.ingenuity.comhttp://www.genomatix.de
Figure 1: Pathways are more than gene sets. Panel A shows the
graphical representation of ERBB signaling pathwayfrom KEGG
database while panel B shows the set of genes on the pathway. The
graph in panel A contains importantinformation regarding gene
product (protein) localization, gene, protein or metabolite
interactions and the type ofthese interactions (activation,
repression, etc.), the direction of the signal propagation, etc.
ORA and FCS approachesare unable to exploit these information. As
such, for the same molecular measurement, these approaches would
yieldexactly the same significance value to this pathway even if
the graph were to be completely redesigned by futurediscovery. In
contrast, network-based approaches are able to take into account
the topological order of the genes andthere interactions.
or reconstruct pathways from molecular measurements. Table 1
shows the list of 34 pathway analysisapproaches, together with
their availability and licensing. In this section we start by
describing theavailability of each method, before discussing the
input format, graph model, and the underlyingstatistical
approaches.
3.1 Software availability and implementation
We often think that the main strength of an approach lies in its
novelty and algorithm efficiency.However, the tools implementation
and availability have become increasingly important for several
4
Method Availability HIPAA License Code Year Reference
Signaling pathway analysis using gene expressionMetaCore* Web
(http://www.genego.com/metacore.php) No Thomson
ReutersJava 2004 N/A
Pathway-Express Standalone (Bioconductor), Web
(http://vortex.cs.wayne.edu/) superseded by the ROntoTools
No free** Java, R 2005 [22,25]
PathOlogist Standalone
(ftp://ftp1.nci.nih.gov/pub/pathologist/)
No free MATLAB 2007 [26,27]
iPathwayGuide* Web (http://www.advaitabio.com/products.html)
Yes AdvaitaCorp.
Java, R 2009 N/A
SPIA Standalone (Bioconductor) No GPL (2) R 2009 [23]NetGSA
Standalone (http://www.biostat.washington.
edu/~ashojaie/software/
No GPL-2 R 2009 [28,29]
PWEA Standalone (https://zlab.bu.edu/PWEA/ No free** C++ 2010
[30]TopoGSA Web (http://www.infobiotics.net/topogsa) No free** PHP,
R 2010 [31]TopologyGSA Standalone (CRAN) No AGPL-3 R 2010
[32]DEGraph Standalone (Bioconductor) No GPL-3 R 2010 [33]GGEA
Standalone (Bioconductor) No Artistic-2.0 R 2011 [34]BPA Standalone
(http://bumil.boun.edu.tr/bpa) No free** MATLAB 2011 [35]GANPA
Standalone (CRAN) No GPL-2 R 2011 [36]ROntoTools** Standalone
(Bioconductor) No CC BY-NC-
ND 4R 2012 [37]
BAPA-IGGFD No implementation available No N/A R 2012 [38]CePa
Standalone (CRAN), Web (http://mcube.nju.
edu.cn/cgi-bin/cepa/main.pl)No GPL (2) R 2012 [39]
THINK-Back-DS Standalone, Web
(http://eecs.umich.edu/db/think/software.html)
No free** Java 2012 [40]
TBScore No implementation available No N/A N/A 2012 [41]ACST
Standalone (available as article supplemental) No N/A R 2012
[42]EnrichNet Web (http://www.enrichnet.org/) No free** PHP 2012
[43]clipper Standalone (http://romualdi.bio.unipd.it/
software)No AGPL-3 R 2013 [44]
DEAP Standalone (available as article supplemental) No GNU
LesserGPL
Python 2013 [45]
DRAGEN Standalone
(http://bioinfo.au.tsinghua.edu.cn/dragen/)
No N/A C++ 2014 [46]
ToPASeq*** Standalone (Bioconductor) No AGPL-3 R 2016 [47]pDis
Standalone (Bioconductor) No free** R 2016 [48]SPATIAL No
implementation available No N/A N/A 2016 [49]BLMA**** Standalone
(Bioconductor) No GPL (2) R 2017 [50,51]Signaling pathway analysis
using multiple types of dataPARADIGM Standalone
(http://sbenz.github.com/
Paradigm)No UCSC-
CGB, free**C 2010 [52]
microGraphite Standalone
(http://romualdi.bio.unipd.it/software)
No AGPL-3 R 2014 [53]
mirIntegrator Standalone (Bioconductor) No GPL3 R 2016
[54,55]Metabolic pathway analysisScorePAGE No implementation
available No N/A N/A 2004 [21]TAPPA No implementation available No
N/A N/A 2007 [56]MetPA Web (http://metpa.metabolomics.ca) No GPL
(2) PHP, R 2010 [57]MetaboAnalyst Web
(http://metpa.metabolomics.ca) No GPL (2) R 2011 [58,59]
Table 1: Pathway analysis tools. Availability is a criterion
that describes the implementation of the method as standaloneor
web-based. HIPAA provides information about HIPAA compliance.
License provides information about the type ofthe software license.
GPL is an abbreviation for the GNU General Public License; AGPL is
an abbreviation for the GNUAffero General Public License; CC
BY-NC-ND 4 is an abbreviation of Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International Public
License. Code shows the programming language used for the method
implementation.* commercial methods; ** free for academic and
non-commercial use; UCSC-CGB is the University of California Santa
CruzCancer Genome Browser; ***ToPASeq provides an R package that
runs TopologyGSA, DEGraph, clipper, SPIA, TBScore,PWEA, TAPPA.
****BLMA provides an R package for bi-level meta-analysis that runs
SPIA, ORA, GSA, and PADOGusing one or multiple expression
datasets.
5
http://www.genego.com/metacore.phphttp:
//vortex.cs.wayne.edu/http:
//vortex.cs.wayne.edu/ftp://ftp1.nci.nih.gov/pub/pathologist/ftp://ftp1.nci.nih.gov/pub/pathologist/http://www.advaitabio.com/products.htmlhttp://www.advaitabio.com/products.htmlhttp://www.biostat.washington.edu/~ashojaie/software/http://www.biostat.washington.edu/~ashojaie/software/https://zlab.bu.edu/PWEA/http://www.infobiotics.net/topogsahttp://bumil.boun.edu.tr/bpahttp://mcube.nju.edu.cn/cgi-bin/cepa/main.plhttp://mcube.nju.edu.cn/cgi-bin/cepa/main.plhttp://eecs.umich.edu/db/think/software.htmlhttp://eecs.umich.edu/db/think/software.htmlhttp://www.enrichnet.org/http://romualdi.bio.unipd.it/softwarehttp://romualdi.bio.unipd.it/softwarehttp://bioinfo.au.tsinghua.edu.cn/dragen/http://bioinfo.au.tsinghua.edu.cn/dragen/http://sbenz.github.com/Paradigmhttp://sbenz.github.com/Paradigmhttp://romualdi.bio.unipd.it/softwarehttp://romualdi.bio.unipd.it/softwarehttp://metpa.metabolomics.cahttp://metpa.metabolomics.ca
reasons. First, software availability and version control are
crucial for reproducing the experimentalresults that were used to
assess the performance of the approach [60]. For this reason,
manyjournals request authors to make their software and data
available before accepting methods articles.Second, if a software
is not ready-to-run, it is very unlikely that the intended audience
(mostly lifescientists) will invest the time to understand and
implement complex algorithms. Practicality,user-friendliness,
output format, and type of interface are all to be considered.
Depending on thedesired availability and intended audience, a
software package may be implemented as standaloneor web-based.
Among the 34 approaches, there are 30 that are available either as
a standalonesoftware package or web service.
Typically, standalone tools need to be installed on local
machines or servers which often requiressome administrative skills.
Most standalone tools depend on full or partial copies of public
pathwaydatabases, stored locally, and need to be updated
periodically. Advantages of standalone toolsinclude: i) instant
availability that does not require internet access, and ii) the
security and privacyof the experiment data. Web-based tools, on the
other hand, run the analyses on a remote serverproviding
computational power and a graphical interface. The major advantage
of web-based toolsis that they are user-friendly and do not require
a separate local installation. From the accessibilityperspective,
web-based tools have the advantage of being available from any
location as long asthere is an internet connection and a browser
available. Also, the update is almost seamless to theclient. This
makes the users task easy and enables collaboration since users all
over the world canutilize the same method without the burden of
installing it or keeping it up-to-date. There aremethods that
provide both web-based and standalone implementations.
HIPAA compliance may also be a factor in certain applications
that involve data coming frompatients or data linked to other
clinical data or clinical records. Currently, the iPathwayGuide
isthe only tool available that can do topological pathway analysis
and is HIPAA compliant.
The programming language and style used for implementation also
play an important rolein the acceptance of a method. Software tools
that are neatly implemented and packaged aremore appealing compared
to those that do not have ready-to-use implementations. Many of
themethods are implemented in the R programming language and are
available as software packageseither from Bioconductor, CRAN, or
the authors website. Their popularity among biologists
andbioinformaticians is due to the fact that many bioinformatics
dedicated packages are available inR. In addition, the rigorous
review procedure provided by Bioconductor and CRAN makes
thesoftware packages more standardized and reliable.
3.2 Experiment input and pathway databases
A pathway analysis approach typically requires two types of
input: experiment data and knownbiological networks. Experiment
data is usually collected from high-throughput experiments
thatcompare a condition phenotype to a control phenotype, where a
condition phenotype can be adisease, a drug treatment, the
knock-out (KO) of a gene while a control phenotype can be healthy,a
different disease or drug treatment, or the wild-type (non-KO)
samples. Experiment data can beobtained from multiple technologies
that produce different types of data: gene expression,
proteinabundance, metabolite concentration, miRNA expression, etc.
Biological networks or pathways areoften represented in the form of
graphs that capture our current knowledge about the interactionsof
genes, proteins, metabolites, or compounds in an organism. The
pathway data is accumulated,updated, and refined by amassing
knowledge from scientific literature describing individual
inter-actions or high throughput experiment results.
6
Table 2 shows the summary of input format and mathematical
modeling of the surveyed ap-proaches. Most pathway analysis methods
analyze data from high-throughput experiments, such asmicroarrays,
next-generation sequencing, or proteomics. They accept either a
list of gene IDs or alist of such gene IDs associated with measured
changes. These changes could be measured with dif-ferent
technologies and therefore can serve as proxies for different
biochemical entities. For instance,one could use gene expression
changes measured with microarrays, or protein levels measured witha
proteomic approach, etc.
Different analysis methods use different input formats. Many
methods accept a list of all genesconsidered in the experiment
together with their expression values. Some analysis methods
selecta subset of genes, considered to be differentially expressed
(DE), based on a predefined cut-off.The cut-off is typically
applied on fold-change, p-value, or both. These methods use the
list of DEgenes and their corresponding statistics (fold-change,
p-value) as input. Other methods use onlythe list of DE genes,
without corresponding expression values, because their scoring
methods arebased only on the relative positions of the genes in the
graph.
Methods which use cut-offs are sensitive to the chosen threshold
value, because a small changein the cut-off may drastically change
the number of selected genes [61]. In addition, they typicallyuse
the most significant genes and discard the rest whose weaker but
coordinated changes mayalso have significant impact on pathways.
Genes with moderate differential expression may be lost,even though
they might be important players in the impacted pathways [62].
Furthermore, thegenes included in the set of DE genes can vary
dramatically if the selection methods are changed.Hence, the
results of pathway analyses based on DE genes may be vastly
different depending onboth the selection method as well as the
threshold value [63]. Furthermore, for the same disease,independent
studies or measurements often produce different sets of
differentially expressed (DE)genes [6466]. This makes approaches
that use DE genes as input appear even more unreliable.
Usually pathways are sets of genes and/or gene products that
interact with each other in acoordinated way to accomplish a given
biological function. A typical signaling pathway (inKEGG for
instance) uses nodes to represent genes or gene products and edges
to represent signals,such as activation or repression, that go from
one gene to another. A typical metabolic pathwayuses nodes to
represent biochemical compounds and edges to represent reactions
that transformone or more compound(s) into one or more other
compounds. These reactions are usually carriedout or controlled by
enzymes, which are in turn coded by genes. Hence, in a metabolic
pathway,genes or gene products are associated with edges rather
than nodes, as in a signaling pathway. Theimmediate consequence of
this difference is that many techniques cannot be applied directly
on allavailable pathways. This is why the analysis of metabolic
pathways is still generally and arguablyunder-developed (only 128
Google Scholar citations to date for the original ScorePage paper
[21]and not many other methods available for metabolic pathway
analysis). However, the topology-based analysis of signaling
pathways was very successful (over 1200 citations to date for the
twoimpact analysis papers mentioned above [22,23] and over 30 other
methods developed since).
There are other types of biological networks that incorporate
genome wide interactions betweengenes or proteins such as
protein-protein interaction (PPI) networks. These networks are
notrestricted to specific biological functions. The main caveat
related to PPI data is that most suchdata are obtained from a
bait-prey laboratory assays, rather than from in vivo or in vitro
studies.The fact that two proteins stick to each other in an assay
performed in an artificial environmentcan be misleading since the
two proteins may never be present at the same time in the same
tissueor the same part of the cell.
Publicly available curated pathway databases used by methods
listed in Table 2 are KEGG [67],
7
Method Experiment input Pathway database Graph model Pathway
scoring
Signaling pathway analysis using gene expressionMetaCore DE
genes proprietary canonical pathway, genome-
scale networkSingle-type, directed Hierarchically aggregated
Pathway-Express DE genes change, or mea-sured genes change*
KEGG signaling Single-type, directed Hierarchically
aggregated
PathOlogist Measured genes expression KEGG Multi-type, directed
Hierarchically aggregatediPathwayGuide DE genes change, or mea-
sured genes changeKEGG signaling, Reactome, NCI, Bio-Carta
Single-type, directed Hierarchically aggregated
SPIA DE genes change KEGG signaling Single-type, directed
Hierarchically aggregatedNetGSA Measured genes expression KEGG
signaling Single-type, directed Multivariate analysisPWEA Measured
genes expression YeastNet Single-type, undirected Hierarchically
aggregatedTopoGSA DE genes PPI network, KEGG Single-type,
undirected Hierarchically aggregatedTopologyGSA Measured genes
expression NCI-PID Single-type, undirected Multivariate
analysisDEGraph Measured genes expression KEGG Single-type,
undirected Multivariate analysisGGEA Measured genes expression KEGG
Single-type, directed Aggregate fuzzy similarityBPA Measured genes
expression -
with cut-offNCI-PID Single-type, DAG Bayesian network
GANPA DE genes change, or mea-sured genes expression
PPI network, KEGG, Reactome, NCI-PID, HumanCyc
Single-type, undirected Hierarchically aggregated
ROntoTools DE genes change, or mea-sured gene expression
KEGG signaling Single-type, directed Hierarchically
aggregated
BAPA-IGGFD Measured genes expression -with cut-off
Literature-based interaction database,KEGG, WikiPathways,
Reactome,MSigDB, GO BP, PANTHER; con-structed gene association
network fromPPIs; co-annotation in GO BiologicalProcess (BP); and
co-expression inmicroarray data
Single-type, DAG Bayesian network
CePa DE genes expression, ormeasured genes expression
NCI-PID Single-type, directed Hierarchically aggregated
THINK-Back-DS DE genes change, measuredgenes expression
KEGG, PANTHER, BioCarta, Reac-tome, GenMAPP
Single-type, directed Hierarchically aggregated
TBScore DE genes change KEGG signaling Single-type, directed
Hierarchically aggregatedACST Measured genes expression KEGG
signaling Single-type, directed Hierarchically aggregatedEnrichNet
DE genes list PPI network, KEGG, BioCarta,
WikiPathways, Reactome, NCI-PID,InterPro, GO with STRING 9.0
Single-type, undirected Hierarchically aggregated
clipper Measured genes expression BioCarta, KEGG, NCI-PID,
Reactome Single-type, directed Multivariate analysisDEAP Measured
genes expression KEGG, Reactome Single-type, directed
Hierarchically aggregatedDRAGEN Measured genes expression
RegulonDB, M3D, HTRIdb, ENCODE,
MSigDBSingle-type, directed Linear regression
ToPASeq*** Measured genes expression KEGG Single-type, directed
Hierarchical & multivariatepDis DE genes change, or mea-
sured genes change*KEGG signaling Single-type, directed
Hierarchically aggregated
SPATIAL DE genes change KEGG signaling Single-type, directed
Hierarchically aggregatedBLMA**** Measured genes expression KEGG
signaling Single-type, directed Hierarchically aggregated
Signaling pathway analysis using multiple types of dataPARADIGM
Measured genes expression,
copy number, and proteinslevels
Constructed PPI networks from MIPS,DIP, BIND, HPRD, IntAct, and
Bi-oGRID
Multi-type, directed Hierarchically aggregated
microGraphite Measured gene expression,and miRNA expression
BioCarta, KEGG, NCI-PID, Reactome Single-type, directed
Multivariate analysis
mirIntegrator DE genes change, and DEmiRNA change
KEGG signaling, miRTarBase Single-type, directed Hierarchically
aggregated
Metabolic pathway analysisScorePAGE Measured genes expression
KEGG metabolic Single-type, undirected Hierarchically
aggregatedTAPPA Measured genes expression KEGG metabolic
Single-type, undirected Hierarchically aggregatedMetPA DE
metabolites change KEGG metabolic Single-type, directed
Hierarchically aggregatedMetaboAnalyst DE genes change and DE
metabolites changeKEGG metabolic Single-type, directed
Hierarchically aggregated
Table 2: A summary of the experiment data input format and
biological network databases used by the surveyed methods
ispresented. Experiment input shows the experiment data input
format for each method. DE means differentially expressed.Change
means fold-change value or t-statistics when comparing
gene/metabolites values between two phenotypes. Mea-sured means the
list of all the genes/metabolites measured in the experiment; List
represents a list of genes/metabolitesidentifiers (e.g. symbols).
With cut-off show methods that take as input the list of all
measured genes and in the analysisthey mark the DE genes. Pathway
database shows the name of the knowledge source for biological
interactions. Graphmodel is a characteristic that shows if the
graph has one or multiple types of nodes as well as if directed or
undirected.* the package ROntoTools (Bioconductor) allows for
analysis using expression values of all genes. ***ToPASeq provides
anR package that runs TopologyGSA, DEGraph, clipper, SPIA, TBScore,
PWEA, TAPPA. ****BLMA provides an R packagefor bi-level
meta-analysis that runs SPIA, ORA, GSA, and PADOG using one or
multiple expression datasets.
8
NCI-PID [68], BioCarta [69], WikiPathways [70], PANTHER [71],
and Reactome [72]. These knowl-edge bases are built by manually
curating experiments performed in different cell types under
differ-ent conditions. These curated knowledge bases are more
reliable than protein interaction networksbut do not include all
known genes and their interactions. As an example, despite being
contin-uously updated, KEGG included only about 5,000 human genes
in signaling pathways while thenumber of protein-coding genes is
estimated to be between 19,000 and 20,000 [73].
The implementation of analysis methods constrains the software
to accept a specific input path-way data format, while the
underlying graph models in the methods are independent of the
inputformat. Regardless of the pathway format, this information
must be parsed into a computer read-able graph data structure
before being processed. The implementation may incorporate a
parser,or this may be up to the user. For instance, SPIA accepts
any signaling pathway or network ifit can be transformed into an
adjacency matrix representing a directed graph where all nodes
arecomponents and all edges are interactions. NetGSA is similarly
flexible with regard to signalingand metabolic pathways. SPIA
provides KEGG signaling pathways as a set of pre-parsed
adjacencymatrices. The methods described in this chapter may be
restricted to only one pathway database,or may accept several.
To the best of our knowledge, there has been no comprehensive
approaches to compare the prosand cons of existing knowledge bases.
It is completely up to researchers to choose the tools thatare able
to work with the database they trust. Intuitively, methods that are
able to exploit thecomplementary information available from
different databases have an edge over methods that workwith one
single database.
3.3 Graph models
Pathway analysis approaches use two major graph models to
represent biological networks obtainedfrom knowledge bases: i)
single-type and ii) multiple-type. The first model allows only one
typeof node, i.e. a gene or protein, with edges representing
molecular interactions between the nodes(KEGG signaling in Figure
2). Models that contain directed graphs are more suitable for
analysesthat include signal propagation of gene perturbation. The
second graph model allows multiple typeof nodes, such as components
and interactions (NCI-PID in Figure 2). Multi-type graph models
aremore complex than single-type, but they are expected to capture
more pathway characteristics. Forexample, single-type models are
limited when trying to describe all and any relations
betweenmultiple components that are involved in the same
interaction. Bipartite graphs, which contain twotypes of nodes and
allow connection only between nodes of different types, are a
particular case ofmulti-type graph models.
The majority of analysis methods surveyed here use a single-type
graph model. Some apply theanalysis on a directed or un-directed
single-type network built using the input pathway, while
otherstransform the pathways into graphs with specific
characteristics. An example of the later is Topolo-gyGSA, which
transforms the directed input pathway into an undirected
decomposable graph, thathas the advantage of being easily broken
down into separate modules [74]. In this method, decom-posable
graphs are used to find important submodules - those which drive
the changes acrossthe whole pathway. For each pathway, TopologyGSA
creates an undirected moral graph1 from the
1The moral graph of a DAG is the undirected graph created by
adding an (undirected) edge between all parentsof the same node
(sometimes called marrying), and then replacing all directed edges
by undirected edges. The namestems from the fact that, in a moral
graph, two nodes that have a common child are required to be
married bysharing an edge.
9
Figure 2: Five biological network graph models used by public
databases are displayed. KEGG database containsboth signaling and
metabolic networks. Signaling networks have genes/gene products as
nodes and regulatorysignals as edges. Types of regulatory signals
include activation, inhibition, phosphorylation, and many
others.Metabolic networks have biochemical compounds as nodes and
chemical reactions as edges. Enzymes (specializedproteins/gene
products) catalyze biochemical reactions, therefore genes are
linked to the edges in these networks.Reactome database is a
collection of biochemical reactions that are grouped in
functionally related sets to forma pathway. There are two types of
nodes: biochemical compounds and reactions. Reaction nodes link
biochemicalcompounds as reactants and products. NCI-PID signaling
networks also have two types of nodes: component nodesand process
nodes. Component nodes are usually biomolecular components. Process
nodes are usually biochemicalreactions or biological processes. In
these networks process nodes link two or more component nodes
throughdirected edges. Process nodes are assigned one of the
following states: positive or negative regulation, or involved
in.Protein-protein interaction (PPI) networks have proteins as
nodes and physical binding as edges or interactions.Two-hybrid
assays are typically used to determine protein protein
interactions. PPIs can be directed in the bait-preyorientation,
when the bait-prey relation is considered (top), or undirected,
when is not (bottom). The BiologicalPathway Exchange (BioPAX)
format has physical entities as nodes and conversions as edges. The
advantage ofthis representation is that is generic and provides a
lot of flexibility to accommodate various types of interactions.In
addition, it provides a machine readable standard that can be used
for all databases to provide their data in aunified format. Network
nodes can be genes, and gene products, complexes, as well as
non-coding RNA. Networkedges can be assembly of a complex,
disassembly of a complex or biochemical reactions among others.
underlying directed acyclic graph (DAG) by connecting the
parents of each child and removing theedge direction. The moral
graph is then used to test the hypothesis that the underlying
networkis changed significantly between the two phenotypes. If the
the research hypothesis is rejected, adecomposable/triangulated
graph is generated from the moral graph by adding new edges.
Thisgraph is broken into the maximal possible submodules and the
hypothesis is re-tested on each ofthem.
PathOlogist and PARADIGM are the two surveyed methods that use
multi-type graph models.PathOlogist uses a bipartite graph model
with component and interaction nodes. PARADIGM,conceptually
motivated by the central dogma of molecular biology, takes a
pathway graph as inputand converts it into a more detailed graph,
where each component node is replaced by several morespecific
nodes: biological entity nodes, interaction nodes, and nodes
containing observed experimentdata. The observed experiment nodes
could in principle contain gene expression and copy
numberinformation. Biological entity nodes are DNA, mRNA, protein,
and active protein. The interactionnodes are transcription,
translation, or protein activation, among others. Biological entity
andinteraction node values are derived from these data and specify
the probability of the node beingactive. These are the hidden
states of the model. From the mathematical model perspective,
modelsthat allow multiple types of nodes: component nodes and
interaction nodes, are more flexible and
10
are able to model both AND and OR gates, which are very common
when describing cellularprocesses.
3.4 Pathway scoring strategies
The goal of the scoring method is to compute a score for each
pathway based on the expressionchange and graph model, resulting in
a ranked list of pathways or sub-pathways. There are a varietyof
approaches to quantify the changes in a pathway. Some of the
analysis methods use a hierarchi-cally aggregated scoring
algorithm, where on the first level, a score is calculated and
assigned toeach node or pair of nodes (component and/or
interaction). On the second level, these scores areaggregated to
compute the score of the pathway. On the last level, the
statistical significance of thepathway score is assessed using
univariate hypothesis testing. Another approach assigns a
randomvariable to each node and a multivariate probability
distribution is calculated for each pathway.The output score can be
calculated in two ways. One way is to use multivariate hypothesis
testingto assess the statistical significance of changes in the
pathway distribution between the two phe-notypes. The other way is
to estimate the distribution parameters based on the Bayesian
networkmodel and use this distribution to compute a probabilistic
score to measure the changes. In thissection, we provide details
regarding the scoring algorithms of the surveyed methods. See
Figure 3for scoring algorithms categories.
3.4.1 Hierarchically aggregated scoring
The workflow of the hierarchically aggregated scoring strategy
has three levels: node statisticcomputation, pathway statistic
computation and the evaluation of the significance for the
pathwaystatistic.
Node level scoring. Most of the surveyed methods incorporate
pathway topology informationin the node scores. The node level
scoring can be divided into four categories: i) graph
measures(centrality), ii) similarity measures, iii) probabilistic
graphical models, and iv) normalized nodevalue (NNV). Approaches in
the first category use centrality measures or a variation of
thesemeasures to score nodes in a given pathway. Centrality
measures represent the importance of a noderelative to all other
nodes in a network. There are several centrality measures that can
be appliedto networks of genes and their interactions and these are
degree centrality, closeness, betweenness,and eigenvector
centrality. Degree centrality accounts for the number of directed
edges that enterand leave each node. Closeness sums the shortest
distance from each node to all other nodes in thenetwork. Node
betweenness measures the importance of a node according to the
number of shortestpaths that pass through it. Eigenvector
centrality uses the network adjacency matrix of a graphto determine
a dominant eigenvector; each element of this vector is a score for
the correspondingnode. Thus, each score is influenced by the scores
of neighboring nodes. In the case of directedgraphs, a node that
has many downstream genes has more influence and receives a higher
score.Methods in this category include MetaCore, SPIA,
iPathwayGuide, MetPA, SPATIAL, TopoGSA,DEAP, pDis, mirIntegrator,
Pathway-Express, and BLMA.
Approaches in the second category use similarity measures in
their node level scoring. Similaritymeasures estimate the
co-expression, behavioral similarity, or co-regulation of pairs of
components.Their values can be correlation coefficients,
covariances, or dot products of the gene expressionprofile across
time or sample. In these methods, the pathways with clusters of
highly correlated
11
Figure 3: A summary of the statistical models is presented for
the surveyed methods. Most methods use thehierarchically aggregated
scoring strategy, in which the score is computed at the node level
before being aggregatedat the pathway-level for significance
assessment. In the left panel, the rows shows a node-level
statistical model whilethe columns show the aggregating strategies
(linear, non-linear, or weighted gene set). Approaches in
multivariateanalysis and Bayesian network categories use
multivariate modeling and Bayesian network, respectively, to
findpathways that are most likely to be impacted. Both BLMA and
ToPASeq provide R packages that run multiplealgorithms. DRAGEN
follows a linear regression strategy while GGEA applies fuzzy
modeling to rank the pathways.
genes are considered more significant. At the node level, a
score is assigned to each pair of nodesin the network which is the
ratio of one similarity measure over the shortest path distance
betweenthese nodes. Thus, the topology information is captured in
the node score by incorporating theshortest path distance of the
pair. Methods in this category include ScorePAGE and PWEA.
Approaches in the third category incorporate the topology in the
node level scoring using aprobabilistic graphical model. In this
model, nodes are random variables, and edges define theconditional
dependency of the nodes they link. For example, PARADIGM takes
observed experi-ment data and calculates scores for all component
nodes, in both observed and hidden states, fromthe detailed network
created by the method based on the input pathway. For each node
score,a positive or negative value denotes how likely it is for the
node to be active or inactive, respec-tively. The scores are
calculated to maximize the probability of the observed values. A
p-valueis associated with each score of each sample such that each
node can be tagged as significantlyactive, significantly inactive,
or not-significant. For each network, a matrix of p-values is
output,in which columns are samples, and rows are component nodes.
Methods in this category includePARADIGM and PathOlogist.
12
Approaches in the fourth category simply compute the score for
each node using the informationobtained from the experiment input.
For example, TAPPA calculates the score of each node as thesquare
root of the normalized log gene expressions (node value) while ACST
and TBScore calculatethe node level score using a sign statistic
and log fold-change.
Pathway level scoring. There are three different ways to compute
the pathway score: i) linear,ii) non-linear, and iii) weighted gene
set. Most methods aggregate node level statistics to pathwaylevel
statistics using linear functions such as averaging or summation.
For example, iPathwayGuidecomputes the scores of pathway as the sum
of all genes while TBScore weights the pathway DEgenes based on
their log fold change and the number of distinct DE genes directly
downstream ofthem, using a depth-first search algorithm.
Approaches in the second group (non-linear) use a non-linear
function to compute the pathwayscores. For example, TAPPA computes
the pathway score for each sample as a weighted sum ofthe product
of all node pair scores in the pathway. The weight coefficient is 0
when there is noedge between a pair. For any connected node pair
the weight is a sign function, which representsjoint up- or
down-regulation of the pair. Another example is EnrichNet, in which
pathway scoresmeasure the difference of the node score distribution
for a pathway and a background network/geneset which consists of
all pathways. At the node level, the distance of all DE genes to
the pathway ismeasured and summarized as a distance distribution.
The method assumes that the most relevantpathway is the one with
the greatest difference between the pathway node score distribution
andthe background score distribution. The difference between the
two distributions is measured by theweighted averaging of the
difference between the two discretized and normalized
distributions. Theaveraging method down-weights the higher
distances and emphasizes the lower distance nodes.
Methods in the third group (weighted gene set) design scoring
techniques that incorporateexisting gene set analysis methods, such
as GSEA [14], GSA [15], or LRPath [75]. Pathway-levelscores can be
calculated using node scores which represent the topology
characteristic of the pathwayas weight adjustments to a gene set
analysis method. PWEA, GANPA, THINK-Back-DS, and CePause this
approach and we refer to them as weighted gene set analysis
methods.
Pathway significance assessment. Pathway scores are intended to
provide information regard-ing the amount of change incurred in the
pathway between two phenotypes. However, the amountof change is not
meaningful by itself since any amount of change can take place just
by chance. Anassessment of the significance of the measured changes
is thus required.
TopoGSA, MetPA, and EnrichNet output scores without any
significance assessment, leavingit up to the user to interpret the
results. This is problematic because the user does not have
anyinstrument to help distinguish between changes due to noise or
random causes, and meaningfulchanges, unlikely to occur just by
chance and therefore, possibly related to the phenotype. Therest of
the analysis methods perform a hypothesis testing for each pathway.
The null hypothesisis that the value of the observed statistic is
due to random noise or chance alone. The researchhypothesis is that
the observed values are substantial enough that they are
potentially related tothe phenotype. A p-value for calculated score
is then computed and a user-defined threshold on thep-value is used
to decide whether the the null hypothesis can be rejected or not
for each pathway.Finally, a correction for multiple comparisons
should be performed.
Typically, pathway analysis methods compute one score per
pathway. The distribution of thisscore under the null hypothesis
can be constructed and compared to the observed. However, thereare
often too few samples to calculate this distribution, so it is
assumed that the distribution is
13
known. For example, in MetaCore and many other techniques, when
the pathway score is thenumber of DE nodes that fall on the
pathway, the distribution is assumed to be hypergeometric.However,
the hypergeometric distribution assumes that the variables (genes
in this case) are in-dependent, which is incorrect, as witnessed by
the fact that the pathway graph structure itself isdesigned to
reflect the specific ways in which the genes influence each other.
Another approach toidentify the distribution is to use statistical
techniques such as the bootstrap [76]. Bootstrappingcan be done
either at the sample level, by permuting the sample labels, or at
gene set level, bypermuting the the values assigned to the genes in
the set.
3.4.2 Multivariate analysis and Bayesian network
Multivariate analysis methods mostly use multivariate
probability distributions to compute pathway-level statistics and
these can be grouped into two subcategories. Methods in the first
category usemultivariate hypothesis testing, while methods in the
second category are based on Bayesian net-work.
NetGSA, TopologyGSA, DEGraph, clipper, and microGraphite are
methods based on multivari-ate hypothesis testing. These analysis
methods assume the vectors of gene expression values in
each(sub)pathway are random vectors with multivariate normal
distributions. The network topologyinformation is stored in the
covariance matrix of the corresponding distribution. For a
network,if the two distributions of the gene expression vectors
corresponding to the two phenotypes aresignificantly different, the
network is assumed to be significantly impacted when comparing the
twophenotypes. The significance assessment is done by a
multivariate hypothesis test. The definitionof the null hypothesis
for the statistical tests and the techniques to calculate the
parameters of thedistributions are the main differences between
these three analysis methods.
BPA and BAPA-IGGFD are two methods based on Bayesian networks.
In a Bayesian network,which is a special case of probabilistic
graphical models, a random variable is assigned to each nodeof a
directed acyclic (DAG) graph. The edges in the graph represent the
conditional probabilitiesbetween nodes, so that the children are
independent from each other and the rest of the graph
whenconditioned on the parents. In BPA, the value of the Bayesian
random variable assigned to eachnode captures the state of a gene
(DE or not). In contrast, in BAPA-IGGFD each random
variableassigned to an edge is the probability that up or down
regulation of the genes at both ends of aninteraction are
concordant with the type of interaction which can be activation or
inhibition. Inboth BPA and BAPA-IGGFD, each random variable is
assumed to follow a binomial distributionwhose probability of
success follows a beta distribution. However, these two methods use
differentapproaches in representing the multivariate distribution
of the corresponding random vector. BPAassumes that the random
vector has a multinomial distribution, which is the generalization
ofthe binomial distribution. In this case, the vector of the
success probability follows the Dirichletdistribution, which is the
multivariate extension of the beta distribution. In contrast,
BAPA-IGGFD assumes the random variables are independent, therefore
the multivariate distributionsare calculated by multiplying the
distributions of the random variables in the vector. It is
worthmentioning that the assumption of independence in BAPA-IGGFD
is contradicted by evidence,specifically in the case of edges that
share nodes.
3.4.3 Other approaches
The four methods DRAGEN, GGEA, ToPASeq, and BLMA follow
strategies that are very differentfrom those of the other 30
methods. As such, BLMA implements a bi-level meta-analysis
approach
14
that can be applied in conjunction any of the four statistical
approaches: SPIA [23], GSA [15],ORA [13], and PADOG [77]. The
package allows users to perform pathway analysis with one datasetor
with multiple datasets. Similarly, ToPASeq provides an R package
that runs TopologyGSA,DEGraph, clipper, SPIA, TBScore, PWEA, and
TAPPA. This method models the input (biologicalnetworks and
experiment data) in such a way that it can be used for any of the 7
different analyses.
DRAGEN is an analysis method that scores the interactions rather
then the genes and uses aregression model to detect differential
regulation. DRAGEN fits each edge of a pathway into twolinear
models (for case and control) and then computes a p-value that
represents the differencebetween the two models. For each pathway,
a summary statistic is computed by combining thep-values of the
edges using a weighted Fishers method [78]. GGEA, on the other
hand, uses PetriNet [79] to model the pathway. The summary
statistic, named consistency, is computed from thefuzzy similarity
between the observed gene expression and Petri net with fuzzy logic
(PNFL) [80].For both of DRAGEN and GGEA, the p-value of the pathway
is calculated by comparing theobserved summary statistic against
the null distribution that is constructed by permutation.
4 Challenges in pathway analysis
Pathway analysis has become the first choice for gaining
insights into the underlying biology of dueits explanatory power.
However, there are outstanding annotation and methodological
limitationsthat have not been addressed [19, 81]. There are three
main limitations of current knowledgebases. First, existing
knowledge bases are unable to keep up with the information
available in dataobtained from recent technologies. For example,
RNA-Seq data allows us to identify transcriptsthat are active under
certain conditions. Alternatively spliced transcripts, even if they
originatefrom the same gene, may have distinct or even opposite
functions [82]. However, most knowledgebases provide pathway
annotation only at the gene level. Second, there is a lack of
condition andcell-specific information, i.e. information about cell
type, conditions and time points. Finally,current pathway
annotations are neither complete nor perfectly accurate [19,83].
For example, thenumber of genes in KEGG have been around 5,000
despite being updated continuously in the past10 years while there
are approximately 19,000 human genes annotated with at least one GO
term.The number of protein-coding genes is estimated to be between
19,000 and 20,000 [73], most ofwhich are included in DNA microarray
assays, such as Affymetrix HG U133 plus 2.0.
Another challenge is the oversimplification that characterizes
many of the models provided bypathway databases. In principle, each
type of tissue might have different mechanisms so
generic,organism-level pathways present a somewhat simplistic
description of the phenomena. Furthermore,signaling and metabolic
processes can also be different from one condition to another, or
even fromone patient to another. Understanding the specific
pathways that are impacted in a given phenotypeor sub-group of
patients should be another goal for the next generation of pathway
analysis tools.See Khatri et al. [19] for a more discussion of
annotation limitations of existing knowledge bases.
Here we focus on challenges of pathway analysis from
computational perspectives. We demon-strate that there is a
systematic bias in pathway analysis [84]. This leads to
unreliability of mostif not all pathway analysis approaches. We
also discuss about the lack of benchmark datasets orpipelines to
assess the performance of existing approaches.
15
4.1 Systematic bias of pathway analysis methods
Pathway analysis approaches often rely on hypothesis testing to
identify the pathways that areimpacted under the effects of the
diseases. Null distributions are used to model populations so
thatstatistical tests can determine whether an observation is
unlikely to occur by chance. In principle,the p-values produced by
a sound statistical test must be uniformly distributed under the
nullhypothesis [8588]. For example, the p-values that result from
comparing two groups using a t-testshould be distributed uniformly
if the data are normally distributed [87]. When the assumptions
ofstatistical models do not hold, the resulting p-values are not
uniformly distributed under the nullhypothesis. This makes
classical methods, such as t-test, inaccurate since gene expression
valuesdo not necessary follow their assumptions. Here we also show
that the problem is also extended topathway analysis, as the
pathway p-values obtained from statistical approaches are not
uniformlydistributed under the null hypothesis. This might lead to
severe bias towards well-studies diseases,such as cancer, and thus
make the results unreliable [84].
Consider three pathway analysis methods that represent three
different classes of methods forpathway analysis: Gene Set Analysis
(GSA) [15] is a Functional Class Scoring method [14, 15,77, 89],
Down-weighting of Overlapping Genes (PADOG) [77] is an enrichment
method [12, 90,91], and Signaling Pathway Impact Analysis (SPIA)
[92] is a topology-aware method [22, 92]. Tosimulate the null
distribution, we download and process the data from 9 public
datasets: GSE14924CD4, GSE14924 CD8, GSE17054, GSE12662, GSE57194,
GSE33223, GSE42140, GSE8023, andGSE15061. Using 140 control samples
from the 9 datasets, we simulate 40,000 datasets as follows.We
randomly label 70 samples as controls samples and the remaining 70
samples as diseases. Werepeat this procedure 10, 000 times to
generate different groups of 70 control and 70 disease samples.To
make the simulation more general, we also create 10, 000 datasets
consisting of 10 control and10 disease samples, 10, 000 datasets
consisting of 10 control and 20 disease samples, and 10,
000datasets consisting of 20 control and 10 disease samples. We
then calculate the p-values of theKEGG human signaling pathways
using each of the three methods.
The effect of combining control (i.e. healthy) samples from
different experiments is to uniformlydistribute all sources of bias
among the random groups of samples. If we compare groups of
controlsamples based on experiments, there could be true
differences due to batch effects. By poolingthem together, we form
a population which is considered the reference population. This
approachis similar to selecting from a large group of people that
may contain different sub-groups (e.g.different ethnicities,
gender, race, life style, or living conditions). When we randomly
select samples(for the two random groups to be compared) from the
reference population, we expect all bias (e.g.ethnic subgroups) to
be represented equally in both random groups and therefore, we
should seeno difference between these random groups, no matter how
many distinct ethnic subgroups werepresent in the population at
large. Therefore, the p-values of a test for difference between the
tworandomly selected groups should be equally probable between zero
and one.
Figure 4 displays the empirical null distributions of p-values
using PADOG, GSA, and SPIA.The horizontal axes represent p-values
while the vertical axes represent p-value densities. Greenpanels
(A0A6) show p-value distributions from PADOG, while blue (B0B6) and
purple (C0C6)panels show p-value distributions from GSA and SPIA,
respectively. For each method, the largerpanel (A0, B0, and C0)
shows the cumulative p-values from all KEGG signaling pathways.
Thesmall panels, 6 per method, display extreme examples of
non-uniform p-value distributions forspecific pathways. For each
method, we show three distributions severely biased towards zero
(eg.A1A3), and three distributions severely biased towards one (eg.
A4A6).
These results show that, contrary to generally accepted beliefs,
the p-values are not uniformly
16
Distribution of pvalues for all pathways (PADOG)
pvalues
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(A0) Pathways in cancer
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
(A1) Renal cell carcinoma
0.0 0.2 0.4 0.6 0.8 1.0
01
23
(A2) Prostate cancer
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
(A3)
Prion diseases
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
(A4) Pertussis
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
(A5) African trypanosomiasis
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
(A6)
Distribution of pvalues for all pathways (GSA)
pvalues
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
(B0) Renal cell carcinoma
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
(B1) Endometrial cancer
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
(B2) ErbB signaling pathway
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
(B3)
Pertussis
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
(B4) Prion diseases
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
(B5) African trypanosomiasis
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
(B6)
Distribution of pvalues for all pathways (SPIA)
pvalues
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
(C0) Rheumatoid arthritis
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
(C1) Amoebiasis
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
(C2) Phagosome
0.0 0.2 0.4 0.6 0.8 1.00
12
34
5
(C3)
Wnt signaling pathway
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
(C4) Basal cell carcinoma
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
(C5) Hippo signaling pathway
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
(C6)
Figure 4: The empirical distributions of p-values using:
Down-weighting of Overlapping Genes (PADOG) - top,Gene Set Analysis
(GSA) - middle, and Signaling Pathway Impact Analysis (SPIA) -
bottom. The distributions aregenerated by re-sampling from 140
control samples obtained from 9 AML datasets. The horizontal axes
display thep-values while the vertical axes display the p-value
densities. Panels A0-A6 (green) show the distributions of
p-valuesfrom PADOG; panels B0-B6 (blue) show the distribution of
p-values from GSA; panels C0-C6 (purple) show thedistribution of
p-values from SPIA. The large panels on the left, A0, B0, and C0,
display the distributions of p-valuescumulated from all KEGG
signaling pathways. The smaller panels on the right display the
p-value distributions ofselected individual pathways, which are
extreme cases. For each method, the upper three distributions, for
exampleA1-A3, are biased towards zero and the lower three
distributions, for example A4-A6, are biased towards one. Sincenone
of these p-value distributions are uniform, there will be
systematic bias in identifying significant pathwaysusing any one of
the methods. Pathways that have p-values biased towards zero will
often be falsely identified assignificant (false positives).
Likewise, pathways that have p-values biased towards one are more
likely to be amongfalse negative results even if they may be
implicated in the given phenotype.
17
distributed for three methods considered. Therefore one should
expect a very strong and systematicbias in identifying significant
pathways for each of these methods. Pathways that have
p-valuesbiased towards zero will often be falsely identified as
significant (false positives). Likewise, pathwaysthat have p-values
biased towards one are likely to rarely meet the significance
requirements, evenwhen they are truly implicated in the given
phenotype (false negatives). Systematic bias, due tonon-uniformity
of p-value distributions, results in failure of the statistical
methods to correctlyidentify the biological pathways implicated in
the condition, and also leads to inconsistent andincorrect
results.
4.2 Querying data from knowledge bases
Independent research groups have tried different strategies to
model complex bio-molecular phe-nomena. These independent efforts
have lead to variation among pathway databases, complicatingthe
task of developing pathway analysis methods. Depending on the
database, there may be dif-ferences in information sources,
experiment interpretation, models of molecular interactions,
orboundaries of the pathways. Therefore, it is possible that
pathways with the same designation andaiming to describe the same
phenomena may have different topologies in different databases.
Asan example, one could compare the insulin signaling pathways of
KEGG and BioCarta. BioCartaincludes fewer nodes and emphasizes the
effect of insulin on transcription, while KEGG
includestranscription regulation as well as apoptosis and other
biological processes. Differences in graphmodels for molecular
interactions are particularly apparent when comparing the signaling
pathwaysin KEGG and NCI-PID. While KEGG represents the interaction
information using the directededges themselves, NCI-PID introduces
process nodes to model interactions (see Figure 2). Devel-opers are
facing the challenge of modifying methods to accept novel pathway
databases or modifyingthe actual pathway graphs to conform to the
method.
Pathway databases not only differ in the way that interactions
are modeled, but their data areprovided in different formats as
well [16]. Common formats are Pathway Interaction Database
eX-tensible Markup Language (PID XML), KEGG Markup Language (KGML),
Biological PathwayExchange (BioPAX) Level 2 and Level 3, System
Biology Markup Language (SBML), and theBiological Connection Markup
Language (BCML) [93]. The NCI provides a unified assembly
ofBioCarta and Reactome, as well as their in-house NCI-Nature
curated pathways, in NCI-PIDformat [68]. In order to unify pathway
databases, pathway information should be provided in acommonly
accepted format.
Another challenge is that the same biological pathways are
represented differently from onepathway database to another. None
of the tools is compatible with all database formats,
requiringeither modification of pathway input or alteration of the
underlying algorithm in order to accom-modate the differences. As
an example, a study by Vaske and others [52] attempts to
compareSPIA [23] with their tool PARADIGM, by re-implementing SPIA
in C, and forcing its applicationon NCI-PID pathways.
Implementation errors are present in the C version of SPIA,
invalidating thecomparison. A solution to overcome this challenge
could be the development of a unified globallyaccepted pathway
format. Another possible solution is to build conversion software
tools that cantranslate between pathway formats. Some attempts
exist to use BIO-PAX [94] as the lingua francafor this domain.
18
4.3 Missing benchmark datasets and comparison
Newly developed approaches are typically assessed by simulated
data or by well-studied biologicaldatasets [95, 96]. The advantage
of using simulation is that the ground truth is known and it canbe
used to compare sensitivity and specificity of different methods.
However, simulation is oftenbiased and does not fully reflect the
complexity of living organisms. On the other hand, when usingreal
biological data, the biology is never fully known. In addition,
many papers presenting newpathway analysis methods include results
obtained on only a couple of datasets and researchers areoften
influenced by the observer-expectancy effect [97]. Thus, such
results are not objective, andmany times they cannot be
reproduced.
A better evaluation approach has been proposed by Tarca et al.
[98] using 42 real datasets.This approach use a target pathway
which is the pathway describing the condition under studyavailable.
For instance, in an experiment colon cancer vs healthy, the target
pathway would be thecolon cancer pathway. The datasets are chosen
so that there is a target pathway associated witheach of the
datasets. The datasets are also all public so other methods can be
compared on thevery same data, in a reproducible and objective way.
The lower the rank and the p-value of thetarget pathway in the
method output, the better the method. This approach has several
importantadvantages including the fact that is reproducible,
completely objective and relies on more thanjust a couple of data
sets that are assessed by the authors using the literature. This
approach wasalso used by Bayerlova et al. [96] (using a different
set of 36 real datasets) as well by other, morerecent papers.
However, in spite of its advantages and great superiority
compared to the usual method ofonly analyzing a couple of data set,
even this benchmarking has important limitations. First, notall
conditions have a namesake pathway in existing databases or
described by literature. Second,complex diseases are often
associated with not only one target pathway, but with many
biologicalprocesses. By its nature, this assessment approach will
ignore other pathways and their ranking,even though they may be
true positives. More importantly, these approaches fail to take
intoconsideration the systematic bias of pathway analysis
approaches. In these review papers, most ofthe datasets are related
to cancer. As such, 28 out of 42 datasets used in Tarca et al.
[98], and 26out of 36 datasets used in Bayerlova et al. [96] are
cancer datasets. In those cases, methods thatare biased towards
cancer are very likely to identify cancer pathways, which are also
the targetpathways, as significant. For this reason, the
comparisons obtained from these reviews are likely tobe biased and
not reliable for assessing the performance of existing
approaches.
5 Conclusions
Pathway analysis is a core strategy of many basic research,
clinical research, and translationalmedicine programs. Emerging
applications range from targeting and modeling disease networksto
screening chemical or ligand libraries, to identification of drug
target interactions for improvedefficacy and safety. The
integration of molecular interaction information into pathway
analysisrepresents a major advance in the development of
mathematical techniques aimed at the evaluationof systems
perturbations in biological entities.
This document discussed and categorized 34 existing
network-based pathway analysis approachesfrom different
perspectives, including experiment input, graphical representation
of pathways inknowledge bases, and statistical approaches to assess
pathway significance. Despite being widelyused, using DE genes as
input make the software sensitive to cutoff parameters. Regarding
graph
19
model, approaches using multiple types of nodes and bipartite
graphs are more flexible and areable to model both AND and OR
gates, which are very common when describing cellular pro-cesses.
In addition, tools that are able to work with multiple knowledge
bases are expected toperform better due to complementary and
independent information that cannot be obtained fromindividual
databases. We also pointed out that there has been no reliable
benchmark to assess andrank existing approaches. Some initial
efforts to assess pathway analysis methods in an objectiveand
reproducible way do exist but they still fail to take into account
statistical bias of existingapproaches.
Despite tremendous efforts in the field, there are outstanding
challenges that need to be ad-dressed. First, current pathway
databases are unable to provide transcript-level activity or
infor-mation related to new types of data, such as SNP, mutation,
or methylation. Second, incompleteannotation and the lack of
condition- and cell-specific information hinder the accuracy of
down-stream pathway analysis. Third, variation among pathway
databases and the lack of a standardformat in which the pathway
data is provided pose a real challenge for implementation.
Developersare facing the challenge of modifying methods to accept
novel pathway databases or modifying theactual pathway graphs to
conform with their method. Fourth, there is a systematic bias due
to thefact that certain conditions, such as cancer, are much more
studied than others. Using control dataobtained from 9 mRNA
datasets, we showed that the p-values obtained by three pathway
analysisapproaches that represent three mainstream strategies in
pathway analysis (FCS, enrichment, andnetwork-based) are not
uniformly distributed under the null. Pathways that have p-values
biasedtowards zero will often be falsely identified as significant
(false positives). Likewise, pathways thathave p-values biased
towards one are likely to rarely meet the significance
requirements, even whenthey are truly implicated in the given
phenotype (false negatives). Systematic bias, due to non-uniformity
of p-value distributions, results in failure of the statistical
methods to correctly identifythe biological pathways implicated in
the condition, and also leads to inconsistent and
incorrectresults.
Finally, there is a lack of benchmarks to assess the performance
of each mathematical approach,as well as to validate existing
knowledge bases and their graphical representation of
pathways.There have been some efforts to provide benchmarks for
this purpose but such methods are notfully reliable yet.
6 Conflict-of-Interest Statement
Sorin Draghici is the founder and CEO of Advaita Corporation, a
Wayne State University spin-offthat commercializes iPathwayGuide,
one of the pathway analysis tools mentioned in this chapter.
7 Acknowledgment
This work has been partially supported by the following grants:
National Institutes of Health(R01 DK089167, R42 GM087013), National
Science Foundation (DBI-0965741), and the RobertJ. Sokol, MD
Endowed Chair in Systems Biology. Any opinions, findings, and
conclusions orrecommendations expressed in this material are those
of the authors and do not necessarily reflectthe views of any of
the funding agencies.
20
References
[1] Ron Edgar, Michael Domrachev, and Alex E Lash. Gene
Expression Omnibus: NCBI geneexpression and hybridization array
data repository. Nucleic Acids Research, 30(1):207210,2002.
[2] Tanya Barrett, Stephen E Wilhite, Pierre Ledoux, Carlos
Evangelista, Irene F Kim, MaximTomashevsky, Kimberly A Marshall,
Katherine H Phillippy, Patti M Sherman, Michelle Holko,Andrey
Yefanov, Hyeseung Lee, Naigong Zhang, Cynthia L Robertson, Nadezhda
Serova, SeanDavis, and Alexandra Soboleva. NCBI GEO: archive for
functional genomics data setsupdate.Nucleic Acids Research,
41(D1):D991D995, 2013.
[3] Alvis Brazma, Helen Parkinson, Ugis Sarkans, Mohammadreza
Shojatalab, Jaak Vilo, Ni-ran Abeygunawardena, Ele Holloway, Misha
Kapushesky, Patrick Kemmeren, Gonzalo GarciaLara, Ahmet Oezcimen,
Philippe Rocca-Serra, and Susanna-Assunta Sansone. ArrayExpressa
public repository for microarray gene expression data at the EBI.
Nucleic Acids Research,31(1):6871, 2003.
[4] Gabriella Rustici, Nikolay Kolesnikov, Marco Brandizi, Tony
Burdett, Miroslaw Dylag, IbrahimEmam, Anna Farne, Emma Hastings,
Jon Ison, Maria Keays, Natalja Kurbatova, James Mal-one, Roby Mani,
Annalisa Mupo, Rui Pedro Pereira, Ekaterina Pilicheva, Johan Rung,
AnjanSharma, Y. Amy Tang, Tobias Ternent, Andrew Tikhonov, Danielle
Welter, Eleanor Williams,Alvis Brazma, Helen Parkinson, and Ugis
Sarkans. ArrayExpress updatetrends in databasegrowth and links to
data analysis tools. Nucleic Acids Research, 41(D1):D987D990,
2013.
[5] Jianjiong Gao, Bulent Arman Aksoy, Ugur Dogrusoz, Gideon
Dresdner, Benjamin Gross,S Onur Sumer, Yichao Sun, Anders Jacobsen,
Rileen Sinha, Erik Larsson, Ethan Cerami,Chris Sander, and Nikolaus
Schultz. Integrative analysis of complex cancer genomics
andclinical profiles using the cBioPortal. Science Signaling,
6(269):pl1, 2013.
[6] Ethan Cerami, Jianjiong Gao, Ugur Dogrusoz, Benjamin E
Gross, Selcuk Onur Sumer,Bulent Arman Aksoy, Anders Jacobsen,
Caitlin J Byrne, Michael L Heuer, Erik Larsson,et al. The cBio
cancer genomics portal: an open platform for exploring
multidimensionalcancer genomics data. Cancer Discovery,
2(5):401404, 2012.
[7] Arthur Liberzon, Aravind Subramanian, Reid Pinchback, Helga
Thorvaldsdottir, PabloTamayo, and Jill P Mesirov. Molecular
signatures database (MSigDB) 3.0. Bioinformatics,27(12):17391740,
2011.
[8] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David
Botstein, Heather Butler,J. Michael Cherry, Allan P. Davis, Kara
Dolinski, Selina S. Dwight, Janan T. Eppig, Mi-dori A. Harris,
David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna
Lewis, John C.Matese, Joel E. Richardson, Martin Ringwald, Gerald
M. Rubin, and Gavin Sherlock. GeneOntology: tool for the
unification of biology. Nature Genetics, 25:2529, 2000.
[9] Minoru Kanehisa and Susumu Goto. KEGG: kyoto encyclopedia of
genes and genomes. NucleicAcids Research, 28(1):2730, 2000.
21
[10] Minoru Kanehisa, Miho Furumichi, Mao Tanabe, Yoko Sato, and
Kanae Morishima. KEGG:new perspectives on genomes, pathways,
diseases and drugs. Nucleic Acids Research,45(D1):D353D361,
2017.
[11] David Croft, Antonio Fabregat Mundo, Robin Haw, Marija
Milacic, Joel Weiser, GuanmingWu, Michael Caudy, Phani Garapati,
Marc Gillespie, Maulik R Kamdar, Bijay Jassal, StevenJupe, Lisa
Matthews, Bruce May, Stanislav Palatnik, Karen Rothfels, Veronica
Shamovsky,Heeyeon Song, Mark Williams, Ewan Birney, Henning
Hermjakob, Lincoln Stein, and PeterDEustachio. The Reactome pathway
knowledgebase. Nucleic Acids Research, 42(D1):D472D477, 2014.
[12] Sorin Draghici, Purvesh Khatri, Rui P Martins, G Charles
Ostermeier, and Stephen A Krawetz.Global functional profiling of
gene expression. Genomics, 81(2):98104, 2003.
[13] Saeed Tavazoie, Jason D. Hughes, Michael J. Campbell,
Raymond J. Cho, and George M.Church. Systematic determination of
genetic network architecture. Nature Genetics, 22:281285, 1999.
[14] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan
Mukherjee, Benjamin L.Ebert, Michael A. Gillette, Amanda Paulovich,
Scott L. Pomeroy, Todd R. Golub, Eric S.Lander, and Jill P.
Mesirov. Gene set enrichment analysis: a knowledge-based approach
forinterpreting genome-wide expression profiles. Proceeding of The
National Academy of Sciencesof the Unites States of America,
102(43):1554515550, 2005.
[15] Bradley Efron and Robert Tibshirani. On testing the
significance of sets of genes. The Annalsof Applied Statistics,
1(1):107129, 2007.
[16] Han-Yu Chuang, Matan Hofree, and Trey Ideker. A Decade of
Systems Biology. AnnualReview of Cell and Developmental Biology,
26:721744, 2010.
[17] Frank Emmert-Streib and Galina V. Glazko. Pathway Analysis
of Expression Data: De-ciphering Functional Building Blocks of
Complex Diseases. PLoS Computational Biology,7(5):e1002053,
2011.
[18] Thomas Kelder, Bruce R Conklin, Chris T Evelo, and
Alexander R Pico. Finding the rightquestions: exploratory pathway
analysis to enhance biological discovery in large datasets.
PLoSBiology, 8(8):e1000472, 2010.
[19] Purvesh Khatri, Marina Sirota, and Atul J Butte. Ten years
of pathway analysis: currentapproaches and outstanding challenges.
PLOS Computational Biology, 8(2):e1002375, 2012.
[20] Muhammad Faiz Misman, Safaai Deris, Suhairul ZM Hashim, R
Jumali, and Mohd SaberiMohamad. Pathway-based microarray analysis
for defining statistical significant phenotype-related pathways: a
review of common approaches. In Information Management and
Engi-neering, 2009. ICIME09. International Conference on, pages
496500. IEEE, 2009.
[21] Jorg Rahnenfuhrer, Francisco S. Domingues, Jochen Maydt,
and Thomas Lengauer. Calculat-ing the Statistical Significance of
Changes in Pathway Activity From Gene Expression Data.Statistical
Applications in Genetics and Molecular Biology, 3(1), 2004.
22
[22] Sorin Draghici, Purvesh Khatri, Adi L Tarca, Kashyap Amin,
Arina Done, Calin Voichita,Constantin Georgescu, and Roberto
Romero. A systems biology approach for pathway levelanalysis.
Genome Research, 17(10):15371545, 2007.
[23] Adi L Tarca, Sorin Draghici, Purvesh Khatri, Sonia S
Hassan, Pooja Mittal, Jung-Sun Kim,Chong J Kim, Juan P Kusanovic,
and Roberto Romero. A novel signaling pathway impactanalysis
(SPIA). Bioinformatics, 25(1):7582, 2009.
[24] Cristina Mitrea, Zeinab Taghavi, Behzad Bokanizad, Samer
Hanoudi, Rebecca Tagett, MicheleDonato, Calin Voichita, and Sorin
Draghici. Methods and approaches in the topology-basedanalysis of
biological pathways. Frontiers in Physiology, 4:278, 2013.
[25] Purvesh Khatri, Calin Voichita, Khalid Kattan, Nadeem
Ansari, Avani Khatri, ConstantinGeorgescu, Adi L Tarca, and Sorin
Draghici. Onto-Tools: new additions and improvements in2006.
Nucleic Acids Research, 35(Web Server issue):W206W211, 2007.
[26] Sol Efroni, Carl F Schaefer, and Kenneth H Buetow.
Identification of Key Processes UnderlyingCancer Phenotypes Using
Biologic Pathway Analysis. PLoS One, 2(5):e425, 2007.
[27] S. Greenblum, S. Efroni, C. Schaefer, and K. Buetow. The
PathOlogist: an automated toolfor pathway-centric analysis. BMC
Bioinformatics, 12(1):133, 2011.
[28] Ali Shojaie and George Michailidis. Analysis of Gene Sets
Based on the Underlying RegulatoryNetwork. Journal of Computational
Biology, 16(3):407426, 2009.
[29] Ali Shojaie and George Michailidis. Network Enrichment
Analysis in Complex Experiments.Statistical Applications in
Genetics and Molecular Biology, 9(1), 2010.
[30] Jui-Hung Hung, Troy W Whitfield, Tun-Hsiang Yang, Zhenjun
Hu, Zhiping Weng, and CharlesDeLisi. Identification of functional
modules that correlate with phenotypic difference: theinfluence of
network topology. Genome Biology, 11(2):R23, 2010.
[31] Enrico Glaab, Anas Baudot, Natalio Krasnogor, and Alfonso
Valencia. TopoGSA: networktopological gene set analysis.
Bioinformatics, 26(9):12711272, 2010.
[32] Maria S Massa, Monica Chiogna, and Chiara Romualdi. Gene
set analysis exploiting thetopology of a pathway. BMC Systems
Biology, 4(1):121, 2010.
[33] Laurent Jacob, Pierre Neuvial, and Sandrine Dudoit. Gains
in power from structured two-sample tests of means on graphs. Arxiv
preprint arXiv:1009.5173, 2010.
[34] Ludwig Geistlinger, Gergely Csaba, Robert Kuffner, Nicola
Mulder, and Ralf Zimmer. Fromsets to graphs: towards a realistic
enrichment analysis of transcriptomic systems. Bioinfor-matics,
27(13):i366i373, 2011.
[35] Senol Isci, Cengizhan Ozturk, Jon Jones, and Hasan H Otu.
Pathway analysis of high-throughput biological data within a
Bayesian network framework. Bioinformatics, 27(12):16671674,
2011.
[36] Zhaoyuan Fang, Weidong Tian, and Hongbin Ji. A
network-based gene-weighting approachfor pathway analysis. Cell
Research, 22(3):565580, 2011.
23
[37] Calin Voichita, Michele Donato, and Sorin Draghici.
Incorporating gene significance in theimpact analysis of signaling
pathways. In Machine Learning and Applications (ICMLA), 201211th
International Conference on, volume 1, pages 126131, Boca Raton,
FL, USA, 12-15 De-cember 2012. IEEE.
[38] Yifang Zhao, Ming-Hui Chen, Baikang Pei, David Rowe,
Dong-Guk Shin, Wangang Xie, FangYu, and Lynn Kuo. A Bayesian
Approach to Pathway Analysis by Integrating GeneGeneFunctional
Directions and Microarray Data. Statistics in Biosciences,
4(1):105131, 2012.
[39] Zuguang Gu, Jialin Liu, Kunming Cao, Junfeng Zhang, and Jin
Wang. Centrality-basedpathway enrichment: a systematic approach for
finding significant pathways dominated by keygenes. BMC Systems
Biology, 6(1):56, 2012.
[40] Fernando Farfan, Jun Ma, Maureen A Sartor, George
Michailidis, and Hosagrahar V Jagadish.THINK Back: knowledge-based
interpretation of high throughput data. BMC Bioinformatics,13(Suppl
2):S4, 2012.
[41] Maysson Al-Haj Ibrahim, Sabah Jassim, Michael Anthony
Cawthorne, and Kenneth Lang-lands. A Topology-Based Score for
Pathway Enrichment. Journal of Computational Biology,19(5):563573,
2012.
[42] Jakub Mieczkowski, Karolina Swiatek-Machado, and Bozena
Kaminska. Identification of Path-way DeregulationGene Expression
Based Analysis of Consistent Signal Transduction. PLoSONE,
7(7):e41541, 2012.
[43] Enrico Glaab, Anas Baudot, Natalio Krasnogor, Reinhard
Schneider, and Alfonso Valencia.EnrichNet: network-based gene set
enrichment analysis. Bioinformatics, 28(18):i451i457,2012.
[44] Paolo Martini, Gabriele Sales, M Sofia Massa, Monica
Chiogna, and Chiara Romualdi. Alongsignal paths: an empirical gene
set approach exploiting pathway topology. Nucleic AcidsResearch,
41(1):e19e19, 2013.
[45] Winston A Haynes, Roger Higdon, Larissa Stanberry, Dwayne
Collins, and Eugene Kolker.Differential expression analysis for
pathways. PLoS Computational Biology, 9(3):e1002967,2013.
[46] Shining Ma, Tao Jiang, and Rui Jiang. Differential
regulation enrichment analysis via theintegration of
transcriptional regulatory network and gene expression data.
Bioinformatics,31(4):563571, 2014.
[47] Ivana Ihnatova and Eva Budinska. ToPASeq: an R package for
topology-based pathway anal-ysis of microarray and RNA-Seq data.
BMC bioinformatics, 16(1):350, 2015.
[48] Sahar Ansari, Calin Voichita, Michele Donato, Rebecca
Tagett, and Sorin Draghici. A novelpathway analysis approach based
on the unexplained disregulation of genes. Proceedings of theIEEE,
105(3):482495, 2017.
[49] Behzad Bokanizad, Rebecca Tagett, Sahar Ansari, B Hoda
Helmi, and Sorin Draghici.SPATIAL: A System-level PAThway Impact
AnaLysis approach. Nucleic Acids Research,44(11):50345044,
2016.
24
[50] Tin Nguyen and Sorin Draghici. BLMA: A package for bi-level
meta-analysis. R packageversion 1.
[51] Tin Nguyen, Rebecca Tagett, Michele Donato, Cristina
Mitrea, and Sorin Draghici. A novel bi-level meta-analysis
approach-applied to biological pathway analysis. Bioinformatics,
32(3):409416, 2016.
[52] Charles J Vaske, Stephen C Benz, J Zachary Sanborn, Dent
Earl, Christopher Szeto, JingchunZhu, David Haussler, and Joshua M
Stuart. Inference of patient-specific pathway activities
frommulti-dimensional cancer genomics data using PARADIGM.
Bioinformatics, 26(12):i237i245,2010.
[53] Enrica Calura, Paolo Martini, Gabriele Sales, Luca
Beltrame, Giovanna Chiorino, MaurizioDIncalci, Sergio Marchini, and
Chiara Romualdi. Wiring miRNAs to pathways: a topolog-ical approach
to integrate miRNA and mRNA expression profiles. Nucleic Acids
Research,42(11):e96, 2014.
[54] Diana Diaz, Michele Donato, Tin Nguyen, and Sorin Draghici.
MicroRNA-augmented path-ways (mirAP) and their applications to
pathway analysis and disease subtyping. In PacificSymposium on
Biocomputing. Pacific Symposium on Biocomputing, volume 22, page
390. NIHPublic Access, 2016.
[55] Tin Nguyen, Diana Diaz, Rebecca Tagett, and Sorin Draghici.
Overcoming the matched-samplebottleneck: an orthogonal approach to
integrate omic data. Nature Scientific Reports, 6:29251,2016.
[56] Shouguo Gao and Xujing Wang. TAPPA: topological analysis of
pathway phenotype associa-tion. Bioinformatics, 23(22):31003102,
2007.
[57] Jianguo Xia and David S Wishart. MetPA: a web-based
metabolomics tool for pathway analysisand visualization.
Bioinformatics, 26(18):23422344, 2010.
[58] Jianguo Xia and David S Wishart. Web-based inference of
biological patterns, functions andpathways from metabolomic data
using MetaboAnalyst. Nature Protocols, 6(6):743, 2011.
[59] Jianguo Xia, Igor V Sinelnikov, Beomsoo Han, and David S
Wishart. MetaboAnalyst 3.0making metabolomics more meaningful.
Nucleic Acids Research, 43(W1):W251W257, 2015.
[60] Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and
Eivind Hovig. Ten simple rules forreproducible computational
research. PLoS computational biology, 9(10):e1003285, 2013.
[61] Dougu Nam and Seon-Young Kim. Gene-set approach for
expression pattern analysis. Briefingsin Bioinformatics,
9(3):189197, 2008.
[62] Yoram Ben-Shaul, Hagai Bergman, and Hermona Soreq.
Identifying subtle interrelated changesin functional gene
categories using continuous measures of gene expression.
Bioinformatics,21(7):11291137, 2005.
[63] Kuang-Hung Pan, Chih-Jian Lih, and Stanley N Cohen. Effects
of threshold choice on biologicalconclusions reached during
analysis of gene expression by DNA microarrays. Proceedings ofthe
National Academy of Sciences of the United States of America,
102(25):89618965, 2005.
25
[64] Paul K Tan, Thomas J Downey, Edward L Spitznagel Jr, Pin
Xu, Dadin Fu, Dimiter SDimitrov, Richard A Lempicki, Bruce M Raaka,
and Margaret C Cam. Evaluation of geneexpression measurements from
commercial microarray platforms. Nucleic Acids
Research,31(19):56765684, 2003.
[65] Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, and Eytan
Domany. Outcome signature genesin breast cancer: is there a unique
set? Bioinformatics, 21(2):171178, 2005.
[66] Liat Ein-Dor, Or Zuk, and Eytan Domany. Thousands of
samples are needed to generate arobust gene list for predicting
outcome in cancer. In Proceedings of the National Academy
ofSciences, 103(15):59235928, 2006.
[67] Hiroyuki Ogata, Susumu Goto, Kazushige Sato, Wataru
Fujibuchi, Hidemasa Bono, and Mi-noru Kanehisa. KEGG: Kyoto
Encyclopedia of Genes and Genomes. Nucleic Acids
Research,27(1):2934, 1999.
[68] Carl F Schaefer, Kira Anthony, Shiva Krupa, Jeffrey
Buchoff, Matthew Day, Timo Hannay,and Kenneth H Buetow. PID: the
pathway interaction database. Nucleic Acids Research,37(Suppl
1):D674D679, 2009.
[69] BioCarta. BioCarta - Charting Pathways of Life.
http://www.biocarta.com.
[70] Alexander R Pico, Thomas Kelder, Martijn P van Iersel,
Kristina Hanspers, Bruce R Conklin,and Chris Evelo. WikiPathways:
pathway editing for the people. PLoS Biology, 6(7):e184,2008.
[71] Huaiyu Mi, Betty Lazareva-Ulitsky, Rozina Loo, Anish
Kejariwal, Jody Vandergriff, StevenRabkin, Nan Guo, Anushya
Muruganujan, Olivier Doremieux, Michael J Campbell, HiroakiKitano,
and Paul D Thomas. The PANTHER database of protein families,
subfamilies, func-tions and pathways. Nucleic Acids Research,
33(Suppl 1):D284D288, 2005.
[72] G Joshi-Tope, Marc Gillespie, Imre Vastrik, Peter
DEustachio, Esther Schmidt, Bernardde