Modelling and Visualizing Selected Molecular Communication ...
Post on 29-Apr-2022
2 Views
Preview:
Transcript
University of Nebraska - LincolnDigitalCommons@University of Nebraska - LincolnComputer Science and Engineering: Theses,Dissertations, and Student Research Computer Science and Engineering, Department of
4-2018
Modelling and Visualizing Selected MolecularCommunication Processes in BiologicalOrganisms: A Multi-Layer PerspectiveAditya Immaneniaditya.immaneni@gmail.com
Follow this and additional works at: https://digitalcommons.unl.edu/computerscidiss
Part of the Computer Engineering Commons, and the Computer Sciences Commons
This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University ofNebraska - Lincoln. It has been accepted for inclusion in Computer Science and Engineering: Theses, Dissertations, and Student Research by anauthorized administrator of DigitalCommons@University of Nebraska - Lincoln.
Immaneni, Aditya, "Modelling and Visualizing Selected Molecular Communication Processes in Biological Organisms: A Multi-LayerPerspective" (2018). Computer Science and Engineering: Theses, Dissertations, and Student Research. 152.https://digitalcommons.unl.edu/computerscidiss/152
MODELING AND VISUALIZING SELECTED MOLECULAR COMMUNICATION
PROCESSES IN BIOLOGICAL ORGANISMS: A MULTI-LAYER PERSPECTIVE
by
Aditya Immaneni
A THESIS
Presented to the Faculty of
The Graduate College at the University of Nebraska
In Partial Fulfilment of Requirements
For the Degree of Master of Science
Major: Computer Science
Under the Supervision of Professor Massimiliano Pierobon
Lincoln, Nebraska
April, 2018
MODELING AND VISUALIZING SELECTED MOLECULAR COMMUNICATION
PROCESSES IN BIOLOGICAL ORGANISMS: A MULTI-LAYER PERSPECTIVE
Aditya Immaneni, M.S.
University of Nebraska, 2018
Adviser: Massimiliano Pierobon
The future pervasive communication and computing devices are envisioned to be tightly
integrated with biological systems, i.e., the Internet of Bio-Nano Things. In particular,
the study and exploitation of existing processes for the biochemical information exchange
and elaboration in biological systems are currently at the forefront of this research direction.
Molecular Communication (MC), which studies biochemical information systems with theory
and tools from computer communication engineering, has been recently proposed to model
and characterize the aforementioned processes. Combined with the rapidly growing field
of bio-informatics, which creates a rich profusion of biological data and tools to mine the
underlying information, this investigation direction is set to produce interesting results and
methodologies not only for systems engineering but also for novel scientific discovery. The
multidisciplinary nature of this work presents an interesting challenge in terms of creating a
structured approach to combine the aforementioned disciplines for the study of information
propagation processes in biological organisms, and their relationship with information for
their control, optimization, and exploitation. In this thesis, we study a selection of these
processes, through different and independent contributions, at the system layer, cellular
layer and pathway layer. First, we model the overall functionality of a multicellular metabolic
system, the human digestion, in terms of energy production from major nutrients in the food.
Second, we analyze metabolic processes in a single cell and their adaptability to incoming
nutrient availability information from the environment. Third, we model and characterize
the processes that enable information to propagate from the external environment and be
processed by the cell. Numerical results are presented to provide a first proof-of-concept
characterization of all these processes in terms of communication theory. While it may be
possible to connect each of these layers in future work, this goes beyond the scope of the
work reported in this thesis.
iv
DEDICATION
This thesis is dedicated to my parents Sudhaker and Vani Immaneni
v
ACKNOWLEDGMENTS
I thank my adviser Dr. Massimiliano Pierobon for his continued support and guidance. His
enthusiasm has always motivated me to work harder. I would also like to thank my commit-
tee members: Dr. Juan Cui and Dr. Tomas Helikar for their effort in reviewing my thesis
and their interest in my work.
Moreover, I want to thank all the members of the MBiTe Lab, including Zahmeeth Sakaff
who collaborated with me in studying the cell metabolism and the JAK-STAT pathways. I
also want to thank Ravi, Natalie, Karthik, Colton and Francesca for supporting and helping
me when I needed it the most. I acknowledge Natasha’s help with accessing the computing
resources of the Holland Computing Center and the Open Science Grid. Further I acknowl-
edge Jiang’s help for accessing the TCGA data and for helping me get started with the Open
Science Grid.
Finally, I thank all my family members and friends for their support and inspiration. My
parents, who drive me to be the best version of myself. My sister Maithreyi, for proof reading
my thesis and providing timely feedback.
vi
Table of Contents
1 Introduction 1
2 Bio-System Layer 7
2.1 Simulation of Glucose Metabolism in Cells . . . . . . . . . . . . . . . . . . . 8
2.2 Controlling the Direction of ODE Equations . . . . . . . . . . . . . . . . . . 9
2.3 Result: Glucose and ATP Graphs . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Computational Tools for Bio-System Layer . . . . . . . . . . . . . . . . . . . 12
3 Cell Layer 13
3.1 Metabolic Pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Available Biological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Result: Visualization of Large Scale Chemical Data . . . . . . . . . . . . . . 16
3.4 Computational Tools for Cell Layer Simulation . . . . . . . . . . . . . . . . . 19
4 Pathway Layer 23
4.1 JAK-STAT Pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 ODE and Gillespie Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Mutual Information Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Result: Mutual Information Flow Graph . . . . . . . . . . . . . . . . . . . . 39
4.5 MAPK Pathway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Result: Mutual Information Change Due to Mutation . . . . . . . . . . . . . 43
vii
4.7 Computational Tools: Pathway Layer . . . . . . . . . . . . . . . . . . . . . . 46
5 Future Work 47
5.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6 Conclusion 52
Bibliography 53
1
Chapter 1
Introduction
Computer engineering research is experiencing an increasing interest in studying the commu-
nication features of biological processes and organisms, motivated by the recently proposed
Internet of Bio-Nano Things paradigm [1]. At the basis of this research, Molecular Commu-
nication (MC) theory is an emerging branch of communication engineering that studies the
propagation of information through molecule exchange and chemical reactions. Future MC-
enabled devices are envisioned to perform functions such as the detection of heart disease [2],
regulation of insulin levels in the body and to serve as an interface for external devices [3].
The principles of MC can be applied to the design systems that interact with biochemical
processes such as a diagnostic drug delivery system [4] or to code information [5]. It can also
be applied to develop communication systems that interconnect nano-scale devices(nano-
machines) with inorganic machines enabled by nano-technology [6] or with organic machines
enabled by synthetic bilogy [7]. In this thesis, through three different studies we look at MC
in biological organisms.
Among the biological processes that can be studied through MC, metabolism is of particular
interest given its fundamental role of processing nutrients to extract energy in biological
organisms [8]. The metabolic system in the human body is responsible for the production
of energy for various functions performed by the cell. Within the cell, a complex system of
reactions, the metabolic network , is responsible for this energy production as well as cell
2
growth and reproduction. This network can be classified into chains of pathways that utilize
the compound present in the environment to produce energy, biomass and other cellular
compounds. Moreover, metabolic processes are continuously regulated and optimized on the
basis of information on nutrient availability, chemical, and physical condition of the envi-
ronment inside and outside biological organisms. This tight connection of metabolism and
its regulation with information processing and communication makes it a suitable ground to
apply MC theory for its modeling, characterization and model-driven discovery of emergent
behaviors [9], and for its future utilization and engineering into the pervasive computation
and communication substrate of the aforementioned future Internet of Bio-Nano Things.
For this thesis, three different layers are independently studied where a selection of processes
are characterized on their MC performance. In the bio-system layer we look at how complex
multicellular bio-systems react to new information. At the cell layer, we look at how avail-
ability of new information, modifies the processes in a single cell organism. In the pathway
layer we look at how biochemical processes can propagate information within the cell.
Recently developed techniques in image processing combined with the increasing availability
of food data provide tools to estimate the nutrients provided by different kinds of food [10].
When this food is ingested, it is processed by the digestive system to provide glucose that
is then transferred by the blood stream to various cells in the body [11]. The glucose in the
blood is then taken by individual cells for the production of energy [12].
In the bio-system layer by available information on food intake, circulation of glucose
through blood and metabolism of glucose by the cell, a computational model is built for
the analysis of the human digestive system. Metabolism on a large scale is observed in the
digestive system of multicellular organisms, termed in this thesis as the bio-system layer of
metabolism.
At the cell layer, drawing inspiration from [13], the changes in the processes internal to the
cell of single celled microbes are studied. By applying FBA, the steady-state fluxes of the
3
metabolic processes in the human gut microbes Bacteroides thetaiotaomicron (B.theta) and
Methanobrevibacter smithii (M.smithii) are obtained. For these microbes, we considered
seven of the major nutrients in the external environment of the cell as inputs to the network.
The activation and deactivation of these processes is then visualized. Human gut microbes
were studied since they are simple and more treatable.
Next, we looked at the signal transduction network which is a complex communication
channel due to the presence of non-linear behavior, stochasticity, feedback and feed-forward
loops. By studying the signal transduction pathway we understand how information from
the external environment of the cell propagates to the internal structure of the cell to control
cell responses such as metabolism. In the pathway layer, we characterize the performance
of the signal transduction pathway in propagating information external to the cell to the
nucleus of the cell and we also quantify the information handled by each process of the cell.
We then propose a computational method which utilizes stochastic chemical reactions and
estimation of MI from multidimensional data.
The rest of this thesis describes the work done at each of the layers and discusses the results
and computational tools applied. The system layer, cell layer and pathway layer are studied
in Chapters 2, 3 and 4 respectively. As we probe deeper into the layers, the complexity
increases due to increase in information availability. This wealth of information provides an
excellent opportunity to apply various tools and techniques of visualization. Visualization
tools Gephi, D3.js and JHive are studied. Further hive plot visualization is applied in order
to present a complex network in a more compact and informative fashion. Also, as proof of
concept, we characterize the changes in the signal transduction pathways due to mutation.
Through the comparison of the MI between a normal and a mutated signal transduction
pathway, the information loss is quantified. Finally, a proof of concept for a self visualizing
bacterial colony is described.
4
Figure 1.1: Summary of study in bio-system layer
Figure 1.2: Summary of study in cell layer
5
Figure 1.3: Summary of study in pathway layer
Motivation
In my first semester at UNL I took a class in Genetic Engineering. As I learned about the
various mechanisms that govern biological systems, I was fascinated by their similarity to
electronic circuits. The possibility of treating biological systems as a computational and
communication system drew me further into the field. While it was a daunting task, a
layered approach helped me understand the required concepts better.
Contribution
Bio-System Layer: A computational model to estimate the production of glucose and
Adenosine TriPhosphate was developed. The model was then ported to the Open Science
Grid to obtain higher throughput.
6
Cell Layer: For a cells’ metabolism, Mutual Information is calculated with lab data, to
assess the cells’ real world performance as a communication system. The large scale network
of reactions in the cell is visualized with hive plots.
Figure 1.4: Overview of contributions in each individual layer
Pathway Layer: For the JAK-STAT pathway, Mutual Information flow is visualized as
a network. The change in Mutual Information due to mutations in MAPK Pathway is
calculated as well.
7
Chapter 2
Bio-System Layer
When studying a Bio-system layer process, complex sub-processes are treated as black boxes.
This means that we start by understanding the relationship between input and output instead
of the processes that control them. This level of abstraction reduces the complexities of the
underlying system making it easier to study the system as a whole.
In this chapter we study the digestive system. As per the CDC there are 30.3 Million (9.4%
of US population) cases of diabetes. The production of Adenosine TriPhosphate(ATP) is
affected by diabetes [14, 15]. ATP is the energy currency of the cell. ATP is obtained
by processing of glucose by the metabolic pathway of the cell. Since diabetes affects the
processing of glucose by the body, ATP production is affected by diabetes. With increased
availability of food data, including applications that determine the major nutrients in food,
it becomes easier to estimate nutrients in the food intake. In order to study diabetes better,
we built a kinetic model of the metabolic system. By studying the processes involved in the
digestive system a better understanding of diabetes can be obtained. The end goal of the
study here is to integrate the application developed in [10] with a glucose model [8] to give
the ATP produced by the cell. The application in [10] takes an image of a plate of food and
estimates the amount of calories in a plate of food. Further the nutrition information, viz.
the amount of Carbohydrates, Proteins and Fats is provided. This information is utilized by
our model to provide the ATP produced by the cell for the intake of food.
8
Figure 2.1: Overview of the System,background source: [16]
Figure 2.1 gives the overview of the system under study. At the input end, food intake can
be broken down into various nutrients according to its composition. The digestive system
processes nutrients in the system and provides cells with glucose via the blood stream. The
cells uptake glucose available in the blood stream and process it to form ATP. Here, food
intake can be considered as input to the system. Depending on what is measured, cell glucose
or ATP can be considered as the output. Since a large part of the blood glucose comes from
the carbohydrates [17], carbohydrates in the food are the primary focus in food input. As
the output concentration of glucose and ATP produced is received. This can be visualized
as a line graph showing the concentration in time.
2.1 Simulation of Glucose Metabolism in Cells
This section summarizes the contribution by Rad in [8]. The process of cell glucose uptake
and its conversion to glucose is simulated by this model. The model simulates the glucose
metabolism by taking into account the reactions that take place in the glucose metabolism
9
pathway. Reactions in the glucose metabolism pathway are simulated as Ordinary Differen-
tial Equations(ODE). A first order Taylor series expansion is used to find the concentration of
all the species in the reactions. This method was chosen to give a good initial approximation
of the system.
2.2 Controlling the Direction of ODE Equations
In order to connect the input(food intake) to the measured output(ATP), relationship of cell
glucose uptake with food intake is to be understood. As described in [11], insulin levels in
the blood follow the blood glucose levels closely. Figure 2.2 shows this relationship. The
parameters for the graph is obtained by fitting the model described in [11] with the human
blood glucose insulin levels described in [18].
At the initial step of the simulation, a food input is simulated by setting the ingested glucose.
At each step of the simulation the blood glucose and insulin levels are calculated as described
in [11]. From the blood glucose levels the cell glucose levels can be determined by applying
Hills function. The minimum and maximum uptake values can be obtained as described
in [12,19]. At each step the cell ATP levels are calculated by the ODE model.
10
0 50 100 150 200 250 300 350Time is seconds
-50
0
50
100
150
200
Bloo
d G
luco
se m
g/dl
Bloo
d In
sulin
pm
ol/l
GlucoseInsulin
Figure 2.2: Glucose Insulin Graph
Figure 2.3: Model Overview
11
Figure 3.1 gives the overview of the system. SBML (Systems Biology Markup Language) [20]
is a an XML file standard for specifying biochemical reaction data. In our model we take all
the reactions and constants for the metabolic pathway. JSBML (Java SBML) [21] is a Java
library to parse SBML files. Since our model was developed using Java we use JSBML.
2.3 Result: Glucose and ATP Graphs
Figure 2.4: Glucose and ATP Production
Figure 2.4 shows the production of ATP of Glucose and ATP in nMol/L for a simulation
of 10 hours. For this simulation we provided a glucose input at two time steps t=0h and
t=6.5h. These times were chosen to make sure the system reaches a steady state. We see
that the ATP follows the levels of cell glucose. Once glucose levels reach a stable state its
response is seen to be lower than the previous glucose input.
12
2.4 Computational Tools for Bio-System Layer
Many computational tools are available to simplify the process of studying system level inter-
actions. The software MATLAB [22] provides a number of toolboxes to make mathematical
computation easier. In the study described in this chapter, the curve fitting toolbox was
used to fit the data based on the equations in [11].
The Open Science Grid (OSG) [23, 24] is a a great tool for running a large number of sim-
ulations. Each of the simulations are run on their own node, so they all essentially run at
the same time without waiting for the previous one to finish. The challenge here is to make
the simulation portable enough to work independently on any platform. In order to test our
model for various initial concentrations we utilize the OSG. On the OSG, for 100 different
initial concentrations, we obtained a runtime of 1.2 minutes. In contrast, the runtime was 80
minutes for a system with an i7 5700HQ(2.7 Ghz) and 12GB RAM. From the simulations we
obtain a large dataset. The data processing and graph generation becomes a tedious task.
The macros option in MS Excel was used to process the simulation data and to automate
the generation of graphs.
13
Chapter 3
Cell Layer
Moving on to the next layer, we focus on the processes that happen in the cell. Within a cell
there are a lot of reactions that occur at any given time. Hence, we treat the cell as a bag
of reactions that are constantly processing information in the form of external compounds.
Once this information is processed by the cell, the output can be seen as the compounds
released by the cell and sometimes by the physical changes to the cell. At this layer, to
maintain simplicity, the interaction between one cell with another is not studied. Only the
processes and interactions that occur within the cell are considered. Studying the behavior
of a single cell helps with developing a simple model that can be further developed to include
the interactions with other cells.
In this chapter we look at metabolic pathways in a human gut microbes. Since these microbes
are unicellular they are perfect for studying the behavior of a single cell. In particular,
the metabolic pathways of Bacteroides thetaiotaomicron(B.theta) and Methanobrevibacter
smithii(M.smithii) are studied. These were chosen due to the availability of the lab data.
The role of the metabolic pathway is to convert the Input (Growth Medium) to Output
(Biomass). The growth medium is analogous to food intake in the previous chapter as the
growth medium consists of nutrients vital to the growth of the bacteria. We start by looking
at all the processes that go on in the pathway. Once the pathway is well defined, we see the
challenges faced in processing and visualizing the data.
14
3.1 Metabolic Pathway
This section describes the work done by Sakaff in [25].
ZS 2016
Abstraction of Our Binary Encoder (Stage I)
10
cN
c1
c3
c2
Transmitter
Receiver
Channel
Metabolic Network State
Environment Chemical
CompositionEnzyme
ExpressionRegulation
c1 c2 c3 cN. . .
Transmitted Signal Channel Received Signal
r1 r2 r3 rM. . .Enzyme Expression Regulation
JN
J1
J3
J2
Chemical compound 1Chemical compound 2Chemical compound 3Chemical compound 4…
r1=1
r3=1
r2=1r4=1
r5=0
rM=0
r7=1
r8=1
rN=1
Figure 3.1: Abstraction of the Metabolic Communication System
The metabolic pathway contributes to the physical growth of the cell. The input to the
metabolic pathway comes from external compounds present around the cell. Therefore, the
environment that the cell grows in is vital to the metabolic pathway. As discussed earlier,
we treat the metabolic pathway as a channel that transmits the input(Growth Medium) to
output(Biomass and other compounds released by the cell). In order to characterize the
system we need to observe the output for various values of the input. In the case of B.theta
we have 7 compounds that comprise the input which control the reactions that occur within
the cell. Biomass is estimated by taking into account the reactions that contribute to its
formation.
Once this input-output relationship is characterized, the variation of the output to each
value of input is calculated. Mutual Information gives the correlation between two random
variables. Hence, the predictability of the output with respect to an input can be obtained
15
by calculating the Mutual Information of the system. The calculation of Mutual Information
is discussed in a later chapter. The value of Mutual Information can be used to determine
the ideal environmental conditions for the growth of a cell with a focus on the cell providing
more information of its environmental conditions. This means that we can determine the
most optimal environment to use the cell as a transmitter.
For B.theta the Mutual Information upper bound for the lab data was found to be 3.3068
bits. Similarly, for M.Smithii the Mutual Information upper bound is found to be 4.5222
bits.
3.2 Available Biological Data
There are a lot of processes that take place in the cell. These processes are classified into
pathways based on their functions. In order to simplify the calculations we only consider
the metabolic pathway. The other pathways cannot be ignored as these do interact with the
metabolic pathway, controlling the reactions that occur in the pathway. To account for these
interactions we use Flux Balance Analysis(FBA). FBA estimates the reaction fluxes based
on the input. These flux values can be used to simulate the metabolic pathway independent
of the signaling pathways that control it. FBA is run for various growth media. The Growth
media represent the various input combinations possible.
Once the FBA is run, the reaction fluxes for each of the growth media are obtained. This
can be used to simulate the pathway as an ODE model. As described in [26], when FBA
is used, the differential equations can be modeled using Michaelis-Menten rate equations.
This essentially fixes the direction of reversible reactions by treating the enzymes as always
present. From the FBA we obtain the chemical reactions that take part in the metabolic
network, the data is extracted for each of the growth media. It is now possible to obtain the
biomass values for different growth media, enabling the calculation of Mutual Information.
16
3.3 Result: Visualization of Large Scale Chemical Data
Figure 3.2: Network representing active chemical reactions in metabolic pathway
Figure 3.2 shows the chemical reactions that occur in the cell represented as a network. There
are approximately 1000 compounds with 900 reactions happening between them. There are
1000 compounds with the 900 reactions. It is seen that the network representation is complex
and not easy to follow. The nodes are arranged using the Force Atlas 2 [27], which arranges
the nodes so that they don’t overlap and are placed close to the nodes that they share edges
17
with. This results in having the nodes with higher degree (both in and out degree) placed
in the center. The size of the nodes have been made proportional to the degree, making it
easier to identify the compounds which are the most active.
A cleaner way to represent this data is to use hive plots [28]. By prefixing the location of
each node in the network, it becomes easier to study each node. This consistency of location
provides familiarity so that a comparative study can be easily performed. In hive plots nodes
are placed along predetermined axis. The hierarchy of the nodes on the axis is kept constant,
so that the nodes do not vary across different networks.
Figure 3.3 shows the hive plot for a growth medium labeled ’Group F’. Group F represents
a configuration of the input. In order to classify the nodes into reactants, reactions and
products 3 axes are used. The compounds external to the cell are given a higher position in
the hierarchy. This makes it easier to identify the external compounds in network. Although
the nodes and edges have increased, hive plots simplifies the network visualization in the
following way. Consider H2O, an external compound to the cell. Assign it the highest level
in the hierarchy, so that it is always at the top of its axis in the products and reactants.
When comparing two growth media it becomes easy to observe what are the changes to H2O.
Further, differential hive plots can be generated to compare different graphs easily.
18
Figure 3.3: Hive Plots
Figure 3.4 shows two differential hive plots comparing the growth media ’Group F’ with
’Group G’ and ’Group Z’ respectively. Group F and Group G have the same output value
in terms of biomass value and produce the highest in all of the media available. In contrast
Group Z produces the least biomass. In the hive plot we see that Group F and G differ by
a few reactions, while Group F and Z have a larger variation.
19
Media F vs G
Media F vs Z
Figure 3.4
3.4 Computational Tools for Cell Layer Simulation
In this section we look at the tools that can be used for processing the data at cell layer
simulation. Python was used to process the input file obtained after FBA. Python allows for
developing quick prototypes. The libraries available for Python enable easy development of
applications. Pythons’ JSON library was used to extract the data from the input file. The
files required for visualization are generated using Python as well. Following subsections
discuss the visualizations tools used.
Gephi
Gephi [29] is a GUI application for visualizing networks. The nodes can be manually created
or can be imported as a file. For this project we generated the network as a .csv file. Once
20
the network is visualized, a better understanding of the system is obtained. This helps in
deciding the direction of further processing and visualization of the data.
Jhive
Jhive [30] is a java based application used to create hive plots. The software takes files in
the dot format and visualizes them as hive plot. The dot format contains the list of nodes
and the list edges. The algorithm HivePlotGenerator takes a JSON file and converts it to a
dot file for Jhive. The advantage of using Jhive is the inbuilt option to use the differential
hive plot generator.
Algorithm 1: HivePlotGenerator(File)
1 Extract all the compounds(nodes) from the input file
2 The for loop in lines 3-6 sets the nodes for all compounds in different media
3 for each compound extracted do
4 set the position of the compound in the hierarchy
5 set the node labels as compound names and set the relevant data
6 end
7 Create a template graph file with all the node data set
8 for each growth media do
9 for each reaction in the growth media do
10 Set each reaction as a node Set the edges from reactants to the reaction node
Set the edges from reactions to the products
11 end
12 Create the Hive plot file from the Nodes template and the reactions data
extracted.
13 end
21
D3.jS
D3.js [31] is a JavaScript library for most types visualization. It is a very powerful for
visualization, however it required good understanding of JavaScript and other web technolo-
gies. [32] is a good example of a hive plot implementation in D3.js.The advantage here is
that the hive chart can be mode more interactive. Figure 3.5 shows a bar chart developed
using D3.js. Some of the starter code is borrowed from the tutorials [cite links]. The chart
shows the mutual information values obtained for different growth media.
22
Figure 3.5: Chart showing the mutual information values for various input combinations
23
Chapter 4
Pathway Layer
In the previous chapter we looked at the cell as a whole. In this chapter we start looking at
the inner workings of the cell. Specifically, signaling pathways which control the reactions
that occur in the cell. As described in the previous chapter, the signal transduction pathways
control the fluxes of the metabolic pathways. While FBA does gives us the optimal values
of the fluxes, being able to obtain real time fluxes of the reactions in the metabolic pathway
helps us study the cell better.
In this chapter we perform a study in the JAK-STAT pathway, Again, the interactions
between the compounds are modeled as a network. In addition to calculating the Mutual-
Information(MI) between the input to output, the MI value is calculated between the input
and all other modes of the network. We also look at ODE and Gillespie simulations for
studying the JAK-STAT pathway. Next, we look at characterizing the changes due to mu-
tation in the signal transduction pathway. Particularly, we look at the MAPK pathway and
the gene expression data available in TCGA.
Signal Transduction Pathways
Signal transduction pathways are series of chained biochemical processes where molecules
interact with each other to propagate physical or chemical signals through biological cells [9].
24
Figure 4.1: Overview of a signal Network
25
In particular, with reference to Fig. 4.1a, they most commonly propagate extracellular sig-
nals (embedded in physical or chemical parameters in the extracellular environment) into the
cell, where the information they carry is utilized to accordingly regulate major cellular func-
tionalities, such as the cell growth rate and cell division (proliferation), cell differentiation,
cell death (apoptosis, anti-apoptosis), and cell physiological stability (homeostasis). This
propagation is most commonly initiated at the cell membrane by special proteins (biological
macromolecules with specific functions), called receptors, which are sensitive to extracellu-
lar signals by binding to information-bearing molecules from the extracellular environment.
Upon this binding reactions, the receptors undergo a conformational change in the intracel-
lular space, and initiate cascades of chemical reactions, i.e., protein-to-protein interactions,
where specific proteins, the kinases, get activated through the addition of a phosphate group
(phosphorylation), and subsequently, possibly after binding to other protein into complexes,
activate other proteins downstream of the cascade. Other specific proteins, the phosphatases,
“reset” the activated proteins along the cascade by removing the aforementioned phosphate
group (dephosphorylation). These cascaded reactions triggered by the initial extracellular
signal result in the overall propagation of the information through reaction chains, which
ultimately results into the activation of transcription factors, which are other proteins that,
when active, are able to regulate the aforementioned cellular functionalities by increasing
(induced) or decreasing (repressed) the expression of one or more downstream DNA genes
inside the cell nucleus [33]. Through these biochemical processes, the initial information con-
tained in the concentration of extracellular molecules is transduced into the concentration of
bound receptors, which is in turn transduced into the concentration of activated kinases and
protein complexes along the reaction cascade, and finally into the concentration of activated
transcription factors.
26
Molecular Information Abstraction
In this chapter, we abstract the aforementioned biochemical processes underlying cell sig-
nal transduction pathways as communication channels that propagate the input information
from extracellular signals to each protein of the pathway, which ultimately relay this infor-
mation as output information to the transcription factors in the cell nucleus, as sketched
in Fig. 4.1b. Our aim is to provide a quantitative characterization of this information as
it flows through the signal transduction pathway, and we rely on the following assumptions
commonly accepted in computational biology literature:
• The concentrations of all the aforementioned molecular species are considered homoge-
neous at any time instant outside the cell membrane (information-bearing molecules), at
the cell membrane (receptors), inside the cell membrane (phosphorylating proteins), and
inside the cell nucleus (possibly other phosphorylating proteins, transcription factors), re-
spectively. This assumption corresponds to a compartmentalized well-stirred system in
chemical system modeling [34].
• Each chemical reaction in the pathway, expressed in general as AC Bkf
��*)��kr
CCD, where
A;B are the reactant molecule species, C;D are the product molecule species, and kf and
kr and the forward and reverse reaction rates, respectively (for irreversible reactions kr D 0
and the backward arrow is omitted, and B and/or D can be omitted depending on the
reaction), is modeled mathematically through mass action kinetics as follows [34]: dŒC �.t/
dtD
kf ŒA�.t/ŒB�.t/ � kr ŒC �.t/ŒD�.t/, where Œ:�.t/ denotes the concentration of the molecule
species as function of the time t . The same expression is valid by substituting ŒD�.t/ in
place of ŒC �.t/. In the pathway picture of Fig. 4.1b, each circle represents a reactant or
product molecule specie, and each arrow corresponds the molecule specie participation to
a chemical reaction. These molecule species might be subject to degradation reactions,
expressed as Akd
��! 0, where kd is the degradation rate, and action kinetics formulation as
27
dŒA�.t/
dtD �kd ŒA�.t/. Chemical reactions are affected by noise according to the Chemical
Master Equation (CME) [34], which can be computationally implemented through the
Gillespie’s Stochastic Simulation Algorithm (SSA) [35].
• The input concentration of information-bearing molecules in the extracellular environment
Xs.t/, where s is a molecular species out of S extracellular signals, is the result of a
molecule source in the extracellular environment (another cell or a dose provided to the cell
culture during an experiment), which consequently varies the concentration Xs.t0/, where
t0 corresponds to an initial state of the system, by an amount X s, whose value corresponds
to the input information, which is kept constant during the propagation of this information
through the pathway. This models the situation where in a lab experiment a chemical
reagent is added to a cell culture in a determinate quantity [36]. The output concentration
of transcription factors Youtk.t/, as well as the concentration of all the proteins involved
in the aforementioned cascaded reactions of the pathway Yj .t/, are in general functions
of the time t . We define T as the time interval necessary for all these concentrations to
reach a steady-state regime (constant or periodic).
In agreement with the aforementioned assumptions, Fig. 4.1b captures the abstraction of the
information flow in a typical cell signaling pathway, as we propose in this paper. In particular,
the Input Information is carried by a change at time t0 in the extracellular concentrations
of information-bearing molecules at the input of the signal transduction pathway, quantified
through the entropy expression H�fXsg
SsD1
�. This information is propagated through the
signal transduction pathway by the modulation of the interactions between the pathway
proteins, which result into a time evolution of the concentration of each of these proteins
within the aforementioned time interval T . Biological noise and other effects [37] tend to
decrease the information content in the protein interaction modulation by randomization
or equivocation [38] during its propagation in the signaling pathway, resulting in a residual
28
information at each pathway protein, quantified through the Mutual Information (MI) Ij D
I�fXsg
SsD1 I
˚Yj .t/; t0 � t � t0 C T
�at protein j . Finally, the protein-protein interaction
modulation through the pathway is transduced into the modulation of the concentration of
each downstream transcription factor k, k D 1; : : : ; K, which is the Output Information of
the pathway, quantified through the MI Ik D I�fXsg
SsD1 I fYoutk
.t/; t0 � t � t0 C T g�
. In
Fig. 4.1b, and in the rest of the paper, this information flow is graphically depicted for each
pathway protein as a circle with area proportional to the corresponding MI.
4.1 JAK-STAT Pathway
The JAK-STAT pathway is a signal transduction pathway that controls various functions
of the cell including proliferation. Changes in JAK-STAT pathway play significant role in
diabetes [39–41] and plays a part in cancer [42]. In JAK-STAT pathway, the receptors form
a complex with external compounds such as Cytokines. Jak then forms a complex with the
Receptor-Cytokine complex. This complex then recruits STAT, which acts as a transcrip-
tion factor which controls gene regulation. Figure 4.2 shows the JAK-STAT pathway as a
network of chemical reactions. This network is later used to represent the flow of information.
In order to perform a study on the JAK-STAT pathway, a robust model of the pathway is
required. [43] summarizes the available models. Since we are studying the information flow
of the pathway we require data on the chemical reactions that take place in the JAK-STAT
pathway. Therefore we look for kinetic models. Finally, the model by Yamada et al [44] was
chosen because it is a well studied ODE model of the JAK-STAT pathway. The final model
is obtained from the biomodels database [45] as it guarantees that the model is as close as
possible to the the work done in [44]. It is better to use the readily available model rather
than re-invent the wheel. In the previous chapter we had looked at applying flux balance
29
analysis(FBA) to the metabolic pathway. However, a similar approach with Signal Trans-
duction pathways is not possible. For the metabolic pathway it can be assumed that when
an enzyme is available it is always present in high enough quantities to constantly enable the
reaction. Hence Micahelis-Menten kinetics can be applied for the metabolic pathways. How-
ever in the case of signal transduction pathways the concentration of enzymes may or may
not be enough to sustain the reactions. In this case the rate equations need to be modeled
using the mass action kinetics. As a consequence of not being able to use Michaelis-Menten
kinetics, FBA cannot be applied to the signal transduction pathways.
4.2 ODE and Gillespie Simulation
ODE
The JAK-STAT model obtained from biomodels website is used to run ODE simulation.
In the ODE model, the chemical reactions modeled as differential reactions are available.
The concentration of each species can then be found by forming the differential equations
that find the change in species concentration. The model from biomodels is available as
an SBML file. Three tools(MATLAB, Python and COPASI) were tested to run an ODE
simulation. In the end MATLAB was chosen for reasons later discussed when we look at
Gillespie simulation. Utilizing ODE simulation the input sensitivity of the JAK-STAT model
is determined. At this point I would like to acknowledge Zahmeeth Sakaff’s contribution in
determining the input sensitivity. This work was done as a collaboration for [46] . Figure
4.3 visualizes the input sensitivity of the model. Here the input is the concentration of the
Cytokine Interferon, the output is the value of activated Stat1n Dimer(since it controls DNA
transcription). It was found that the pathway is most sensitive to Interferon(IFN) values of
0-20 nMol/litre .
30
End-to-end Molecular Communication Channels in Cell Metabolism: an Information Theoretic Study 0
R
JAK
Receptor-JAK
IFN-Receptor-JAK
IFNRJ2
IFN-R-J2*
STAT1c
IFNRJ2*-STAT1c
STAT1c*
IFNRJ2*-STAT1c*
STAT1c*-STAT1c*
SHP2
IFNRJ2*-SHP2
PPX
STAT1c*-PPX
STAT1c-STAT1c*
STAT1n*-STAT1n*
STAT1n*
PPN
STAT1n*-PPNSTAT1n
STAT1n-STAT1n*
mRNAnmRNAc
SOCS1IFNRJ2*-SOCS1
IFNRJ2*-SHP2-SOCS1-STAT1c
STAT1c*-STAT1c*-PPX
STAT1n*-STAT1n*-PPN
IFNRJ2*-SOCS1-STAT1c
IFN
IFNRJ2*-SHP2-STAT1c
IFNRJ2*-SHP2-SOCS1
IFNR
Nucleus
aCell
Figure 4.2: Jak STAT pathway visualized as a network
31
0
50
100
150
200
250
300#
Tim
e
14
40
36
00
57
60
79
20
10
08
0
12
24
0
14
40
0
16
56
0
18
72
0
20
88
0
23
04
0
25
20
0
27
36
0
29
52
0
31
68
0
33
84
0
[STA
T1n
*-ST
AT1
n*]
(n
M/L
)
Time (seconds)
Figure 4.3: Sensitivity of JAK-STAT to input Interferon
Gillespie Simulation
In an ODE simulation, for each time step it is assumed that all possible reactions occur at
each time step. Therefore, the order in which the reactions are processed is not important.
However, when we consider the way reactions occur, a stochastic order of the reactions is
more realistic. Since chemical reactions are basically interactions of molecules, the order of
chemical reaction is more like a spatiotemporal function. The Gillespie algorithm solves this
by treating the order of reactions as a random event.
Both Copasi and MATLAB have algorithms for Gillespie simulation. MATLAB provides
32
stochastic simulation under the Sim biology toolbox. The SBML file was imported into
MATLAB and the custom rate equations were modified into mass action law equations.
This allowed for Gillespie simulation in MATLAB. Once a valid setup for Gillespie was
obtained, we require a large number of simulations within the input range of IFN pre-
viously calculated. Between the input range of 0-20 nMol/litre with a step of 0.4, we
simulate each concentration 100 times. Although each simulation is not computationally
intensive, calling each of the simulation with a for loop is not as efficient. MATLAB has
a built in for loop called parfor that executes the loop iterations parallely. This signifi-
cantly reduces the computation time as we obtain a better resource utilization. The algo-
rithm ParallelStochasticSimulation(File) shows the Gillespie simulation using the parallel
for. For each of the simulations we store the values of concentration for all time steps.
Algorithm 2: ParallelStochasticSimulation(File)
1 load the sim biology model from the file
2 set the simulation options: Time Steps, Solver, etc.
3 for each concentration between 0 to 20 at 0.4 steps do
4 Set the initial concentration value
5 parfor 100 values do
6 Simulate the model using sim biology Generate the plots Save the data
7 end
8 end
33
Figure 4.4: Plot from Gillespie Simulation
Figure 4.4 shows the plot from the Gillespie simulation. Compared to the ODE simulation
we see that the curves in the plot are not that smooth. This is expected for the stochastic
simulation.
4.3 Mutual Information Calculation
The calculation of mutual information helps quantify the loss of information in the JAK-
STAT pathway. By calculating the Mutual information between the input(Interferon for the
JAK-STAT pathway) and each of the nodes present in the pathway we observe the flow of
information through the pathway. In this section we look at calculating Mutual Information
34
for multi-dimensional data.
The relationship between the various entropies can be seen as:
I. NXsourcej NXdest/ D H. NXsource/ �H. NXsourcej NXdest/ D H. NXdest/ �H. NXdest j NXsource/
H. NXsource/ D
ˇ x maxsource
x minsource
P NXsource.xsource/ log2.P NXsource
.xsource//:dxsource
H. NXsourcej NXdest/ D
ˇ xmaxdest
x mindest
P NXdest.xdest/:
ˇ x maxsource
x minsource
P NXsource j NXdest.xsourcejxdest/:
log2.P NXsource j NXdest.xsourcejxdest//:dxsourcedxdest
Calculation
As per [46], we detail a methodology to estimate the aforementioned molecular information
flow parameters starting from the knowledge of the chemical reactions of the pathway, and
their kinetic rates, as expressed in Sec. 4. For this, we take into account that in general
the signaling pathway communication channels as defined above are characterized by the
non-linearity of chemical reactions, and the effect of feed-forward, and feedback loops in the
pathway reaction cascade [47], which, together with the aforementioned CME noise [34], do
no allow for a closed-form analytical expression of the MI parameters. As a consequence,
in this paper we devise a computational approach based on the stochastic simulation of
chemical reaction kinetics through the aforementioned SSA [35]. Based on this simulation
methodology, we estimate the MI by collecting and analyzing data, inspired by the procedure
in [36,47] with the following three main differences: i) we are based on a computational sim-
ulation rather than expensive wet lab experiments, which does not pose stringent constraints
on the size of the data set that can be collected; ii) we estimate the MI taking into account
35
the complete time evolution of the output, instead of only accounting for a single value of
the output in a dose-response characterization, often made in experimental studies, such as
in [36]; iii) we perform the MI estimation not only at the pathway output, but also at each
protein and protein complex.
For simplicity of notation, in the following we will consider a pathway having only one
species of information-bearing molecules at the input (S D 1), and only one type of output
transcription factors (K D 1). All the following expressions can be generalized to signal
transduction pathways with multiple inputs and outputs.
Computational Approach
Goal
The final goal of our computational approach is the estimation of the MI QIj at each pathway
protein j , expressed as
QIj D QH.X/ � QH.X j˚Yj .t/; t0 � t � t0 C T
/ ; (4.1)
where H.:/ and H.:j:/ denote the estimated entropy and conditional entropy, respectively, X
is the input concentration of information-bearing molecules, and˚Yj .t/; t0 � t � t0 C T
is
the time evolution of the concentration of the pathway protein Yj .t/ within a time interval T
from t0. The estimation of the output MI QIout is expressed as in (4.1) and in the subsequent
equations by substituting out in place of j .
Details
The necessary data for the MI estimations is obtained through SSA simulations of the chem-
ical reactions of the pathway [35]. In particular, for each value xi , i D 0; : : : ; I , of the input
concentration X sampled from the range between xmin and xmax, defined here as the value
below which the concentrations of any pathway protein do not significantly change, and the
36
value above which the same concentrations do not show noticeable changes in their time
evolution, we run a total of R simulations. Each SSA simulation is run independently, and
starts in the same steady state that the system reaches with an input concentration value
X D 0.
The estimated input entropy QH.X/ is computed through the histogram approach [48] as
QH.X/ D �
IXiD1
pX.xi/ log2
�pX.xi/
wX
�; (4.2)
where pX.xi/ D 1=I , according to the simplifying assumption of having a uniformly dis-
tributed input, in agreement with [36], and wX is the sampling interval .xmax � xmin/=I .
The estimated conditional entropy QH.X j˚Yj .t/; t0 � t � t0 C T g/ of the input concentration
X given the time evolution of the concentration of the pathway protein j is computed as
QH.X j˚Yj .t/; t0 � t � t0 C T
/ D
�
XNj;t0
XNj;t1
� � �
XNj;tN
pYj
�˚yj;tn
N
nD0
�Sfyj;tng
N
nD0XsD1
pX jfyj;tng
N
nD0
.xs/
log2
0@pX jfyj;tngN
nD0
.xs/
wX;fyj;tng
N
nD0
1A ; (4.3)
where tN D t0C T , N being the number of time samples considered when discretizing Yj .t/
within the interval T (for computational processing),˚yj;tn
N
nD0is a set of values of the
protein concentration Yj .t/ at time instants t0; t1; : : : ; tN , Nj;tnis the number of histogram
bins considered for the protein concentration value Yj .tn/ to compute the multidimensional
histogram pYj, Sfyj;tng
N
nD0
and wX;fyj;tng
N
nD0
are the number and the size of histogram bins
considered for the input concentration X to compute the histogram pX jfyj;tng
N
nD0
.xs/, where
37
wX;fyj;tng
N
nD0
D .xmax � xmin/=Sfyj;tng
N
nD0
and xs is a value from the concentration input
fxigIiD0 sampled according to the histogram. The numbers of histogram bins Nj;tn
are com-
puted from the aforementioned simulation data according to the Doane’s formula [48] as
follows:
Nj;tnD 1C log2.C /C log2
1C
gYj .tn/
�gYj .tn/
!: (4.4)
where C D I �R is the total number of simulation runs, gYj .tn/ is the estimated 3rd-moment-
skewness of the distribution pYj .tn/ from the simulation data, and �gYj .tn/D
q6.C�2/
.CC1/.CC3/.
The number of histogram bins Sfyj;tng
N
nD0
is computed with a similar expression as in (4.4) by
substituting Yj .tn/ (and C ) with the set of xi values (number of xi values) that resulted in a
concentration evolution for protein j equal to˚yj;tn
N
nD0. Finally, the probabilities pYj
, for
all the J pathway proteins, and pX jfyj;tng
N
nD0
, for all the combination of values yj;tnat each
time instant tn of each of the J pathway proteins, are computed as histogram distributions of
the aforementioned data according to Algorithm 3. In Fig. 4.5 we show a graphical example
of the computation of�fZi;rgtn
; btn
�as per Algorithm 3 for a protein in the case study
pathway detailed in Sec. 4.1, where we consider the results of multiple simulation runs for
different input concentrations, and overlay at tn the Nj;tnequally-spaced bins between min
and max values.
38
!",$%
tn
[STA
T1n*
-STA
T1n*
] (nM
/L)
Time (seconds)
Extracellular Environment
Intracellular Environment
Cell Membrane
Extracellular
Signals
Transcription Factor
InputConcentration
max
min
# '( )*%+
Figure 4.5: Graphical sketch of the computation of Steps 1-4 of Algorithm 3 for the phos-phorylated and dimerized output transcription factor STAT1n*-STAT1n* of the JAK-STATpathway.
39
Algorithm 3: Probability Histograms for Equation (4.3)
Data: R simulation runs for each of I input concentrations containing values for all
N simulation steps
Result: For each protein j , pYjand p
X jfyj;tngN
nD0
1 for each simulation time step tn do
2 Create fZi;rgtnby extracting protein j concentration for each simulation run r
and input concentration i
3 Map each value of fZi;rgtnin Nj;tn
equally-spaced bins (with index btn) between
min and max values, expressed as�fZi;rgtn
; btn
�4 end
5 Obtain matrix M of size C by N by combining all the mapped bin indices btnfor each
simulation run .i; r/ and each time step tn
6 Compute the multidimensional histogram considering each row of M as a datapoint:
pYj
�˚yj;tn
N
nD0
�7 for each bin in the multidimensional histogram do
8 Take all the input values corresponding to the values˚yj;tn
N
nD0that define the
current multidimensional bin
9 Compute the histogram pX jfyj;tng
N
nD0
by mapping the input values found at Step 8
into Sfyj;tng
N
nD0
equally space bins between min and max values
10 If no input value from Step 8, set pX jfyj;tng
N
nD0
D 0
11 end
4.4 Result: Mutual Information Flow Graph
As shown in Fig. 4.6, the JAK-STAT kinetic model that we utilize to compute the numerical
results of this paper consists of J D 34 chemical species (proteins) and 46 reactions, and
40
its complete description and parameter values can be found in [44] and in the Biomodels
database [45], respectively. In this model, the input is the concentration of a small signaling
protein called interferon gamma (IFN- /IFN-green node) while the output is the phosphory-
lated transcription factor STAT1n*-STAT1n* (blue node). In Fig. 4.6 we show the complete
interconnections between different protein species, and proteins at different phosphorylation
(denoted with a * when phosphorylated) or binding (dashed or denoted by their initials)
states involved in reactions.
To obtain the data necessary for our computational approach, we utilized the implementa-
tion of the SSA algorithm in MATLAB Simbiology. Through these simulations, the values
of xmin and xmax, defined in Sec. 4.3, were found to be 0 and 20 nmol/litre, respectively.
For simplicity, we considered a number I D 51 different input concentrations, resulting in
a sampling interval wX D 0:4 nmol/litre. For each input concentration, we arbitrarily run
R D 100 independent simulations for a time interval T D 10; 000 seconds, estimated as
defined in Sec. 4. The time step of each simulation is set to tn � tn�1 D 1 second (N =
10,000). In Fig. 4.5 we show the simulation results for the phosphorylated and dimerized
output transcription factor STAT1n*-STAT1n* at each time step for only one of the R runs
for a restricted number of input concentrations out of I .
The MI values for each pathway protein estimated from the simulation data through the
computational approach in Sec 4.3 is reported in Fig. 4.6, and graphically shown in a cor-
responding proportional size of each graph node (protein). As expected, the value of MI
is decreasing as the information propagates through the reaction cascades, accumulating
chemical noise at each reaction (data processing inequality), from an estimated input en-
tropy QH.X/ D 4:35 bits to an estimated output MI QIout D 1:65 bits.
41
In Fig. 4.7 we show a comparison bar chart between the MI of Fig. 4.6 estimated by taking
into account the time evolution˚Yj .t/; t0 � t � t0 C T
of each protein concentration, and an
MI similarly estimated, but only taking into account the maximum value maxt0�t�t0CT Yj .t/.
As expected, the latter generally underestimates the MI of the pathway proteins.
End-to-end Molecular Communication Channels in Cell Metabolism: an Information Theoretic Study
R(3.69)
JAK(0.56)
Receptor-JAK(4.34)
IFN-Receptor-JAK(4.35)
IFNRJ2(4.35)
IFN-R-J2*(4.03)STAT1c(3.92)
IFNRJ2*-STAT1c(3.86)
STAT1c*(3.98)
IFNRJ2*-STAT1c*(0.61)
STAT1c*-STAT1c*(3.82)
SHP2(3.9)
IFNRJ2*-SHP2(2.03)
PPX(3.88)
STAT1c*-PPX(3.86)
STAT1c-STAT1c*(0.73)
STAT1n*-STAT1n*(1.65)
STAT1n*(1.31)
PPN(1.3)
STAT1n*-PPN(1.06)
STAT1n(0.57)
STAT1n-STAT1n*(0.56)
mRNAn(0.56)mRNAc(0.56)
SOCS1(0.56)IFNRJ2*-SOCS1(0.56)
IFNRJ2*-SHP2-SOCS1-STAT1c(0.56)
STAT1c*-STAT1c*-PPX(3.45)
STAT1n*-STAT1n*-PPN(1)
IFNRJ2*-SOCS1-STAT1c(0.56)
IFN(4.35)
IFNRJ2*-SHP2-STAT1c(3.84)
IFNRJ2*-SHP2-SOCS1(0.56)
IFNR(4.35)
0
Nucleus
aCell
Figure 4.6: Estimated MI of the JAK-STAT pathway (node size proportional to MI value in[bits]).
42
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Mutua
lInformation(in
bits)
Max_Value_Dependent_MI Time_Evolution_Dependent_MI
Figure 4.7: Comp. MI with time evol. Vs. MI with max values.
4.5 MAPK Pathway
The MAPK(Mitogen Activated Protein Kinase) is a signal transduction pathway that has
been shown to be significant in Cancer Research [49–51]. Mutations in the genes of the
MAPK may lead to a cancerous state in the cell. By computing the Mutual Information for
the MAPK pathway as a network we hope to understand how the mutations modify the flow
of information. Similar to the JAK-STAT pathway, the receptor in the MAPK pathway is
EGFR(Epidermal Growth Factor). For transcription factor(output) c-Fos is chosen from the
pathway. In this section we look at applying gene expression data from the TCGA database
to obtain Mutual Information of the MAPK network.
43
TCGA Data
TCGA(The Cancer Genome Atlas) [52] is a database for Genomic data. The database
contains gene data from tumor and non tumor cells for 33 cancer types. For this study we
consider the mutations in the MAPK pathway that lead to colon cancer. Access to cancer
data was ensured by the local copy of the TCGA database hosted by the UNL Systems
Biology and Biomedical Informatics, managed by Dr.Cui. From the database we utilized the
gene expression data. The method utilized to obtain the genomic data is RNA-Seq, which
gives the value of mRNA expression for each gene. The expression is obtained in Fragments
Per Kilobase of transcript per Million mapped reads(FPKM). Since mRNA is responsible
for generating information in the biological cell network, utilizing the gene expression data
should provide us with an estimation of the information flow.
4.6 Result: Mutual Information Change Due to Mutation
In Fig. 4.8 we show data points extracted from TCGA colon cancer data. In the following,
we demonstrate our approach for calculating the communication performance of a signaling
pathway from high-throughput genomic data, and compare them in healthy and cancer cells.
For each sample, we extracted the expression of EGFR and the corresponding expression of
the immediate early gene c-Fos [53]. Although cell signaling pathways are dynamic, and their
input-output behavior is in general a function of the time, we can consider these data points
as steady-state average samples from an ensemble of non-synchronized signaling pathway
states.
44
0
50
100
150
200
250
300
350
400
0 2 4 6 8 1 0 1 2 1 4 1 6
C-FO
S [F
PKM
]
EGFR [FPKM]
mRNA Expression EGFR vs C-FOS
Tumor
Healthy
Poly. (Tumor)
Poly. (Healthy)
Figure 4.8: Egfr vs Fos fpkm
By observing the polynomial trend lines [54] of the healthy samples Vs the tumor sample, it
is clear that in healthy cells the expression of c-Fos can be controlled more precisely and over
a larger range than in tumor cells, where c-Fos is mostly around a stable (and low, as also
confirmed in [55]) value for almost all range of observed EGFR expression. Inspired by the
methods in [13], we can quantify these two cases of controllability in terms of communication
performance of the pathway, i.e., Mutual Information (MI) from information theory [38],
which is expressed as follows: I .EGFRI cFos/ D H .EGFR/ �H .EGFR jcFos /,
where EGFR and cFos represent values of the expression of EGFR and c-Fos observed in
the aforementioned data. The input entropy:
H .EGFR/ D �PK�1
kD1 P .Ak � EGFR < AkC1/ log2 P .Ak � EGFR < AkC1/
is computed from the probability histogram of EGFR values in the data [56] with bin num-
ber K and size AkC1 � Ak chosen according to the Doane’s formula [48]. The conditional
45
entropy of the input given the output H.EGFRjcFos/ is computed as follows:
H.EGFRjcFos/ D �PN�1
nD1 P.Bn � cFos < BnC1/PM�1
mD1 P.Cm � EGFR < �
� CmC1jBn � cFos < BnC1/� log2 P .Cm � EGFR < CmC1 jBn � cFos < BnC1 /
where P .Bn � cFos < BnC1/ and P .Cm � EGFR < CmC1
jBn � cFos < BnC1 / are the probability histogram of cFos values, and EGFR values corre-
sponding to cFos within a determinate histogram bin, computed from values in the data [56]
as mentioned above.
We compute two mutual information values, namely, Ihealthy .EGFRI cFos/ � 0:73 bits
(41 samples) and Itumor .EGFRI cFos/ � 0:27 bits (470 samples) considering in each case
only the data from the healthy samples and tumor samples, respectively. As in communica-
tion theory mutual information expresses how much information at the input of the channel
is contained at its output, or in other words, the controllability of the channel output by
modulating the input, similarly we have quantified the controllability of the c-Fos expres-
sion by modulating the input of the pathways, represented by the EGFR expression. Since
Itumor .EGFRI cFos/ < Ihealthy .EGFRI cFos/, as expected, their difference of 0.46 bits
quantifies the information loss suffered by tumor tissue cells in colorectal cancer through
the mechanism of regulation of the c-Fos gene by the EGF signal. This result demonstrates
that it is possible to infer the communication performance of a signaling pathway from high-
throughput genomic data, and at the same time quantitatively characterize a cancer by a
decrease in these performance. In this project, we will extend this approach to quantita-
tively characterize the communication performance at each reaction in the pathway, and the
performance loss associated to each single genetic change leading to disease. The MI value
for healthy cells was found to be 0.73 bits while for the tumor case it was found to be 0.27
bits.
46
4.7 Computational Tools: Pathway Layer
Some of the tools used in the previous layers are used in the Pathway Layer as well. As de-
scribed earlier MATLAB was used for Gillespie Simulations. Additionally COPASI(COmplex
PAthway SImulator) [57] is provides a convenient way of creating,editing, and simulating
pathway models. SBML files can be imported/exported to/from COPASI allowing for porta-
bility of model.
In section 4.3 we saw the mutual information calculation. For this calculation, based on the
number of simulations and time duration of simulation, we generate a 5000 by 10000 matrix,
for each of the 33 compounds. On a system with an i7 5700HQ(2.7 Ghz) and 12GB RAM
a run time was 14 hours was required for the computation. Further the calculation is IO
intensive as all the data has to be loaded from the disk and cannot be stored in the memory
due to large size. In order to speed up the calculations we used the Holland Computing
Center at UNL. By dividing the task load between 8 nodes of the HCC we were able to
reduce the computation time to 4 hours.
47
Chapter 5
Future Work
For future work we hope to add another layer to this layered perspective. Abstracting a
system even further, we consider the interaction between two organisms or more specifically
a colony of single celled organisms that interact with each other. We propose to control these
interactions to the effect of obtaining a system that can visualize itself or its surroundings.
We hope to provide a way to effectively communicate the state of our system which can
be perceived instead of being interpreted. We take the example of a bacterial colony that
can sense the toxicity of water. The toxicity of the colony can be effectively communicated
through self visualization. Once a toxicity is sensed, this would be converted to a meaningful
number which can be displayed by the colony. Further we divide the tasks into sensing and
visualization by introducing a second colony. The introduction of the second colony provides
modularity, simplicity, and standardizes the system. The second colony could display the
result of any function performed by the first colony so long as its output matches the input
to the second colony. A membrane can be used to separate the colonies, such that only the
input signal of the second colony can pass through the membrane.
48
Figure 5.1: System Block Diagram
5.1 Design
We base our design on a simple seven segment display. A seven segment display utilizes 4
horizontal and 3 vertical segments to form a display. A light emitting diode illuminates each
of the segment; the illumination of a segment is based on the number being displayed. The
light emitting diodes would be replaced by bacteria with Green Florescence Protein (GFP),
whose production of GFP is based on the number being displayed and their location with
respect to the segment markers. GFP is a protein that reflects ultra violet light. Therefore
the bacteria glows under ultra violet light when GFP is present.
49
Figure 5.2: Segment Markers
Figure 5.2 shows the layout of the segments labeled from one to seven. in order to display
the number 3, the segments one and six would be off and the segments two,three, four, five
and seven would be on. We can utilize the below truth table to form the necessary equations
for each of the segments.
Table 5.1: Truth Table for 7 segment display
Number D C B A 1 2 3 4 5 6 7
0 0 0 0 0 1 1 1 1 1 1 01 0 0 0 1 0 0 1 1 0 0 02 0 0 1 0 0 1 1 0 1 1 13 0 0 1 1 0 1 1 1 1 0 14 0 1 0 0 1 0 1 1 1 0 15 0 1 0 1 1 1 0 1 1 0 16 0 1 1 0 1 1 0 1 1 1 17 0 1 1 1 0 1 1 1 0 0 08 1 0 0 0 1 1 1 1 1 1 19 1 0 0 1 1 1 1 1 0 0 1
By using Karnaugh maps we obtain the following equations:
one = D+C*B’+B’.A’+C.A’
50
two=D+B+CA+C’A’
three= D+C’+B’A’+BA
four=D+C+B’+A
five= D+C’B+C’A’+BA’+CB’A
six=C’A’+BA’
seven=D+CB’+BA’+C’B
where A’ represents the compliment of A.
The values derived from the above equations are boolean values which can be used to control
the production of GFP.
5.2 Simulation
Gro is used to simulate bacterial colonies. It is helps prototype the logic for a genetically
engineered cell. It was developed by the Klavins Lab for Synthetic Biology at the University
of Washington. [58]
We simulate the system by hard-coding the number to display in binary format. Using 4
boolean variables bit0 to bit4 we specify the numbers from 0-9. The bacterial colony is then
allowed to grow for 26 seconds of simulation time and the growth rate is then set to zero.
This prevents the bacteria from moving. As the simulation time reaches 30 seconds, we
enable the display function which makes the bacteria produce GFP based on the equations
for the segments. If the bacteria senses the chemical signal for a segment and if the segment
equation is true, the bacteria produces GFP.
51
Figure 5.3: GRO Screenshot
Figure 5.3 shows the number 6 displayed using the GRO simulator.
52
Chapter 6
Conclusion
We observed the various types of data available at each layer of the system. Further, some
of the tools available to a Bioinformatician are studied. With more and more discoveries in
the field of biology, it will become essential to keep track of all the resources available. For
data sources we have; TCGA for Genomic and Cancer Data and The Biomodels repository
for pathway data. While for Computation we have
• Python for data processing
• MATLAB for metamathematical studies
• HCC and OSG for large scale computation
For visualization we demonstrate the capabilities of Excel, D3.js, Gephi and Jhive. In the
context of communication we show how Mutual Information calculations can be applied at
the cell and pathway layers to quantify the communications that happen within the system.
We also briefly look at building biological systems that can communicate their internal state.
53
Bibliography
[1] I. F. Akyildiz, M. Pierobon, S. Balasubramaniam, and Y. Koucheryavy, “The internet
of bio-nano things,” IEEE Communications Magazine, vol. 53, no. 3, pp. 32–40, March
2015.
[2] Y. Chahibi and I. Balasingham, “An intra-body molecular communication networks
framework for continuous health monitoring and diagnosis,” in Engineering in Medicine
and Biology Society (EMBC), 2015 37th Annual International Conference of the IEEE.
IEEE, 2015, pp. 4077–4080.
[3] N. A. Abbasi and O. B. Akan, “An information theoretical analysis of human insulin-
glucose system towards the internet of bio-nano things,” IEEE transactions on nanobio-
science, 2017.
[4] Y. Chahibi, M. Pierobon, S. O. Song, and I. F. Akyildiz, “A molecular communication
system model for particulate drug delivery systems,” IEEE Transactions on Biomedical
Engineering, vol. 60, no. 12, pp. 3468–3483, 2013.
[5] A. Marcone, M. Pierobon, and M. Magarini, “A parity check analog decoder for molec-
ular communication based on biological circuits,” in INFOCOM 2017-IEEE Conference
on Computer Communications, IEEE. IEEE, 2017, pp. 1–9.
54
[6] M. Pierobon, “A molecular communication system model based on biological circuits,”
in Proceedings of ACM The First Annual International Conference on Nanoscale Com-
puting and Communication. ACM, 2014, p. 2.
[7] C. Harper, M. Pierobon, and M. Magarini, “Estimating information exchange perfor-
mance of engineered cell-to-cell molecular communications: a computational approach,”
in IEEE International Conference on Computer Communications (INFOCOM).
[8] M. G. Rad, A. Immaneni, M. McCabe, M. Pierobon, and J. Cui, “A simulation
model of glucose-insulin metabolism and implementation on osg,” in Bioinformatics
and Biomedicine (BIBM), 2017 IEEE International Conference on. IEEE, 2017, pp.
1832–1839.
[9] D. L. Nelson and M. M. Cox, Lehninger Principles of Biochemistry. W. H. Freeman,
2005, ch. 12.2, pp. 425–429.
[10] J. C. B. V. R. e Silva and M. G. Rad, “A mobile-based diet monitoring system for
obesity management.”
[11] M. Lombarte, M. Lupo, G. Campetelli, M. Basualdo, and A. Rigalli, “Mathematical
model of glucose–insulin homeostasis in healthy rats,” Mathematical biosciences, vol.
245, no. 2, pp. 269–277, 2013.
[12] R. Basu, A. Basu, C. M. Johnson, W. F. Schwenk, and R. A. Rizza, “Insulin dose-
response curves for stimulation of splanchnic glucose uptake and suppression of endoge-
nous glucose production differ in nondiabetic humans and are abnormal in people with
type 2 diabetes,” Diabetes, vol. 53, no. 8, pp. 2042–2050, 2004.
[13] A. Rhee, R. Cheong, and A. Levchenko, “The application of information theory to
biochemical signaling systems,” Phys Biol., vol. 9, no. 4, p. 045011, August 2012.
55
[14] H. Karakelides, Y. W. Asmann, M. L. Bigelow, K. R. Short, K. Dhatariya, J. Coenen-
Schimke, J. Kahl, D. Mukhopadhyay, and K. S. Nair, “Effect of insulin deprivation
on muscle mitochondrial atp production and gene transcript levels in type 1 diabetic
subjects,” Diabetes, vol. 56, no. 11, pp. 2683–2689, 2007.
[15] J. C. Koster, M. A. Permutt, and C. G. Nichols, “Diabetes and insulin secretion: the
atp-sensitive k+ channel (katp) connection,” Diabetes, vol. 54, no. 11, pp. 3065–3072,
2005.
[16] M. R. Villarreal, “Digestive system diagram,” https://commons.wikimedia.org/wiki/File:Digestive
system diagram en.svg, 2006.
[17] M. J. Franz, “Protein: metabolism and effect on blood glucose levels,” The diabetes
educator, vol. 23, no. 6, pp. 643–651, 1997.
[18] N. J. Aparicio, F. E. Puchulu, J. J. Gagliardino, M. Ruiz, J. M. Llorens, J. Ruiz,
A. Lamas, and R. De Miguel, “Circadian variation of the blood glucose, plasma in-
sulin and human growth hormone levels in response to an oral glucose load in normal
subjects,” Diabetes, vol. 23, no. 2, pp. 132–137, 1974.
[19] V. Sarabia, T. Ramlal, and A. Klip, “Glucose uptake in human and animal muscle cells
in culture,” Biochemistry and Cell Biology, vol. 68, no. 2, pp. 536–542, 1990.
[20] M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, A. P. Arkin, B. J.
Bornstein, D. Bray, A. Cornish-Bowden et al., “The systems biology markup language
(sbml): a medium for representation and exchange of biochemical network models,”
Bioinformatics, vol. 19, no. 4, pp. 524–531, 2003.
56
[21] A. Drager, N. Rodriguez, M. Dumousseau, A. Dorr, C. Wrzodek, N. Le Novere, A. Zell,
and M. Hucka, “Jsbml: a flexible java library for working with sbml,” Bioinformatics,
vol. 27, no. 15, pp. 2167–2168, 2011.
[22] Mathworks, “SimBiology Release 2017b Documentation,”
https://www.mathworks.com/help/simbio/index.html, 2017, [Online].
[23] R. Pordes, D. Petravick, B. Kramer, D. Olson, M. Livny, A. Roy, P. Avery, K. Black-
burn, T. Wenaus, F. Wurthwein et al., “The open science grid,” in Journal of Physics:
Conference Series, vol. 78, no. 1. IOP Publishing, 2007, p. 012057.
[24] I. Sfiligoi, D. C. Bradley, B. Holzman, P. Mhashilkar, S. Padhi, and F. Wurthwein, “The
pilot way to grid resources using glideinwms,” in Computer Science and Information
Engineering, 2009 WRI World Congress on, vol. 2. IEEE, 2009, pp. 428–432.
[25] Z. Sakkaff, J. L. Catlett, M. Cashman, M. Pierobon, N. R. Buan, M. B. Cohen, and
C. A. Kelley, “End-to-end molecular communication channels in cell metabolism: an
information theoretic study,” in Proceedings of the 4th ACM International Conference
on Nanoscale Computing and Communication. ACM, 2017, p. 21.
[26] E. Klipp and W. Liebermeister, “Mathematical modeling of intracellular signaling path-
ways,” BMC neuroscience, vol. 7, no. 1, p. S10, 2006.
[27] M. Jacomy, T. Venturini, S. Heymann, and M. Bastian, “Forceatlas2, a continuous graph
layout algorithm for handy network visualization designed for the gephi software,” PloS
one, vol. 9, no. 6, p. e98679, 2014.
[28] M. Krzywinski, I. Birol, S. J. Jones, and M. A. Marra, “Hive plotsrational approach to
visualizing networks,” Briefings in bioinformatics, vol. 13, no. 5, pp. 627–644, 2011.
57
[29] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An open source soft-
ware for exploring and manipulating networks,” 2009. [Online]. Available:
http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154
[30] M. Krzywinski, “Jhive,” http://www.bcgsc.ca/wiki/display/jhive/home, 2006.
[31] M. Bostock, “D3. js-data-driven documents (2016),” URL: https://d3js. org, 2016.
[32] K. Kashin, “Hive Plot of Network of Individuals at Risk of HIV,”
http://www.konstantinkashin.com/data/hiv-hiveplot.html.
[33] E. Goncalves, J. Bucher, A. Ryll, J. Niklas, K. Mauch, S. Klamt, M. Rocha, and J. Saez-
Rodriguez, “Bridging the layers: towards integration of signal transduction, regulation
and metabolism into mathematical models,” Molecular BioSystems, vol. 9, no. 7, pp.
1576–1583, 2013.
[34] C. J. Myers, Engineering genetic Circuits. Chapman & Hall, 2009.
[35] D. T. Gillespie, “Stochastic simulation of chemical kinetics,” Annual Review of Physical
Chemistry, vol. 58, pp. 35–55, May 2007.
[36] R. Suderman, J. A. Bachman, A. Smith, P. K. Sorger, and E. J. Deeds, “Fundamental
trade-offs between information flow in single cells and cellular populations,” PNAS, vol.
114, no. 22, pp. 73–85, May 2017.
[37] J. E. Ladbury and S. T. Arold, “Noise in cellular signaling pathways: causes and effects,”
Trends Biochem Sci., vol. 37, no. 5, pp. 173–178, May 2012.
[38] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd Edition. Wiley,
2006.
58
[39] M. B. Marrero, A. K. Banes-Berceli, D. M. Stern, and D. C. Eaton, “Role of the
jak/stat signaling pathway in diabetic nephropathy,” American Journal of Physiology-
Renal Physiology, vol. 290, no. 4, pp. F762–F768, 2006.
[40] A. K. Banes, S. Shaw, J. Jenkins, H. Redd, F. Amiri, D. M. Pollock, and M. B. Mar-
rero, “Angiotensin ii blockade prevents hyperglycemia-induced activation of jak and
stat proteins in diabetic rat kidney glomeruli,” American Journal of Physiology-Renal
Physiology, vol. 286, no. 4, pp. F653–F659, 2004.
[41] S. Thomas, J. Snowden, M. Zeidler, and S. Danson, “The role of jak/stat signalling in
the pathogenesis, prognosis and treatment of solid tumours,” British journal of cancer,
vol. 113, no. 3, p. 365, 2015.
[42] V. Boudny and J. Kovarik, “Jak/stat signaling pathways and cancer. janus ki-
nases/signal transducers and activators of transcription.” Neoplasma, vol. 49, no. 6,
pp. 349–355, 2002.
[43] A. Gambin, A. Charzynska, A. Ellert-Miklaszewska, and M. Rybinski, “Computational
models of the jak1/2-stat1 signaling,” Jak-Stat, vol. 2, no. 3, p. e24672, 2013.
[44] S. Yamada, S. Shiono, A. Joo, and A. Yoshimura, “Control mechanism of jak/stat signal
transduction pathway,” FEBS letters, vol. 534, no. 1-3, pp. 190–196, 2003.
[45] C. Li, M. Donizelli, N. Rodriguez, H. Dharuri, L. Endler, V. Chelliah, L. Li, E. He,
A. Henry, M. I. Stefan et al., “Biomodels database: An enhanced, curated and annotated
resource for published quantitative kinetic models,” BMC systems biology, vol. 4, no. 1,
p. 92, 2010.
[46] Z. Sakkaff, A. Immaneni, and M. Pierobon, “Estimating the molecular information
through cell signal transduction pathways,” in Under Review,2018 IEEE 19th Interna-
59
tional Workshop on Signal Processing Advances in Wireless Communications (SPAWC).
IEEE.
[47] R. Cheong, A. Rhee, C. Wang, I. Nemenman, and A. Levchenko, “Information transduc-
tion capacity of noisy biochemical signaling networks,” Science, vol. 334, pp. 354–358,
2011.
[48] A. Papoulis and S. U. Pillai, Probability, Random Variables and Stochastic Processes,
4th ed. McGraw-Hill, 2002.
[49] E. F. Wagner and A. R. Nebreda, “Signal integration by jnk and p38 mapk pathways
in cancer development,” Nature Reviews Cancer, vol. 9, no. 8, p. 537, 2009.
[50] J. Y. Fang and B. C. Richardson, “The mapk signalling pathways and colorectal cancer,”
The lancet oncology, vol. 6, no. 5, pp. 322–327, 2005.
[51] A. Carracedo, L. Ma, J. Teruya-Feldstein, F. Rojo, L. Salmena, A. Alimonti, A. Egia,
A. T. Sasaki, G. Thomas, S. C. Kozma et al., “Inhibition of mtorc1 leads to mapk
pathway activation through a pi3k-dependent feedback loop in human cancer,” The
Journal of clinical investigation, vol. 118, no. 9, pp. 3065–3074, 2008.
[52] R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus, D. R. Lowy, W. A. Kibbe, and
L. M. Staudt, “Toward a shared vision for cancer genomic data,” New England Journal
of Medicine, vol. 375, no. 12, pp. 1109–1112, 2016.
[53] M. K. Pandey, G. Liu, T. K. Cooper, and K. M. Mulder, “Knockdown of c-fos suppresses
the growth of human colon carcinoma cells in athymic mice,” Int J Cancer, vol. 130,
no. 1, pp. 213–222, January 2012.
60
[54] J. Fan, “Local polynomial modelling and its applications: From linear regression to
nonlinear regression,” Monographs on Statistics and Applied Probability. Chapman &
Hall/CRC, pp. 1269–1275, 1996.
[55] S. Mahner, C. Baasch, J. Schwarz, S. Hein, L. W?lber, F. J?nicke, and K. Milde-
Langosch, “C-fos expression is a molecular predictor of progression and survival in
epithelial ovarian carcinoma,” Br. J. Cancer, vol. 99, no. 8, pp. 1269–1275, October
2008.
[56] “The Cancer Genome Atlas,” https://cancergenome.nih.gov/, accessed: 2017-02-06.
[57] S. Hoops, S. Sahle, R. Gauges, C. Lee, J. Pahle, N. Simus, M. Singhal, L. Xu, P. Mendes,
and U. Kummer, “Copasia complex pathway simulator,” Bioinformatics, vol. 22, no. 24,
pp. 3067–3074, 2006.
[58] K. Lab, “Gro: The cell programming language,” University of Washington.
top related