Degree: Bachelor of Computer Science 180hp Supervisor(s): Céline Fernandez, Major: Information Systems Annabella Loconsole Programme: Information Systems Examiner: Bengt J. Nilsson Date of exam: 2012-09-20 Technology and society Computer Science Investigation of Pathway Analysis Tools for mapping omics data to pathways -Focus on lipidomics and genomics data Undersökning av analysverktyg för att kartlägga omik data till relationsvägar – Fokus på data av typen lipidomik och genomik Author: Attila Konrád
53
Embed
Investigation of Pathway Analysis Tools for mapping omics ... · Keywords: Biochemistry, Cardiovascular disease, Database, Genomics, Lipids, Lipidomics, Metabolomics, PAT, Technology
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Degree: Bachelor of Computer Science 180hp Supervisor(s): Céline Fernandez, Major: Information Systems Annabella Loconsole Programme: Information Systems Examiner: Bengt J. Nilsson Date of exam: 2012-09-20
Tech n ology a n d societ y
Com pu t er Scien ce
Invest iga t ion of Pa thway Ana lysis
Tools for mapping omics da ta to
pa thways -Focu s on l ip id om ics an d gen om ics d a ta
Undersökning av ana lysverktyg för a t t ka r t lägga omik da ta
t ill rela t ionsvägar – F ok u s p å d a t a a v t yp en l i p i d om i k och gen om i k
A u t h or: A t t i l a K on rá d
P a g e | 2
An education isn 't how m uch you have com m itted to m em ory,
or even how m uch you know. It 's being able to d ifferentiate
between what you know and what you don 't. /Anatole France
Ackn ow le dge m e n ts I would like say thank you to everyone who helped me with my thesis. To my supervisors
I thank you for your pa t ience, guidance and a ll the good feedbacks.
P a g e | 3
Abst ract This thesis examines PATs from a mult idisciplinary view. There are a lot of PAT's
exist ing today ana lyzing specific type of omics da ta , therefore we invest iga te them and
what they can do. By defin ing some specific requirements such as how many omics data
types it can handle, the accuracy of the PAT can be obta ined to get the most su itable
PAT when it comes to mapping omics da ta to pa thways . Result s show that no PATs
found today fu lfills the specific set of requirements or the main goal though software
test ing. The Ingenuity PAT is the closest to fu lfill the requirements . Requested by the
end user , two PATs are tested in combinat ion to see if th ese can fu lfill the requirements
of the end user . Uniprot ba tch conver ter was tested with FEvER and r esults did not turn
out successfu lly since the combinat ion of the two PATs is no bet ter than the Ingenuity
PAT. Focus then turned to an a lternat ive combinat ion , a homepage ca lled NCBI that
have search engines connected to severa l free PATs available thus fulfilling the
requirements . Through the search engine “omics” da ta can be combined and more t han
one input can be taken a t a t ime. Since technology is rapidly moving forward , the need
for new tools for data in terpreta t ion a lso grows. It means tha t in a near future we may
be able to find a PAT tha t fu lfills the requirements of the end user s.
Ke yw ords: Biochemist ry, Cardiovascular disease, Database, Genomics, Lipids,
Lipidomics, Metabolomics, PAT, Technology
Sammanfa t tn ing Detta examensarbete granskar ana lysverktyg ur et t tvärvetenskapligt perspekt iv. Det
finns en hel del olika ana lysverktyg idag som analyserar specifika typer av omik data
och därför undersöker vi hur många det finns samt vad de kan göra. Genom a t t defin iera
et t anta l specifika krav såsom hur många typer av omik da ta den kan hantera,
noggrannhet av verktygets ana lys så kan man se vilka som är mest lämpliga
ana lysverktygen när det gä ller kar t läggning av omik da ta . Resulta ten visar a t t det idag
in te finns ana lysverktyg som uppfyller de specifik t angivna kraven eller huvudsyftet
genom testn ing av programvaran . Ingenuity ana lysverktyget ä r det närmaste vi kan
komma för de krav som vi söker . På begäran av slu tanvändaren testades två
ana lysverktyg för a t t se om en kombinat ion av dessa kan uppfylla slu t användarens
krav. Analysverktyget Uniprot ba tch converter t estas med FEvER men resulta t är in te
framgångsr ikt , då kombinat ionen av dessa verktyg in te ä r bä t t re än Ingenuity
ana lysverktyget . Fokus vänds mot en a lternat iv kombinat ion som är en hemsida och
heter NCBI. Hemsidan har en sökmotor kopplad t ill flera olika ana lysverktyg som är
gra t is a t t använda . Genom sökmotorn kan ”omik” data kombineras och mer än et t
inmata t värde kan hanteras i t aget . Eftersom tekniken snabbt går framåt innebär det
däremot a t t nya ana lysverktyg behövs för da ta hanter ing och inom en snar framt id så
har vi kanske et t a na lysverktyg som uppfyller kraven av slutanvändar na .
Abst ract ................................................................................................................................................. 3 Sammanfa t tning ................................................................................................................................. 3 1. In t roduct ion .................................................................................................................................. 5
1.1. Purpose ...................................................................................................................................... 5 1.2. Problem definit ions and Aims .............................................................................................. 6 1.3. Problem discussion .................................................................................................................. 6 1.4. Rela ted work with PAT .......................................................................................................... 8
2. Methods ......................................................................................................................................... 8 2.1. Model in use .............................................................................................................................. 8
2.1.1. R equirem ent collection , docum entation and validation ...................................... 8 2.1.2. R equirem ent processing and test case creation ..................................................... 9 2.1.3. Objective ...................................................................................................................... 12 2.1.4. Underlying objectives ................................................................................................ 12
2.2. Alternat ive research methods ............................................................................................ 13 3. Biomedica l background ............................................................................................................ 13
3.1. Genet ics ................................................................................................................................... 13 3.1.1. Gene .............................................................................................................................. 14 3.1.2. S N P .............................................................................................................................. 15
3.2. Biochemist ry of Lipids .......................................................................................................... 16 3.2.1. Lipid defin ition .......................................................................................................... 17 3.2.2. Classes of L ipids ........................................................................................................ 17 3.2.3. Enzym es involved in the synthesis of lipids ......................................................... 18 3.2.4. Lipoproteins ................................................................................................................ 21
4. Computer Science background ............................................................................................... 23 4.1. Databases, Data mining and Knowledge discovery ....................................................... 23 4.2. PAT ........................................................................................................................................... 23
5. Requirements and Test elicit a t ion ........................................................................................ 24 5.1. Requirements ......................................................................................................................... 24 5.2. Test ing ..................................................................................................................................... 25 5.3. Test cases ................................................................................................................................ 26
6. Result ........................................................................................................................................... 26 6.1. F inding the PATs .................................................................................................................. 26 6.2. Sort ing the PATs ................................................................................................................... 27 6.3. Test ing the PATs ................................................................................................................... 27 6.4. Evalua t ing the PATs ............................................................................................................ 28 6.5. F ina l eva luat ion of the PATs .............................................................................................. 28 6.6. The best PAT from the ranked list .................................................................................... 29 6.7. Combining PATs .................................................................................................................... 31 6.8. Funct ionalit ies ....................................................................................................................... 34 6.9. Quality ..................................................................................................................................... 35
7. Discussion ................................................................................................................................... 36 7.1. Is it possible to find a PAT that processes metabolomics and lipidomics raw da ta
as input and combine them with genet ic informat ion? ........................................................ 36 7.2. What a re the funct ionalit ies offered by the ava ilable ana lysis tools? ....................... 36 7.3. What a re the qualit ies of these tool's and how to eva lua te them? ............................. 37
P a g e | 5
7.4. Why not Ingenuity and why Uniprot with FEvER? ...................................................... 38 8. Future Value .............................................................................................................................. 38 9. References ................................................................................................................................... 38 Appendix 1 – Test Cases .................................................................................................................. 42 Appendix 2 – Lipid, MI SNP and Metabo SNP data sheet ...................................................... 46 Appendix 3 – Requirements Matr ixes .......................................................................................... 50 Appendix 4 – Respons Times .......................................................................................................... 53
1. In t roduct ion Vast amount s of resea rch is done in lipidomics and genomics, making
computers, In ternet and var ious ana lysis tool's very common today both in
simple and advanced forms. As an example a simple ca lcula t ion can be
performed on one computer and t ransfer red or copied to another if needed.
More advanced per formances somet imes require a software tool tha t can
perform a cer ta in ta sk on a given set of data in order to give a cer ta in resu lt .
The resu lt is in turn usua lly not logically ordered and a visua l presen ta t ion is
needed. This is where a pa thway analysis tool (PAT) is needed. A pa thway
ana lysis tool (PAT) is an advanced tool t ha t processes given da ta , compares
the given da ta with stored da ta in a da tabase and present s the resu lt s
obta ined visually. A company tends to h ire a programmer to develop a
pa thway ana lysis tool (PAT) in order to in tegra te it with in the organiza t ion
[34]. One of the main groups of scien t ific users is the group of r esea rchers in
fields of bioinformat ics, genet ics, genomics and metabolomics. Researchers a re
dependent of these pa thway analysis tools in their scien t ific work. In some
scien t ific fields such as genomics and metabolomics, there a re too many
ana lysis tools (PAT), doing a ll kinds of different t a sks. Too many pa thway
ana lysis tools in a specific field can confuse resea rchers who do not have
enough knowledge in technology [5]. This makes it difficu lt to decide wha t
pa thway analysis tools a re su ited for cer ta in da ta and within wha t scien t ific
field. Since technology is a lso moving forward ext remely fast , people with
mult idisciplina ry knowledge a re needed more and more [20]. For resea rchers
who work with in the biomedical field of metabolomics and gen omics there a re
specific ana lysis tools. The purposes of these pa thway ana lysis tools (PAT) a re
to help the users in their work, where they can visua lize da ta that may lead to
new scien t ific discovery. Technology and informat ion shar ing has taken a big
step forward and has helped substant ia lly in different a reas a round the wor ld
such as in hea lth ca re and medicine.
1.1.P u rpose
Finding reliable pathway analysis tools (will be refer red to as PAT from now on) that
can do a ll the necessary da ta computat ions and can visua lly present the results is
requested by Céline Fernandez from Clin ica l Research Center (CRC) in Malmö (will
be refer red to as the end user). CRC work s in discover ing new medicine, diagnost ic
tools and improved t reatments in order to improve hea lth wor ldwide.
P a g e | 6
1.2.P roblem de fin it ions an d Aim s
Since there a re many PAT ava ilable with lot s of informat ion , the following
resea rch quest ions a re defined in th is thesis:
Is it possible to find a PAT tha t processes metabolomics and
lipidomics raw da ta as input and combines them with genet ic
informat ion?
What a re the funct iona lit ies offered by the available analysis tools?
What a re the qua lit ies of these tool's and how to evalua te them?
The object ives a re defined in order to help answer the three resea rch
quest ions. The main a im of th is thesis is the following:
To find a PAT tha t can process a combinat ion of da ta inputs with the type of
“omics” da ta , i.e. lipidomics/metabolomics, genomics da ta .
In order to reach the main purpose, severa l under lying object ives a re needed.
These a re the following:
1) Find PATs tha t a re able to map pa thways of the following type of da ta :
a ) Overa ll metabolomics da ta
b) Lipidomics da ta
c) Genomics da ta
2) Evalua te the selected PAT and their funct ions. Test the current
accuracy of the exist ing PAT in order to answer if the output from
these tools shows the “correct” resu lt s.
3) Evalua te the selected PAT according to specific requirements given by
the end user ; see sect ion 1.3 for the specific requirements.
After the eva lua t ion of the PAT according to requirements, two opt ions
a re possible:
Opt ion 1: One or more PAT passes steps 2 and 3 and is delivered to the
end user .
Opt ion 2: If no PAT fulfilling the requirements is found. Alterna t ive
solu t ions will be to see if it is possible to adapt any of the evalua ted
ana lysis tools, combine more than one or make an in house
development of a PAT meet ing the requirements of the end user .
1.3.P roblem discu ss ion
In order to solve the problem we must consider wha t PATs a re, how complex
they a re and wha t they can do. The funct iona lit ies of the PAT need to be
tested [28] to see if they fu lfill the specific requirements (S ee T able 1).
P a g e | 7
Table1. 8 specific requirements listed tha t needs to be fu lfilled by a PAT.
Requ irem en t
ID
Requ irem en t description
1 User is able t o see and select on the PAT
what type of da ta it must process (if the
input field is for metabolomic, lipidomic or
genomic)
2 User must be able to cont rol if obta ined
resu lt is va lid from the PAT according to
lit era ture, In ternet or laboratory resu lt s
3 The user must receive resu lt s by the PAT
with in a cer ta in t ime
4 The user can navigate between sta r t of
sea rch (input da ta ) to the end of sea rch
(resu lt s obta ined).
5 The user can get a visua l presenta t ion of
metabolomics, lipidomics and genomics
da ta from the PAT
6 The user can zoom in and out expanding
the view to neighbor ing possible resu lt s to
see connected pa thways on the received
resu lt s from the PAT.
7 The user can input a specific type of da ta
in to the PAT (metabolomic, lipidomic or
genomic)
8 The user can input combined omics da ta
and then map them to pa thways
Acquir ing knowledge from litera ture gives us informat ion about the
complexity of a PAT [27]. The funct iona lit ies from a PAT can be obta ined with
help of software test ing of da ta inputs [9] and th is way we can check if the
PAT sa t isfy the requirements of the potent ia l users. The defin it ion of qua lity
is of a bigger sca le and harder to define since qua lity has different meanings
to different people [36]. The qua lit ies of the PAT are acceptable if they a re
fu lfilling a ll the requirements [36] according to a set of requirement
specifica t ions. We will be using the requirement specifica t ions according to
table 1. Homepages associa ted with PAT a lso need to be qua lity checked and
five selected a spects a re used: Accuracy and Correctness (how t rustwor thy is
the informat ion provided on the homepages), Com pleteness (a re the
homepages complete or under const ruct ion), R elevance (how relevant is
content or informat ion on a homepage to the PAT), T im e and Punctuality
(how fast can a homepage be found when sea rching), T raceability (is the
informat ion provided on the homepages t raceable to their or iginal source).
P a g e | 8
1.4.Re lated w ork w ith P AT
Most PAT today is made specifica lly with focus on metabolomics and
genomics. This is due to the resea rch work in metabolic engineer ing, cellular
metabolism and in toxic genomics [16, 25]. Companies spend vast amounts of
money developing a PAT while t rying to compete with each other [8, 15]. The
compet it ion for the companies involves building, adapt ing and eva lua t ing
each other 's PAT, telling why their PAT is bet ter than the other [8, 15, 33].
Since the PAT is specifica lly developed for a biomedica l field [41], there exists
no fu ll-sca le analysis on the en t ire PAT yet . Our study is a fir st a t tempt a t
such an ana lysis of a complete set of a ll PAT.
2. Methods This sect ion descr ibes the scien t ific methods used to eva lua te the different PAT.
Sta r t ing with the selected method in use, how the informat ion is ga thered and
deta ils on the object ives and under lying object ives.
2.1.Mode l in u se
The main purpose (t o find a PAT tha t can process a combina t ion of da ta
inputs with the type of “omics” da ta , i.e. lipidomics/metabolomics, genomics
da ta ) of the project was divided in to four under lying object ive, each with it s
specific object ive. Methods tha t will be performed a re based on an empir ica l
model with a study on PAT in order to test and ana lyze each of the PAT and
their homepages. Test cases a re designed based on the requirements from the
end user a t the fir st in terview. The requirements a re rechecked a few weeks
la ter with the end user in a second in terview. Once acknowledged, the
software test ing begins with requirements and test cases, in order to see if
ingoing da ta matches the out coming da ta of the PAT. Da ta is based on a gene
name (e.g. NPPA), reference SNP accession ID (rs number such as rs5068) or
a lipid class name (such as lipoproteins). Ver ifica t ion (from the PAT) of the
out coming da ta to see if it s relevant is per formed by compar ing the received
resu lt s with informat ion found in lit era ture. A ranked list is made ranking
the best PAT first , based on how many requirements a re met . If no PAT meets
a ll the requirements, the end user have a request to adapt or combine 2
specifica lly selected tools, which end user is a lready familia r ized with , while
the ranked list get s disca rded.
2.1.1. R equirem ent collection , docum entation and validation
Five meet ings a re booked a t the Clin ica l Research Center (CRC) in order to
make in terviews. All pa r t icipants (resea rchers including the end user) a re
going to discuss about the problem tha t needs to be solved. Discussion will
focus on PAT in genera l and specific funct ions a re going to be desired by the
resea rchers tha t have to be on a PAT. Requirements a re made connected to
these funct ions on a PAT and a new meet ing is booked. Dur ing each
meet ing everyth ing is wr it ten down and documented. After each meet ing,
P a g e | 9
requirements a re collected to be sor ted and processed in order t o make test
cases. La ter a checkup takes place a t same place, to see if everyth ing is on
the r ight t rack.
2.1.2. R equirem ent processing and test case creation
The requirements a re processed and formula ted. They a re a lso shor tened
down from 15 to eight requirements with the most impor tan t things tha t a
PAT must do. Each of the requirements is given an ident ifica t ion number .
Test case templa tes a re sought and one t empla te is selected, downloaded
and then customized (Fig 1). Specific test cases a re designed to su it the
requirements and linking them to their respect ive requirement (S ee T able
2). The designs of the specific test cases a re made by adding the goa l of the
test a long with the events to achieve the goa l. Last ly the expected response
is wr it ten , descr ibing wha t resu lt s we should expect by following the
events. The whole process sta r t s by ca refu lly checking a requirement from
the list and t rying to see if they can be made as a single test case in one go.
If tha t is not possible severa l t est cases a re needed. If we look a t fir st
requirement in table 1 above, we see tha t 3 different da ta types need to be
tested. So we have to split the requirement in to more than 1 test case since
a ll PAT may not be able to process a ll 3 da ta types. We decide to take the
first da ta type which is for metabolomic input da ta . We a lso select a da ta
input tha t we know should give a response and present some resu lt s. F rom
th is we can write down our events in the test case by having an input and
then get t ing a response. So we can then a lso sta te the expected response. In
our case it is tha t the metabolomic da ta type gives da ta informat ion rela ted
to our da ta input tha t we made. Next 2 test cases will be simila r with the
small difference of having a different input da ta type. Same approach
method is applied to the rest of the test cases. Requirement s a re going to be
checked, eva lua ted if it can be made as one test or split t ing them in to more
test cases for same requirement , wr it ing the events and the expected
response.
P a g e | 1 0
Figu re 1. A test case templa te used in th is study.
P a g e | 1 1
Table 2. A table showing requirement ID with descr ipt ion linked to specific Test
Case ID
ID Requ irem en t
description
Type Lin ked w ith Test
Case ID
1 User is able to see and
select on the PAT what
type of data it must
process (if the input field
is for metabolomic,
lipidomic or genomic)
Fu n ction al 1, 2 an d 3
2 User must be able to
check if the result s
obta ined is va lid from the
PAT according to
lit era ture or laboratory
results
Non
fu n ction al
4
3 The user must receive
results by the PAT with in
a cer ta in t ime
Non
fu n ction al
5
4 The user should naviga te
between star t of search
(input data) to the end of
search (result s obtained).
Non
fu n ction al
6
5 The user should get a
visua l presenta t ion of
metabolomics, lipidomics
and genomics da ta from
the PAT
Non
fu n ction al
7
6 The user must be able to
zoom in and out
expanding the view to
neighbor ing possible
results to see connected
pa thways on the received
results from the PAT.
Fu n ction al 8
7 The user must input a
specific type of data in to
the PAT (metabolomic,
lipidomic or genomic)
Fu n ction al 9
8 The user must be able to
input combined omics
da ta and then map them
to pathways
Fu n ction al 10
P a g e | 1 2
2.1.3. Objective
The main purpose is achieved by acquir ing knowledge from litera ture such
as books and a r t icles and by doing software test ing. The resu lts obta ined
from the test s a re than compared with requirements made by the potent ia l
users of the PAT.
2.1.4. Underlying objectives
Object ive 1:
Ga ther ing of informat ion by sea rching books and a r t icles , finding lot s of
PAT and obta in what da ta it can process. Download PAT if possible to
ana lyze them.
Object ive 2:
Eva lua te the selected PAT with their funct ions and methods by going
through each tool, clicking a round and input t ing da ta . Test cases a re
designed from the given requirements. Test s on the PAT are based upon:
a ) From the lit era ture known metabolomics, lipidomics, and
genet ic pa thways and correla t ions
b) Compar ison between resu lt s obta ined from the lit era ture and
from the PAT
c) Compar ison between exist ing labora tory resu lt s and the PAT
d) How long it t akes to process da ta by the ana lysis tool
Correct resu lt s a re considered to be those tha t come from scient ific a r t icles,
books or labora tory resu lt s ver ified by scien t ist s. Pa thways and correla t ions
with metabolomics, lipidomics, and genet ics a r e tested against lit era ture
known resu lt s. Compar ison between resu lt s obta ined from PAT aga inst
a r t icle and book resu lt s a re going to be done first , a fterwards the exist ing
labora tory resu lt s. Accuracy of the PAT are acquired by the output da ta and
resu lt s will either accura tely match a ll da ta or not . A simple t imer is used
to record the processing t ime of a PAT. F inally a list of PAT will show
which PAT passed, fa iled and why they fa iled our examina t ion .
Object ive 3:
In order to have a sa t isfied end user , specific set of requirements a re
needed tha t must be fu lfilled with a final evalua t ion . Requirements a re
collected a t an ea r ly stage with in terviews from resea rchers and the end
user who a lso represent other potent ia l users. The most desired and
impor tan t requirements were discussed and ident ified to be the following:
Selected ana lysis tool must be able to:
a ) Naviga te between data and resu lt s
b) Make visua l presenta t ion of obta ined metabolomics, lipidomics
or genomics da ta
c) Have zoom in and zoom out funct ions expanding the view to
neighbor ing possible resu lt s connected to pa thways on the
resu lt s obta ined
P a g e | 1 3
d) The PAT should be able to process more than one type of da ta
(metabolomic, lipidomic or genomic)
e) Be able to combine omics da ta and then map their pa thways
Naviga t ion will be tested by looking a t the output da ta (resu lt s obta ined) to
the ingoing da ta (the beginning of where da ta is inser ted). Inser t ions of
da ta a re made in the required fields while t raceability or clickable t racking
views a re sought when obta in ing resu lts. Any visua l presenta t ions on
obta ined resu lt s a re accepted but deta iled view of pa thway combina t ions
and correla t ions a re prefer red. On output da ta zoom funct ions a re sought
tha t is a small magnifying glass with a plus or minus sign in the PAT. To
test how many type of da ta (metabolomic, lipidomic or genomic) the PAT
can process, one of each da ta type will be selected. Three da ta types
together (metabolomic, lipidomic and genomic together) a re going to be
tested first , two da ta types (metabolomic with lipidomic or genomic,
lipidomic with genomic or metabolomic) a re tested secondly and last ly one
by one inputs of each (metabolomic, lipidomic, genomic). If a PAT passes a ll
a ims a fter eva lua t ion , a ll resu lt s and test mater ia l a re in tended to be
turned over to the end user . Fur ther suppor t will be provided in form of
answer ing quest ions on specific PAT. Test s on the PAT, Uniprot and
FEvER are going to be done if no PAT will be found tha t fu lfill the
requirements.
2.2.Altern ative re search m eth ods
There a re a lterna t ive methods to conduct th is study but it would involve
working in a biochemist ry labora tory to observe, in terview and obta in resu lt s
from exper iments and a fterwards designing while a lso building a complete
PAT. Another method is to make a homepage connect ing it towards a PAT
tha t is being used in the labora tory. Method selected in sect ion 3.1 and
descr ibed more in sect ion 4 is being done by reasons of get t ing good qua lity
resu lt s, t ime saving and efficiency.
3. Biomedica l background This sect ion conta ins background information needed in order to understand
the biomedica l pa r t . Ga thered informat ion is about genet ics, lipids and their
biochemist ry, metabolomics, genomics and ca rd iovascula r disease.
3.1.Gen e tics
Genet ics is the study of genes with their st ructures, sequences and their role
in heredity. It is a way to t ry and expla in how they work, what they a re and
wha t they can do [32]. Genet ics involve scien t ific studies of genes and their
effect s leading to va r ia t ion in living organisms [32]. Meaning how cer ta in t ra it
is or condit ions a re being passed down from one genera t ion to the next . Also
how genes a re un it is of heredity tha t ca r ry inst ruct ions for making proteins
P a g e | 1 4
tha t direct act ivit ies in cells and funct ions of ou r bodies. An example of
funct ion is inher ited disorders leading to diseases [32]. Disorders have been
detected due to the la rge amount of labora tory exper iments and technology
advancements, da ta stor ing provide use of PATs, thus giving funct ions to
sea rch and match genes with each other .
3.1.1. Gene
Genes a re small molecula r un it is tha t ca r ry the heredity of living
organisms. The gene holds the informat ion to build and main ta in an
organism. Eukaryot ic cells have a nucleus, which conta ins t igh t ly packed
DNA and a re well protected [5]. The main building blocks of a gene consist
of cova lent ly linked n it rogen bases A, T, C and G. The st ructures a re then
st rengthened by ca rbon and hydrogen bonds. This makes a sequence tha t in
the end forms a long double helix DNA cha in . The DNA cha in is t igh t ly
packed together with h istones, which a re proteins, to form an organized
st ructu re. The organized st ructu re is ca lled chromosomes [11]. All the
chromosomes a re well protected with in the nucleus (Fig 2). The DNA cha in
in turn codes for many funct ions of living orga nisms [5]. Genet ic
informat ion and t ra it is a lso gets passed on to the offspr ing when mat ing.
In our genome there a re some st ructura l genes which upon reading, t ell us
wha t mater ia ls a re needed in order to build up a cell or an organism. This
is our genotype. The st ructura l genes we a re going to use a re determined in
combina t ion with the environment and this is ca lled our phenotype. The
phenotype is a lso a ffected by the environment of ea r lier genera t ions and
th is is ca lled epigenet ic [5]. Those phenotypes a re e.g. eye color and blood
type. The genotypes a re ident ica l in a ll human individua ls up to about 99
percent . Remaining 1 percent va ry from person to person crea t ing the
fea tures tha t makes us a ll unique. Tiny differences in t he genome
sequences dist inguish an individual from another [5]. The t iny difference on
the changes of single bases involves reproduct ion from two individuals
crea t ing an offspr ing and changes by Single Nucleot ide Polymorphism
(SNP) as ment ioned more in text below. Keeping t rack of t in y differences is
ha rd and some of t hese t iny genet ic var ia t ions a re impor tan t due to
suscept ibility to cer ta in diseases (like asthma, diabetes, sclerosis and
cancer), un less you have an ana lysis tool a t your disposa l [5].
P a g e | 1 5
Figu re 2. A schemat ic presenta t ion of human DNA assembled in to a
chromosome.
3.1.2. S N P
SNP is shor t for Single Nucleot ide Polymorphism and it is a sequence
var ia t ion in DNA. This means tha t a n it rogen base is different in a gene
sequence for one individual while the rest of the gene sequence is st ill
simila r to another individua l [5]. For an example the gene sequence
ATAGGC is a lmost the same as the gene sequence ATCGGC, however , we
have a change on the second A to having a C instead. Changes of one
nucleot ide in the sequence of our genes a re named Single Nucleot ide
Polymorphism (SNP) and occur throughout the whole genome [3]. Single
Nucleot ide Polymorphism (SNP) var ia t ions occur in a ll species, leading to
genet ic va r ia t ions and may resu lt in different phenotype of the organism. In
[4] resea rch resu lt s show how different ia t ion has occurred. The genet ic
changes a re based on na tura l select ion to su it the most favorable adapt ion
of the genes [3]. Some of these Single Nucleot ide Polymorphism (SNP)
sequences a re even specific to an ethnic group while it may be missing in
another group. According to [32] both the coding and the non coding regions
of the DNA can be a ffected. Single Nucleot ide Polymorphism (SNP)
sequences involve suscept ibility to diseases as ment ioned in the end of
sect ion 2.1.1. A scen ar io given will descr ibe why Single Nucleot ide
Polymorphisms (SNP: s) a re impor tan t [32]. Couples registers for a hea lth
check and gives blood to be ana lyzed in order to detect how hea lthy they
a re. The blood goes through t rea tments so only small sequences of
nucleot ides a re left . The Single Nucleot ide Polymorphism (SNP) sequence
of one individual is the following:
“GCCAGTATTGTCGATTTCACAAGTGCCTTTCTGTCGGGATGTCACACA
P a g e | 1 6
ACGG”. Other person has the following of
“GCCAGTATTGTCGATTTCACAAGTGCGTTTCTGTCGGGATGTCACACA
ACGG”. The sequences from both individual’s a re codes for a prot ein , coding
the uptake of fa t and sugar in the human body. The small va r ia t ions
between these two individua ls a re marked with a color . One of them has
h igh r isk of get t ing diabetes. With the help of today’s technology, SNP
ana lyses a re used to determina te disease suscept ibility [32]. Ana lysis
revea ls t er r ible news for the couple, were the individual with the single
base changed to G has to sta r t using insulin with a syr inge, unless food
habit change within a year or two. The scenar io descr ibed above a re very
common in hea lth ca re today and a lso not the only work a rea exploit ing
genet ic va r ia t ions. In forensic science the genet ic va r ia t ions a re exploited
dur ing DNA fingerpr in t ing [32].
3.2.Bioch em istry o f Lip ids
Biochemist ry is a lso ca lled biological chemist ry which is the study of chemica l
processes in living organisms. Biochemistry regula tes and governs over a ll
living processes with in a ll living organisms [5]. This occurs by biochemical
signa ling. The signa ling is sor t of an informat ion flow as in sending a message
from one place to another . Signa ls flow through every par t in an organism
regula t ing the metabolism. Metabolism stands for the meaning of living
organisms to susta in life and reproduce them self. One impor tan t pa r t in
biochemist ry is the lipids. Lipids a re impor tan t components in a cell and form
cell membrane, vita l t issues and serve as an energy source for the organism
[1]. Lipids a re stored as energy reserves with in the organism and used whe n
needed. Lipids help keeping the elect rochemica l balance of a cell, cell
signa ling and t ra fficking regarding wha t is going in or out to the cell [1 1].
The lipids usua lly consist of a pola r head and a hydrophobic ta il. The lipids
bind to each other due to the hydrophobic pa r t wants to stay in contact wit h
other hydrophobic molecules [3]. The dist r ibut ion between the hydrophobic
and pola r pa r t s of the lipids direct s the 3-dimensiona l st ructure of the
molecules [7] and with a rela t ively la rge pola r pa r t , the lipids form micelles
while more equal dist r ibut ion , leads to the format ion of double layers known
as membranes (Fig 3).
P a g e | 1 7
Figu re 3. P icture of lipids with hydrophobic ta ils bound together and with
other components forming the membrane. (Modified picture taken from
Human Cell Biology ref. [43]).
3.2.1. Lipid defin ition
Chemists, biochemists and other analyst s tha t work with lipids have a
grea t and firm understanding of the t erm ca lled lipid according to [19]. But
there is no widely accepted defin it ion today and they a re sa id to be a group
of na tura lly occurr ing compoun ds. In an organism, [44] and [53] sta te tha t
thousands of va r ious forms of lipid molecules can be found and lipids can be
ca tegor ized in to six main ca tegor ies (Fig 4). They a ll have a low solubility in
wa ter and h igh solubility in organic solvents.
3.2.2. Classes of L ipids
Recent ly a new nomencla ture system was proposed by [26] due to the
diversity of lipids in human plasma , separa t ing lipids in to eight classes or
ca tegory where six of them are considered main classes. Each class can be
fur ther divided in to sub classes and individua l molecula r species (Fig 3).
The first ca tegory is the fa t ty acyls and is a lso ca lled fa t ty acids. The
fa t ty acids can have three forms such as fa t ty acids, octadecanoids
and eicosanoids. They a re the most common building block for more
st ructu ra l complex lipids and can be sa tura ted or unsa tura ted. Cells
use these lipids to form the va r ious membranes found in a cell, to
store energy and to adjust the membrane flu idity in many ce lls. [43,
53]
Second ca tegory is the Glycerolipids and has three forms as mono-,
di- and t r iacylglycerolipids. Their funct ions a re main ly as energy
storage and a re bulked up in the t issue as fa t in an imals. [43, 53]
P a g e | 1 8
Third ca tegory is ca lled Glycerophospolipids but they a re usua lly
ca lled phospholipids. The main forms a re Phospha t idylcholine (PC),
Phospha t idylchethanolamine (PE) and Phospha t idic acid (PA). The
glycerophospolipid classes a re the only ones tha t have a phosphor
binding and they a re the key component in order to form bilayers.
[43, 53]
The four th ca tegory consist s of Sphingolipids. The main forms a re
Sphingomyelin and Ceramides. The Sphingolipids have a pola r head
and two non pola r t a ils. Sphingomyelin act a s a protect ion forming a
myelin sheath to protect nerves. [43, 53]
The fifth ca tegory is the Sterol lipids and they a re of va r ious a lcohol
forms. Sterol lipids a re an impor tan t component for biological roles.
Sterols act a s regula t ing hormones and as signa ling molecules. [43,
53]
The last ca tegory is the Prenols tha t form terpenes and act a s a pre-
cursor molecules of vitamins as vitamin A, E and K. [43, 53]
3.2.3. Enzym es involved in the synthesis of lipids
A deeper insight is presented in th is sect ion with focus on lipids and it is
synthesis, for a more understanding on the amount of informat ion a PAT
must be able to process. Sta r t ing from the sta r t of da ta inputs (a lipid name
connected to glycerolipids) to resu lt s obta ined.
Some lipid cha ins a re very long or complex while others a re shor t . It wou ld
take a long t ime to chemica lly synthesize the lipids, however , with the help
of enzymes it is much faster a s [37] presents. Numerous forms of lipids
occur and severa l enzymes a re needed. In [46] a system biology view
presents needed enzymes by use of a PAT. E.g. the synthesis of fa t ty acids
occurs in the cytoplasm and key enzymes involved a re the acetyl -CoA
carboxylase (ACC) and malonyl-CoA carboxylase (MCC) sta t ed in [51].
While another group of coenzyme ca lled Acyl-CoA, choresterol
acylt ransferase (ACAT) works on cholesterol [51]. This is st rengthened in
[45] showing a clea r view by pictures. The fa t ty acids a re so many and can
be sa tura ted or unsa tura ted and for th is purpose designa ted symbols a re
given [31] in order to keep t rack of the ca rbon a toms a nd their bindings.
The symbols consist of two numbers between a colon (:) [31]. The first
number tells us the ca rbon length of the fa t ty acid and the second number
the sta te of sa tura t ion . A fa t ty acid with severa l unsa tura ted bounds shows
a h igher number a t it is second va lue (S ee T able 3). Synthesis of fa t ty acids
beyond 16 ca rbons length goes through a two-carbon elonga t ion process,
according to [31] by enzymes in the endoplasmic ret icu lum (ER). Not only
elonga t ion occurs bu t a lso desa tura t ion by enzymes in the endoplasmic
ret icu lum (ER) using four enzymes named desa turase delta four , delta five,
delta six and delta nine. The designa ted delta names with a number a re
given according to which posit ion in the fa t ty acid ca rbon cha in the
desa tura t ion occurs [31]. The main dena turase is delta nine and is ca lled
Stea royl-CoA desa turase-1. The desa tura t ion requires oxygen (O2), a
coenzyme ca lled Nicot inamide adenine dinucleot ide hydrogen (NADH) and
P a g e | 1 9
an elect ron t ranspor t ing hemoprotein ca lled Cytochrome b5 [47]. In fa t ty
acid desa tura t ion two hydrogen a toms a re removed from the fa t ty acid
making an oxida t ion on both the fa t ty acid and NADH. This crea tes a
double bond between ca rbons in the fa t ty acid cha in .
Table 3. The main fa t ty acids in organisms (Modified table taken from Cyber lipid
center ref. [31] and Virgin ia web educa t ion ref. [10])