OCTOBER 2015 VOL 1 ISSUE 1 ALFALFA: explained! By Muniba Faiza Computer and Drugs: What you need to know “Gene expression signatures are commonly used to create cancer prognosis and diagnosis methods. Gene expression signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological phenotype or a medical condition.”
October 2015 issue of Bioinformatics Review. Articles on latest bioinformatics trends, tools and topics. Latest Issues available on www.bioinformaticsreview.com
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
The basic concepts of genome assembly19 MUSCLE: Tool for Multiple Sequence Alignment 27 DNA test for paternity: This is how you can fail! 32
How to check new peptides accuracy in Proteogenomics 17
Tumor progression prediction by variability based expression signatures 8 Basics of Mathematical Modelling 25 Introduction to mathematical modelling - Part 2 29
ALFALFA: explained 06 BioMiner & Personalized Medicine: A new perspective 12
Meta-analysis of biological literature: Explained 15 Basic Concept of Multiple Sequence Alignment 23 Computer and Drugs: What you need to
know 21
EDITORIAL
SECTION EDITORS
ALTAF ABDUL KALAM MANISH KUMAR MISHRA
SANJAY KUMAR PRAKASH JHA NABAJIT DAS
REPRINTS AND PERMISSIONS
You must have permission before reproducing any material from Bioinformatics Review. Send E-mail
requests to [email protected]. Please include contact detail in your message.
BACK ISSUE
Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com
at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,
subject to availability. Pre-payment is required
CONTACT
PHONE +91. 991 1942-428 / 852 7572-667
MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025
STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as [email protected]
PUBLICATION INFORMATION
Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social
and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015
Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used
under licence by SEWA trust.
Published in India
EXECUTIVE EDITOR FOUNDING EDITOR
FOZAIL AHMAD MUNIBA FAIZA
Bioinformatics Review – The
Road Ahead
Bioinformatics, being one of the best fields in terms of future prospect,
lacks one thing - a news source. For there are a lot of journals publishing
a large number of quality research on a variety of topics such as genome
analysis, algorithms, sequence analysis etc., they merely get any notice
in the popular press.
One reason behind this, rather disturbing trend, is that there are very
few people who can successfully read a research paper and make news
out of it. Plus, the bioinformatics community has not been yet
introduced to research reporting. These factors are common to every
relatively new (and rising) discipline such as bioinformatics.
Although there are a number of science reporting websites and portals,
very few accept entries from their audience, which is expected to have
expertise in some or the other field.
Bioinformatics Review has been conceptualized to address all these
concerns. We will provide an insight into the bioinformatics - as an
industry and as a research discipline. We will post new developments in
bioinformatics, latest research.
We will also accept entries from our audience and if possible, we will
also award them. To create an ecosystem of bioinformatics research
reporting, we will engage all kind of people involved in bioinformatics -
Students, professors, instructors and industries. We will also provide a
free job listing service for anyone who can benefit out of it.
“ALFALFA is a new platform for sequencing. It is extremely fast and accurate at mapping long reads (> 500bp), while still being competitive for moderately s ized reads (> 100bp). Both end -to-end (i.e., global) and local read alignment is supported and several strategies for paired -end (i.e., global) mapping can efficiently handle large variations in insert s ize (i.e., input genome to be sequenced)”
“ Meta-analysis an analysis of already published data by s imply rearranging it, sorting it, and trying to find hidden patters out of published literature. ”
t’s a fine Monday morning, and
the new intern finds his way to
the laboratory of biological
data mining procedures. His brief
interview with the concerned
scientist has allowed him to have
very limited understanding of the
subject.... Upon his arrival he is
greeted with a humongous corpus of
mixed articles, say some 4000, and
he is required to assemble specific
information out of the data set, by
diligently scrutinizing the
components of each article.
Well, the situation could be
frightening to a purely wet lab
working biologist, but a man who
has had any exposure to the real
power of file handling with any
programming language will know
how to let a simple few lines of
code do his bidding.
So what is meta-analysis about?
The new cool word to biological
realm “meta-analysis” can be better
understood by understanding the
meaning of first half of the
term; META, meaning data of data,
thence making meta-analysis an
analysis of already published data
by simply rearranging it, sorting it,
and trying to find hidden patters out
of published literature.
By most rudimentary means, meta-
analysis can be achieved by reading
the corpus of research and review
articles concerning a particular topic
which may be as wide a whole
Eukaryotic genome or may be
narrowed down to phyla, groups,
species may be a specific disease or
even any gene in particular. Where on
one part we try to narrow down to
disease or gene, one must also realize
biological systems are most complex
to date and present day computer
simulations fail to rival the complexity
with equal efficiency, so any analysis
narrowed down to gene must also
consider that the gene may very well
be found in multiple organisms and
thus may present considerably high
amount of results irrelevant to the
study.
A rigorous manual inspection of
program sorted data is required to
sort out such entries. Since meta-
analysis relies heavily on statistical
studies of data, researchers tend to
rely on programming languages such
as Stata and R to write their specific
codes for analyses, R unlike Stata is
free, produces publication quality
outputs and provides a plethora of
packages, of which a few provide
programs like PDF miner, PubMed
miner etc used for accessing PubMed
I
META ANALYSIS
Bioinformatics Review | 16
database, these packages contain
codes to access the database and
extract all information off them with a
command based interface for huge
data sets at once cutting down
manual efforts and time taken to
achieve the task.
All praises sent, the method has its
own fair share of drawbacks and
issues. The current query system to
NCBI and sister organizations fail to
acknowledge synonymous terms and
treats them as individual entities not
linked to any, but only in association
with the length of query items made
alongwith. A robust query system is
needed to enhance the results, and
make the whole concept more
efficient.
Need of the hour is to engage more
resources into developing well-
structured and somewhat intelligent
query systems which can truly
acknowledge the gene-names and
abbreviations, scientific and English
names of organisms and also the
variations of presenting names of
techniques involved.
Bioinformatics Review | 17
How to check
new peptides
accuracy in
Proteogenomics
Muniba Faiza
Image Credit: Google Images
“ During the discovery of novel genes, there are huge chances of getting false results as positives, i.e., we can also get
those peptides which in actual are not but the algorithm may show them .”
roteogenomics is an emerging
area which is an interface of
proteomics and genomics. This
intersection employs the genomic and
transcriptobmic information to
identify the novel peptides by using
mass spectrometry based techniques.
The proteomic data can then be used
to identify the fingerprints of genic
regions in that particular genome
which may results in the modification
of gene models and can also improve
the gene annotations. So, we can say
that proteogenomics has been well
accepted as a tool to discover novel
proteins and genes."But, during the
discovery of novel genes, there are
huge chances of getting false results
as positives, i.e., we can also get those
peptides which in actual are not but
the algorithm may show them".
Therefore, to avoid or more
accurately, to minimize the chances of
false positives, a False Discovery Rate
(FDR) is used. FDR is a ratio of number
of decoy hits / number of targets.
FDR = decoy/ target
In most conventional proteogenomic
studies, a global false discovery rate
(i.e., the identifications of annotated
peptides and novel peptides are
subjected to FDR estimation in
combination) is used to filter out false
positives for identifying credible novel
peptides. However, it has been found
that the actual level of false positives
in novel peptides is often out of
control and behaves differently for
different genomes.
It has been observed previously that,
under a fixed FDR, the inflated
database generated by, e.g. six-open-
reading-frame (6-ORF) translation of a
whole genome significantly decreases
the sensitivity of peptide
identification.
Recently, Krug implied that the
identification accuracy of novel
peptides is greatly affected by the
completeness of genome annotation,
i.e., more the genome is annotated,
higher are the chances of
identification of accurate novel
peptides.
In this recent paper, they followed the
same framework as in Fu’s work to
P
PROTEOGENOMICS
Bioinformatics Review | 18
quantitatively investigate the
subgroup FDRs of annotated
and novel peptides identified by 6-ORF
translation search.
In this article, they have revealed that
the genome annotation completeness
ratio is the dominant factor
influencing the identification accuracy
of novel peptides identified by 6-ORF
translation search when a global FDR
is used for quality
assessment. However, with a stringent
FDR control (e.g. 1%), many
low scoring but true peptide
identifications may be excluded along
with false positives.
To increase the sensitivity and
specificity of novel gene discovery,
one should reduce the size of
searched database as much as
possible. For example, when the
transcriptome information (especially
from the strand-specific cDNAseq
data) is available, it is apparently more
favorable to search against the
transcriptome as well than to search
against the genome alone. If the
transcriptome information is
unavailable, it would be also helpful to
reduce the 6-OFR translation database
by removing sequences that are
predicted to be hardly possible to be
real proteins.
Reference:
A note on the false discovery rate of
novel peptides in proteogenomics
Kun Zhang1,2, Yan Fu3,*, Wen-Feng
Zeng1,2, Kun He1,2, Hao Chi1,
Chao Liu1, Yan-Chang Li4, Yuan Gao4,
Ping Xu4,* and Si-Min He1
Bioinformatics Review | 19
The basic concepts
of genome
assembly
Muniba Faiza
Image Credit: Google Images
” Genome, as we all know, is a complete set of DNA in an organism including all of its genes. It consists of all the
heritable information and also some regions which are not even expressed.”
enome, as we all know, is a
complete set of DNA in an
organism including all of its
genes. It consists of all the heritable
information and also some regions
which are not even expressed.
Almost 98 % of human genome has
been sequenced by the Human
Genome Project, only 1 to 2 % has
been understood. Still the human
genome has to be discovered more
whether it would be in terms of genes
or proteins. Many sequencing
strategies and algorithms have been
proposed for genome assembly. Here I
want to discuss the basic strategy
involved in genome assembly, which
sounds quite difficult but is not really
complex if understood well.
Basic strategy involved behind
discovering the new information of
genome is explained in following steps:
1. First of all, the whole genome of
an organism is sequenced which
results in thousands or hundreds
of different unknown fragments
starting from anywhere and
ending upto anywhere.
2. Now, since we don't know what
the sequence is and which fragment
should be kept near to which one, the
concept for 'Contigs' is employed.
Contigs are the repeated overlapping
reads which are formed when the
broken fragments comes to each
other only by matching the
overlapping regions of the sequence.
It means that many fragments which
are consecutive are joined to form
contig. Many such contigs are formed
during the joining process.
3. Now, the question that arises is how
come we know that a fragment which
may be a repeat has been kept in its
right place as a genome may have
many repeated regions? To overcome
this, paired ends are used. Paired ends
are the two ends of the same
sequence fragments which are linked
together, so that if one of the end of
the fragment is aligned in lets say
contig1 then the other end which is a
part of the former will also be aligned
in the same Contig as it is the
consecutive part of the sequence.
There are various software with the
help of which we can define different
lengths of the paired ends.
4. After that all the Contigs combine
to form a scaffold, sometimes
GENOMICS
G
Bioinformatics Review | 20
called
as Metacontigs or Supercontigs,
which are then further processed
and the genome is sequenced.
All of this is done by different
assembly algorithms, mostly used are
Velvet and the latest is SPADES.
According to my experiences, more
efficient algorithms are which may
provide us large information in one go.
Just imagine that we got a thread of
sequence with unknown base pairs,
then what would we do with that
thread and how would we identify and
extract the useful information from
this thread??
Thank you for reading, Don't forget to
share this article if you like it.
Bioinformatics Review | 21
Computer and Drugs: What you need to know
Altaf Abdul Kalam
Image Credit: Google Images
” Computers are being used to design drugs and it is being done on a humongous level, by almost every
multinational pharma company.”
ould you chance your life
to a lowly piece of
hardware called the
computer? Would you let it fabricate
and determine drugs for life
threatening diseases like hepatitis,
cancers and even AIDS? Well, actually
your answer (or your opinion)
doesn’t seem to matter. Because the
world has moved over to the other
side. Computers are being used to
design drugs and it is being done on a
humongous level, by almost every
multinational pharma company, the
names of which you will undoubtedly
find at the back of your prescription
medicines at home. So what’s with all
this computer stuff? Have we parted
with our perspicacity, our intuition,
our ready willingness to tackle any
challenge head on? We have always
found solutions to mankind’s biggest
problems all by ourselves. As
Matthew McConaughey’s character
in Interstellar says "..or perhaps
we’ve forgotten we are still
pioneers?"
Well philosophical mubo-jumbo
aside, its not that simple as its
sounds. Ofcourse, most of you
reading this already have some
background in this topic and have
already understood what I am talking
about. But for those of you who
haven’t the slightest clue, don’t
worry, this write up is for you.
Throughout this series of articles on
this particluar issue, I am going to try
and break it down to the basics. Lets
say that by the end you would see a
car not for what it is – with all its
complexity and slickness- but for
what made it the way it is – the nuts
and bolts and rubber and.. whatever,
you get the point!
So where do we start? Money! Yes
the thing that runs the world.
Contrary to what all the losers who
never made a dime say, money
simply is everything. Even Bill Gates
was forced to acknowledge the fact
and decalre "Money isn’t everything
in this world , but you gotta have a
lot of it before you say such rubbish."
So that settles it then. Now lets come
back.
The basic modus opernadi of
designing a drug is that you first find
a suitable target which you believe
will be key to challenging the disease.
This mostly is a protein/enzyme that
can metabolise a particular drug or in
some cases even the disease causing
genes from the pathogen itself.
Finding this target is not easy, but it is
not that hard either. We have
documentations, intensive studies
and databases dedicated to listing,
characterizing and studying the drug
metabolizing genes and proteins in
the body. Different classes of
metabolizers act on different types of
chemicals (or drugs if you like). A
class of metabolizers called the CYP
CADD
W
Bioinformatics Review | 22
enzymes metabloize over sixty
percentage of the known drugs and
medicines that humans consume.
This includes drugs (the real ones –
LSD, cocain, heroin.. get it?) and
even poisons and sedatives. The
metabolizers ofcourse don’t know
which is which. If it suits them they
metabolize it, else they are out of
your system.
Now, under the assumption that we
have a drug target, the next step is
finding the suitable drug candidate
itself. Now this step is what you call
finding a needle in a haystack. There
are literally millions of drugs out
there and if that is not enough you
can go design your own and get it
synthesized. In a drug target (we will
call it simply the ‘protein’ henceforth)
there are multiple points of action
where multiple drugs can act. So for
example in a protein made of 200
amino acids, we might find 50
actionable amino acids. For these
fifty amino acids we may find
thousands and thousands of drug
candidates, all capable in bringing
about some or the other change in
the protein.
So how do we find the One? If you
asked that question about fifteen
years back, the answer would have
been to slog it out. Match every drug
candidate you got against the protein
you have and check the effects in
vivo. Now countless number of
factors come into play when a drug
and a protein interact – global
minima, energy minimization, binding
affinity, hydrogen bonding intensity
and what not. We shall learn about
them in more detail in upcoming
articles.
So to put it simply, scientists spent
their whole sorry lives pitting
different drug candidates against the
same protein over and over again
until they found something
worthwile to hang on to. Even if all
the above mentioned factors blended
in wonderfully, they might sadly, at
the end discover that the drug
actually caused more harm than it did
good. So the candidate gets
discarded and they start the process
all over again! Sometimes you got
lucky and found the right drug within
performing a few combinations. But
mostly it took years to even zero in
on a drug that could be moved
further into the drug discovery
pipeline which in itself is another
torturous process!
So coming back to the money factor.
You don’t need to be a Harvard
Business School garduate to learn
that this tiresome task costs money,
a lot of money! Money in the form of
manpower, reagents, biological
matter like tissues and test animals
and plants, instrumentation,
electiricity and what not. Another
thing it costs is something which
none of use care about much – time.
Picture designing a drug for some
novel disease which is killing
thousands of people each year. And
picture having to0 do this same
procedure and coming out with a
drug after 10-15 years. The cost of
such a life saving drug will also be
high because the company or lab that
made them would want to recover all
the time and money they spent on it
in the first place. Not exactly feasible
and effective I would say.
So here comes computer aided drug
design. Which – brace yourself, can
shave off years from the drug
discovery pipeline. It can get you into
the clinical trials phase within say 2-3
years as opposed to the earlier
average of 7-8 years. Less money
spent, less time spent, faster
avilability of a potential cure and who
knows, even less expensive
medicines.
So how does it work? How does the
entry of a man made machine change
everything for the better so
drastically? What does a computer
do that humans could not? Can you
trust the results that happen in silico
over something that happen in vivo?
Is a computer finally that evolved
that it can simulate life forms inside
its mother box and processors? We
will hopefully see those questions
answered in the next few posts!
Bioinformatics Review | 23
Basic Concept of
Multiple Sequence
Alignment
Muniba Faiza
Image Credit: Google Images
“ The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence
similarity.”
ultiple Sequence
Alignment (MSA) is a very
basic step in the
phylogeny analysis of organisms. In
MSA, all the sequences under study
are aligned together pairwise on the
basis of similar regions with in
them. The major goal of MSA
pairwise alignment is to identify the
alignment that maximizes the protein
sequence similarity.
This is done by seeking an alignment
that "maximizes the sum of
similarities for all the pair of
sequences", which is called as the
'Sum-of-scores or SP Score'. The SP
Score is the basic of many alignment
algorithms.
The most widely used approach for
constructing MSA is "Progressive
Alignment", where a set of n proteins
are aligned by performing n-1
pairwise alignments of pairs of
proteins or pairs of intermediate
alignments guided by a phylogeny
tree connecting the sequences. A
methodology that has been
successfully used as an improvement
of progressive alignment based on
the SP Score is "Consistency-based
Scoring",where the alignment is
consistent dependent on the
previously obtained alignment, for
example, we have 3 sequences
namely, A,B, & C ,the pairwise
alignment A-B, B-C imply an
alignment of A-C which may be
different from the directly computed
A to C alignment.
Now, the question arises that how
much can we rely on the obtained
MSA? and how an MSA is validated?
The validation of MSA program
typically uses a benchmark data set
of reference alignments. An MSA
produced by the program is
compared with the corresponding
reference alignment which gives an
accuracy score.
Before 2004, the standard
benchmark was BAliBASE (
Benchmark Alignment dataBASE) , a
database of manually refined MSAs
consisting of high quality
documented alignments to identify
the strong and weak points of the
numerous alignment programs now
available.
"Recently, several new benchmark
are made available,
namely, OXBENCH, PREFAB,
SABmark, IRMBASE and a new
extended version of BAliBASE."
Another parameter which is
considered as basic in most of the
alignment programs is fM Score. It is
used to assess the specificity of an
alignment tool and identifies the
proportion of matched residues
predicted that also appears in the
M
SEQUENCE ANALYSIS
Bioinformatics Review | 24
reference alignment. Many of the
times, it is encountered that some
regions of the sequences are
alignable and some are not, however,
there are usually also intermediate
cases , where sequence and structure
have been diverged to a point at
which homology is not reliably
detectable.In such a case, the fM
Score , at best, provides a noisy
assessment of alignment tool
specificity, that becomes increasingly
less reliable as one considers
sequences of increasing structural
divergence.
However, after considering the
reference alignments, the accuracy of
results is still questionable as the
reference alignments generated are
of varying quality.
REFERENCES:
Multiple sequence alignment
Robert C Edgar1 and Serafim
Batzoglou2
BAliBASE: a benchmark alignment
database for the evaluation of
multiple alignment programs
Julie D. Thompson, Frédréric
Plewniak and Olivier Poch
Bioinformatics Review | 25
Basics of
Mathematical
Modelling - Part 1
Fozail Ahmad
Image Credit: Google Images
“Mathematical modeling receives a broad domain of cellular processes such as metabolic regulation, gene-gene
interaction, and gene-protein interaction. This has made a bridge between experimental and expected outcome. ”
“In order to obtain the parameter values for analysing the kinetic behaviour of biochemical processes, we simply
investigate into expression data (gene/protein) at different transcriptional and translational level that enable us to frame-out the comprehensive structural biochemical pathway.”
“DNA contain genetic information of an individual. The whole set of genes in an organism is known as Genome.
99.9% of genome is exactly same in every human individual but, 0.1% differs. 0.1% of each individual is unique to the person thus making it possible to identify each individual.”