Top Banner
OCTOBER 2015 VOL 1 ISSUE 1 ALFALFA: explained! By Muniba Faiza Computer and Drugs: What you need to know “Gene expression signatures are commonly used to create cancer prognosis and diagnosis methods. Gene expression signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological phenotype or a medical condition.”
36

Bioinformatics Review - October 2015

Jul 25, 2016

Download

Documents

October 2015 issue of Bioinformatics Review. Articles on latest bioinformatics trends, tools and topics. Latest Issues available on www.bioinformaticsreview.com
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics Review - October 2015

OC T OBER 2015 VOL 1 ISSUE 1

ALFALFA:

explained! By Muniba Faiza

Computer and Drugs: What you need to know

“Gene expression signatures are

commonly used to create cancer

prognosis and diagnosis methods.

Gene expression signature is a

group of genes in a cell whose

combined expression pattern is

uniquely characteristic of a

biological phenotype or a medical

condition.”

- Charles Wins

Page 2: Bioinformatics Review - October 2015

Public Service Ad sponsored by IQLBioinformatics

Page 3: Bioinformatics Review - October 2015

Contents

October 2015

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Topics

03

22

34

99

34

99

Genomics

Proteomics

Systems Biology

Data Analysis

Software

CADD

Editorial.... 5

The basic concepts of genome assembly19 MUSCLE: Tool for Multiple Sequence Alignment 27 DNA test for paternity: This is how you can fail! 32

How to check new peptides accuracy in Proteogenomics 17

Tumor progression prediction by variability based expression signatures 8 Basics of Mathematical Modelling 25 Introduction to mathematical modelling - Part 2 29

ALFALFA: explained 06 BioMiner & Personalized Medicine: A new perspective 12

Meta-analysis of biological literature: Explained 15 Basic Concept of Multiple Sequence Alignment 23 Computer and Drugs: What you need to

know 21

Page 4: Bioinformatics Review - October 2015

EDITORIAL

SECTION EDITORS

ALTAF ABDUL KALAM MANISH KUMAR MISHRA

SANJAY KUMAR PRAKASH JHA NABAJIT DAS

REPRINTS AND PERMISSIONS

You must have permission before reproducing any material from Bioinformatics Review. Send E-mail

requests to [email protected]. Please include contact detail in your message.

BACK ISSUE

Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com

at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,

subject to availability. Pre-payment is required

CONTACT

PHONE +91. 991 1942-428 / 852 7572-667

MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025

STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as [email protected]

PUBLICATION INFORMATION

Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social

and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015

Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used

under licence by SEWA trust.

Published in India

EXECUTIVE EDITOR FOUNDING EDITOR

FOZAIL AHMAD MUNIBA FAIZA

Page 5: Bioinformatics Review - October 2015

Bioinformatics Review – The

Road Ahead

Bioinformatics, being one of the best fields in terms of future prospect,

lacks one thing - a news source. For there are a lot of journals publishing

a large number of quality research on a variety of topics such as genome

analysis, algorithms, sequence analysis etc., they merely get any notice

in the popular press.

One reason behind this, rather disturbing trend, is that there are very

few people who can successfully read a research paper and make news

out of it. Plus, the bioinformatics community has not been yet

introduced to research reporting. These factors are common to every

relatively new (and rising) discipline such as bioinformatics.

Although there are a number of science reporting websites and portals,

very few accept entries from their audience, which is expected to have

expertise in some or the other field.

Bioinformatics Review has been conceptualized to address all these

concerns. We will provide an insight into the bioinformatics - as an

industry and as a research discipline. We will post new developments in

bioinformatics, latest research.

We will also accept entries from our audience and if possible, we will

also award them. To create an ecosystem of bioinformatics research

reporting, we will engage all kind of people involved in bioinformatics -

Students, professors, instructors and industries. We will also provide a

free job listing service for anyone who can benefit out of it.

EDIT

OR

IAL

Tariq Abdullah

Founder

Letters and responses:

[email protected]

Page 6: Bioinformatics Review - October 2015

Bioinformatics Review | 6

ALFALFA: explained Muniba Faiza

Image Credit: Google Images

“ALFALFA is a new platform for sequencing. It is extremely fast and accurate at mapping long reads (> 500bp), while still being competitive for moderately s ized reads (> 100bp). Both end -to-end (i.e., global) and local read alignment is supported and several strategies for paired -end (i.e., global) mapping can efficiently handle large variations in insert s ize (i.e., input genome to be sequenced)”

igh throughput

sequencing has

revolutionized the new

world of bioinformatics

research. Since everyone is aware of

the Human Genome project in which

the human genome has been

sequenced, millions of species have

been sequenced so far. Sequencing is

a very important aspect of

bioinformatics so new faster and

better sequencing techniques are

needed . New sequencing platforms

produce biological sequence

fragments faster and cheaper.

Ideal read mappers should accomplish

the following aspects:

Maximal speed

Minimal memory

Maximal accuracy

Shoot at a moving target (since fast

evolving technologies differ in length

distribution and sequencing errors).

Recent advances in next generation

sequencing technologies have led to

increase read lengths, higher error

rates and error models showing more

and longer indels (insertions and

deletions). A preprocessing step of

indexing reference genomes and/or

sequencing reads must guarantee fast

substring matching. The overall search

space is pruned to candidate genomic

regions by searching matching

segments (called seeds) between

reads and the reference genome.

These candidate regions are then

further investigated to look for

acceptable alignments that reach a

particular score. Then the sequencing

is done.

ALFALFA is a new platform for

sequencing is extremely fast and

accurate at mapping long reads (>

500bp), while still being competitive

for moderately sized reads (> 100bp).

Both end-to-end (i.e., global) and local

read alignment is supported and

several strategies for paired-end (i.e.,

global) mapping can efficiently handle

large variations in insert size (i.e.,

input genome to be sequenced). The

name is an acronym for “A Long

Fragment Aligner/A Long Fragment

Aligner". It is repeated twice as a pun

on repetitive and overlapping

fragments observed in genome

sequences that heavily distort read

mapping and genome assembly.

The most fascinating feature of

ALFALFA is that it uniquely uses the

‘enhanced sparse suffix arrays’ to

index reference genome (the genome

to be sequenced). Index refers to a

data structure that allow for the quick

location of all occurrences of patterns

starting at interesting positions only.

Sparse suffix array is a technology

which uses LCP (Longest Common

Prefix) series which reduces the

solution space and forms a suffix tree

efficiently. Sparse Suffix Array uses a

chaining algorithm to speed up

dynamic programming extensions of

the candidate region. This data

structure facilitates fast calculation of

maximal and super-maximal exact

matches. The speed-memory trade-

off is tuned by setting the sparseness

value of the index.

H

SOFTWARE

Page 7: Bioinformatics Review - October 2015

Bioinformatics Review | 7

ALFALFA follows a canonical seed-

and-extend work- flow for mapping

reads onto a reference genome:

Root system

Reference genome indexed by

enhanced sparse suffix array

Seed

Super-maximal exact match between

reference genome and read

(To enable quick retrieval of variable-

length seeds called super-maximal

exact matches between a read and

the reference genome).

Flower bud

Cluster of seed forms candidate

genomic region

(Seeds are then grouped in to non-

overlapping clusters that mark

candidate genomic regions for read

alignment).

Flower

Gaps between seeds filled by

dynamic programming

(Handling of candidate region is

prioritized by agglomerate base pair

coverage of seeds. the final extend

phase sample seeds from candidate

regions to form collinear chains that

are bridged using dynamic

programming).

Features of ALFALFA:

ALFALFA uses the technological

evolution for the production of longer

reads by using maximal exact matches

[MEMs] and super-maximal exact

matches [SMEMs] as seeds. (Since

MEMs between a read and a

reference genome may overlap,

super- maximal exact matches are

defined as MEMs that are not

contained in another MEM in the read

) . These seeds are then extensively

filtered and then decide the order of

alignment to allow for more accurate

prioritization of candidate regions. To

reduce the number of expensive

dynamic programming computations

needed, ALFALFA chains seeds

together to form a gapped alignment.

As a result, the extension phase

(aligning the matches) is limited to

filling gaps in between chains while

evaluating alignment quality.

The sparseness value‘s of sparse suffix

arrays (controlled by the option -s)

provides an easily tunable trade-off to

balance performance and memory

footprint. In theory, sparse suffix

arrays take up 9/s + 1 bytes of

memory per indexed base. A sparse

suffix array with sparseness factor 12

thus indexes the entire human

genome with a memory footprint of

5.8GB. It shows that ALFALFA is good

to perform sequencing at maximal

speed acquiring minimal memory

space.

ALFALFA tries to balance the number

and the quality of seeds using a

combination of maximal and super-

maximal exact matches. The intervals

*i..i + l−1+ and *j..j + l-1] correspond to

a maximal exact match between a

read and a reference genome if there

is a perfect match between both

subsequences of length ` starting at

position i in the read and at position j

in the reference genome, with

mismatches occurring at positions

(i−1,j−1) and (i + l, j + l) just before and

after the location of the matching

subsequence.

A combination of neighboring seeds

increases the evidence that some

region in the reference genome holds

potential to serve as a mapping

location. ALFALFA therefore sorts

seeds according to their starting

position in the reference genome and

bins them into non-overlapping

clusters using the locally longest seeds

as anchors around which regions are

built. This results in a list of candidate

regions along the reference genome.

To limit the number of candidate

regions requiring further examination,

only SMEMs and rare MEMs are used

for candidate region identification. .

Candidate regions are then ranked by

their cov- erage of read bases,

calculated from the seeds that make

up the clusters. Sequential processing

of these prioritized candidate regions

halts when either a high number of

feasible alignments has been found, a

series of consecutive candidate

regions failed to produce an

acceptable alignment or read

coverage drops below a certain

threshold.

The dimensions of a dynamic

programming matrix correspond to

the bounds of a candidate region, but

Page 8: Bioinformatics Review - October 2015

Bioinformatics Review | 8

computations are often restricted to a

band around the main diagonal of the

matrix. The width of this band

depends on the minimal alignment

score required.ALFALFA further

reduces the dimensions of the matrix

by forming a collinear chain of a

subset of the seeds that make up a

candidate region. Dynamic

programming can then be restricted

to fill the gaps in between consecutive

non-overlapping seeds. The chaining

algorithm starts from an anchor seed

and greedily adds new seeds that do

not exhibit a high skew to the chain.

The skew is defined as the difference

of the distances between two seeds

on the read sequence and the

reference genome. The amount of

skew allowed is automatically decided

based on the gap between the seeds

and the parameters that influence the

feasibility of an alignment. ALFALFA

allows multiple chains per candidate

region, based on the available anchor

seeds. Anchor selection is based on

seed length and seeds contained in

chains can no longer be used as

anchors in successive chain

construction.

Overall, Bowtie 2 has the highest

sensitivity, which reaches 100%.

However, Bowtie 2 is also less able to

distinguish between good and bad

alignments. CUSHAW3, BWA-MEM

and ALFALFA exhibit the best trade-off

between true positives and false

positives. The mapping quality is

determined by ROC (receiver

operating characteristic ) curve.

The benchmark results demonstrate

that ALFALFA is extremely fast at

mapping long reads, while still being

competitive for moderately sized

reads. Together with BWA-SW and

BWA-MEM, it is one of a few mappers

that scale well for read lengths up to

several kilobases.

Reference

BMC Bioinformatics Sample 16:59

doi:10.1186/s12859-015-0533-0

Michaël Vyverman

([email protected])

Bernard De Baets

([email protected]) Veerle

Fack ([email protected]) Peter

Dawyndt

Page 9: Bioinformatics Review - October 2015

Bioinformatics Review | 9

Tumor progression prediction by variability based expression signatures Muniba Faiza

Image Credit: Stock Photos

“Gene expression signatures are commonly used to create cancer prognosis and diagnosis methods. Gene expression

signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological

phenotype or a medical condition.”

ancer has become a very

common disease now a

days, but the main reason

of causing this is unknown

up till now. Various reasons have been

given and recent research says that

improper sleeping patterns may also

lead to cancer. Like cause of cancer is

difficult to predict, similarly, its

progression and prognosis is also very

difficult. Despite of many advances in

cancer treatment, early detection is

still very difficult. While there have

been many early cancer screening

techniques, but are not realistic

because of the lack of cost-

effectiveness or requirement of

invasive procedures. Genomic

screening techniques are a promising

approach in this area. Gene

expression signatures are commonly

used to create cancer prognosis and

diagnosis methods. Gene expression

signature is a group of genes in a cell

whose combined expression pattern is

uniquely characteristic of a biological

phenotype or a medical condition. But

only few of them were successfully

able to utilize in clinics and many of

them failed to perform. Since these

signatures attempt to model the

highly variable and unstable genomic

behavior of cancer, they are unable to

predict cancer. The degree of

deviation in gene expression from the

normal tissue, i.e., the hyper-

variability across cancer types can be

used as a measurement of risk of

relapse or death. This gives rise to the

concept of Gene expression anti-

profiles. Anti- profiles are used to

develop cancer genomic signatures

that specifically takes advantage of

gene expression heterogeneity. They

explicitly model increased gene

expression variability in cancer to

define robust and reproducible gene

expression signatures capable of

accurately distinguish tumor samples

from healthy controls.

Differentially variable genes= anti-

profile genes

After employing many experiments

regarding cancer anti-profiles, the

results indicated that the anti-profile

approach can be used as a more

robust and stable indicator of tumor

malignancy than traditional

classification approaches.

The researchers’ hypothesis is that

the degree of hyper-variability (w.r.t

normal samples) is directly

proportional to the tumor progression,

i.e., degree of hyper- variability as

measured with respect to the normal

samples would increase with tumor

progression. Corrada Bravo et al found

SYSTEMS BIOLOGY

C

Page 10: Bioinformatics Review - October 2015

Bioinformatics Review | 10

out a way to derive a colon-cancer

anti-profile for screening colon tumors

by measuring deviation from normal

colon samples. To create an anti-

profile, they used a set of normal

samples and tumor samples, probe-

sets are then ranked by the quantity

σj,tumor/ σ j,normal(where σj,tumor and

σj,normal are the standard deviations

among the tumor samples and normal

samples, respectively, for probeset j)

in descending order, and a certain

number of probesets (typically 100)

with the highest value are selected.

Then they calculated the normal

regions of each probe set and then the

number of probe sets for which the

expression lies outside the normal

region was calculated to get an anti-

profile score of the sample.

To test their hypothesis, they

obtained two publicly available

microarray datasets with normal,

adenoma, and cancer colon samples.

By studying these datasets, they

plotted the distribution of variance of

cancer/adenoma samples to variance

of normal samples ratio (in log2 scale)

for these probe sets on the other

dataset (Fig. 1A and B).

Fig.1 Among probes that exhibit higher

variability among cancers than among

normals, the degree of hypervariability

observed is related to the level of

progression. (a) Distribution of

variance ratio statistic ,log2 σ σ2tumor

÷ σ2normal- for colon dataset (Gyorffy

et al; GSE4183) from anti-profile

computed using another colon dataset

(Skrzypczak et al; GSE20916). (B)

Distribution of variance ratio statistic

for Skrzypczak et al colon dataset from

anti-profile computed using Gyorffy et

al colon dataset. (C) Distribution of

variance ratio statistic for

adrenocortical data (Giordano et al;

GSE10927) for universal anti-profile

probe sets.

Both adenoma and cancer samples

show higher variability than normals

(region to the right of x = 0), while

cancer samples show higher

hypervariability than adenomas. This

suggests that hypervariability is a

stable marker between experimental

datasets and that specific selection of

hypervariable genes across cancer

types and the anti-profile method can

be extended to model tumor

progression. These studies showed

that Gene expression anti-profiles

capture tumor progression.

DNA methylation is one of the

primary epigenetic mechanisms for

gene regulation, and is believed to

play a particularly important role in

cancer. High levels of methylation in

promoter

are usually associated with low

transcription. Cancer has loss of

sharply methylation levels which is

associated with increased

hypervariability in gene expression

across multiple tumor types. They

applied the anti-profile scoring

method to DNA methylation data

from thyroid and colon samples,

where for each tissue type, normal,

adenoma and cancer samples were

available. Figure 2 shows the

distribution of adenoma and

Page 11: Bioinformatics Review - October 2015

Bioinformatics Review | 11

carcinoma samples against normal

samples on a principal component

plot, showing the presence of the

hypervariability pattern in

methylation data: the normal

samples cluster tightly, while the

adenomas show some

dispersion and the carcinomas show

even greater dispersion. Since these

behaviors are present for both colon

and thyroid data, it again reinforces

their notion that the anti-profile

approach has wide application for

classification in cancer.

figure 2. Anti-profiles applied to

methylation data: first two principal

components of (A) thyroid

methylation data and (B) colon

methylation data.

Conclusion:

The anti-profile approach is more

suitable for cancer prognosis. It can

robustly predict the tumor progression

and prognosis based on the variability

in the gene expressions. The results

presented above also confirms that

gene expression signatures based on

hyper-variability can be highly

valuable.

Reference:

Wikum Dinalankara and Héctor

Corrada Bravo Center for

Bioinformatics and Computational

Biology, Department of Computer

Science and UMIACS, University of

Maryland, College Park, MD, USA.

Page 12: Bioinformatics Review - October 2015

Bioinformatics Review | 12

BioMiner &

Personalized

Medicine: A new

perspective

Muniba Faiza

Image Credit: Google Images

“BioMiner is a web-based tool which provides various tools for studying the statistical analysis and a deep insight of

transcriptomics, proteomics and metabolomics with cross-omics concept”

ersonalized medicines have

become a very important part

of the medicine world now a

days. They are also known as

‘Individualized Medicines’.

Personalized medicines allow a

doctor to prescribe more specific

and efficient medicines to a

particular patient. This concept has

created many more opportunities

and aspects in the medicine world.

Personalized medicine concept is

accomplished by obtaining high-

throughput data sets from genomics,

transcriptomics, proteomics and

metabolomics, but more specifically

it requires the ‘cross-omics’, i.e.,

linkage between transcriptomics,

proteomics and metabolomics.

Currently, there are simple web-

based tools which do not allow much

access to the high throughput

datasets from the omics. But a new

novel web based tool “Biominer” has

been launched recently which

provides access to a wide variety of

high-throughput datasets. This tool

was developed within the scope of

an international and interdisciplinary

project (SYSTHER). Biominer

provides the user various facilities

with convenient tools which help

them to analyze the high-throughput

datasets and provides a deep insight

for complex cross-omics datasets

with enhanced visualization abilities.

Since Biominer was developed under

Systher (System Biology Tools

Development for Cell Therapy and

Drug Development –

www.systher.eu) project so its main

focus is on cancer.

Public data repositories such as Gene

Expression Omnibus (GEO) and

ArrayExpress for microarray data,

PRoteomics IDEntifications (PRIDE) for

proteomics data, or Sequence Read

Archive (SRA) of NCBI are used to store

the biological high-throughput

datasets for next- generation

sequencing. The only limitation with

these repositories is that they store

biological data of a dedicated set of

single omic type and do not support

the cross-omics.

A database namely, SystherDB has

been developed in which the stored

data is well presented and easily

accessible, and whose data is mined

and analyzed by the BioMiner tools.

A public instance of BioMiner is

freely available online. It currently

contains 18 different studies, with

almost 4,000 microarrays and more

than 187 Mio measured values of

genes, proteins, or metabolites.

Since BioMiner was developed in

P

TOOLS

Page 13: Bioinformatics Review - October 2015

Bioinformatics Review | 13

the SYSTHER project, most of the

studies are focused on the

glioblastoma multiforme (GBM).

Fig.1 Workflow of BioMiner

FEATURES:

1. BioMiner uses Google Web

Toolkit (GWT) for the graphical user

interface (GUI).

2. A separate MySQL database is

created which is manually curated and

used to store the Experimental data

from genomics, proteomics and

metabolomics.

3. Data import has to be

performed by a dedicated specialist to

ensure the data consistency.

4. Response time is with in just a

few seconds, for this purpose special

indexing methods are implemented.

5. Metabolite data are annotated

using three different identifier

systems: Golm Metabolome Database,

Human Metabolome Database

(HMDB), and Kyoto Encyclopedia of

Genes and Genomes (KEGG).

6. Predefined cross-omics

relationship (e.g., a mapping of

metabolites onto genes or vice versa)

among the biological datasets.

7. Pathway and functional

information from Reactome, KEGG,

and Wiki- Pathways.

8. Gene Ontology is also

supported.

9. Correlation analyses (statistical

analysis of any two variables) are

based on Pearson correlation

coefficients.

10. Correlations are calculated for

high-variance genes (by default top

500 genes).

11. BioMiner complies with public

data management standards such as

Minimum Information About a

Microarray Experiment (MIAME),

Minimum Information About a

Proteomics Experiment (MIAPE), and

Minimum Information About a

Metabolomics Experiment (MIAMET).

12. ENSEMBL database is used for

cross-mapping between the genes

and proteins.

13. Cross-mapping between the

genes and metabolites the combined

information of ConsensusPathDB and

HMDB is used.

A

Fig2 . Data mining with Biominer.

screenshots of different results

from data mining with Biominer

including the following: (a) study

overview, (B) detection of

differentially expressed genes, (C)

correlation of gene expression and

survival time, (d) identification of

significantly enriched pathways, (e)

visual pathway inspection based on

predefined layouts, and (f)

biomolecule comparison of gene

and protein expression. results are

typically presented in synchronized,

parallel views composed of a table

and a plot. Fig3. Pathway

visualization. Interactive pathway

visualization of the cell cycle

pathway from WikiPathways

repository.

BioMiner is a web-based tool which

provides various tools for studying the

statistical analysis and a deep insight of

transcriptomics, proteomics and

Page 14: Bioinformatics Review - October 2015

Bioinformatics Review | 14

metabolomics with cross-omics concept.

Results are presented in two parallel

views composed of a table and a plot.

Both views are interactive and user-

defined selections can be synchronized.

Pathway visualization is achieved by

extending the PathVisio library. It also

provides clinicians and physicians a

platform integrating high-throughput

data together with clinical parameters,

thereby leading to better personalized

medicines.

Reference:

Chris Bauer1, Karol stec1, alexander

Glintschert1, Kristina Gruden2,

Christian schichor3, michal or-Guil4,5,

Joachim selbig6 and Johannes

schuchhardt1

Page 15: Bioinformatics Review - October 2015

Bioinformatics Review | 15

Meta-analysis of

biological

literature:

Explained

Manish Kumar Mishra

Image Credit: Google Images

“ Meta-analysis an analysis of already published data by s imply rearranging it, sorting it, and trying to find hidden patters out of published literature. ”

t’s a fine Monday morning, and

the new intern finds his way to

the laboratory of biological

data mining procedures. His brief

interview with the concerned

scientist has allowed him to have

very limited understanding of the

subject.... Upon his arrival he is

greeted with a humongous corpus of

mixed articles, say some 4000, and

he is required to assemble specific

information out of the data set, by

diligently scrutinizing the

components of each article.

Well, the situation could be

frightening to a purely wet lab

working biologist, but a man who

has had any exposure to the real

power of file handling with any

programming language will know

how to let a simple few lines of

code do his bidding.

So what is meta-analysis about?

The new cool word to biological

realm “meta-analysis” can be better

understood by understanding the

meaning of first half of the

term; META, meaning data of data,

thence making meta-analysis an

analysis of already published data

by simply rearranging it, sorting it,

and trying to find hidden patters out

of published literature.

By most rudimentary means, meta-

analysis can be achieved by reading

the corpus of research and review

articles concerning a particular topic

which may be as wide a whole

Eukaryotic genome or may be

narrowed down to phyla, groups,

species may be a specific disease or

even any gene in particular. Where on

one part we try to narrow down to

disease or gene, one must also realize

biological systems are most complex

to date and present day computer

simulations fail to rival the complexity

with equal efficiency, so any analysis

narrowed down to gene must also

consider that the gene may very well

be found in multiple organisms and

thus may present considerably high

amount of results irrelevant to the

study.

A rigorous manual inspection of

program sorted data is required to

sort out such entries. Since meta-

analysis relies heavily on statistical

studies of data, researchers tend to

rely on programming languages such

as Stata and R to write their specific

codes for analyses, R unlike Stata is

free, produces publication quality

outputs and provides a plethora of

packages, of which a few provide

programs like PDF miner, PubMed

miner etc used for accessing PubMed

I

META ANALYSIS

Page 16: Bioinformatics Review - October 2015

Bioinformatics Review | 16

database, these packages contain

codes to access the database and

extract all information off them with a

command based interface for huge

data sets at once cutting down

manual efforts and time taken to

achieve the task.

All praises sent, the method has its

own fair share of drawbacks and

issues. The current query system to

NCBI and sister organizations fail to

acknowledge synonymous terms and

treats them as individual entities not

linked to any, but only in association

with the length of query items made

alongwith. A robust query system is

needed to enhance the results, and

make the whole concept more

efficient.

Need of the hour is to engage more

resources into developing well-

structured and somewhat intelligent

query systems which can truly

acknowledge the gene-names and

abbreviations, scientific and English

names of organisms and also the

variations of presenting names of

techniques involved.

Page 17: Bioinformatics Review - October 2015

Bioinformatics Review | 17

How to check

new peptides

accuracy in

Proteogenomics

Muniba Faiza

Image Credit: Google Images

“ During the discovery of novel genes, there are huge chances of getting false results as positives, i.e., we can also get

those peptides which in actual are not but the algorithm may show them .”

roteogenomics is an emerging

area which is an interface of

proteomics and genomics. This

intersection employs the genomic and

transcriptobmic information to

identify the novel peptides by using

mass spectrometry based techniques.

The proteomic data can then be used

to identify the fingerprints of genic

regions in that particular genome

which may results in the modification

of gene models and can also improve

the gene annotations. So, we can say

that proteogenomics has been well

accepted as a tool to discover novel

proteins and genes."But, during the

discovery of novel genes, there are

huge chances of getting false results

as positives, i.e., we can also get those

peptides which in actual are not but

the algorithm may show them".

Therefore, to avoid or more

accurately, to minimize the chances of

false positives, a False Discovery Rate

(FDR) is used. FDR is a ratio of number

of decoy hits / number of targets.

FDR = decoy/ target

In most conventional proteogenomic

studies, a global false discovery rate

(i.e., the identifications of annotated

peptides and novel peptides are

subjected to FDR estimation in

combination) is used to filter out false

positives for identifying credible novel

peptides. However, it has been found

that the actual level of false positives

in novel peptides is often out of

control and behaves differently for

different genomes.

It has been observed previously that,

under a fixed FDR, the inflated

database generated by, e.g. six-open-

reading-frame (6-ORF) translation of a

whole genome significantly decreases

the sensitivity of peptide

identification.

Recently, Krug implied that the

identification accuracy of novel

peptides is greatly affected by the

completeness of genome annotation,

i.e., more the genome is annotated,

higher are the chances of

identification of accurate novel

peptides.

In this recent paper, they followed the

same framework as in Fu’s work to

P

PROTEOGENOMICS

Page 18: Bioinformatics Review - October 2015

Bioinformatics Review | 18

quantitatively investigate the

subgroup FDRs of annotated

and novel peptides identified by 6-ORF

translation search.

In this article, they have revealed that

the genome annotation completeness

ratio is the dominant factor

influencing the identification accuracy

of novel peptides identified by 6-ORF

translation search when a global FDR

is used for quality

assessment. However, with a stringent

FDR control (e.g. 1%), many

low scoring but true peptide

identifications may be excluded along

with false positives.

To increase the sensitivity and

specificity of novel gene discovery,

one should reduce the size of

searched database as much as

possible. For example, when the

transcriptome information (especially

from the strand-specific cDNAseq

data) is available, it is apparently more

favorable to search against the

transcriptome as well than to search

against the genome alone. If the

transcriptome information is

unavailable, it would be also helpful to

reduce the 6-OFR translation database

by removing sequences that are

predicted to be hardly possible to be

real proteins.

Reference:

A note on the false discovery rate of

novel peptides in proteogenomics

Kun Zhang1,2, Yan Fu3,*, Wen-Feng

Zeng1,2, Kun He1,2, Hao Chi1,

Chao Liu1, Yan-Chang Li4, Yuan Gao4,

Ping Xu4,* and Si-Min He1

Page 19: Bioinformatics Review - October 2015

Bioinformatics Review | 19

The basic concepts

of genome

assembly

Muniba Faiza

Image Credit: Google Images

” Genome, as we all know, is a complete set of DNA in an organism including all of its genes. It consists of all the

heritable information and also some regions which are not even expressed.”

enome, as we all know, is a

complete set of DNA in an

organism including all of its

genes. It consists of all the heritable

information and also some regions

which are not even expressed.

Almost 98 % of human genome has

been sequenced by the Human

Genome Project, only 1 to 2 % has

been understood. Still the human

genome has to be discovered more

whether it would be in terms of genes

or proteins. Many sequencing

strategies and algorithms have been

proposed for genome assembly. Here I

want to discuss the basic strategy

involved in genome assembly, which

sounds quite difficult but is not really

complex if understood well.

Basic strategy involved behind

discovering the new information of

genome is explained in following steps:

1. First of all, the whole genome of

an organism is sequenced which

results in thousands or hundreds

of different unknown fragments

starting from anywhere and

ending upto anywhere.

2. Now, since we don't know what

the sequence is and which fragment

should be kept near to which one, the

concept for 'Contigs' is employed.

Contigs are the repeated overlapping

reads which are formed when the

broken fragments comes to each

other only by matching the

overlapping regions of the sequence.

It means that many fragments which

are consecutive are joined to form

contig. Many such contigs are formed

during the joining process.

3. Now, the question that arises is how

come we know that a fragment which

may be a repeat has been kept in its

right place as a genome may have

many repeated regions? To overcome

this, paired ends are used. Paired ends

are the two ends of the same

sequence fragments which are linked

together, so that if one of the end of

the fragment is aligned in lets say

contig1 then the other end which is a

part of the former will also be aligned

in the same Contig as it is the

consecutive part of the sequence.

There are various software with the

help of which we can define different

lengths of the paired ends.

4. After that all the Contigs combine

to form a scaffold, sometimes

GENOMICS

G

Page 20: Bioinformatics Review - October 2015

Bioinformatics Review | 20

called

as Metacontigs or Supercontigs,

which are then further processed

and the genome is sequenced.

All of this is done by different

assembly algorithms, mostly used are

Velvet and the latest is SPADES.

According to my experiences, more

efficient algorithms are which may

provide us large information in one go.

Just imagine that we got a thread of

sequence with unknown base pairs,

then what would we do with that

thread and how would we identify and

extract the useful information from

this thread??

Thank you for reading, Don't forget to

share this article if you like it.

Page 21: Bioinformatics Review - October 2015

Bioinformatics Review | 21

Computer and Drugs: What you need to know

Altaf Abdul Kalam

Image Credit: Google Images

” Computers are being used to design drugs and it is being done on a humongous level, by almost every

multinational pharma company.”

ould you chance your life

to a lowly piece of

hardware called the

computer? Would you let it fabricate

and determine drugs for life

threatening diseases like hepatitis,

cancers and even AIDS? Well, actually

your answer (or your opinion)

doesn’t seem to matter. Because the

world has moved over to the other

side. Computers are being used to

design drugs and it is being done on a

humongous level, by almost every

multinational pharma company, the

names of which you will undoubtedly

find at the back of your prescription

medicines at home. So what’s with all

this computer stuff? Have we parted

with our perspicacity, our intuition,

our ready willingness to tackle any

challenge head on? We have always

found solutions to mankind’s biggest

problems all by ourselves. As

Matthew McConaughey’s character

in Interstellar says "..or perhaps

we’ve forgotten we are still

pioneers?"

Well philosophical mubo-jumbo

aside, its not that simple as its

sounds. Ofcourse, most of you

reading this already have some

background in this topic and have

already understood what I am talking

about. But for those of you who

haven’t the slightest clue, don’t

worry, this write up is for you.

Throughout this series of articles on

this particluar issue, I am going to try

and break it down to the basics. Lets

say that by the end you would see a

car not for what it is – with all its

complexity and slickness- but for

what made it the way it is – the nuts

and bolts and rubber and.. whatever,

you get the point!

So where do we start? Money! Yes

the thing that runs the world.

Contrary to what all the losers who

never made a dime say, money

simply is everything. Even Bill Gates

was forced to acknowledge the fact

and decalre "Money isn’t everything

in this world , but you gotta have a

lot of it before you say such rubbish."

So that settles it then. Now lets come

back.

The basic modus opernadi of

designing a drug is that you first find

a suitable target which you believe

will be key to challenging the disease.

This mostly is a protein/enzyme that

can metabolise a particular drug or in

some cases even the disease causing

genes from the pathogen itself.

Finding this target is not easy, but it is

not that hard either. We have

documentations, intensive studies

and databases dedicated to listing,

characterizing and studying the drug

metabolizing genes and proteins in

the body. Different classes of

metabolizers act on different types of

chemicals (or drugs if you like). A

class of metabolizers called the CYP

CADD

W

Page 22: Bioinformatics Review - October 2015

Bioinformatics Review | 22

enzymes metabloize over sixty

percentage of the known drugs and

medicines that humans consume.

This includes drugs (the real ones –

LSD, cocain, heroin.. get it?) and

even poisons and sedatives. The

metabolizers ofcourse don’t know

which is which. If it suits them they

metabolize it, else they are out of

your system.

Now, under the assumption that we

have a drug target, the next step is

finding the suitable drug candidate

itself. Now this step is what you call

finding a needle in a haystack. There

are literally millions of drugs out

there and if that is not enough you

can go design your own and get it

synthesized. In a drug target (we will

call it simply the ‘protein’ henceforth)

there are multiple points of action

where multiple drugs can act. So for

example in a protein made of 200

amino acids, we might find 50

actionable amino acids. For these

fifty amino acids we may find

thousands and thousands of drug

candidates, all capable in bringing

about some or the other change in

the protein.

So how do we find the One? If you

asked that question about fifteen

years back, the answer would have

been to slog it out. Match every drug

candidate you got against the protein

you have and check the effects in

vivo. Now countless number of

factors come into play when a drug

and a protein interact – global

minima, energy minimization, binding

affinity, hydrogen bonding intensity

and what not. We shall learn about

them in more detail in upcoming

articles.

So to put it simply, scientists spent

their whole sorry lives pitting

different drug candidates against the

same protein over and over again

until they found something

worthwile to hang on to. Even if all

the above mentioned factors blended

in wonderfully, they might sadly, at

the end discover that the drug

actually caused more harm than it did

good. So the candidate gets

discarded and they start the process

all over again! Sometimes you got

lucky and found the right drug within

performing a few combinations. But

mostly it took years to even zero in

on a drug that could be moved

further into the drug discovery

pipeline which in itself is another

torturous process!

So coming back to the money factor.

You don’t need to be a Harvard

Business School garduate to learn

that this tiresome task costs money,

a lot of money! Money in the form of

manpower, reagents, biological

matter like tissues and test animals

and plants, instrumentation,

electiricity and what not. Another

thing it costs is something which

none of use care about much – time.

Picture designing a drug for some

novel disease which is killing

thousands of people each year. And

picture having to0 do this same

procedure and coming out with a

drug after 10-15 years. The cost of

such a life saving drug will also be

high because the company or lab that

made them would want to recover all

the time and money they spent on it

in the first place. Not exactly feasible

and effective I would say.

So here comes computer aided drug

design. Which – brace yourself, can

shave off years from the drug

discovery pipeline. It can get you into

the clinical trials phase within say 2-3

years as opposed to the earlier

average of 7-8 years. Less money

spent, less time spent, faster

avilability of a potential cure and who

knows, even less expensive

medicines.

So how does it work? How does the

entry of a man made machine change

everything for the better so

drastically? What does a computer

do that humans could not? Can you

trust the results that happen in silico

over something that happen in vivo?

Is a computer finally that evolved

that it can simulate life forms inside

its mother box and processors? We

will hopefully see those questions

answered in the next few posts!

Page 23: Bioinformatics Review - October 2015

Bioinformatics Review | 23

Basic Concept of

Multiple Sequence

Alignment

Muniba Faiza

Image Credit: Google Images

“ The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence

similarity.”

ultiple Sequence

Alignment (MSA) is a very

basic step in the

phylogeny analysis of organisms. In

MSA, all the sequences under study

are aligned together pairwise on the

basis of similar regions with in

them. The major goal of MSA

pairwise alignment is to identify the

alignment that maximizes the protein

sequence similarity.

This is done by seeking an alignment

that "maximizes the sum of

similarities for all the pair of

sequences", which is called as the

'Sum-of-scores or SP Score'. The SP

Score is the basic of many alignment

algorithms.

The most widely used approach for

constructing MSA is "Progressive

Alignment", where a set of n proteins

are aligned by performing n-1

pairwise alignments of pairs of

proteins or pairs of intermediate

alignments guided by a phylogeny

tree connecting the sequences. A

methodology that has been

successfully used as an improvement

of progressive alignment based on

the SP Score is "Consistency-based

Scoring",where the alignment is

consistent dependent on the

previously obtained alignment, for

example, we have 3 sequences

namely, A,B, & C ,the pairwise

alignment A-B, B-C imply an

alignment of A-C which may be

different from the directly computed

A to C alignment.

Now, the question arises that how

much can we rely on the obtained

MSA? and how an MSA is validated?

The validation of MSA program

typically uses a benchmark data set

of reference alignments. An MSA

produced by the program is

compared with the corresponding

reference alignment which gives an

accuracy score.

Before 2004, the standard

benchmark was BAliBASE (

Benchmark Alignment dataBASE) , a

database of manually refined MSAs

consisting of high quality

documented alignments to identify

the strong and weak points of the

numerous alignment programs now

available.

"Recently, several new benchmark

are made available,

namely, OXBENCH, PREFAB,

SABmark, IRMBASE and a new

extended version of BAliBASE."

Another parameter which is

considered as basic in most of the

alignment programs is fM Score. It is

used to assess the specificity of an

alignment tool and identifies the

proportion of matched residues

predicted that also appears in the

M

SEQUENCE ANALYSIS

Page 24: Bioinformatics Review - October 2015

Bioinformatics Review | 24

reference alignment. Many of the

times, it is encountered that some

regions of the sequences are

alignable and some are not, however,

there are usually also intermediate

cases , where sequence and structure

have been diverged to a point at

which homology is not reliably

detectable.In such a case, the fM

Score , at best, provides a noisy

assessment of alignment tool

specificity, that becomes increasingly

less reliable as one considers

sequences of increasing structural

divergence.

However, after considering the

reference alignments, the accuracy of

results is still questionable as the

reference alignments generated are

of varying quality.

REFERENCES:

Multiple sequence alignment

Robert C Edgar1 and Serafim

Batzoglou2

BAliBASE: a benchmark alignment

database for the evaluation of

multiple alignment programs

Julie D. Thompson, Frédréric

Plewniak and Olivier Poch

Page 25: Bioinformatics Review - October 2015

Bioinformatics Review | 25

Basics of

Mathematical

Modelling - Part 1

Fozail Ahmad

Image Credit: Google Images

“Mathematical modeling receives a broad domain of cellular processes such as metabolic regulation, gene-gene

interaction, and gene-protein interaction. This has made a bridge between experimental and expected outcome. ”

iochemical processes are

simply complex, and their

apparent feature does not

easily allow us to investigate what

exactly system means. Moreover,

most of the biochemical processes

obey nonlinear reaction kinetics. That

is, amount of reactant

(Protein/RNA/DNA/) is not directly

proportional to its product. This leads

to further increase in complexity level

of molecular mechanism. And create

biological noises such as

randomization (stochasticity) of

biomolecules, perturbation in cell

signaling, difficulty in quantification of

cell products and even unexpected

response of the entire system. Here

comes development and utilization of

a mathematical model takes multiple

factors/parameters into consideration

and provides researcher with a visual

understanding how complex biological

system functions and responds to

external (hormone/drug/cation/anion)

and internal signals

(protein/enzyme/cation/anion) or

adverse environmental condition such

as deficiency of Fe2+ ion during

formation of Vitamin-D. Basically,

mathematical modeling receives a

broad domain of cellular processes

such as metabolic regulation, gene-

gene interaction, and gene-protein

interaction. This has made a bridge

between experimental and expected

outcome. In case of discrepancies,

between the two, parameter taken

into consideration need to be refined.

The general approach of the modeling

give us the following benefit:

1. Discrepancies between

mathematical model and actual

experimental result point to

components that still are missing

from hypothetically developed

model, and therefore one can

develop a more comprehensive

scenario of systems behavior. On

the hand, a well developed

model assist in designing and

clarifying the additional issues in

ongoing experiment.

2. With the help of a mathematical

model researcher can modify

experimental parameter (e.g., by

introducing modified protein

associated with Mg2+ uptake into

cell) and run the computer

simulations.

3. Most importantly, mathematical

models are not limited to

environmental/experimental

constraints. They may be quickly

changed for multiple

conditions/parameters and most

suitable simulation can be

assessed for developing a reliable

experimental design.

4. A mathematical model may help

to investigate sub-system that

sometimes regulate special

SYSTEMS BIOLOGY

B

Page 26: Bioinformatics Review - October 2015

Bioinformatics Review | 26

biochemical process, though all

biological reactions cannot be

treated same, which necessarily

provides substantial information

about large systems behaviour.

Fig: 1. Schematic representation of

biological modelling process,

knowledge generation and

experimental design.

By doing multiple simulations and

changing parameter values, we are

able to represent the real

biochemical/molecular phenomena

which seem to be difficult-to-treat.

→ (To be Continued…

Page 27: Bioinformatics Review - October 2015

Bioinformatics Review | 27

MUSCLE: Tool for

Multiple Sequence

Alignment

Muniba Faiza

Image Credit: Google Images

“MUSCLE is one of the software which is known for its speed and accuracy on each of the four benchmark test sets (

BAliBASE, SABmark, SMART and PREFAB).”

n my last article I discussed

about the Multiple Sequence

Alignment and its creation.

Now in this article, I am going to

explain the workflow of one of the

MSA tool, i.e., MUSCLE. MUSCLE is a

software which is used to create MSA

of the sequences of interest. MUSCLE

is one of the softwares which is known

for its speed and accuracy on each of

the four benchmark test sets (

BAliBASE, SABmark, SMART

and PREFAB). It is comparable to T-

COFFEE & MAFFT (these tools will be

explained in upcoming articles).

MUSCLE algorithm:

Two distance measures are used by

MUSCLE for a pair of sequences: a

kmer distance (for an unaligned pair)

and the Kimura distance (for an

aligned pair). A kmer is a

contiguous subsequence of length k,

also known as a word size or k-tuple,

i.e., it decides that how much

alphabets in the sequences will be

searched & aligned together. Kimura

distance is the measure which is based

on the fact that multiple substitutions

occurs at a single site.

For an aligned pair of sequences,

MUSCLE

computes the pairwise percent

identities ,i.e., how much

percentage of the sequences are

aligned/matched and convert it

to a distance matrix applying

Kimura.

Distance matrices are then

compiled using UPGMA method

(i,e., a method of phylogeny tree

construction based on the fact

that the mutations occur at the

constant rate), which gives a

TREE1,which is followed by an

progressive alignment and forms

MSA1,

compute pairwise percent

identities from MSA1 and

construct a Kimura Distance

MAtrix,

again apply UPGMA method and

forms TREE2

again followed by a progressive

alignment and forms MSA2, and

forms tree.

now from the last obtained tree

it delete the edges which results

in the formation of two sub trees,

computes the sub tree profile

(align the sub trees)

and then finally gives an MSA, for

which the SP Score is calculated

(explained in previous article

"Basic Concept of MSA"),

if the SP Score is better, then only

it saves that last obtained MSA as

MSA3, otherwise it discard the

MSA,

again repeat from the step 6, and

finally gives a clustered MSA.

TOOLS

I

Page 28: Bioinformatics Review - October 2015

Bioinformatics Review | 28

Fig.1 The workflow of MUSCLE

This is how MUSCLE works.

MUSCLE alignment is also used in

MEGA6 tool which is used for

phylogeny tree construction.

Every software or tool has its

own benefits depending up on

the needs under consideration.

There are various other tools also

available for MSA such as T-

COFFEE, MAFFT, etc, which have

high accuracy and speed. They

will be explained in upcoming

articles.

Reference:

MUSCLE : multiple sequence

alignment with high accuracy and

high throughput

Robert C. Edgar*

Page 29: Bioinformatics Review - October 2015

Bioinformatics Review | 29

Introduction to

mathematical

modelling - Part 2 Fozail Ahmad

Image Credit: Google Images

“In order to obtain the parameter values for analysing the kinetic behaviour of biochemical processes, we simply

investigate into expression data (gene/protein) at different transcriptional and translational level that enable us to frame-out the comprehensive structural biochemical pathway.”

athering of Dynamic/Kinetic

information

In the previous section you

might have noticed that modelling

biochemical process requires

calibrated set of fine parameters which

fit into and across the set of

chemical/reactant species

(gene/protein/molecule) involved in

the process. Question arises, where do

we collect data from? And what are

the standard criteria for determining

parameters? Basically, for a

researcher, it is necessary to know the

source of the data first and then how

to manipulate to get relevant

information for modelling. Source of

the data can be chosen depending

upon the requirement of experimental

design. For modelling, data can be

taken in the form of gene-gene

interaction, Gene expression (micro-

array) and gene-protein interaction.

Basically, interaction and expression

do not simply reveal the

dynamic/kinetic values of the system

and therefore need to be manipulated

for further implication.

In order to obtain the parameter

values for analysing the kinetic

behaviour of biochemical processes,

we simply investigate into expression

data (gene/protein) at different

transcriptional and translational level

that enable us to frame-out the

comprehensive structural biochemical

pathway. This can be done by

accounting the following methods:

1. Genome (complete set of genes)

analysis at transcription level

through DNA sequencing &

genotyping

2. Transcriptome (all mRNA)

analysis at translation level using

microarrays

3. Proteome (entire protein)

analysis at cellular level (reaction

between proteins and other

molecule in cell) using Mass

spectroscopy and 2D gel

electrophoresis

4. Metabolome (total metabolites

and their intermediate) analysis

at cell level, interaction

metabolites and regulator

using 13C labelling and NMR

techniques

5. Interactome (all interacting

molecule) analysis by yeast-2

hybrid screen & TAP techniques

(Ligand: TAP- Tandem affinity

purification, NMR- Nuclear magnetic

resonance).

SYSTEMS BIOLOGY

G

Page 30: Bioinformatics Review - October 2015

Bioinformatics Review | 30

Fig: 2. OMICS generates data for

developing structural pathway as well

as parameter values are set from the

same. A mathematical model

(formula), in the form of differential

equations, form reaction channels are

derived and then executed/solved

using suitable algorithm. The resultant

Simulation shows the dynamic

behaviour of the system. This can be

fluctuated by changing parameter

values to obtain the close-up result

form experimental data.

Above mentioned techniques are

collectively referred to as Omics. They

provide us structural and dynamic data

that is used to generate mathematical

formula representing observable

reactions, followed by development of

mathematical model and

comprehensive pathway of biological

system. These tentative models allow

us (as mentioned in part-1) to observe

the effect of a stimulus on specific

signalling pathway, perturbation in

cellular activities and gene expression

level etc.

Omics are characterized by a number

of features. First, they allow

researcher for analyses on different

molecular levels such as gene, protein

and metabolites level. These different

molecular levels sometimes show

asynchronous behaviour—that is,

although some metabolite such as

glucose is higher in a cell and

corresponding enzymes are lower than

sugar to catalyse the reaction or vice

versa. Asynchronous behaviour is an

indication of complex regulatory

mechanism. Therefore, it is crucially

important to evaluate the degree of

synchronization of all cellular level.

Second, Omics are highly parallelized.

This means all genes/mRNA (read-outs

in sample) can be studied

simultaneously rather than having to

perform separate experiments

focusing on individual genes. This

parallelization also allows researcher

to compare the degree of expression

results for the same gene and to have

an interaction between resultant

proteins. Third, they are very

standardized and therefore needs very

high automation computing, providing

scientist with a large number of

samples at time. In the process, after

collecting huge data, most relevant

information is picked up and then

processed further for final analyses.

Entireties of techniques in an Omics

are very important in the sense that

they generate numerical data based

upon which we are able to develop

structural pathway for mimicking the

real picture of biological system and

then to represent in the form of

mathematical model.

( →Continue to part 3 )

Page 31: Bioinformatics Review - October 2015

Bioinformatics Review |

31

Page 32: Bioinformatics Review - October 2015

Bioinformatics Review |

32

DNA test for paternity:

This is how you can fail!

Tariq Abdullah

Image Credit: Google Images

“DNA contain genetic information of an individual. The whole set of genes in an organism is known as Genome.

99.9% of genome is exactly same in every human individual but, 0.1% differs. 0.1% of each individual is unique to the person thus making it possible to identify each individual.”

NA test, also called as DNA

fingerprinting is done to verify

paternity, to find out criminal

involvement, in forensic science, in

archaeology and other scientific fields.

It is a well-established fact that DNA

fingerprinting test is foolproof. It has

its merits in court cases too. It is

considered to be a credible evidence

of criminal involvement and

paternity. Case of N.D Tiwari had also

got wide media attention in recent

times.

BUT the chances are, you can fail a

DNA test even with your real father or

mother!

To understand how is it possible, let

us look at how DNA Fingerprinting is

done.

The technique of DNA

fingerprinting/genetic fingerprinting or

simply DNA Test was discovered by Dr.

Alec Jeffreys in 1984. DNA contain

genetic information of an individual.

The whole set of genes in an organism

is known as Genome. 99.9% of

genome is exactly same in every

human individual but, 0.1% differs.

0.1% of each individual is unique to

the person thus making it possible to

identify each individual.

To identify a person by the difference

in DNA sequence, the sequence are

simply compared to each other. To

speed up this process of comparison,

rather than each nucleotide, a

molecular biologist compares the

regions of high variation in DNA

sequence, called minisatellites. The

location and sequence

of minisatellites in genome varies in

every individual. The chances of

occurrence of same minisatellites are

very low (1 in a billion). Hence it can be

treated as unique to every individual

just like a fingerprint.

To perform DNA test, the set of DNA is

first broken into smaller pieces by an

enzyme called restriction

endonuclease called Eco R1 which cuts

the sequence at distinct location

where the sequence is GAATTC or its

complementary. The location of this

repetitive sequence varies in every

individual. Each fragment is then

sorted according to their molecular

weight (or size) by a technique

called gel electrophoresis.

The fragment are then compared to

each other. If the fragments generated

by restriction enzymes are of same size

it is more likely that both the

sequences originated from the same

individual. Click here to read more

about genetic fingerprinting.

So how can DNA fingerprinting fail?

BIOINFORMATICS NEWS

D

Page 33: Bioinformatics Review - October 2015

Bioinformatics Review |

33

For the DNA test to fail we have to

have two different set of DNA

(Genome) in our bodies. It is possible

by following ways, this is a concise list

for a quick reference.

1. Since the human body is a

complex and dynamic system, the

environmental conditions in

different parts of body may lead

to changes in DNA, this is a

comparatively new idea studied

in epigenetics. Though changes

occur this way, It is not likely to

change the entire DNA and

location of microsatellites.

2. Transposable elements may also

cause the location of some

sequence to change. The

occurrence of transposable

elements is not so widespread as

to change locations of all

microsatellite. Hence, this

idea does not seems satisfactory.

3. Occurrence of more than one

kind of cells in terms of genome,

in human, i.e. Human

Chimera was recently seen in a

US man who failed the paternity

test with his real child.

So what is Human Chimera?

To be simple and precise, human

chimera is the occurrence of cells with

completely different set of genes in a

single individual. It is a very rare

condition and may go unnoticed.

Sometimes, when one of the twin dies

during early pregnancy, the remnant

cell may be taken up and absorbed by

the surviving embryo. Thus surviving

embryo will have two kinds of genome

in different part of the body depending

upon the process of differentiation.

Thus, if the cells that form sperm in

your body have different set of

genome than rest of the cells in your

body, you may fail a DNA test! This is a

rare condition and chances are low

that you would get away with a crime.

If you liked this article or find this

article worth reading, please do not

forget to share. Who knows, there

might be cases of human chimera

around you. :)

Page 34: Bioinformatics Review - October 2015
Page 35: Bioinformatics Review - October 2015

Subscribe to Bioinformatics Review newsletter to get the latest post in your

mailbox and never miss out on any of your favorite topics.

Log on to

www.bioinformaticsreview.com

Page 36: Bioinformatics Review - October 2015