1.1 Introduction - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/5011/10/10_chapter1.pdf · 1.2 Single Nucleotide Polymorphisms (SNPs) SNPs are the nucleotide changes that

C h a p t e r 1 I n t r o d u c t o r y R e v i e w

Page 2

1.1 Introduction

he knowledge of the complete human genome sequence has unfolded the

mysteries of the human genome variation which in!turn has allowed a

mechanism!based approach to the understanding of the relationship of

genotype with disease. This understanding is considered as the essential precursor to

the development of the personalized medicine. With rapid advances in high!throughput

genotyping and next!generation sequencing technologies, a large amount of genetic

variation has been discovered which has assumed many forms. The simplest type of

variant results from a single base mutation which substitutes one nucleotide for other

and that accounts for the most common form of variation referred to as single

nucleotide polymorphisms (SNPs). Many other forms of variation result from the

insertion or deletion of one or more nucleotides, so!called insertion/deletion (INDEL)

polymorphisms. The most common insertion/deletion events occur in repetitive

sequence elements, consisting of variable length sequence motifs that are repeated in

tandem in a variable copy number, so!called variable number tandem repeat

polymorphisms (VNTRs). VNTRs can further be divided on the basis of the size of the

tandem repeat unit: microsatellites (or simple sequence repeats (SSRs)) and

minisatellites. Microsatellites (or SSRs) consists of one to six bases repeat motifs. Direct

tandem repeat sequences of motif 10!30 base pairs are called minisatellites (Jeffreys et

al., 1985). The rarest insertion/deletion events involve deletion or duplications of


Page 3

regions that can range from a few kilobases to several megabases. Few other types of

repeats were also observed in the genomes. These include palindrome sequences,

inverted repeats, and mirror repeats (Cox and Mirkin, 1997).

This huge quantity of various forms of genetic variations in the human genome led many

to question the origin and maintenance of such a human population’s genetic load.

Kimura (1983) formulated a theory called neutral theory of evolution that proposed that

most of the sequence variations does not make a significant impact on the phenotypic

consequences and so, will not be subjected to the natural selection, thus, rendering the

majority of mutations likely to be phenotypically neutral. However, there are a certain

number of undefined alleles that can cause directly (referred to as mutations) or

increase the susceptibility to disease (polymorphisms). Bioinformatics analysis of human

sequence provides an opportunity to identify the most common form of genetic

variation, SNPs, by comparison of two sequences viz., coding DNAs, expressed sequence

tags (ESTs) or genomic sequences. Discovery of SNPs that affect biological function have

become increasingly important and availability of the databases for the SNPs to some

large extent have been discussed in this chapter.

1.2 Single Nucleotide Polymorphisms (SNPs)

SNPs are the nucleotide changes that occur in DNA which account for approximately

90% of the genetic variation among individuals in a population (Collins et al., 1998). SNP


Page 4

is a nucleotide change that is prevalent in at least 1% of the population (Figure 1.1).

There are two categories of SNPs as given below:

1. Linked SNPs: SNPs which do not reside within genes and do not affect protein

functions. These are also referred as Indicative SNPs which originate to response

to the drugs or to the risk of getting a certain disease.

2. Causative SNPs: SNPs which affect protein structures and/or functions and

cause diseases. These can be divided into two categories:

(a) Coding SNPs: SNPs found in the coding regions of the genes and affect

protein function. Again coding SNPs can be divided into two parts:

(i) Non!synonymous SNPs (nsSNPs): In these cases changed nucleotide

leads to change in amino acid

(ii) Synonymous SNPs (sSNPs): In these cases changed nucleotide does

not lead to change in amino acid.

(b) Non!coding SNPs: the nucleotide change is located within the regulatory

parts of genes and is correlated to the changes in the corresponding mRNA

expressions.


Page 5

Figure 1.1: The single nucleotide polymorphism (SNP) where a single nucleotide (A, C, T

or G) in the DNA sequence is altered. Here, C changed to T, hence, change in nucleotide.


Page 6

The non!synonymous SNPs (nsSNPs) that lead to amino acid changes in protein products

are likely to affect their structure and the function depending on the type of the amino

acid change as well as the site of change (Cargill et al., 1999; Stenson et al., 2003;

Thorisson and Stein, 2003; Ng and Henikoff, 2006). Some amino acid changes are

tolerated by proteins with no concomitant phenotypic effect and the corresponding

nsSNPs are referred to as benign or neutral nsSNPs. Those leading to amino acid

changes that are not tolerated by protein structure and function which further lead to

disease phenotypes are referred to as pathogenic or disease mutants (Saunders and

Baker, 2002; Bao and Cui, 2005; Yue and Moult, 2006).

Although comparative genetic analyses of healthy and disease individuals have led to

the discovery of a number of mis!sense mutations/nsSNPs associated with diseases, the

list may be far from complete as the list of uncharacterized mutations/nsSNPs

discovered from the human genome project outweighs the list of characterized

mutations/nsSNPs. In this post!genomic era, classification of nsSNPs into disease or

neutral has, therefore, been perceived as the first step before any study is attempted

such as pharmacogenomics and a variety of computational methods have been devised

for this purpose (Mooney, 2005; Ng and Henikoff, 2006; Thusberg and Vihinen, 2009).

But before that I give details of the databases hosting information on mutations/SNPs.

Needless to mention these databases, in addition to serving as information resources,


Page 7

have also been providing datasets for benchmark studies of the computational methods

developed for prediction of pathogenic mutations.

1.3 Databases

The databases include dbSNP (Sherry et al., 1999), the Human Genome Variation

Database (HGVbase) (Fredman et al., 2004), On!line Mendelian Inheritance in Man

(OMIM) (Hamosh et al., 2005) and Human Gene mutation database (HGMD) (Stenson et

al., 2003) etc. The details of these databases are given below:

1.3.1 The Single Nucleotide Polymorphism Database (dbSNP)

The dbSNP is a free public domain for broad collection of simple genetic polymorphisms

across the different organisms. This database has been developed and maintained by

the National Center for Biotechnology Information (NCBI) in collaboration with the

National Human Genome Research Institute (NHGRI) and is available at

http://www.ncbi.nlm.nih.gov/SNP/. This database was created in 1998 (Sherry et al.,

1999) for providing additional information to Genbank, NCBI’s public collection of

protein and nucleotide sequences. In addition to SNPs, the dbSNP contains a range of

other molecular variation: (1) deletion and insertion polymorphisms (DIPs/indels) and

(2) microsatellite repeat variations or short tandem repeats (STRs). Each dbSNP entry

includes the sequence context of the polymorphism (i.e., the surrounding sequence),

the occurrence frequency of the polymorphism (by population or individual), and the

experimental method(s), protocols, and conditions used to assay the variation. The


Page 8

dbSNP can be searched using the Entrez SNP tool with queries viz., a refSNP number ID,

a gene name, an allele, a build number etc with the summarized information of that

searched SNP. It has been reported that this database contains some false positive

entries due to genotyping and base!calling errors (Reich et al., 2003; Mitchell et al.,

2004; Musumeci et al., 2010).

1.3.2 Human Genome Variation Database (HGVbase)

The Human Genome Variation database, HGVbase, previously known as HGbase

(http://hgvbase.cgb.ki.se/; Fredman et al., 2004) is a highly curated and non!redundant

database of available genomic variation data of all types but mostly comprising of single

nucleotide polymorphisms (SNPs). The HGVbase is supported by the establishment of a

European consortium comprising teams at the Karolinska Institute, Sweden, the

European Bioinformatics Institute, United Kingdom (UK) and at the European Molecular

Biology Laboratory, Germany.

This database can also be called as extension of manually curated dbSNP where the

HGVbase curators provide a more!extensively validated SNP data set by filtering out

SNPs in repeat and low complexity regions and by identifying SNPs for which a

genotyping assay can successfully be designed. The HGVbase include polymorphisms as

well as variations with rare or single occurrence alleles as well as disease!related and

disease!causing clinical mutations.


Page 9

1.3.3 The Human Gene Mutation Database (HGMD)

The Human Gene Mutation Database (HGMD) constitutes a comprehensive core

collection of data on germ!line mutations responsible for human inherited disease

(http://www.hgmd.org/; Stenson et al., 2003). The HGMD was first made publicly

available in April 1996 and is now available as commercial to users after the

collaboration between HGMD and BIOBASE GmbH in 2006. The scope of HGMD is

particularly limited to mutations which include single base!pair substitutions in coding,

regulatory and splicing!relevant regions, insertions/deletions (indels), duplications and

triplet repeat expansions.

1.3.4 On!line Mendelian Inheritance in Man (OMIM)

OMIM is an on!line database (http://www.omim.org; Hamosh et al., 2000) that

catalogues all the human genes and their associated mutations based on the long

running catalogue Mendelian Inheritance in Man (MIM), started in 1967 by Victor A.

McKusick at Johns Hopkins. This database was available on the NCBI web site in 1995.

OMIM is an excellent resource for providing background information about biology of

genes and their related diseases.

1.3.5 The UniProt/SwissProt Database

UniProtKB/Swiss!Prot (http://expasy.org/; Bairoch and Apweiler, 1996) is a highly

curated and manually annotated, non!redundant protein sequence database. This

database was created in 1986 by Amos Bairoch at Swiss Institute of Bioinformatics and


Page 10

maintained collaboratively by the Department of Medical Biochemistry of the University

of Geneva and the EMBL Data Library. The objective of UniProtKB/Swiss!Prot is to

provide all known relevant information about a particular protein. The information

about variants has been listed as disease/polymorphisms for each protein sequence

entry. The additional bonus of Uniprot/SwissProt is that it is well integrated with the

OMIM, dbSNP and NCBI database family and whenever new variants are updated these

in those databases also become available on the UniProt/SwissProt database.

1.4 Computational analysis of effects of nsSNPs

As mentioned earlier, there are a large number of mis!sense mutations whose

phenotypic effects have not been discovered. Hence, methods to accurately predict the

effect of mis!sense mutations have always been in demand. Several methods have been

developed and have been briefly discussed (Mooney, 2005; Ng and Henikoff, 2006;

Thusberg and Vihinen, 2009).The basic approach adopted by all these methods involves

use of either sequence or structural information or both, of proteins harboring the

nsSNPs with an underlying idea that mis!sense mutations that alter protein structure

and function are likely to be pathogenic and those do not alter are likely to be

neutral/benign (Figure 1.2). In other words, the phenotypic effect of a mis!sense

mutation is judged by its effect at the protein level. In order to predict whether a given

mis!sense mutation is pathogenic or neutral, various features at the mutation site are

considered which include evolutionary conservation (Miller et al., 2001), solvent


Page 11

accessibility, secondary structure (Sunyaev et al., 2000) etc. In addition, the effect of

mutation on protein stability is also considered by some studies (Wang and Moult,

2001).

There have been studies to map mis!sense mutations on to their respective proteins and

study of their protein sequence and structural contexts (Sunyaev et al., 2000; Burke et

al., 2007, Yue et al., 2006, Adzhubei et al., 2010). Wang and Moult (2001) showed that

83% of the disease!causing mutations affected protein stability. Using both structure as

well as sequence information, Sunyaev et al. (2000) showed that 70% of the disease

causing mutations affect the structurally and functionally important sites such as those

buried sites, active sites or sites involved in disulphide bonds. Gong and Blundell (2010)

showed the distribution of amino acid variants by mapping onto the 3D structures, if

available and reported the occurrence of disease!related variants much more frequently

at solvent inaccessible regions as well as at amino acid residues involved in hydrogen

bond formation as compared to polymorphic variants.

However, the coverage for prediction methods using protein structure is only 14% (Yue

and Moult, 2005) as compared to coverage using sequence!based methods (81%)

(Ramensky et al., 2002). For sequence!based prediction methods, first step is to select

the homologous sequences, manually or automatically. Since the amino acids occurring

in the alignments form the fundamental basics of sequence!based prediction method,

the alignments and the number of sequences used are the central part in the prediction


Page 12

PATHOGENIC BENIGN

Figure 1.2: The basic approach for simple amino acid prediction using either sequence

or structure based method

Protein sequence or Structure and

amino acid as an input

Structure Sequence

Structural features such as

crystallographic B factor, solvent

accessibility, ligands binding site,

3D structure environment etc.

Sequence based features include conservation

score, position!specific evolutionary score

derived from MSA, the physiochemical

properties, amino acid substitution matrix

Apply scoring rules for prediction


Page 13

Table 1.1: Available Amino acid substitution prediction methods

Methods Algorithm Used Conservation analysis Structural features

SIFT Scores calculated using

Dirichlet mixtures

Sequence Homology !!!!!

PolyPhen Empirical rules Position Specific Independent

Counts (PSIC)

Predicted Features/

Homology modeling

PANTHER Alignment Scores PANTHER Library , HMMs !!!!

SNAP Neural Network PSIC profiles, Pfam, PSI!BLAST Predicted features

SNPs3D Support Vector Machine Shannon !entropy based Predicted Features

PhD SNP Support Vector Machine Sequence profiles, Sequence

environment

!!!!

MutPred Random Forest SIFT, Pfam, PSI!BLAST Predicted Features

nsSNPAnalyzer Random Forest Normalized probability Homologous

structures

PMUT Neural Network Physicochemical Features Predicted Features

PAREPRO Support Vector Machine psap score, residue difference !!!!

MAPP MSAs Physicochemical Features !!!!

SAAP Known PDB !!! Structural analysis

SNPs&GO Support Vector Machine Sequence profiles, ontology !!!!!!

TopoSNP MSAs Relative entropy , Pfam 3D structural locations

PolyPhen 2 Bayesian Classification PSIC Profiles Predicted Features/

Homology modeling


Page 14

of pathogenic mutations. Prediction methods (SNPs&GO, Calabrese et al., 2009) also

incorporate annotations or gene ontology to increase the prediction accuracy. In the

following sections I discuss some of the widely used methods (Table 1.1).

1.4.1 SIFT

SIFT (Sorting Intolerant From Tolerant) constructs multiple sequence alignment (MSA) as

a query and creates 13!component Dirichlet mixture based score matrix for each

position in the alignment ((http://sift.jcvi.org/) and (http://sift-dna.org); Ng and Henikoff,

2001). Based on the amino acids appearing at each position in the MSA, SIFT calculates

the score for each amino acid substitution which will be converted into a normalized

probability that the substitution would be evolutionary tolerated. Substitutions at a

position showing normalized probabilities less than a chosen cutoff value (0.05) are

predicted to be pathogenic, and those greater than or equal to the cutoff value are

predicted to be tolerated. SIFT is available both in the online server as well as a

standalone software which can be downloaded to a local system and run.

1.4.2 PolyPhen

PolyPhen (Polymorphism Phenotyping) like SIFT also takes an evolutionary approach in

distinguishing pathogenic nsSNPs from functionally neutral ones and is available as an

online server (http://genetics.bwh.harvard.edu/pph; Ramensky et al., 2002). PolyPhen

uses a rule!based cutoff system based on sequence, phylogenetic, and structural

information to classify variants. The sequence based characterization includes SWALL


Page 15

database annotation for sequence features (Johnson and Todd, 2001), SignalP program

to predict signal peptide regions (Nielsen et al., 1997), the Coils2 program for prediction

of coiled coil regions (Lupas et al., 1991), TMHMM to predict transmembrane region

(Krogh et al., 2001), PHAT (Predicted Hydrophobic And Transmembrane region) matrix

substitution score (Ng et al., 2000). The phylogenetic prediction is based on the

position!specific independent counts (PSIC) score (Sunyaev et al., 1999) derived from

multiple sequence alignments (MSAs) of observations. It utilizes protein structure

databases, such as PDB (Protein Data Bank) or PQS (Protein Quaternary Structure), and

three!dimensional structure databases and the use of DSSP (Dictionary of Secondary

Structure in Proteins) software (Kabsch and Sander, 1983) to determine if a variant may

have an effect on the protein's secondary structure, solvent!accessible surfaces and phi!

psi dihedral angles. In addition, PolyPhen calculates normalized B!factor (temperature

factor), change in residue chain volume, region of the phi!psi map (Ramachandran map),

change in residue side chain volume, normalized accessible surface area and change in

accessible surface propensity resulting from the amino acid substitution. In addition,

PolyPhen also checks whether the amino acid substitution site is in spatial contact with

ligands or protein subunits or interchain contacts, functional sites, and binding sites.

After characterizing the variant, PolyPhen uses empirically derived rules to predict that

variant as “probably damaging” to protein function, “possibly damaging”, “benign” and

“unknown”.


Page 16

1.4.3 PolyPhen 2

PolyPhen!2 is an improved version of PolyPhen with selective combination of 11

sequence and structure!based features for the characterization of an amino acid

substitution available as both online server and batch form

(http://genetics.bwh.harvard.edu/pph2; Adzhubei et al., 2010). The sequence!based

attributes include PSIC scores of wild type and their difference, MSA properties include

number of residues observed in the MSA, sequence identity (SI) with the closest

homologue and position of mutation in relation to domain boundaries as defined by

Pfam (Finn et al., 2010). The structure based attributes are change in accessible surface

area propensity for buried residues, crystallographic B!factor and solvent accessibility.

PolyPhen!2 use naïve Bayesian classifier to predicts the characterization of variants as

“probably damaging” or “possibly damaging” or “benign” or “unknown”.

1.4.4 SNAP

SNAP (Screening for Nonacceptable Polymorphisms) is an online neural!network based

method to make prediction of the effect of a mis!sense mutation

(http://rostlab.org/services/snap/; Bromberg and Rost, 2007). The method utilizes local

sequence environment of a residue, biochemical properties including the substitution by

charged amino acid in the buried position, introduction of proline as structure disruptor

in alpha!helices, replacement of hydrophilic by hydrophobic side chain or vice!versa,

over packing of cavity/core with the replacement by larger size residue, transition


Page 17

frequencies of the mutations, evolutionary information encoded from combinations of

weighted amino acid frequency and position!specific scoring matrix vectors from PSI!

BLAST (Altschul et al., 1997), profiles generated by PSIC (Sunyaev et al., 1999), structure!

based sequence features includes secondary structure information predicted by

PROFsec (Rost and Sander,1993; Rost,1996), PROFacc to predict the solvent accessibility

(Rost and Sander, 1994; Rost, 2005), predictions of chain flexibility by PROFbval

(Schlessinger et al., 2006), protein family/domain related evolutionary information from

Pfam (Finn et al., 2010) and SWISS!PROT annotations (Bairoch and Apweiler, 2000) to

predict score for a particular variant. The score can be seen as the signature for the

prediction effects as well as reliability index (RI).

1.4.5 PANTHER

PANTHER (Protein ANalysis THrough Evolutionary Relationships), an online server,

estimates the likelihood of a mis!sense variant to cause a functional impact on the

protein (http://www.pantherdb.org/tools/csnpScoreForm.jsp; Thomas et al., 2003). It

calculates substitution position!specific evolutionary conservation (subPSEC) scores

based on an alignment of evolutionarily related proteins. The alignments are derived

from PANTHER library of protein families/subfamilies based on Hidden Markov Models

(HMMs). First, the likelihood of a particular amino acid substitution at a particular

position, aaPSEC score, is calculated and the subPSEC score can be the differences of the

aaPSEC scores of the amino acid residues. The subPSEC score can range from 0 (neutral)


Page 18

to about !10 (more pathogenic). The authors suggested !3 as best user cut!off values for

the prediction of pathogenic mutations.

1.4.6 PhD SNP

PhD!SNP (Predictor of human Pathogenic Single Nucleotide Polymorphisms), a web!

based support vector machine classifier, uses a combination of single!sequence (SVM!

Sequence) and sequence profile (SVM!Profile) to classify a mis!sense variant

(http://gpcr.biocomp.unibo.it/~emidio/PhD!SNP/PhD!SNP.htm; Capriotti et al., 2006).

The SVM!Sequence classifies the mutation to be disease or benign based on the nature

of the specific mutation and their neighboring/local sequence environment. The SVM!

Profile is created from MSAs through the sequence profile information and classifies the

variant according to the ratio between the frequencies of the wild!type and mutated

residue. Prediction of mutation at a particular position is based on the decision!tree

algorithm where user can chooses either sequence!based or sequence and profile based

information (Hybrid Meth).

1.4.7 MutPred

MutPred is a web application tool developed to classify an amino acid substitution as

disease!associated or neutral (http://mutpred.mutdb.org/; Li et al., 2009). It is a

Random Forest based classifier that integrates protein structure, sequence as well as

evolutionary information. MutPred utilizes SIFT method (Ng and Henikoff, 2003) along

with PSI!BLAST, transition frequencies (Bromberg and Rost, 2007) and Pfam profiles


Page 19

(Finn et al., 2010). Structural attributes include secondary structure and solvent

accessibility prediction by PHD method (Rost, 1996), coiled!coil structure prediction by

MARCOIL (Delorenzi and Speed, 2002), disorder prediction by DisProt (Peng et al.,

2006), stability prediction by MuPro (Cheng et al., 2006), transmembrane helix

prediction by TMHMM (Krogh et al., 2001) and B!factor prediction (Radivojac et al.,

2004). Functional sites predictions include DNA!binding residues (Ahmad et al., 2004),

catalytic residues, methylation (Daily et al., 2005), calmodulin!binding targets (Radivojac

et al., 2006), ubiquitination (Radivojac et al., 2009) and glycosylation. The MutPred

measures the likelihood of observing a given mutation depending upon the probability

score whether mis!sense variant is pathogenic or neutral as well as estimating the

functional effects by using significance p!values for the particular phentotypic effect.

1.4.8 nsSNPAnalyzer

nsSNPAnalyzer is an online random forest based classifier using the structural and

evolutionary information for the classification of mis!sense variants

(http://snpanalyzer.utmem.edu; Bao et al., 2005). After submitting the protein

sequence as a query, nsSNPAnalyzer searches against the ASTRAL database (Chandonia

et al., 2004) for homologous protein structures for extracting structural features

including solvent accessibility, environmental polarity and secondary structure. The

evolutionary attributes include the normalized probability of the amino acid substitution


Page 20

from MSA as well as similarity and dissimilarity between the wild type and mutated

amino acid.

1.4.9 SNPs&GO

SNPs&GO is a web!server based on support vector classifier to predict disease related

mutations from the protein sequence (http://snps!and!go.biocomp.unibo.it/snps!and!

go/; Calabrese et al., 2009). It is based on the type of mutation and sequence

environment information, sequence profiles generated from MSAs and PANTHER

predictions (Thomas et al., 2003). A novel feature, log!odd score derived from Gene

Ontology (GO) terms (Ashburner et al., 2000) is a crucial feature in increasing the

performance of SNPs&GO for predicting the pathogenic mutations. The SNPs&GO

output consists of a table listing the number of the mutated position in the protein

sequence, the wild!type residue, the new residue and if the related mutation is

predicted as disease or as neutral.

1.4.10 SNPs3D

SNPs3D is a SVM!based classifier which assigns molecular functional effects of nsSNPs

based on structure and sequence analysis and is available as an online server

(http://www.snps3d.org/; Yue et al., 2006). The sequence!based features include

probability of the substitution at a particular position, Shannon entropy at each position,

mean entropy and standard deviation of the entropy over all positions. Structural

attributes include a set of 15 stability factors which are used to access the impact of


Page 21

each mutant on protein stability. The classes of electrostatic interaction viz., reduction

of polar!polar, polar!charge, charge!charge, solvation effects viz., burying of charge or

polar groups, disulphide bond breakage and reduction in non!polar area buried on

folding, structural rigidity viz., crystallographic B!factor, Z score and standard deviation,

steric strain representing backbone strain and overpacking. The sequence features

contribute to svm!profile score and structural features to svm!structure score. A

positive svm!profile as well as positive svm!structure score indicates a variant classified

as neutral, and a negative score indicates a pathogenic case. For variants that act by

affecting protein function rather than stability, the stability model is expected to return

a positive (non!pathogenic) svm!structure score and the profile model a negative

(pathogenic) score.

1.4.11 PMUT

PMUT is a neural network (NNs) based method available on!line as a webserver

(http://mmb2.pcb.ub.es:8080/PMut/; Ferrer!Costa et al., 2005). The features include

sequence as well as structural features. The structural parameters include predicted

solvent accessibility and secondary structure prediction by the PHD software (Rost and

Sander, 1993), observed secondary structure prediction by SSTRUC implementation of

the Kabsch and Sander method (Rost and Sander, 1993), observed solvent accessibility

prediction by NACCESS (Hubbard and Thornton, 1993). Substitution matrices such as

BLOSUM62 and PAM40, physiochemical properties including hydrophobic indices from


Page 22

water/ocatnol free energy measurement, volume changes derived from Van der Waals

volumes as well as volumes of buried residues. Their differences in physiochemical

properties between wild!type and mutant type are also incorporated. Secondary

structure propensities are obtained from standard Chou and Fasman analysis (Chou and

Fasman, 1974) as well as from the Swindells et al., analysis (Swindells et al., 1994). After

submitting to PMUT, the output shown as a pathogenicity index ranging from 0 to 1

(indexes >0.5 signal pathological mutations) and a confidence index ranging from 0 (low)

to 9 (high).

1.4.12 MAPP

MAPP (Multivariate Analysis of Protein Polymorphism), a java based web tool that

considers the physicochemical variation in each column of a MSA and, on the basis of

this variation, calculates the deviation of mutated amino acid from the variation and

thus can predict the impact of all possible amino acid substitutions on the function of

the protein (http://mendel.stanford.edu/SidowLab/downloads/MAPP/index.html; Stone

and Sidow, 2005). MAPP considers 6 physicochemical properties for the evaluation of

mis!sense variants: hydropathy (Kyte and Doolittle, 1982), polarity (Stryer, 1995), charge

(Stryer, 1995), side!chain volume (Zamyatin, 1972), free energy in alpha helical

conformation (Muñoz and Serrano, 1994) and free energy in beta!sheet conformation

(Muñoz and Serrano, 1994). MAPP calculates the impact score by scoring the deviation

from the MSA column for each possible amino acid variant by calculating each property


Page 23

difference from the mean and dividing by the square root of the variance. With the

highest impact score, MAPP identifies a potentially pathogenic one whereas low impact

score shows the negligible effect or neutral.

1.4.13 PAREPRO

PAREPRO (Prediction of amino acid replacement probability) is yet another SVM!based

method on the basis of evolutionary information as well as 50 selected properties from

the AAindex (Kawashima et al., 1999; Kawashima and Kanehisa, 2000) for the prediction

of pathogenic mutations (http://www.mobioinfor.cn/parepro/; Tian et al., 2007). This

method is available both online as well as standalone server. First, the position!specific

amino acid probability (psap) score is calculated from MSA, and then residue differences

(RD), mutation position information (MI) and information about surrounding around the

mutation position (IE), thus these combinations of features were selected as input into

SVM to make a model. PAREPRO thus appears to use more specific evolutionary

information.

1.4.14 TopoSNP

TopoSNP is an an online server for analyzing the non!synonymous SNPs (nsSNPs) that

can be mapped onto known 3D structures of proteins

(http://gila.bioengr.uic.edu/snp/toposnp/; Stitziel et al., 2004). This tool provides an

interactive structural visualization of nsSNPs as well as classification of nsSNP sites into

three categories based on their geometric location in the protein structure: surface


Page 24

pocket or an interior void, a convex region or a shallow depressed region, a completely

buried inside the protein structure. The conservation feature viz., relative entropy of

SNPs calculated from MSA as obtained from the Pfam database is also incorporated into

TopSNP. Thus, TopoSNP, by selecting an nsSNP site, can be used to visualize the specific

assignment of geometric class and relative entropy score.

1.4.15 SNPeffect

SNPeffect is an online web!resource for nsSNPs mapping the phenotypic effects of allelic

variations in human (http://snpeffect.vib.be/; Reumers et al., 2005). SNPeffect have

been designed by using different functional properties to predict the effect of nsSNPs

describing the molecular phenotype of proteins. The functional properties in the

SNPeffect can be divided into three parts as: properties affecting protein folding and

stability; affecting functional and binding sites, affecting cellular processing of a protein.

Change in free energy upon mutation as calculated by FoldX (Guerois et al., 2002),

change in protein aggregation and amyloidosis predicted by TANGO (Fernandez!

Escamilla et al., 2004) and AmyScan (Lopez and Serrano, 2004; Lopez et al., 2002) were

evaluated in the SNPeffect for the functional properties affecting stability and folding.

For active and catalytic sites, SNPeffect uses Catalytic Site Atlas (CSA) (Porter et al.,

2004) database visualizing and documenting enzyme active sites and catalytic residues

in enzymes for which three!dimensional structures are available. PA!SUB (Lu et al.,


Page 25

2004) and PSORT II (Horton and Nakai, 1997) are evaluated in SNPeffect for predicting

cellular localization whenever nsSNPs is localized.

1.4.16 SAAP

SAAP (Single Amino Acid Polymorphisms) (http://www.bioinf.org.uk/saap/db/; Cavallo

and Martin, 2005) first collects relevant data from dbSNP and HGVbase and then maps

the data onto the translated regions of the gene to determine whether the mutation is

in a part of the gene translated to protein. If, mutation is in the coding part (exon), then

check for whether it is non!synonymous or not. Data from Online Mendelian Inheritance

in Man (OMIM) as well as Locus specific mutation databases (LMSDs) is also

incorporated into the SAAP. Once the mapping of a mutation to a protein sequence is

established and if a pdb structure of the corresponding mutation is known, then the

mutant is mapped onto the protein structure and its structural analysis is performed.

Structural analysis includes whether mutation involve hydrogen bonding residues, steric

clashes, mutation location on the interface or directly involved in the binding

interactions with ligand or partner protein. In this way, SAAP represents as a completely

automatic and reliable implementation of nsSNPs mapping on to the known structure of

the protein. SAAP is also available for download locally.

1.4.17 Align GVGD!

Align GVGD! is a web!based program that combines the MSA and the biophysical

characteristics of amino acids to predict the pathogenic mutations


Page 26

(http://agvgd.iarc.fr/agvgd_input.php; Mathe et al., 2006; Tavtigian et al., 2006). This

software calculates two types of conservation score based on MSA of a query protein

for each substitution viz., Grantham variation (GV) and Grantham deviation (GD). GV

measures the degree of biochemical variation among amino acids found at each

position in the MSA and the biochemical distance of mis!sense variations from the

observed amino acid at a particular position reflects GD. Align!GVGD is an extension of

the original Grantham differences like composition (C), polarity (P) and volume (V) to

MSAs, by combining GV and GD scores to predict the disease causing activity of each

mis!sense substitution. Each amino acid can be plotted on a 3D graph, having C, P and V

as the three axes, with different weights applied to each axis. All amino acids will then

form a cloud of points at a given position in the MSA and be enclosed within a box, the

coordinates of which are defined by the minimum and maximum values of C, P, V, for

the observed amino acids. If the substitution lies within the box, then GD = 0 and vice

versa. Thus, Align!GVGD can measure the biochemical difference between the mis!

sense and the observed amino acid variation at that position in the MSA.

As discussed already a mis!sense mutation can be classified as a pathogenic/disease or

neutral/benign based on the distribution pattern of several sequence and structure

based features concerning the mutation. It is clear that the task of prediction of mis!

sense mutation into pathogenic or neutral is essentially a binary classification problem

in a multi!dimensional feature space and hence it is not surprising to find several well


Page 27

known methods adopting machine learning classifiers. In fact there has been a growing

trend amongst the computational biologists to adopt one of the machine learning

classifiers in an attempt to improve accuracies of their prediction methods. There are

excellent reviews and books available which give their theories (McCulloch and Pitts,

1943; Haykin, 1994; Michalski and Tecuci, 1994; Mitchell, 1997; Vapnik, 1998; Strasser

and Weber, 1999; Breiman, 2001; Cruz and Wishart, 2006). However, for the purpose of

continuity I give in this chapter some essential aspects on machine learning classifiers

with a special emphasis on support vector machine (SVM) used in the present work.

1.5 What are Machine Learning Classifiers?

Machine learning classifiers form a branch of artificial intelligence and incorporate a

variety of probabilistic, statistical and optimization techniques that allows the computer

to first “train” from past examples and use that prior training to classify new data,

identify new patterns from large, noisy or complex data sets. There are several

classifiers available which can be used for solving classification problems and they are:

(a) Support Vector machine, (b) Random Forest, (c) Neural Network, (d) Decision Tree

using recursive partitioning, (e) Conditional Inference Trees, (f) Naïve Bayesian Classifier,

(g) Bootstrap Aggregating (bagging) and (h) Ensemble of Random Forest & Bagging.


Page 28

1.5.1 Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised machine!learning method whose

mathematical framework was first developed by Vapnik (1995). It is based on the

concept of decision (hyperplanes) that define decision boundaries whose preliminary

task is to classify objects into two classes and hence it is extensively used for

classification and regression problems (Larranaga et al., 2006; Vapnik, 1995; Yang,

2004).

To classify objects, SVM plots the given values as points (known as vectors) in space and

differentiates between members and non!members of a defined class by drawing a

maximum margin hyperplane between them. The vectors near the hyperplane are the

support vectors (Figure 1.3). The objective of the SVM modeling is to find the optimal

hyperplane in a multidimensional space that separates clusters of vectors into two class

labels. The points corresponding to the query object are plotted in the same space and

depending on the position relative to the hyperplane; it is assigned one of the two class

labels. So, having the points with the largest margin (positive distance from the

hyperplane), better will be the generalization of SVM classifier.

Additionally, SVM have ability to deal with errors in the training dataset by adding a

“soft margin” (Cortes & Vapnik, 1995) in order to avoid misclassification of unknown

datasets by SVM. This soft margin can be introduced by allowing some data points to

C h a p t e r

Figure 1.3:

given objec

support vec

1

Simple repre

cts: stars fro

ctors.

esentation o

om crosses.

of SVM whe

The black

I

re decision

one in both

n t r o d u c

hyperplane

h crosses an

t o r y R e v

Pa

separates th

nd stars rep

v i e w

age 29

he two

present


Page 30

push their way through the margin of the separating hyperplane without affecting the

final result.

For a sample of x vectors, prediction is based on a formula:

!"#$ % &'() "* +i i,"#i- #$ . /$0123 (1.1)

where, f(x) is the decision function, K (.,.) is the kernel function, i is the weight

assigned to the training feature vector xi and yi is the corresponding label (+1: member,

!1: non!member). In equation 1.1, ‘b’ is chosen so that yi f (xi) =1 for any i with 0< i<C

where ‘C’ is a cost parameter.

Various popular kernel functions are available and they are as follows:

i) Linear Kernel: This kernel can be represented as

4"#- +$ % #5 + (1.2)

ii) Polynomial:

4"#- +$ % "#- +$6 (1.3)

Where ‘d’ is the degree of the polynomial, ‘k’ is the kernel function.

iii) Sigmoid: This kernel is expressed as

4"#- +$ % 789: ";"#- +$ . < (1.4)

Where, = and " are parameters respectively called gain and threshold

and ; > 0 and < < 0.


Page 31

iv) Inhomogeneous Polynomial:

4"#- +$ % ""#- +$ . >$6 (1.5)

Where,‘d’ is the degree of the polynomial.

v) Radial Basis Function (RBF):

4"#- +$ % ?@A BCCDEFCCG

HIG J (1.6)

Where K is the threshold parameter and K L M.

Among the different kernels, RBF kernel is the most commonly used in the biological

problems (Hsu et al., 2009). This is because of the reasons:

i) RBF kernel nonlinearly maps data points into a higher dimensional space,

and therefore, unlike the linear kernel, can handle the cases where there

is a non!linear relationship between class labels and attributes.

Therefore, a Gaussian kernel defined on a domain of infinite cardinality,

will produce a feature space of infinite dimension and will ensure

possibility of smooth and simple estimates. Further, it has been shown

that the linear kernel is a special case of RBF kernel (Keerthi and Lin,

2003).


Page 32

ii) Simple linear transformation in the output units generated can be fully

optimized using linear modeling in RBF which allow the data to be trained

quickly than other kernels.

iii) RBF has a gamma parameter which makes optimization in SVM easier

than other kernels.

iv) The number of hyper!parameters, which influences the complexity of

model selection, is more in case of polynomial kernel than RBF kernel.

1.5.2 SVM Applications

Support vector machines have been successfully applied to a number of biological

applications. SVMs are evolved from the sound theory to the implementation and

experiments as compared to heuristic approach of the other machine learning

classifiers. The SVMs is global, unique and does not depend on the dimensionality of the

input space. SVMs does not attempt to control model complexity keeping the number

of small features and most importantly less prone to overfitting than other machine

learning algorithms. The effectiveness of SVM in overcoming these shortcomings and

the superior generalization performance makes the method very promising (Vert, 2002;

Noble, 2004; Yang, 2004). SVM has been shown to work well for many biological

analyses, including, prediction of pathogenic mutations (Krishnan and Westhead, 2003;

Bao and Cui, 2005; Yue et al., 2005, Yue and Moult, 2006; Tian et al., 2007), prediction


Page 33

of protein subcellular locations (Park and Kanehisa, 2003), classification of proteins and

their functions (Cai et al., 2003), prediction of membrane protein types (Cai et al.,

2004), classification of G!protein!coupled receptors (GPCRs) (Karchin et al., 2002),

classification of nuclear receptors (Bhasin and Raghava, 2004), protein fold!recognition

(Ding and Dubchak, 2001; Shamim et al., 2007), prediction of RNA binding proteins (Han

et al., 2004), prediction of rRNA!RNA!and!DNA binding proteins from sequence (Cai et

al., 2003), prediction of phosphorylation sites (Kim et al., 2004), prediction of T!cell

epitopes (Bhasin and Raghava, 2004; Zhao et al., 2003), prediction of regulatory

networks (Qian et al., 2003), analysis of microarray gene expression data (Brown et al.,

2000), protein!protein interaction prediction (Bock and Gough, 2001; Koike and Takagi,

2004; Yellaboina et al., 2007), etc.

1.5.3 SVM Softwares

Many of the SVM software packages viz., SVMlight

(Joachims, 1999), SVMstruct

(Joachims,

1999), Sequential Minimal Optimization (SMO) (Platt, 1999), LIBSVM (Library for

Support Vector Machines) (Chang and Lin, 2001) etc. are available. However, some of

the SVM software used are either quite complicated or are not suitable for large

problems. Depending upon the size n of the working sample test, SVMlight

allows the

users ranging from 2 to 100, therefore allows optimizing this software in each iteration

step, thereby making a complicated task. Sequential Minimal Optimization (SMO) by

Platt (1998) proposed a two!loop heuristic method for the working set and restricts the


Page 34

sample size n to 2. Therefore, optimization is not needed in the SMO. However, there is

still limitation of proposed SMO that might not work in solving very large problems

(Keerthi et al., 2001). It has been proposed through LIBSVM which implements both

SMO and SVMlight

by restricting the working size test to 2 (SMO) as well as strategy

followed in SVMlight

. Chang and Lin (2001) reported the better performance and stability

of LIBSVM on the benchmark datasets as compared to two other SVM softwares.

LIBSVM is available at http://www.csie.ntu.tw/~cjlin/libsvm and can be downloaded

into the local system as libsvm!2.88 packages and run. LIBSVM contains many features

where users can easily incorporate their functions into their own programs. Different

formulations can be easily implemented in LIBSVM viz., C!support vector classification

(C!VSC), "!support vector classification ("!SVC), one!class SVM, N!support vector

regression (N!SVR) and "!support vector regression ("!SVR). In addition to class label,

LIBSVM also provides a decision/probability values to the users. LIBSVM have additional

advantages in providing a cross!validation model selection as well as weighted SVM for

unbalanced data. For this reason, I have used LIBSVM as one of the software packages in

this thesis.

1.6 R package!

As a part of GNU project, freely available R package is widely used for statistical

computing and graphics. The R package mechanism is highly flexible where the

developers can submit packages for specific functions/interests that can be easily


! Page!35!

available to users. R provides excellent quality graphs to the interested users. R is

available at URL: www.r!project.org and can be downloaded into the system and run. R

provides a number of tools for installing and using the packages with R CMD INSTALL in

the Linux directory to install user!interested R!package. I have used different machine

learning classifiers available in the R package. For SVM prediction of pathogenic

mutations, I have also used SVM in R which is implemented as e1071 package.

1.7!The!Scope!and!aim!of!present!study!!

The review presented in this chapter highlighted the importance of nsSNPs involved in

altering the protein function and their disease causing status. Although there are

methods available to classify human nsSNPs/mutations into benign or pathogenic and

these methods use various attributes varying from sequence!based, evolutionary based,

combination of structural and evolutionary information to a variety of machine learning

techniques including linear logistic regression, decision trees, random forest, neural

networks, neuro!fuzzy classifier, Bayesian classifier, ridge partial least square and

support vector machines. Despite their availability, it is still conceived as highly

important to develop new methods with higher prediction accuracies. The thesis

presents details of our investigations in order to develop a new method with higher

accuracy as compared to the existing methods. As will be found in the following

chapters I have identified new discriminatory features and used them in an SVM!based


! Page!36!

discrimination of mutations as disease or benign. The newly developed SVM!based

method outperforms the other methods in its prediction accuracies.

1.8! Summary!

In this chapter, a brief review on the nsSNPs has been presented including a

comprehensive account of the organization of database, different methods used for

discrimination of nsSNPs. Further, an overview of the different existing methods used in

predicting pathogenic effect of min!sense mutations was also presented. This chapter

discuss about briefings of one of the machine learning classifiers extensively used (SVM)

to be used in the prediction of pathogenic mutations. The next chapter tells the

investigations and their distributions of sequence!structure features to be used in the

machine learning classifier viz., SVM for the prediction of pathogenic mutations.

1.1 Introduction - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/5011/10/10_chapter1.pdf · 1.2 Single Nucleotide Polymorphisms (SNPs) SNPs are the nucleotide changes that

Documents