1.1 Introduction - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/5011/10/10_chapter1.pdf · 1.2 Single Nucleotide Polymorphisms (SNPs) SNPs are the nucleotide changes that
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 2
1.1 Introduction
he knowledge of the complete human genome sequence has unfolded the
mysteries of the human genome variation which in!turn has allowed a
mechanism!based approach to the understanding of the relationship of
genotype with disease. This understanding is considered as the essential precursor to
the development of the personalized medicine. With rapid advances in high!throughput
genotyping and next!generation sequencing technologies, a large amount of genetic
variation has been discovered which has assumed many forms. The simplest type of
variant results from a single base mutation which substitutes one nucleotide for other
and that accounts for the most common form of variation referred to as single
nucleotide polymorphisms (SNPs). Many other forms of variation result from the
insertion or deletion of one or more nucleotides, so!called insertion/deletion (INDEL)
polymorphisms. The most common insertion/deletion events occur in repetitive
sequence elements, consisting of variable length sequence motifs that are repeated in
tandem in a variable copy number, so!called variable number tandem repeat
polymorphisms (VNTRs). VNTRs can further be divided on the basis of the size of the
tandem repeat unit: microsatellites (or simple sequence repeats (SSRs)) and
minisatellites. Microsatellites (or SSRs) consists of one to six bases repeat motifs. Direct
tandem repeat sequences of motif 10!30 base pairs are called minisatellites (Jeffreys et
al., 1985). The rarest insertion/deletion events involve deletion or duplications of
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 3
regions that can range from a few kilobases to several megabases. Few other types of
repeats were also observed in the genomes. These include palindrome sequences,
inverted repeats, and mirror repeats (Cox and Mirkin, 1997).
This huge quantity of various forms of genetic variations in the human genome led many
to question the origin and maintenance of such a human population’s genetic load.
Kimura (1983) formulated a theory called neutral theory of evolution that proposed that
most of the sequence variations does not make a significant impact on the phenotypic
consequences and so, will not be subjected to the natural selection, thus, rendering the
majority of mutations likely to be phenotypically neutral. However, there are a certain
number of undefined alleles that can cause directly (referred to as mutations) or
increase the susceptibility to disease (polymorphisms). Bioinformatics analysis of human
sequence provides an opportunity to identify the most common form of genetic
variation, SNPs, by comparison of two sequences viz., coding DNAs, expressed sequence
tags (ESTs) or genomic sequences. Discovery of SNPs that affect biological function have
become increasingly important and availability of the databases for the SNPs to some
large extent have been discussed in this chapter.
1.2 Single Nucleotide Polymorphisms (SNPs)
SNPs are the nucleotide changes that occur in DNA which account for approximately
90% of the genetic variation among individuals in a population (Collins et al., 1998). SNP
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 4
is a nucleotide change that is prevalent in at least 1% of the population (Figure 1.1).
There are two categories of SNPs as given below:
1. Linked SNPs: SNPs which do not reside within genes and do not affect protein
functions. These are also referred as Indicative SNPs which originate to response
to the drugs or to the risk of getting a certain disease.
2. Causative SNPs: SNPs which affect protein structures and/or functions and
cause diseases. These can be divided into two categories:
(a) Coding SNPs: SNPs found in the coding regions of the genes and affect
protein function. Again coding SNPs can be divided into two parts:
(i) Non!synonymous SNPs (nsSNPs): In these cases changed nucleotide
leads to change in amino acid
(ii) Synonymous SNPs (sSNPs): In these cases changed nucleotide does
not lead to change in amino acid.
(b) Non!coding SNPs: the nucleotide change is located within the regulatory
parts of genes and is correlated to the changes in the corresponding mRNA
expressions.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 5
Figure 1.1: The single nucleotide polymorphism (SNP) where a single nucleotide (A, C, T
or G) in the DNA sequence is altered. Here, C changed to T, hence, change in nucleotide.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 6
The non!synonymous SNPs (nsSNPs) that lead to amino acid changes in protein products
are likely to affect their structure and the function depending on the type of the amino
acid change as well as the site of change (Cargill et al., 1999; Stenson et al., 2003;
Thorisson and Stein, 2003; Ng and Henikoff, 2006). Some amino acid changes are
tolerated by proteins with no concomitant phenotypic effect and the corresponding
nsSNPs are referred to as benign or neutral nsSNPs. Those leading to amino acid
changes that are not tolerated by protein structure and function which further lead to
disease phenotypes are referred to as pathogenic or disease mutants (Saunders and
Baker, 2002; Bao and Cui, 2005; Yue and Moult, 2006).
Although comparative genetic analyses of healthy and disease individuals have led to
the discovery of a number of mis!sense mutations/nsSNPs associated with diseases, the
list may be far from complete as the list of uncharacterized mutations/nsSNPs
discovered from the human genome project outweighs the list of characterized
mutations/nsSNPs. In this post!genomic era, classification of nsSNPs into disease or
neutral has, therefore, been perceived as the first step before any study is attempted
such as pharmacogenomics and a variety of computational methods have been devised
for this purpose (Mooney, 2005; Ng and Henikoff, 2006; Thusberg and Vihinen, 2009).
But before that I give details of the databases hosting information on mutations/SNPs.
Needless to mention these databases, in addition to serving as information resources,
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 7
have also been providing datasets for benchmark studies of the computational methods
developed for prediction of pathogenic mutations.
1.3 Databases
The databases include dbSNP (Sherry et al., 1999), the Human Genome Variation
Database (HGVbase) (Fredman et al., 2004), On!line Mendelian Inheritance in Man
(OMIM) (Hamosh et al., 2005) and Human Gene mutation database (HGMD) (Stenson et
al., 2003) etc. The details of these databases are given below:
1.3.1 The Single Nucleotide Polymorphism Database (dbSNP)
The dbSNP is a free public domain for broad collection of simple genetic polymorphisms
across the different organisms. This database has been developed and maintained by
the National Center for Biotechnology Information (NCBI) in collaboration with the
National Human Genome Research Institute (NHGRI) and is available at
http://www.ncbi.nlm.nih.gov/SNP/. This database was created in 1998 (Sherry et al.,
1999) for providing additional information to Genbank, NCBI’s public collection of
protein and nucleotide sequences. In addition to SNPs, the dbSNP contains a range of
other molecular variation: (1) deletion and insertion polymorphisms (DIPs/indels) and
(2) microsatellite repeat variations or short tandem repeats (STRs). Each dbSNP entry
includes the sequence context of the polymorphism (i.e., the surrounding sequence),
the occurrence frequency of the polymorphism (by population or individual), and the
experimental method(s), protocols, and conditions used to assay the variation. The
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 8
dbSNP can be searched using the Entrez SNP tool with queries viz., a refSNP number ID,
a gene name, an allele, a build number etc with the summarized information of that
searched SNP. It has been reported that this database contains some false positive
entries due to genotyping and base!calling errors (Reich et al., 2003; Mitchell et al.,
2004; Musumeci et al., 2010).
1.3.2 Human Genome Variation Database (HGVbase)
The Human Genome Variation database, HGVbase, previously known as HGbase
(http://hgvbase.cgb.ki.se/; Fredman et al., 2004) is a highly curated and non!redundant
database of available genomic variation data of all types but mostly comprising of single
nucleotide polymorphisms (SNPs). The HGVbase is supported by the establishment of a
European consortium comprising teams at the Karolinska Institute, Sweden, the
European Bioinformatics Institute, United Kingdom (UK) and at the European Molecular
Biology Laboratory, Germany.
This database can also be called as extension of manually curated dbSNP where the
HGVbase curators provide a more!extensively validated SNP data set by filtering out
SNPs in repeat and low complexity regions and by identifying SNPs for which a
genotyping assay can successfully be designed. The HGVbase include polymorphisms as
well as variations with rare or single occurrence alleles as well as disease!related and
disease!causing clinical mutations.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 9
1.3.3 The Human Gene Mutation Database (HGMD)
The Human Gene Mutation Database (HGMD) constitutes a comprehensive core
collection of data on germ!line mutations responsible for human inherited disease
(http://www.hgmd.org/; Stenson et al., 2003). The HGMD was first made publicly
available in April 1996 and is now available as commercial to users after the
collaboration between HGMD and BIOBASE GmbH in 2006. The scope of HGMD is
particularly limited to mutations which include single base!pair substitutions in coding,
regulatory and splicing!relevant regions, insertions/deletions (indels), duplications and
triplet repeat expansions.
1.3.4 On!line Mendelian Inheritance in Man (OMIM)
OMIM is an on!line database (http://www.omim.org; Hamosh et al., 2000) that
catalogues all the human genes and their associated mutations based on the long
running catalogue Mendelian Inheritance in Man (MIM), started in 1967 by Victor A.
McKusick at Johns Hopkins. This database was available on the NCBI web site in 1995.
OMIM is an excellent resource for providing background information about biology of
genes and their related diseases.
1.3.5 The UniProt/SwissProt Database
UniProtKB/Swiss!Prot (http://expasy.org/; Bairoch and Apweiler, 1996) is a highly
curated and manually annotated, non!redundant protein sequence database. This
database was created in 1986 by Amos Bairoch at Swiss Institute of Bioinformatics and
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 10
maintained collaboratively by the Department of Medical Biochemistry of the University
of Geneva and the EMBL Data Library. The objective of UniProtKB/Swiss!Prot is to
provide all known relevant information about a particular protein. The information
about variants has been listed as disease/polymorphisms for each protein sequence
entry. The additional bonus of Uniprot/SwissProt is that it is well integrated with the
OMIM, dbSNP and NCBI database family and whenever new variants are updated these
in those databases also become available on the UniProt/SwissProt database.
1.4 Computational analysis of effects of nsSNPs
As mentioned earlier, there are a large number of mis!sense mutations whose
phenotypic effects have not been discovered. Hence, methods to accurately predict the
effect of mis!sense mutations have always been in demand. Several methods have been
developed and have been briefly discussed (Mooney, 2005; Ng and Henikoff, 2006;
Thusberg and Vihinen, 2009).The basic approach adopted by all these methods involves
use of either sequence or structural information or both, of proteins harboring the
nsSNPs with an underlying idea that mis!sense mutations that alter protein structure
and function are likely to be pathogenic and those do not alter are likely to be
neutral/benign (Figure 1.2). In other words, the phenotypic effect of a mis!sense
mutation is judged by its effect at the protein level. In order to predict whether a given
mis!sense mutation is pathogenic or neutral, various features at the mutation site are
considered which include evolutionary conservation (Miller et al., 2001), solvent
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 11
accessibility, secondary structure (Sunyaev et al., 2000) etc. In addition, the effect of
mutation on protein stability is also considered by some studies (Wang and Moult,
2001).
There have been studies to map mis!sense mutations on to their respective proteins and
study of their protein sequence and structural contexts (Sunyaev et al., 2000; Burke et
al., 2007, Yue et al., 2006, Adzhubei et al., 2010). Wang and Moult (2001) showed that
83% of the disease!causing mutations affected protein stability. Using both structure as
well as sequence information, Sunyaev et al. (2000) showed that 70% of the disease
causing mutations affect the structurally and functionally important sites such as those
buried sites, active sites or sites involved in disulphide bonds. Gong and Blundell (2010)
showed the distribution of amino acid variants by mapping onto the 3D structures, if
available and reported the occurrence of disease!related variants much more frequently
at solvent inaccessible regions as well as at amino acid residues involved in hydrogen
bond formation as compared to polymorphic variants.
However, the coverage for prediction methods using protein structure is only 14% (Yue
and Moult, 2005) as compared to coverage using sequence!based methods (81%)
(Ramensky et al., 2002). For sequence!based prediction methods, first step is to select
the homologous sequences, manually or automatically. Since the amino acids occurring
in the alignments form the fundamental basics of sequence!based prediction method,
the alignments and the number of sequences used are the central part in the prediction
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 12
PATHOGENIC BENIGN
Figure 1.2: The basic approach for simple amino acid prediction using either sequence
or structure based method
Protein sequence or Structure and
amino acid as an input
Structure Sequence
Structural features such as
crystallographic B factor, solvent
accessibility, ligands binding site,
3D structure environment etc.
Sequence based features include conservation
score, position!specific evolutionary score
derived from MSA, the physiochemical
properties, amino acid substitution matrix
Apply scoring rules for prediction
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 13
Table 1.1: Available Amino acid substitution prediction methods
Methods Algorithm Used Conservation analysis Structural features
SIFT Scores calculated using
Dirichlet mixtures
Sequence Homology !!!!!
PolyPhen Empirical rules Position Specific Independent