Top Banner
MOLECULAR BIOLOGY Bioinformatics and its applications Vidya Kothekar Director Dr. D.Y. Patil Biotechnology & Bioinformatics Institute Akurdi Pune – 411044 7-Feb-2006 (Revised 27-Feb-2007) CONTENTS Introduction Scope of bioinformatics Biological databases Sequence analysis Protein primary sequence analysis DNA sequence analysis Sequence comparison methods Scoring matrices: Dot matrix, PAM and BLOSSUM Pair wise and multiple sequence analysis methods Database analysis using BLAST and FASTA Structural bioinformatics Small molecular structural database CCSD Protein structural database PDB Introduction to molecular modeling Modeling small molecules Modeling Biopolymers Modeling DNA Modelling proteins Comparative modeling of proteins Geometry optimization using molecular mechanics and dynamics Application of bioinformatics to drug designing and chemo-informatics Other databases Key words Bioinformatics, Biological data, Databases, Data analysis, Structural bioinformatics, Molecular modeling, Drug designing, Chemoinformatics
29

Bioinformatics and its a - NISCAIR

Feb 03, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics and its a - NISCAIR

MOLECULAR BIOLOGY

Bioinformatics and its applications

Vidya Kothekar Director

Dr. D.Y. Patil Biotechnology & Bioinformatics Institute Akurdi

Pune – 411044 7-Feb-2006 (Revised 27-Feb-2007)

CONTENTS

Introduction Scope of bioinformatics Biological databasesSequence analysis

Protein primary sequence analysis DNA sequence analysis

Sequence comparison methods Scoring matrices: Dot matrix, PAM and BLOSSUMPair wise and multiple sequence analysis methodsDatabase analysis using BLAST and FASTA

Structural bioinformatics Small molecular structural database CCSDProtein structural database PDB

Introduction to molecular modeling Modeling small moleculesModeling Biopolymers

Modeling DNAModelling proteins

Comparative modeling of proteins Geometry optimization using molecular mechanics and dynamics

Application of bioinformatics to drug designing and chemo-informaticsOther databases

Key words Bioinformatics, Biological data, Databases, Data analysis, Structural bioinformatics, Molecular modeling, Drug designing, Chemoinformatics

Page 2: Bioinformatics and its a - NISCAIR

Introduction

The rate of expansion of Internet, its associated use in depositing and accessing large amounts of biological data, and explosion in genomic information, has created an unprecedented need for scientists with a deep working knowledge of the biological sciences and computational methods. Bioinformatics, has provided a strong tool for advancement of research and development in the field of Biotechnology, which is a multibillion dollar industry. It is important for reducing cost and time of development of new products such as: drugs, vaccines, plants with specific properties and resistance to pests and diseases, new protein molecules, biological materials, diagnostics, etc. and processes. This is important for Biotech industry, because after WTO act, both products and the processes, are protected by Intellectual Property Rights (IPR). As the full genome sequences, data from microarrays, proteomics, as well as specific data from taxanomic level became available, integration of these databases becomes a necessity. It can be done only using sophisticated Bioinformatics tools. Organizing these data into suitable databases and developing software tools, are major challenges, where India can participate. India has an advantage over other countries in this respect, because of its large trained manpower with strength in Chemistry, Physics, Mathematics, Computer Science, Soft ware development, Health care and Biological sciences. Although the current market size for Bioinformatics is limited to few million US$ (20-100), it is expected to grow fast. Government of India has undertaken number of initiatives. Extensive network has been established by the Department of Biotechnology. Human resource development at graduate, Masters, Ph.D and post-doctoral level is going on. Considering major share of Pharma sector in the 500 Million US $ Biotech market, many companies such as: Ranbaxy, Reddy’s, Nicholas Piramal, Shanta Biotech, Reliance Industries, are entering into market with their research and developments (R & D) and other business ventures. Clinical trials, vaccine development, drug development, contract research outsourcing (CRO), creating mirror sites of well established data banks are couple of new areas which are coming into focus. Scope of bioinformatics

Bioinformatics tools are required to acquire, store, organize, archive, analyze and visualize biological data. Genomics and genome analysis, protein informatics, microarray analysis, structural and functional proteomics, lead development, development of protein markers, target identification and validation, molecular modeling, Chemo-informatics, drug development and discovery, analog based drug design and traditional drug design constitute different areas in bioinformatics. The applications can also be to the problems in agricultural science, food preservation, chemical industry, detergents, pesticides, paints and cosmetic industry. Manpower is needed for development of i) Algorithms using tools in pure mathematics, statistics, numerical analysis, optimization procedures and artificial intelligence etc and ii) software using different programming languages such as C++, RDBMS (ORACLE), Java/Biojava, PERL/BioPERL and XML/BioXML. Skilled manpower is also needed for designing databases; managing Gene Banks, Protein Banks and Banks for Biochemical reactions; analyzing Biochemical and 3D structural databases; creating mirror sites and project oriented links; studying genomics, proteomics, functional proteomics, motifs, profiles, phylogenesis, human SNP (single nucleotide polymorphism) databases and maps, protein modelling and drug designing.

2

Page 3: Bioinformatics and its a - NISCAIR

Genetic information

Living organism is like a mini computer, where total information (genetic information) is stored in the sequence of nucleotide bases [adenine (A), guanine (G)] (purines) and [thymine (T) and cytosine (C)] (pyrimidines) of the DNA molecule. DNA is a long polymer (106 to 107 Å), located inside the cell nucleus, in the form of folded bundle of strings (chromosomes). The bases are connected to five membered methylated sugar (deoxyribose) phosphate backbone by glycosidic bonds, through C1' atom of the sugar to N9 of purines or N1 of pyrimidines (Fig. 1a). The DNA molecule normally has two strands that are wound over each other, as in a right-handed screw, in an anti parallel sense (direction of C5' to C3' in two strands is opposite). Complementary bases (A-T) and (G-C) on the two-strands face each other. These, are connected by a pair of hydrogen bonds (Fig. 1b) (Watson and Crick, 1953). They stack over each other by van der Waal’s forces. During cell division, this information is 'conserved ' through the process of 'replication' where a 'complementary strand' is synthesized over the existing or the 'template' strand.

Figure 1: (a) Sugar phosphate backbone of DNA, (b).Watson & Crick’s DNA structure

The information contained in the ‘genome’ (total DNA content of all the chromosomes) is ‘replicated’ during the cell replication cycle and ‘transferred’ for the synthesis of strings of amino acids (protein primary structure) as a three letters code (each amino acid is coded by three nucleotides), taking help of a battery of enzymes, through the messenger and transfer RNA molecules (mRNA and tRNA). RNA molecule differs from DNA, in having uracil (U) instead of thymine (T) and ribose sugar in place of deoxyribose sugar. The nucleotide sequence in DNA translates to amino acid sequence in proteins. The regulatory proteins control DNA replication, translation and transcription. Four levels of protein structure

The proteins have a hierarchical structure. The ‘primary’ structure is only an amino acid sequence. Stretches of amino acid chains, depending on physico-chemical properties of the constituent amino acids, can organize into different ‘secondary structures’ (α- helix, β-sheet, c-coil and t-turn) characterized by hydrogen-bonded network amongst carboxylic and amino groups of the backbone atoms of the amino acids. The Secondary structure is energy driven and stabilized by H-bonds. Subsequently, due to interaction amongst side chains, backbone and environment, amino acid chains fold into specific 3-D structure (‘tertiary’ structure of the

3

Page 4: Bioinformatics and its a - NISCAIR

protein ‘monomer’). Number of such protein ‘monomers’ can aggregate to give rise to a ‘quaternary structure’ of the proteins. Both monomeric and multimeric protein/enzyme structures are characterized by specific ‘active’ and/or ‘catalytic’ sites. Majority of body’s vital processes, can be controlled by interaction of a specific ‘activator’ or ‘inhibitor’ (effector) with the typical ‘architecture’ and ‘chemical properties of the residues’ at the active site. It is easy to understand why, a small modification in the ‘active site residue’ of the protein or a ‘functional group’ of an ‘effector’ can dramatically alter the activity of an enzyme, while many mutations may go unnoticed or affect only the ‘kinetics’ of some reactions. The enzyme substrate or protein-protein interactions depend, basically, on the change in ‘free energy’ of interaction between the two molecules. In nutshell, all the life’s important processes depend on proper folding proteins/ enzymes and their interactions with different metabolites (inhibitors or activators). For further information, one can refer to any text book of Biochemistry as Biochemistry by Stryer (1999).

Figure 2: Gα Subunit from Transducin with bound GTP (based on structure by Noel et al (1993) PDB code 1GOT, as depicted in figure 13.6 in the book “Introduction to protein structure” by Brandon & Tooze (1991))

Gα subunit of the enzyme transducin, with bound GTP molecule is depicted in Fig. 2. •The helical region is connected to GTPase by two regions: one which follows helix α1 and second which precedes β2 in GTPase domain. Any small modification in either conformation or chemical nature of these two helices can alter the interaction. Thus, for a rational design of any biochemical product or a process, one needs knowledge of nucleotide sequences in the genome of different organisms, organization of genomic information, comparisons between genomes of different organisms; rules governing ‘replication’, ‘transcription’ and ‘translation’; rules for post-translational modifications; information on total protein content (proteome) of the cell and enzyme networks; rules for folding of the polypeptide chains; rules for protein-protein interaction and rules for interaction of enzymes with different metabolites. The flow of biological information is shown in Fig. 3.

4

Page 5: Bioinformatics and its a - NISCAIR

Metabolon Proteome Genome

Fig. 3: Transfer of genetic information is from Genome to Proteome to Metabolon DNA sequence data

Genes are particular stretches of DNA, which when perturbed, result in observable consequences, as change in protein abundance, composition, traits or a disease. However, there are few dozen genes, which code for RNAs and do not yield proteins. Similarly, there are sequences that directly influence RNA production (transcription) or protein production (translation), although they themselves do not encode any protein. Genes are encoded in DNA sequences (shown as a line inFig. 4). RNA is transcribed from coding DNA or ‘exon’. Intervening sequences or ‘introns’ are spliced out (for further details refer to any book on Bioinformatics such as by: Attwood & Parry-Smith, 2001, Baxevanis & Quellerre, 2001). Mature RNA is translated into amino acid sequences. A portion of mRNA (shown in striped boxes) is not transcribed and does not encode any protein.

Fig. 4: Gene expression

The process of moving genetic information from DNA to protein is quite complex. The signals are not well defined. There is an initiation (ATG or Methionine) and termination (TGA, TAA or TAG) codon. Upward of 70% of the promoter region, contains a TATA box. During the end modification, polyA tail may be present, or absent, or may contain AATAAA. Entire set of genes and surrounding introns is called genome. The genome can be considered as the total information content of an organism. With few exceptions, the complete genome is contained in each cell of the organism eg. neuron, muscle cell, or skin. This is why an udder cell of a mature sheep could generate the entire new lamb (Dolly) in 1997. However, due to difference in the fidelity of copying process in different cells, the genomes in different cells may not be identical.

5

Page 6: Bioinformatics and its a - NISCAIR

Abundance of a particular gene can be judged using hybridization kinetic studies. In this technique, first a complimentary DNA sequence (cDNA) library is created, using genes expressed in a particular tissue, at a particular time. Later, cDNA can be spliced onto the circular DNA molecule, that can be propagated into the bacteria and cDNA can be easily cloned. The strands of cDNA molecules in the library are made to come apart and rejoin (hybridize). The rate of joining depends on abundance of a particular gene. The method can be used to study DNA sequences of ‘expressed genes’. However, within a single gene, the ‘intron’ or ‘noncoding’ sequence occupies from tens to thousands or more base pairs. A very tiny fraction of this sequence is shown to regulate the coding region. The remaining DNA is ‘silent’ or has no specific function. Genes roughly occupy only 3% of total genome content. Rest is ‘surplus’ DNA, which is also termed as ‘garbage’ or ‘junk’ by many people. There are ‘dull’, ubiquitous, short stretches of 300 base pairs of DNA known as ‘Alu’ sequences. There are some monotonous repetitive sequences as ‘CA’. Many repetitive sequences are flanked by viral like sequences called ‘retrotransposons’. Craig Venter attempted finding out gene responsible for Huntington’s disease on 60,000 long base pair stretch of a chromosome. In his unsuccessful attempt, he and his other colleagues partially sequenced 609 randomly chosen cDNAs, derived from three simple brain tissues. The sequences were long enough to identify the cloned gene from which they were derived. These are known as ESTs (expressed sequence tags). Whole genome sequencing projects

There are essentially two ways to sequence a genome. The BAC-to-BAC method also referred to as the ‘map-based method’, evolved from procedures developed by a group of researchers from Human Genome Project (HGP), headed initially by Noble Laureate James Watson and later by Francis Collins, Director, National Institute of Health (NIH) Washington, U.S.A during the late 1980s and 90s. The method continues to develop and change. It is slow but sure. The ‘SHOT GUN’ method, known also as ‘whole genome sequencing method’, was developed by GNN president, Craig Venter in 1996, when he was at the Institute for Genomic Research (TIGR). It brings speed into the picture, enabling researchers to do the job in months to a year. In the ‘BAC to BAC’ approach, one first creates a crude physical map of the whole genome, before sequencing the DNA. Constructing a ‘map’ requires cutting the chromosomes into large pieces and figuring out the order of these big chunks of DNA. In contrast in the ‘shotgun’ method, one goes straight to the job of decoding, bypassing the need for a physical map. In ‘BAC to BAC’ method, several copies of the genome are randomly cut into large pieces that are about 150,000 base pairs (bp) long. Each of these 150,000 bp fragments is inserted into a BAC-a ‘bacterial artificial chromosome’, which is a man made piece of DNA that can replicate inside a bacterial cell. The whole collection of BACs containing the entire human genome is called a BAC library. These pieces are fingerprinted to give each piece a unique identification tag, which determines the order of the fragments. Fingerprinting involves cutting each BAC fragment with a single enzyme and finding common sequence, landmarks in overlapping fragments, which determine the location of each BAC along the chromosome. Then overlapping BACs with markers, every 100,000 bp form a ‘map’ of each chromosome. Each BAC is then broken randomly into 1,500 bp pieces and placed in another artificial piece of DNA (M13 library). All the M13 libraries are sequenced. 500 bp from one end of the fragment are sequenced, generating millions of sequences. Computer algorithm

6

Page 7: Bioinformatics and its a - NISCAIR

(PHRAP) assembles the millions of sequenced fragments into a continuous stretch resembling each chromosome. PHRAP looks for common sequences that join two fragments together In ‘shotgun’ method, multiple copies of the genome are randomly shredded into pieces that are 2,000 base pairs (bp) long, by squeezing the DNA through a pressurized syringe. This is done a second time to generate pieces that are 10,000 bp long. Each 2,000 and 10,000 bp fragment is inserted into a plasmid, which is a piece of DNA that can replicate in bacteria. The two collections of plasmids, containing 2,000 and 10,000 bp chunks of human DNA, are known as plasmid libraries. Both the 2,000 and the 10,000 bp plasmid libraries are sequenced. 500 bp from each end of each fragment are decoded generating millions of sequences. Sequencing both ends of each insert is critical for assembling the entire chromosome. Computer algorithms assemble the millions of sequenced fragments into a continuous stretch resembling each chromosome. Protein sequence and structural data

Proteomics refers to the science, which simultaneously studies entire protein content of a cell. Knowing when and at what levels, genes are expressed, is only the first level of understanding how genome determines phenotypes. While mRNA level determines the protein concentration in the cell, it is subjected to post translational modifications, that can not be detected by hybridization. A number of experimental tools are required to assess protein concentration in the cell. Although it is beyond the scope of this chapter to cover these methods, they need to be innumerated here for assessing their data quality and problems. Gels have long been used by Molecular Biologists to separate different components according to their masses. Different components would migrate through a gel matrix with different speeds. In a gel map, each spot represents different protein. The analysis is similar to analysis of ‘DNA Microarrays’. The resolution and position of each spot is first analyzed. Next the spot is identified and the sequence information is used to connect the spot with gene sequence. The immobilized protein can be directly sequenced or the spot can be removed and analyzed using mass spectrometry methods as: Electro Spray Ionization-Mass Spectrometry (ESI-MS) or Matrix Assisted Laser Desorption Mass Spectrometry (MALDI). The method depends on identifying fragments of peptide by masses. Information on 3-Dimentional (3D) structures of protein, nucleotides and their complexes with ligands is of trivial importance for understanding and designing effectors (activators or inhibitors of biochemical interactions). X-ray crystallography is the main technique employed. Availability of the pure sample, crystallization methods, availability of high intensity X-ray beam, high speed computers and software, are few of the considerations, which speed down the availability of 3D data. . Another method for 3D structure determination is Nuclear Magnetic Resonance (NMR) spectroscopy, which can be applied to few compounds in solid and liquid state. Application of NMR spectroscopy is restricted, because of high cost of isotope labeling and difficulties in interpretation. A third alternative to experimental 3D structure determination is to use theoretical computer assisted molecular modeling methods, based on physico-chemical principles, energy optimization techniques and evolutionary relationships amongst molecules. This forms important area in Bioinformatics. However, the technique should be applied in cases, when no other alternative is available. These would be discussed later.

7

Page 8: Bioinformatics and its a - NISCAIR

Biological databases

Fundamental characteristics of digital chain molecules as DNA and proteins is that they can be cast in the form of digital symbols as: Nucleotides: Adenine (A), Guanine (G), thymine(T) and cytosine (C) or amino acids as Tyrosine (Y), Glycine (G), Histidine (H), Lys (K) etc. which are distinct, although chemically modified some times. Therefore, experimentally determined sequences, in principle, can be obtained with complete certainty. There is no lower limit of uncertainty associated with the efficiency of measurement. Provided we have enough economical resources, the nucleotide sequence in genomic DNA and associated amino acid sequence can be revealed completely. However, in genomic projects carrying out large-scale sequencing, a purpose, relevance ethics and economics, decides the quality of data. Data quality and data texture

Although sequence data can be determined accurately, they are not available to a researcher without additional noise, because of incorrect interpretation of experiments and incorrect handling and storage of public databases. There are number of other reasons, as: i) public databases are curetted by highly diverse people, ii) they are annotated by highly diverse people, iii) the error rates of subsequent handling are still more, iv) there is some amount of experimental error, and v) the error is also associated with the way the data is stored. In sequence databases the result is garbage instead and random noise. It would mess up ‘coding’ and ‘non-coding’ regions. In proteins, we would have randomly distributed functional sites and post translational modifications. Database redundancy is another important problem. Many entries in DNA and protein databases represent members of the same gene family or versions of homologous genes, found in different organisms. Several groups might have submitted the same sequence. Annotation of these very similar sequences would result close to identical but there may be significant differences between species and tissues. Redundancy is also resulted by different experimental approaches themselves. In large part of cDNA splicing, the spliced form of pre mRNA means that genes are going alternate splicing. The given piece of DNA would be associated with several cDNA and lead to sequence non-continuous with the genomic DNA. As a result there would be different ways of joining coding and non-coding regions. The sequence of genes being spotted on the glass plates or synthesized on DNA chip is typically based on the sequences or clusters deposited in databases. Microarray would end up in number of sequences more than number of genes. Biological data may also represent protein sequences. It is possible that protein sequence may not directly correlate to genomic wild type sequence, due to modifications or requirement for crystallization etc. Redundant data set may result in biased statistical analysis and correlation between different positions may be an art effect of biased sampling. Proteins undergo glycosylation and phosphorylation. There are many other types of post-translational modifications as addition of fatty acids and cleavage of signal peptides, in the N-terminal of secretary proteins. Prediction of post-translational modifications in proteins is an important area in bioinformatics. Statistical analysis has played a vital role in evolution of protein sequences (for further information on these aspects please refer to book Biochemistry by Stryer (1999)).

8

Page 9: Bioinformatics and its a - NISCAIR

Functional aspect is determined mainly by local sequence characteristics and does not depend critically on structure maintained in long-range order. Functionally similar proteins are usually evolutionally related or homologous and have similar fold. A typical example is of serine proteases (Geer 1981). However it is possible, that function is conserved when sequence is not evolutionally related, e.g. transaldolase, fructose- 1,6 bisphosphate aldolase, urease catalytic domain and phosphotriesterase. Another example is of enterotoxin and choleratoxin which are close homologs (80 % sequence homology, structural similarity 98% at resolution 0.6 Å). The fold is similar to TSS toxin which is a remote homolog (8.8 % homology, structural similarity 35 %, resolution 2.4 Å). However, why aminoacyl tRNA synthetase has same fold with sequence homology 4.4 % structural similarity 41% at 2.2 Å is not understood (Russel et al 1997). Genome and DNA information resources

Genome databases define four data types: Sequence, Physical, Genetic and Bibliographic. 1. Physical data includes: sequence-tagged sites, coding regions, noncoding regions,

control regions, telomeres, centromeres, repeats and metaphase chromosome bands. 2. Genetic data includes seven data fields: locus name, location, recombination distance,

polymorphisms, breakpoints, rearrangements and disease association. 3. Genome databases are classified into four categories as: molecular, genetic, organism

and bibliographic. 4. Bibliographic databases include name of author, laboratory, journal, page, issue, year

etc GenBank (http://www.ncbi.nlm.nih.gov/Genbank/GenbankSearch.html) Genbank is maintained by National Center for Biotechnology Information (NCBI) USA, at NIH Washington. It incorporates databases from publicly available sources, direct submissions from the authors and large scale sequencing projects (Benson et al 1998). Because of the diversity of the database, and the size, it is split in sub groups (17 as of today). Resource exchange data both with DDBJ and EMBL are discussed later.

Table 1: Three letter code for 17 divisions of GenBank

PRI Primate VRL Viral ROD Rodent PHG Bacteriaphage MAM Other Mammalian SYN Synthetic VRT Other Vertebrate UNA Unannotated PLN Plant, Fungal, algal EST Expressed Seq.

Tag BCT Bacterial PAT Patent RNA Structural RNA STS (Seq. Tagged site) GSS (Genome Survey

Sequence) HTG (High Throughput

Genomic Sequence

INV Invertabrate EMBL (http://www.embl-heidelberg.de/) European Molecular Biology Laboratory Nucleotide Sequence Data Library (EMBL) is the nucleotide sequence data at European Bioinformatics Institute (Stoesser et al 1998) from: direct author submission, genome sequencing groups, scientific literature and patent

9

Page 10: Bioinformatics and its a - NISCAIR

applications. The database is produced in collaboration with DDBJ and GenBank. Updated entries are exchanged on daily basis. DDBJ ( http://www.ddbj.nig.ac.jp/) DNA Database of Japan (DDBJ) is produced and maintained by National Institute of Genetics, Japan (Tateno et al 1998). It has web based data submission tools. Another important database is Human Genome Database (GDB) http://gdbwww.gdb.org. Number of applications are available with online Mendelian Inheritance in Man (OMIM)) http://www3.ncbi.nlm.nih.gov/Omim/ database. Proteome and protein information resources

Biochemical pathway databases

Once the genes are expressed and translated into proteins, their products participate in complicated biochemical interactions, called ‘pathways’. The best known metabolic pathway resources are: What is There’ (WIT, http://wit.mcs.anl.gov/WIT2/ ) and Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.ad.jp/kegg/) PIR (http://pir.georgetown.edu) Protein Information Resource (PIR) database at National Biomedical Research Foundation (NBRF) was started by Margaret Dayhoff as early as in 1960. It is maintained by an association of macro-molecular sequence data collection. It includes Protein Information Resource at (NBRF), International Protein Information Database Japan (JIPID), and Martinsried Institute of Protein Sequence (MIPS). PIR1-PIR4 differs in quality of data and level of annotation. For example PIR-1 gives fully annotated entries, PIR-2 gives preliminary entries, PIR-3 gives unverified entries and PIR-4 gives conceptual translation of artefactual sequences, conceptual translation of sequences not transcribed and translated, and conceptual translation of genetically engineered sequences. SWISS-PROT (http://expasy.hcuge.ch/sprot/sprot-top.html ) Department of Medical Biochemistry, Geneva 1986 started this database. After 1994 it moved to EMBL’s UK outstation. After 1998 it moved to Swiss institute of Bioinformatics (SIB). It has high level of annotation, description of function of protein, structure of domains; post translational modification, variants etc. It is a least redundant database and also the most popular database. MIPs (http://www.mips.biochem.mpg.de/ ) Martinsried Institute of Protein (MIPs) sequencing is a tripartite international protein sequence database project. Database is distributed with PATCHX. It is accessible through Web server. Results of FASTA similarity search within all proteins at PIR are stored dynamically in this database. TrEMBL TrEMBL represents computer-annotated supplement to SWISS-PROT. The database benefits from SWISS-PROT FORMAT. It has translation of all coding sequences (CDS). SP-TrEMBL has entries that would be eventually be incorporated in SWISS-PROT. REM-TrEMBL includes immunoglobins, T-cell receptors, small fragments that are greater than 8 amino acids (aa), synthetic peptides, patented and codon translated sequences.

10

Page 11: Bioinformatics and its a - NISCAIR

NRL-3D (http://www.nbrfa.georgetown.edu/pir/nrl3d.html ) NRL-3D database is produced by PIR from PDB (Protein Data Bank). Titles and biological resource information conform to PIR standards. Bibliographic references and cross references to MEDLINE are also given in this database. This makes sequence information in PDB available for similarity search. Database might be searched using ATLAS retrieval system. Thus NRL-3D database is least comprehensive but very useful as it can directly relate to structural information. PIR1-4 is most comprehensive but quality of annotation even in PIR-1 is poor. SWISS-PROT is highly structured but sequence convergence is poor. Composite databases

One solution to the proliferating primary databases is, to create a composite database. Choice of different sources and handling of redundancy is important in this approach. Examples of few composite databases are listed here. NRDB (http://www.ncbi.nlm.nih.gov) Non-redundant database (NRDB) is built locally at the NCBI. The database is composite of GenPept (Translation of Genbank automatic CDS translation), PDB, SWISS-PROT, SPupdates (Weekly updates of SWISS-PROT), PIR and GenPeptupdate. It is up-to-date, but not non redundant. It is non identical (identical copies are removed). This is rather simplistic approach, which leads to problems, as multiple copies of the same protein are retained in the database, leading to polymorphism or minor sequencing errors. OWL (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/OWL) OWL is a non-redundant database built at University of Leeds with Daresbury Laboratory, Warington. It is composite of SWISS-PROT, PIR1-4, GeneBank (CDS translation) and NRL-3D. The sources are assigned priority as regards to their level of annotation. The process removes identical copies and single point mutation (Bleasby et al 1994). The main defect of OWL is, that it is released on 6-8 weeks basis and not up-to- date. Some sequencing and translational errors are retained. BLAST services for OWL are available from UK, EMBnet National node, SEQNET and from UCL MIPSX MIPSX is a merged database at Max-Planck Institute in Martinsried (Mewes et al 1998). The database contains information from the following resources: PIR 1-4; MIPS preliminary entries, MIPSOwn; MIPS/PIR preliminary entries, PRIMOD, MIPS preliminary translations, MIPSTrn, MIPS yeast entries, MIPSH; NRL-3D, SWISS-PROT; EMTrans, an automated translation of EMBL, GBTrans, translated GenBank entries; Kabat and PseqIP. The sources are assigned different priorities and sequences identical within and or between them are removed. Only unique copies are retained. In addition, all subsequences are removed. Secondary databases

In addition to the primary databases, there are several secondary databases, which are basically the fruits of our knowledge. Because there are several primary databases and variety of ways of analyzing protein sequences, the information housed in each of the secondary resources, is different. To design tools to search different databases, and assess their biological significance, is a difficult task. Some of the major ‘pattern databases’ are listed in

11

Page 12: Bioinformatics and its a - NISCAIR

Table 2. In each, the primary source is noted, together with the type of pattern stored. ‘PRINTS’ is currently the only database derived from a composite database.

Table 2: Secondary databases

Secondary Database Primary Database Stored Information

PROSIE SWISS-PROT Regular Expression (patterns) Profile SWISS-PROT Weighted Matrices (Profiles) Prints Aligned Motifs (Fingerprints) Pfam SWISS-PROT Hidden Markov Model (HMMs) BLOCKS PROSITE/PRINTS Aligned Motifs (blocks) IDENTITY PROSITE/PRINTS Fuzzy Regular Expression

Web addresses of the secondary databases

• PROSITE http://expasy.hcuge.ch/sprot/prosite.hrml

• Profile http://www.isrec.isb-sib.ch/profile/profile.html

• PRINTS http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS

• Pfam http://www.sanger.ac.uk/Software/Pfam

• BLOCKS http://www.blocks.fhcrc.org

• IDENTITY http://dna.Stanford.EDU/identity

Composite protein pattern databases

Curators of PROSITE, Profiles, PRINTS and Pfam created a unified database of protein families with single central family annotation resource in Geneva. SCOP (http://scop.mrc-lmb.cam.ac.uk/scop) Structural Classification of Proteins (SCOP) maintained at MRC laboratory of Structural Biology and Center for Protein Engineering, describes structural and evolutionary relationships amongst known structures. SCOP is constructed using manual inspection and automated methods. It is based on: Family: Proteins are clustered into families with clear-cut evolutionary relationship. Superfamily: They are placed in superfamily in spite of low sequence identity. However, their structural and functional relationship suggests common evolutionary origin. Fold: Same major secondary structure in same arrangement is seen. CATH (http://www.biochem.ucl.ac.uk/bsm/cath) CATH (Class, Architecture, Topology and Homology at UCL) is a classification based on automatic and manual methods. Class: Gross secondary structure, Architecture: Gross arrangement, Topology: Overall shape and connectivity, and Homology: Similarities by sequence.

12

Page 13: Bioinformatics and its a - NISCAIR

PDBsum http://www.biochem.ucl.ac.uk/bsm/pdbsum) PDBsum is a major resource of structural information at web- based compendium at UCL. It gives PDB entries, resolution, and R-factor, number of protein chains, number of ligands, metal ions, secondary structure, fold cartoons, and ligand interactions. It is a single resource for information at 1D (sequence), 2D (Motif), 3D (structure) levels. Gene and protein analysis databases and software can be found at http://www.humgen.nl/programs.html . Sequence analysis

Protein primary sequence analysis

Physicochemical properties of 20 amino acids are well understood. Number of useful computation tools is available at EXPASY Molecular Biology Server (http://www.expasy.ch/) at Swiss institute of Bioinformatics. The main objective of these tools is to assist the analysis and identification of unknown protein isolated through 2-D (2-dimensional) Gel Electrophoresis, as well as to predict basic physical properties of unknown protein. Some computer programs, used popularly are listed here. AACompIdent and AACompSim (Wilkins et al 1997) (http://www.expasy.ch/tools) AACompIden/ AACompSim use amino acid (aa) composition of unknown protein, to find out known proteins with the same composition. The programs use aa composition, isoelectric point (pI) and Molecular Weight (MW). The algorithm computes difference of these properties for each sequence, for all proteins from a specific taxanomical class, regardless of a taxanomical class, with and without inclusion of pI. AACompIdent uses experimentally determined values, while AACompSim uses properties derived on the basis of aa sequence. PROPSEARCH (http://swift.cmbi.ru.nl/prive/past/propsearch) PROPSEARCH is along the same line as AACompSim and uses protein composition to detect relationship amongst the proteins. It can find out even week relationship amongst the proteins (Hobohm and Sander 1995). It is a more robust technique. 144 different physical properties are used to perform the analysis. These are called ‘query vectors’. Input is just the query sequence. MOWSE (http://bioweb.pasteur.fr/docs/EMBOSS/emowse.html ) Molecular Weight Search (MOWSE) algorithm (Pappin et al 1993) is used for analysis using mass spectrometric technique. The program uses molecular weight of intact protein and the one, as a result of digestion. Using specific proteases, unknown protein can be identified. ProtParam (www.expasy.ch/tools/protparam.html) One can compute pI and MW using ProtParam (Bjellqvist et al 1993). PI determination is based on its pK value as determined in the study on protein migration, in denaturing conditions, at neutral and acidic pH. The sequence can be furnished in FASTA format or a SWISS-PROT identifier. TGREASE (http://ftp.virginia.edu/pub/fasta )

13

Page 14: Bioinformatics and its a - NISCAIR

TGREASE calculates hydrophobicity along the length of a protein (Kyte & Doolittle 1982). Inherent in 20 amino acids is hydrophobicity, which is, relative propensity to bury itself. Hydrophobicity, along with sterric and other considerations, influences, how the protein actually folds. TGREASE finds putative trans membrane helices. It belongs to FASTA suite of programs at University of Virginia and can run in ‘stand alone’ mode also. SAPS (http://www.iserc.isb-sib.ch/software/SAPS_form.html) Statistical Analysis of Protein Sequence (SAPS) (Brendel et al 1992) gives extensive statistical information on proteins. MOTIFS and PATTERNS Often direct sequence comparisons by BLAST (discussed later), may not yield any interesting result. However, there may be some weak sequence determinants that may allow you to identify the family. A family of sequences can be alternatively used to identify new distantly related members of the same family. This can be done using program PSI-BLAST ProfileScan ( http://www.isrec.isb-sib.ch/software/PFRAMESCAN) The program ProfileScan searches similarity of nucleic acid or protein sequence using PROFILE database (Gribskov et al 1988). ProfileScan can also do search through three databases: PROSITE, pfam and Gribskov collection. DNA sequence analysis

The most sensitive comparisons are made at the protein level. Redundancy of genetic code of 64 codons, is reduced to 20 in proteins. However, there is a loss of information that relates more directly to the evolutionary process. Proteins are functional abstraction of the genetic events that occur in DNA. Study of DNA sequences, underlines detection of phylogenetic relationship through intron/exon prediction, inference of protein coding sequences and determination of sequence through ‘open reading frame’ (ORF). Phylogenetic analysis gives family relationship amongst species. The analysis is carried on small sections of aligned DNA, taken from the same gene in different organisms. DNA is used because the pattern of mutations, insertions and deletions at gene level is definitive. The silent mutations that do not result in mutations in amino acids level are automatically introduced. Phylogenetic relationships are often represented graphically. Finding of a correct reading frame (ORF) is not an easy task. It is considered normally as the longest reading frame, uninterrupted by the stop codons (TGA, TAA or TAG). Generally the initial codon is that of methionine (ATG). However it is also a very common amino acid. It is necessary to use additional techniques to detect 5’ untraslated sequence end. Recognition of flanking sequence is useful. In Eukaryotic systems ‘exons’ are part of transcribed sequences. ‘Introns’ are transcribed but are not part of coding sequences or CDs. If there is a gene made up of two ‘introns and three ‘exons’, the ‘exons’ are terminated by intron–exon boundaries. Pattern of codon usage is also quite important. Codon usage varies with the species. Codon usage pattern can be used to identify 5’ and 3’ ends of an ORF. One can study ‘forward translation’ in three frames (0, +1, +2) and ‘backward translation’ in three frames (0, -1, -2) (six frame translation) and come to the conclusion of the right gene. There are several packages that translate DNA sequence to protein. These are listed below:

• Translate ( http://www.expasy.org/tools/dna.html ) - Translates a nucleotide sequence to a protein sequence

14

Page 15: Bioinformatics and its a - NISCAIR

• Transeq (http://www.ebi.ac.uk/emboss/transeq ) - Nucleotide to protein translation from the EMBOSS package

• Genewise (http://www.sanger.ac.uk/software/Wise2/genewiseform.shtml ) Compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors

• Graphic Codon Usage Analyser (http://gcus.schoedl.de) - Displays the codon bias in a graphical manner

• BCM search launcher (http://searchlauncher.bcm.tmc.edu/seq-util/Options/sixframe.html - Six frame translation of nucleotide sequence(s)

• Backtranslation (http://www.entelechon.com/eng/backtranslation.html ) - Translates a protein sequence back to a nucleotide sequence

Sequence comparison methods

Scoring matrices: Dot matrix, PAM and BLOSSUM

One can compare either two sequences at a time (pair wise sequence alignment) or number of sequences (multiple sequence alignment) simultaneously. The main question is how one compares two nucleotides or amino acids, and also how one handles the ‘gaps’? In pair wise alignment, one uses either a ‘dot matrix’ indicating presence or absence of a particular residue, or scoring matrices. Scoring matrices as PAM (Percentage of Accepted Mutations), based on Dayhoff’s mutational data (Dayhoff et al. 1978), and BLOSUM (Block Substitution Matrix), based on the observed amino acid substitutions (Henikoff and Henikoff 1992), are very popular. BLOSUM matrices are calculated on the basis of a large data set of approximately 2000 conserved amino acid patterns called ‘BLOCKS’. These BLOCKS have been found in a database of protein sequences, representing over 500 groups of related proteins. Pair wise and multiple sequence analysis methods

One of the most popular computer programs for pairwise sequence comparison is BLAST (Basic Local Alignment Search Tool) by Altschul et al (1990) and Altschul (1991). Given two sequences, it calculates a segment pair, defined usually as ‘subsequence’, which is of the same length as that of ‘ungapped alignment’ (no gaps are allowed). This strategy concentrates on finding regions of ‘high local similarity’ in alignments without gaps called ‘hot spots’. Changing together several locally similar regions can create alignments with some gaps. Later ‘hot spot’ extensions are attempted into the surrounding regions. The original BLAST could produce only ungapped alignments. Gapped BLAST (Altschul et al 1997) seeks only one, rather than all ungapped alignments that give significant match, and thus speeds the search. BLASTP compares a protein query sequence against a protein sequence database, tBLASTn compares a protein query sequence against a nucleotide sequence database, dynamically translated in all six reading frames, BLASTN compares nucleotide query sequence against a nucleotide database, BLASTX compares six frame translated nucleotide query sequence, against a protein database, tBLASTx compares translated nucleotide sequence, against translated nucleotide database and PSI-BLAST is position specific iterative BLAST. PSI-BLAST is very useful for detecting weak ‘homologs’.

15

Page 16: Bioinformatics and its a - NISCAIR

FASTA program, developed by Lipman and Pearson (1985), is the first program to search sequence databases for gapped local alignment. It has been developed many years before BLAST. The program produces local alignments per pair of compared sequences. The best scoring local region is given as output. FASTA is slower than BLAST. In FASTA, we first identify ‘short words’ or ‘k- tuples’, common to both sequences. The word length for proteins is ‘2’ and for DNA it is ‘4-6’. Comparing these ‘k-tuples’ and their relative ‘offsets’can be viewed as focusing along the diagonals. The program uses ‘heuristic approach’ and ‘dynamic programming’, to join these k-tuples. Comparing number of sequences simultaneously is like creating a matrix. It is extremely useful for the comparison of similar proteins from different organisms or species. One can extend pair-wise sequence comparison method to number of sequences and use dynamic programming algorithm. It is also possible to use ‘Genetic algorithm’ and ‘Tree hierarchical method’. Instead of using full scoring matrices, one can use chemical similarity tables. Needleman and Wunsch (1970) developed an algorithm for sequences that show similarity across most of their lengths. However, this is not usually the case. Smith and Waterman (1981) algorithm uses dynamic programming to compute the most sensitive pair-wise similarity alignments. It could be used for identification of remote homologs. CLUSTAL by Feng and Doolittle (1987) uses multi-dimensional dynamic programming matrix. Their method is based on the fact, that, similar sequences tend to be evolutionarily related. The method aligns the sequences in pairs, following the branching order of a family tree. Similar sequences are aligned first, and distantly related sequences later. Once the score of pair-wise alignments are obtained, then it is used to cluster distantly related sequences. The program is useful to produce a ‘phylogenetic tree’. Database analysis using BLAST and FASTA

BLAST example In case we have a sequence and wish to identify a gene, but no clues are available, what we can do is: Step 1 Copy the query sequence (e.g. sequence given below) AAAAGAAAAGGTTAGAAAGATGAGAGATGATAAAGGGTCCATTTGAGGTTAGGTAA TATGGTTTGGTATCCCTGTAGTTAAAAGTTTTTGTCTTATTTTAGAATACTGTGAT CTATTTCTTTAGTATTAATTTTTCCTTCTGTTTTCCTCATCTAGGGAACCCCAAGA GCATCCAATAGAAGCTGTGCAATTATGTAAAATTTTCAACTGTCTTCCTCAAAATA AAGAAGTATGGTAATCTTTACCTGTATACAGTGCAGAGCCTTCTCAGAAGCACAGA ATATTTTTATATTTCCTTTATGTGAATTTTTAAGCTGCAAATCTGATGGCCTTAAT TTCCTTTTTGACACTGAAAGTTTTGTAAAAGAAATCATGTCCATACACTTTGTTGC AAGATGTGAATTATTGACACTGAACTTAATAACTGTGTACTGTTCGGAAGGGGTTC CTCAAATTTTTTGACTTTTTTTGTATGTGTGTTTTTTCTTTTTTTTTAAGTTCTTA TGAGGAGGGGAGGGTAAATAAACCACTGTGCGTCTTGGTGTAATTTGAAGATTGCC CCATCTAGACTAGCAATCTCTTCATTATTCTCTGCTATATATAAAACGGTGCTGTG AGGGAGGGGAAAAGCATTTTTCAATATATTGAACTTTTGTACTGAATTTTTTTGTA ATAAGCAATCAAGGTTATAATTTTTTTTAAAATAGAAATTTTGTAAGAAGGCAATA TTAACCTAATCACCATGTAAGCACTCTGGATGATGGATTCCACAAAACTTGGTTTT ATGGTTACTTCTTCTCTTAGATTCTTAATTCATGAGGAGGGTGGGGGAGGGAGGTG GAGGGAGGGAAGGGTTTCTCTATTAAAATGCATTCGTTGTGTTTTTTAAGATAGTG TAACTTGCTTAAATTTCTTATGTGACATTAACAAATAAAAAAGCTCTTTTAATATT AGATAA

16

Page 17: Bioinformatics and its a - NISCAIR

Step 2

Go to the EXPASY (EMBnet) or any other BLAST Server. Step 3: Select the program: BLASTN This is the BLAST program that will compare a nucleotide query sequence against a nucleotide database. Step 4: Select the database: EMBL without ESTs (DNA)

This is the main EMBL nucleotide database. Step5: Ignore the matrix option as it is not used by BLASTN. Step 6: Select sequence input format: Plain Text Step 7: Select the following options:

Gapped Alignment: ‘ON`

BLAST filter: ‘ON’

Graphic Output: ‘ON’

Note: These are all ‘ON’ by default.

Step 8: Later paste the query sequence into the specified area. Step 9: Hit the button: Run BLAST Step 10: Then wait, as the server processes your query. Step 11: Examine the output e-value This value is the number of times you would expect to see such a match (or better) merely by chance. The closer the value is to zero, the less likely the event is. Smaller the e- value, the more significant the match is. Using FASTA One of the most commonly used servers for FASTA is http://www.ebi.ac.uk. As in the case of BLAST, one has to paste a query sequence in FASTA format (which starts with letter ‘>’ followed by a continuous string of one letter amino acid code). The program is also available at number of other sites. It can be used both for nucleic acids and proteins by choosing suitable options. It usually compares the protein sequence vs. SWISS-PROT protein sequence database library.

17

Page 18: Bioinformatics and its a - NISCAIR

Structural bioinformatics

Small molecular structural database CCSD (http://www.ccdc.cam.ac.uk/) Small organic and inorganic molecules are of great significance, not only in chemical industry, but also in biotechnology, in design of drugs, enzyme inhibitors, markers for DNA and proteins etc. Developments in the field of affinity chromatography, detergents, pesticides and cosmetic industries also need information on properties of small molecules. There are today over 300 chemical databases and picking a right molecule for the desired purpose is not an easy task. This makes Chemoinformatics an important discipline in bioinformatics. Cambridge Structural Database (CSD) is one of the oldest and most comprehensive databases. It includes chemical and crystallographic data for inorganic and organic molecules in 1D (one dimension), 2D (two dimension) and 3D (3 dimension). There are four components in the information content of the CSD viz.: 1D Information: Text and Simple Numerical Items Bibliographic items: compound name, journal reference, etc.; chemical text strings, molecular formula, amino-acid sequences etc; simple numerical items, unit-cell parameters, R-factor, etc and text comment describes disorder, errors, etc. 2D Information: Chemical Structural Diagram as a Connection Table Atom properties: atom sequence number, element symbol, number of connected non-H atoms, number of terminal H- atoms, net charge, 2D- display coordinates and bond properties: pair of atom sequence numbers and formal chemical bond type [1=single, 2=double, 3=triple, 4=quadruple, 5=aromatic, 6=catena bond, 7=delocalized double, 9=pi-bond, positive bond types are acyclic while negative ones are cyclic]. 3D Information Molecular structure Cartesian coordinates and graphics representation of the 3D structure 3D-Crystal Structure Description Space-Group and Symmetry Operators, 3D- Atomic Coordinates (fractional) for Crystal Chemical Unit, Crystallographic Connectivity established using Covalent Radii, Matching Information that maps the 'crystallographic' atoms to 'chemical' atoms in the 2D connection table. Protein structural database PDB (http://www.rcsb.org) The Protein Data Bank (PDB) was established at Brookhaven National Laboratories (BNL) in 1971 as an archive for ‘biological macromolecular crystal structures’ (Bernstein et al 1977). In the beginning the archive held seven structures, and with each year a handful more were

deposited. In the 1980s the number of deposited structures began to increase dramatically. This was, due to the improved technology for all aspects of the crystallographic process and the addition of structures determined by nuclear magnetic resonance (NMR) methods. By the early 1990s the majority of journals required a PDB accession code and at least one funding agency (National Institute of General Medical Sciences) adopted the guidelines published by

the International Union of Crystallography (IUCr) requiring data deposition for all structures. The mode of access to PDB data has changed over the years, as a result of improved technology, notably the availability of the WWW replacing distribution solely via magnetic media. Further, the need to analyze diverse data sets required the development of modern data management systems.

18

Page 19: Bioinformatics and its a - NISCAIR

In October 1998, the management of the PDB became the responsibility of the Research Collaboratory for Structural Bioinformatics (RCSB) ( http://www.rcsb.org). In general terms, the vision of the RCSB is to create a resource based on the most modern technology that facilitates the use and analysis of structural data and thus creates an enabling resource for biological research. The PDB archive contains macromolecular structure data on proteins, nucleic acids, protein-nucleic acid complexes, and viruses. Files in its holdings are deposited by the international user community and maintained by PDB staff. Approximately 50-100 new structures are deposited each week. They are annotated by RCSB and released upon the depositor's specifications. PDB data is freely available worldwide. It has been maintained by Berman et al (2000). Introduction to molecular modeling

Modeling small molecules

Modeling is a technique, to obtain structural information about an object purely on a theoretical basis, using a set of mathematical rules. We can apply this method for building 3D- structures of small molecules of biological relevance. Typical examples are enzyme inhibitors and activators, drugs in case they are not available in any database. We can also build biopolymers (DNA, RNA, proteins and carbohydrates). The first step in the molecular modeling is visualizing a biological molecule. For achieving this it is necessary to define a coordinate frame of reference. Generally ‘Cartesian coordinate system’ is used. In this, there are three mutually perpendicular axes (OX, OY and OZ) (Figure 5), passing through a point O (the origin). To get X, Y, Z coordinates of point P in space, first a perpendicular (PM) is drawn from a point P on the XY plane. From point M perpendiculars (ML and MN) are drawn on axes OX and OY. Distance OL, LM and PM respectively describe the Cartesian coordinates (X, Y and Z) of the atom located at P.

Fig. 5: Cartesian coordinate system Fig. 6: Cylindrical polar coordinate system

The Cartesian coordinates of any a molecule cannot be fed directly to the computer for displaying on the screen. One has to use graphics visualization software. Many such programs are freely available on the Internet e.g.:

RasMol (http://www.umass.edu/microbio/rasmol/), MolMol (http://www.tucows.com/preview/9805), Swiss-PDB viewer (http://www.expasy.ch/spdbv/).

19

Page 20: Bioinformatics and its a - NISCAIR

One can download these programs and install them on a personal computer (PC) or a graphic work-station. Each graphic package uses a specific FORMAT for supplying Cartesian coordinates. Most popular is PDB FORMAT. Often, it is possible, that the coordinates of a molecule are given in ‘crystal internal frame of reference’ attached to the unit cell of the crystal, or else ‘internal coordinates’, viz. bond lengths, bond angles and torsional angles, related to three preceding atoms, or ‘Cylindrical polar coordinates’ (r, θ, z), as these are more relevant from crystallographic or chemical point of view. Coordinates for DNA are usually given in the ‘Cylindrical polar coordinate system’ (Fig. 6) because of its helical symmetry. In such cases, one has to first transform these coordinates into Cartesian coordinate system (X, Y, Z) and then only molecular graphics software can be used to visualize these molecules. Another necessary information for molecular visualization is chemical connectivity table. Majority of the graphics packages compute connectivity or supply this information on the basis of chemical formulas.

Small molecules can be built using internal coordinates based on chemical parameters. For any atom j (Fig. 7) these are defined in terms of coordinates of the three preceding atoms: j' to which atom j is connected, j" to which atom j' is connected and j"' to which atom j'' is connected.

Fig. 7: Internal coordinate system The bond length Rj is defined as the distance between the atoms j and j'. The bond angle αj is defined as the angle between the atoms j-j'-j". The torsional angle ßj is the angle between the images of bonds j-j' and j"-j"’ on the plane, perpendicular to the bond j'-j". The clockwise rotation of the bond j-j' about the bond j'-j" to bring j-j' in the direction of j"-j'" is taken as the positive angle. Definition of bond length, bond angle and torsional angle for the first three atoms is problematic. To resolve this, by definition atom 1 is taken as the origin, atom 2 is taken to be on the negative X axis and is defined by the bond length R2, atom 3 is taken in the XY plane in the first or fourth quadrant. It is defined by bond length R3 and bond angle α3 (Thompson, 1967). For any atom j Bj (shown below) is defined as a 4x4 matrix in terms of its internal coordinates Rj, αj and ßj. Another 4x4 matrix Aj computed on the basis of matrix Bj and Aj’ (A matrix of the chain connecting atom) is also defined. The A matrix for the first atom is taken equal to a unitary matrix. For other atoms a 4x4 A matrix is obtained by multiplying its B matrix to the A matrix of the connected atom (chain atom). The terms A 1,4, A2,4 and A3,4 represent the X, Y and Z coordinates. The matrix elements for Bj are given below:

20

Page 21: Bioinformatics and its a - NISCAIR

1000sin.sin.cossin.cossin.sincos.sin.sincos.coscos.sin

cos.0sincos

jjjjjjjj

jjjjjjjj

jjj

j RR

R

B

j

βαββαβαβαββαβα

ααα

−−−

−−

=

Modeling Biopolymers

Modeling DNA

DNA molecule can adopt different backbone conformations as: A, B and Z. The most common B form of DNA is realized only for a random nucleotide sequence, in a highly hydrated state. This has about 10 base pairs per turn of the helix. The distance between the consecutive base pairs is 3.4 Å. This gives pitch (distance along the helical axis of the bases with identical orientation) equal to 34 Å or 3.4 nanometers (nm). The bases are perpendicular to the helix axis. Each base pair is rotated by approximately 36° about the helical axis with respect to preceding base. It has an overall width of 20 Å. IUPAC-IUB commission nomenclature (Sanger, 1984) and rotational angles for the backbone atoms of DNA are depicted in Figure 8. The angles are defined in terms of four consecutive atoms in the backbone of a nucleotide unit, taken from P to P. The bases have somewhat restricted movement about the glycosyl bond between C1’-N9 (χ) in case of purines or C1’-N1 (χ ) of pyrimidines. DNA molecule preserves overall helical symmetry in spite of local short-range and long-range structural disorders. Simulation of DNA can be done using either 'Cartesian' or 'Cylindrical polar’ coordinate building blocks. For generation of standard DNA structures, with the helical symmetry, in addition to building blocks only two parameters are needed, viz. the base pair rise tr- which is the distance between the successive base pairs along the helical axis and the twist tw- which is the angle of rotation of the following base pair with respect to existing one, are needed. If one of the Cartesian axes (in general Z- axis) coincides with the helical axis of the molecule, we can generate the DNA polymer using set of rules given below.

Fig. 8: IUPAC-IUB nomenclature for DNA

21

Page 22: Bioinformatics and its a - NISCAIR

Suppose xi, yi and zi are the coordinates of the ith atom in a building block, the coordinates of the same atom in the nth residue can be obtained by:

• Rotating coordinates of all the atoms in a block by angle (n-1).tw by a rotational transformation,

• Followed by translation of the unit along the helix axis by an amount (n-1).tr.

If the building blocks are in the Cylindrical polar coordinate system (ri, θi and zi designating coordinates of the ith atom ) (Figure 6) the task is easier. In this case, radius rin of ith atom in the n th residue does not change with the residue number. Angle θin becomes θi + (n-1). tw, and Zin the z coordinates of the ith atom in the nth residue becomes Zin = zi + (n-1).tr. Thus Xin, Yin and Zin coordinates of ith atom of the nth residues are:

( )( )

( )( )( ) trnzZ

twnrYtwnrX

iin

iiin

iiin

⋅−+=⋅−+⋅=⋅−+⋅=

11sin1cos

θθ

Generation of the opposite strand, can be done by using symmetry information in DNA. In the case of A and B forms of DNA, the dyad axis is along the X-axis (Arnott et al 1976). The projection of the sugar-phosphate backbone on the base plane has a mirror symmetry along this axis. As a result, the coordinates of the corresponding atom on the opposite strand (Xin', Yin', Zin' ) are given by:

iinin

inin

inin

zZZYY

XX

2'

'

'

−=−=

=

where zi is the z coordinate of the same atom in the building block and Xin, Yin and Zin are the coordinates of the corresponding atom on the generating strand. If the opposite strand is to be generated using cylindrical polar coordinates, the following relationship can be used:

( )( ) trnzZ

twn

iin

iin

⋅−+−=

⋅−+−=

1

1'

' θθ

These can be then converted to the Cartesian coordinates using following relation:

''

''

sin

cos

iniin

iniin

rY

rX

θ

θ

⋅=

⋅=

The advantage with the cylindrical polar coordinates is, that for the molecule with the cylindrical symmetry, the successive residues could be added by increasing Zi by tr rise per residue and θi by 360°/N where N is the number of residues per turn or axial symmetry. Z DNA has a dinucleotide repeat. The nth neighbouring dinucleotide unit can be generated using following matrix transformation (Wang et al 1981)

22

Page 23: Bioinformatics and its a - NISCAIR

( ) ( )( ) ( )

43.700

100060cos60nsin060sin60cos

*

**

**

×+×−−

−−−=

nZYX

nnn

ZYX

i

i

i

in

in

inοο

οο

While the opposite strand can be generated using transformation

in

in

in

in

in

in

ZYX

ZYX

×−

−=

100010001

'

'

'

Modelling proteins

In contrast to DNA, proteins have a wide range of structures, which primarily depend upon their amino acid sequences, though not necessarily. Predicting 3D-structure of a protein, with several hundred amino acids, purely on theoretical basis, with no structural or chemical information available, is still a dream. Computer modeling can be used to simulate and energy minimize small peptidic fragments. There are several algorithms to predict secondary structural elements in proteins using statistical, chemical, homology based and ‘threading the sequence through structural motifs’ techniques. Several such tools are available at ExPASy Proteomics server http://www.expasy.ch/tools/#secondary . The starting point in protein structure simulation is the structural information on 20 commonly occurring amino acids. Since, there are various possible side chains and secondary structures, it is difficult to generate a polypeptide chain using ‘Cylindrical polar’ or ‘Cartesian coordinates’ building blocks. It is easier to use ‘Internal coordinate building blocks’ for each amino acid and assign torsional angles to all the backbone atoms, on the basis of its secondary structure (α, β, turn (T)and coil(C)) given in Table 3.

Table 3: Torsional angles for different secondary structures of proteins

Conformation Residue φ ψ

α helix 1 -57° -47°

310helix 1 -60° -30°

ß sheet anti-parallel 1 -139° 135°

ß sheets parallel 1 -119° 113°

2 -60° -30° Type- I turn

3 90° -0°

2 -60° 120° Type- II turn

3 80° 0°

2 -60° -30° Type- III Turn

3 -60° -30°

23

Page 24: Bioinformatics and its a - NISCAIR

Comparative modeling of proteins

Comparative protein modeling is based on a simple assumption that homologous proteins should have similar structures (Johnson et al 1994). In this, one overlaps structurally conserved regions (SCR) using sequence alignment (Pascarella & Argos 1992) methods. The procedure requires at least one sequence, with known 3-D structure and sequence significantly similar to the target sequence (preferably homology above 60%). This is used as a ‘template’. The first and the most crucial step in comparative modeling is the search of a suitable ‘template’. In order to determine, if a modeling request can be carried out, one compares the ‘target’ sequence, with a database of sequences derived from the Protein Structural Database (PDB), using programs such as FASTA and BLAST. Next, one finds out structurally conserved regions in a protein. Sequence alignment is an important phase, as it determines correspondence between residues in the ‘reference’ sequence and in the ‘model’ protein. This can be achieved by using the best scoring diagonals obtained by SIM (available at SWISS-MODEL (http://swissmodel.expasy.org//SWISS-MODEL.html). SIM selects all templates with sequence identities above 25% and projected model size larger than 20 residues. Residue, located in non-conserved loops, will be ignored during the modeling exercise. Assigning coordinates with SCR can be done using basic building blocks and torsional angles. Building of a loop or structurally variable regions (SVR) can be done either on the basis of existing loops in Brookhaven Protein Data Base or by de novo generation (Orengo et al 1992, Luthy et al 1991). Since, the loop building only adds Cα atoms, the backbone carbonyl and nitrogen atoms must be completed in these regions. This step can be performed by using a library of pentapeptide backbone fragments derived from the PDB entries, determined with a resolution better than 2.0 Å. The next step is, the construction of a framework, which is computed by averaging the position of each atom in the ‘target’ sequence, based on the location of the corresponding atoms in the ‘template’, in case there exist more than one. Conformational search of side chains is important as these may not fit properly with the new environment. Information on packing of molecules is very helpful for side chain organization. Close packing of helices and sheets has been extensively studied by (Chothia 1984). Number of side chains, that need to be built, is dictated by the degree of sequence identity between the target and the template sequences. An algorithm has been presented by Sali et al (1990) for homology based modeling of proteins. MODELLER http://www.salilab.org/modeller/modeller.html is available on www. Number of packages as: MOE (Molecular Operating Environment), BIOSYM (Accelerys) have a program called HOMOLOGY. One can also use SWISS-MODEL (http://swissmodel.expasy.org//SWISS-MODEL.html). Structural refinement can be done using different energy minimization (EM), molecular dynamics (MD) and simulated annealing molecular dynamics (SAMD) simulation methods discussed in the next paragraph. Most popular programs are: CHARMM, AMBER and GROMOS. The quality of structure can be assessed by their backbone torsional angles using programs such as: PROCHECK (http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html and, WHATCHECK (http://swift.cmbi.ru.nl/whatif). Geometry optimization using molecular mechanics and dynamics

Any technique of energy calculation, which computes energy of a molecule, as a function of its geometry, can be used for energy- based optimization of the structure of a molecule. The complete mathematical description of a molecule, including both quantum mechanical and

24

Page 25: Bioinformatics and its a - NISCAIR

relativistic effects, is a formidable task. For molecules up to few atoms, belonging to the first or second row elements, quantum chemical techniques were found quite useful for geometry optimization. For larger molecules and their complexes conventionally empirical, atom-atom based, pair-wise additive potential is used. It is known as ‘Force Field’ and a mechanistic approach. The general form of potential energy Utot is given as:

impangbonhboneltornontot UUUUUUUU ++++++=

The terms here give energy contributions from: non-bonded, torsional, electrostatic, hydrogen bonding, bond length distortion, bond angle distortion and improper angle distortions. These can be computed for a given geometry of a molecule, using a set of empirical parameters. Geometry optimization can be done, by minimizing total energy with respect to different geometrical variables. If only two parameters, such as two torsional angles, are varied, isoenergy contours with respect to the lowest energy, in two dimensions can be plotted, for different combinations of these two variables (Ramachandran plot). One can obtain both lowest energy structure and other allowed conformations. The approach is known as ‘grid search’ or ‘systematic search’ (SS) method. When the number of variables is large, one cannot use SS method. There are series of techniques based on computation of the first and second derivatives of the potential energy function and Taylor series expansion. Examples of such methods are steepest descent (SD), conjugate gradient (CG), Newton Rapson (NR) and Powel minimizer. The approach is known as ‘molecular mechanics’ (MM) approach. One can also use molecular dynamics (MD) approach, which uses classical Newtonian equations of motion, and integration of acceleration and velocity to compute temperature and space coordinates. Simulated annealing molecular dynamics (SAMD), random molecular dynamics (Monte Carlo or MC) and Genetic Algorithm (GA) are some new upcoming optimization methods. Details can be found elsewhere (Kothekar, 2005). Application of bioinformatics to drug designing and chemo-informatics

Diabetes, cancer, hypertension, infectious diseases caused by variety of parasitic pathogens are major health hazards. Due to rapid resistance towards the existing drugs, and high antigenic variability, many diseases still remain uncured. This necessitates use of novel drug designing techniques. Basic principle of ‘rational drug designing’ is quite simple. It is based on E.Fisher and Paul Ehrlich’s ‘Lock & Key’ hypothesis for enzymatic activity. According to this hypothesis, every drug interacts with its specific ‘target’ molecule. Drug target can be an enzyme, receptor, circulating messenger, storage site, ion channel or a membrane bound molecule. It can also be a DNA or a RNA molecule. There are basically two approaches to designing drugs: ‘analog based’ and ‘target structure based or rational drug designing’. Analog based drug designing

In analog based method, a lead molecule, which shows some activity against a diseased condition is optimized without taking into consideration, the target molecular structure. Latter is usually found by chance. Simplest way is to first identify compounds with desired activity. Then, select series of parameters based on their physico-chemical properties, and set series of ‘simultaneous equations’, with unknown coefficient. Unknown activity can be found from the solution of these equations. Different statistical methods are usually used. The oldest method

25

Page 26: Bioinformatics and its a - NISCAIR

is using Quantitative Structure Activity Relationship (QSAR). Later is a multivariate, mathematical relationship between a set of 1D/ 2D/ 3D physicochemical properties (descriptors) and any other experimentally determined property of interest as biological activity, association constant, dissociation constant, product yield etc. The QSAR relationship is expressed as a mathematical equation. The analysis of the statistical relationships between molecular structure and various properties facilitates the understanding of how chemical structure and biological activity could relate to each other. QSAR addresses basic two questions: What features of the molecule affect the activity and what features of the activity can be modified to enhance an activity. Basically it is a mathematical model to account for the activity. Corwin Hansch, in 1958 proposed an hypothesis, that substituents on the parent molecule alter biological activity. For this purpose he made use of physical organic chemistry to compute biological activity. He used Hammets’s coefficient ‘σ‘ to correlate partition relationship. The idea was developing parameters to quantify the electronic effects of substituents on the chemical reactions and ionic equilibria. The Hammett and Taft correlated the chemical activity of the compound with the variation of substituents at different positions. They used relation:

( ) xHX KK ρσ=/log

Here KX is the rate constant of the derivative, ρ the rate of different reactions, which depends on the electron release or withdrawal of the substituents, KH is the rate constant for the parent, σx refers to the electronic effect of the substituents, relative to hydrogen. It was found that the relative lipophilic character of the substituents was an important determinant of the activity. The idea was born that for any particular receptor, some optimum value of log(P) would be found, to correspond to the maximum probability of reaching the receptor in a given time. The simplest way to express such an idea is that log (1/C) was parabolically dependent on log (P) (Richon & Young, 1997). Thus, the extremely useful, general Hansch equation, given below, was proposed.

( ) ( ) ( ) 432

21 loglog/1log kkPkPkC ++−= σ

Another approach was to overlap 3-dimentional (3D) structures of active analogs over each other, using software such as FLEX-ALIGN by MOE (Molecular Operating Environment by Chemical Computation Group, Canada) or FITIT (Holtje and Jendretzki, 1995). Then identify salient features or ‘pharmacophore’ necessary for activity. Later use this information as a ‘scaffold’ and to compute the activity in different available chemical compounds, available in chemical databases to yield the active analog. Success of old QSAR and its limitations, prompted development of 3D-QSAR taking help of 3D- structures of molecules. The method, known as is today, is considered as one of the most powerful methods of drug designing. Several computer programs such as APEX-3D (Biosym), CoMFA (Tripos), RSA (Cerius 2), CATALYST (Biocad) based on this hypothesis are popular in the industry. However, majority of them are supplied by software vendors and are expensive.

26

Page 27: Bioinformatics and its a - NISCAIR

Chemoinformatics

The main hurdle in popularizing 3D- QSAR is, availability of 3D- parameters in chemical databases. Chemoinformatics deals with this area. It is based on laws of physics, chemistry, coordinate systems and transformations, building molecules and polymers, conformational studies on small molecules, quantum chemical calculations on small molecules, molecular mechanics (MM) and molecular dynamics (MD) calculations. Many programming packages are available for the same. It is one of the most important areas in bioinformatics. Target structure based drug designing

Design of a new potent and selective ligand, for a given protein or a receptor, is one of the most important applications in contemporary drug design. The field is often known as ‘structure- based’, ‘rational’ or ‘de novo’ drug design and uses information on the 3D- structure of the target molecule. Basically there are three conceptually different routes to construct a new molecule. The first is to do ‘active site analysis’. These methods are used to determine which kinds of atoms and functional groups are best suitable to interact with the ‘active site’. The second approach consists of doing a ‘template’ structure search and then subsequently modify the same, to satisfy the functional requirements of the protein binding sites. One may obtain such a template candidate from a search in 3D- database. It is also possible to modify the existing structures to optimize a ligand using 3D- QSAR approach. The third, and the most popular approach is by ‘sticking together pieces’ (growers and builders). These pieces may be either fragments (functional groups, rings, etc.) or an isolated single atom. The approach is known as ‘connection methods’ or ‘de novo design’. The database search method requires structure preparation or cleaning the protein structural data file, for energy or score calculations, visualization of macromolecular structure using graphics tools, active site detection, contact potential or scoring methods, and ligand docking methods. These usually search for a ligand, which is complimentary to the active site of the target and shows maximum interaction (lowest value of free energy of interaction). Interactions can be sterric, hydrophobic, electrostatic or hydrogen bonding (H-bonding). Computation of ‘free energy’ of interaction of a ligand with the receptor, in the presence of water, is computationally intensive job. As a result, many packages do not allow conformational flexibility to the ligand or the active site residues of the receptor and also do not compute the energy explicitly. A typical example is DOCK developed by Kuntz (DesJarlais et al., 1988, Kuntz, 1992). Some methods as LUDI by Böhm, (1993) resort to pure geometry considerations based on certain rules. The ligand is docked in the active site using programs as: DOCK, AFFINITY, GOLD, FLEXX. A list of docking software can be found at: http://www.imb-jena.de/~rake/Bioinformatics_WEB/dd_tools.html In pharmacogenomics one first isolates a gene related to the diseased condition, using DNA microarrays. Then translates the sequence into corresponding protein sequence. Next step is to find a suitable ‘template’ for sequence comparison and use ‘homology’ based protein modeling to fold the protein in right way. Next one analyzes the ‘active cavity’ and designs a suitable molecule to inhibit or activate the activity of the target molecule. It is possible now, to develop personalized medicine. The approach would enable the mankind to enjoy the fruits of Human Genome Project (HGP).

27

Page 28: Bioinformatics and its a - NISCAIR

Other databases

Bioinformatics also deals with databases for crops, organisms, microbes, patents, chemicals, bibliographic information etc. Drug trials and management of their data is a new upcoming area in bioinformatics. Medical bioinformatics (Medical Informatics) deals with information on diseases, their treatments, geographical spreads, managements, manpower, etc. Summary

Bioinformatics is basically use of computational tools to acquire data on sequences of DNA and protein, structural information, fold proteins, design effectors, maintain databases, curate databases, analyze databases and give guidelines for design and development of new Biotech products. It reduces substantially cost, time and manpower needed for the product development. Suggested Reading

1. Altschul , S.F.; Gish,W.; Miller, W.; Myers, E.W. and Lipman, D.J. (1990) J.Mol. Biol. 215:403. 2. Altschul , S.F.; Madden, T.L.; Schaffer, A.A.; Zhang, J.; Zhang, Z.;.; Miller, W. and Lipman, D.J. (1997)

Nucleic Acid Research, 25:2289. 3. Altschul S.F. (1991) J. Mol. Biol. 219:555. 4. Arnott, S.; Campbell Smith, F.J. and Chandrasekaran, R. (1976) “CRC handbook Biochemistry and

Molecular Biology” G.D. Fasman ed. CRC. Cleveland Ohio, vol. 2 Nucleic acids 411. 5. Attwood, T.K. and Parry-Smith, D.J. (2001) “Introduction to Bioinformatics”, Pearson Education, Asia. 6. Baxevanis, A. D. and Quellette, B. F.F. (2001) “Bioinformatics, a Practical Guide to the Analysis of

Genes and Proteins”, Wiley Interscience. 7. Benson, D. A.; Buguski, M. S.; Lipman, D. J.; Ostel, J. and Ouellette, B.F.F. (1998) Nucleic Acid

Research, 26:1. 8. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N. and

Bourne, P.E. (2000). Nucleic Acids Research, 28 : 235. 9. Bernstein,F.C.; Koetzle,T.F.; Williams,G.J.; Meyer,E.E.; Brice,M.D.; Rodgers,J.R.; Kennard,O.;

Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112: 535. 10. Bjellqvist, B.; Hughes, G.; Pasquali, C.; Paquet, N.; Ravier, F.; Sanchez, J.-C.; Frutiger, S. and

Hochstrasse, D.F. (1993) Electrophoresis 14: 1023. 11. Bleasby, A.J.; Akrigg, D. and Attwood, T.K. (1994) Nucleic Acid Research, 22: 3574. 12. Böhm, H.J (1993c) J. Comput Aided Mol. Design 6: 593. 13. Brandon, C & Tooze, J. (1991)”Introduction to protein Structure”, Garland Science. 14. Brendel, V.; Bucher, P.; Nourbakhsh,I; Blasidell,B.E. and Karlin, S.(1992) Proc. Natl. Acad. Sci. USA

89: 2002. 15. Chothia,C.(1984) Annual Rev. Biochem. 53: 537 16. Dayhoff M.O.; Schwartz R. and Orcutt B.C. (1978) “Atlas of protein sequence and structure. Vol. 5,

Suppl. 3, Ed”. 17. DesJarlais RL, Sheridan RP, Seibel GL, Dixon JS, Kuntz ID, Venkataraghavan R. (1988) J Med Chem.

31:722. 18. Feng, D.F. and Doolittle, R.F. (1987) J. of Mol. Evol. 25: 351. 19. Geer, J. (1981) J.Mol. Biol. 153: 1027 20. Grisbov, M.; Homyak, M.; Edenfield, J. and Eisenberg, D. (1988) Comput. App l.Biosci. 4:61. 21. Henikoff, S. and Henikoff, J.G. (1992) Proc. Natl. Acad. Sci. USA 89: 10915. 22. Hobohm, U. and Sander, C. (1995) J. Mol.Biol. 251: 390. 23. Holtze, H..D. and Jendretzki, U.K (1995)Arch. Pharm.- Med. Chem, 328: 577. 24. http://www.netsci.org/Science/Compchem/feature19.html 25. Johnson, M.S.; Srinivasan, N.; Sowdhamini, R. and Blundell, T.L. (1994) Cril. Rev. Biochem. Mol. Biol.

29:1. 26. Kothekar, V. (2005) “Essentials of Drug Designing”, Dhruv Publications. 27. Kuntz, I.D. (1992) Science 257:1078. 28. Kyte, J. and Doolittle, R.F. (1982) J.Mol. Biol. 157: 105. 29. Lipman, D.J. and Pearson, W.R. (1985) Science 227:1433.

28

Page 29: Bioinformatics and its a - NISCAIR

30. Luthy, R. ; McLachlan, A. D. and Eisenberg, D. (1991) Proteins 10: 220. 31. Mews, H.W.; Hani, J.; Pfeir, F. and Frishman, D. (1998) Nucleic Acid Research 26: 33. 32. Needleman, S.B. and Wunsch, C.D.(1970) J. Mol. Biol. 48:443 33. Noel, J.P.;Hamm,H.E. and Sigler, P.B. (1993) Nature, 366: 654. 34. Orengo, C.A.; Brown, N. P. and Taylor, W. R. (1992) Proteins 14: 128. 35. Pappin, D.J.C.,; Hojrup, P.. and Bleasby, A. J. (1993) Current Biol. 3: 327. 36. Pascarella, S. and Argos, P. (1992b) Prol. Eng 5: 121. 37. Richon A.B.and Young S.S. (1997) Introduction to QSAR methodology 38. Russell, R. B., Saqi, M. A. S., Sayle, R. A., Bates, P. A. & Sternberg, M. J. E. (1997) J. Mol. Biol, 269:

423.. 39. Sali, A and Blundell, T.L. (1990) J. Mol. Biol. 212: 403. 40. Sali, A. Overington, M.S., Johnson M. S. and Blundell, T. L. (1990) TIBS. 15, 235 41. Sanger, W.F.(1984) “Nucleic acid structure and function”, Academic Press, New York. 42. Smith, T.F. and Waterman, M.S. (1981) J. Mol. Biol. 147:443. 43. Stoesser, G.; Moseley, M.A.; Sleep, J.; McGowran, M.; Garcia-Pastor, M.; and Sterk, P.(1998)

Nucl. Acid Res. 26: 21. 44. Stryer, L. (1999) “Biochemistry”, W.H. Freeman & Company, New York 45. Tateno, Y; Fukami-Kobayashi, K.; Miyazaki, S.; Sugawara, H. and Gojobori, T. (1998) Nucleic

Acid Research 26: 16. 46. Thompson, H.B.(1967) J. Chem. Phys. 47: 3407. 47. Wang, A.H.J.; Quigley, G.J.; Kolpak, F.J.; Crawford, J.L.; Van Boom, J.H.; van der Marel, G. and Rich,

A. (1979) Nature 282: 680. 48. Watson, J.D. and Crick, F.H.C. (1953) Nature 171: 737. 49. Wilkins, M.R.; Lindskog, I.; Gasteiger, E.; Bairoch, A., Sanchez, J.C.; Hochstrasser, D.F. and Appel,

R.D. (1997) Electrophoresis 18:403. Biochemistry 50. Nelson, D.L. and Cox, M.M. (2004) “Lehninger Principles of Biochemistry”, W.H. Freeman 51. Stryer, L. (1999) “Biochemistry”, W.H. Freeman & Company, New York Biophysics 52. Hoppe, W., Lohmann,W., Mark,H. and Ziegler, H. M.(1983) “Biophysics”, Springer Verlag, Heidelberg. 53. Sanger, W.F. (1984) “Nucleic acid structure and Function, Academic Press. Bioinformatics Basics 54. Lesk, A. M. (2002) “Introduction to Bioinformatics”, Oxford University Press 55. Cynthia Gibas, Per Jambeck (2001) “Developing Bioinformatics Computer Skills”, OReilly Media 56. Kothekar V. “Introduction to Bioinformatics” Dhruv Publications, 2004 For sequence databases and analysis 57. Attwood, T.K. and Parry-Smith, D.J. (2001) “Introduction to Bioinformatics”, Pearson Education, Asia. 58. Baxevanis, A. D. and Quellette, B. F.F. (2001) “Bioinformatics, a Practical Guide to the Analysis of

Genes and Proteins”, Wiley Interscience. Molecular Modeling 59. Andrew, V.A. and Gardner, M. “Molecular modelling and drug design”, Boca Raton:CRC Press,1994 60. Holtje H.D. and Folkers, G (1996) “Molecular Modeling Principles and applications”, 61. VCH Protein Modeling 62. Tsiglny, I.F. (2002) “Protein Structure Prediction: Bioinformatics Approach”, International University

Line. 63. Elber, R.(2007 Protein Modeling)Springer Drug designing 64. Kothekar, V. (2005) “Essentials of Drug Designing”, Dhruv Publications. 65. Kubinyi, H. (1993)”3D QSAR in Drug Design: Theory, Methods and Applications” ESCOM, Leiden Chemoinformatics 66. Chemoinformatics : A Textbook (Hardcover), by Johann Gasteiger (Editor), Thomas Engel (Editor)

29