1 Bioinformatics Handling and analysis of data obtained from current biomedical / gene technology methods Interdisciplinary science • Biology • Mathematics • Computer science Medical genomics and bioinformatics Biological sequences DNA -> mRNA -> protein Information resources in biomedicine Sequence analysis Sequence alignments Database searches for sequence similarity Finding genes in genomes Finding disease genes Linkage analysis Medical genomics and bioinformatics Microarray data analysis gene expression - mRNA abundance Molecular genetic and cytogenetic analysis in the clinic RNA bioinformatics microRNAs and prediction of target mRNAs Medical genomics and bioinformatics Proteomics Large scale analysis of protein content Molecular phylogeny Sequences in virology and microbiology Introduction to bioinformatics • Information resources Tore Samuelsson Nov 2009 Flow of genetic information DNA RNA transcript splicing mature mRNA protein protein structure -> biological function 56,000 protein structures 8,000,000 protein sequences 100 x 10 6 sequences corresponding to partial mRNAs ~ 250 x 10 9 nt
18
Embed
Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Bioinformatics
Handling and analysis of data obtained fromcurrent biomedical / gene technology methods
Two major types of DNA / nucleotide / base sequences found in databases such as GenBank and EMBL
* Genomic , arising from sequencing of DNA material isolated from cells
* ESTs , arising from projects to determine what mRNAs are produced in an certain organism or in a certaintype of cell within a multicellular organism.
DNA
mRNA
EST (Expressed Sequence Tag)
Expressed Sequence Tags (ESTs) correspond to partial mRNA sequences, they are sequences of cDNA which have been reverse-transcribed from mRNA
Short sequences (~500-1000 bases), each is result of single sequencing experiment -> high frequency of errors
Applications:
1) Used to answer questions like: What genes in a specific cell or tissue are expressed ?
2) Identification of coding regions in genomicsequences
3) Discovery of new genes
Redundancy at GenBank=> RefSeq
Many sequences are represented more than once in GenBank
Why are genome sequences and comparative genomics useful?
• Many non-human organisms are important model systems
• Comparative genomics useful in gene identification, identification of regulatory elements etc.
• Evolution of genes, proteins and organisms
Variation between individuals
2007 Craig Venter
2008James WatsonCancer patient, normal and cancer tissue Yoruba, Ibadan, NigeriaHan Chinese
SNPs ~3 x 106
Insertion/deletion polymorphisms 105-106
Structural variants/copy number variation103-104
Variation between individuals
6
Flow of genetic information
DNA
RNA transcript
splicing
mature mRNA
protein
protein structure -> biological function
56,000 protein structures
8,000,000 protein sequences
100 x 106 sequences correspondingto partial mRNAs
~ 250 x 109 nt
The SWISS-PROT Protein Sequence Data Bank (www.ebi.ac.uk ) is a database of protein sequences produced collaboratively by Amos Bairoch (University of Geneva) and the EBI. It contains high-quality annotation, is non-redundant, and cross-referenced to many other databases.
SWISS-PROT is accompanied by TrEMBL, a computer-annotated supplement to SWISS-PROT. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated into SWISS-PROT.
Uniprot : Swissprot + TrEMBL
Sequence entries in Feb 2009Uniprot 7,568,118 Swissprot 410,518TrEMBL 7,157,600
Genbank NCBI protein db 24,133,189
Protein sequence databases
ID PRIO_HUMAN STANDARD; PRT; 253 AA.AC P04156;DT 01-NOV-1986 (REL. 03, CREATED)DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).GN PRNP.OS HOMO SAPIENS (HUMAN).OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;OC EUTHERIA; PRIMATES.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 86300093.RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H.,RA PRUSINER S.B., DEARMOND S.J.;RL DNA 5:315-324(1986).RN [2]RP SEQUENCE OF 8-253 FROM N.A.RX MEDLINE; 86261778.RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.;RL SCIENCE 233:364-367(1986).RN [3]RP VARIANT AMYLOID GSS, SEQUENCE OF 58-85 AND 111-150.RX MEDLINE; 91160504.RA TAGLIAVINI F., PRELLI F., GHISO J., BUGIANI O., SERBAN D.,RA PRUSINER S.B., FARLOW M.R., GHETTI B., FRANGIONE B.;RL EMBO J. 10:513-519(1991).RN [4]RP REVIEW ON VARIANTS.RX MEDLINE; 93372867.RA PALMER M.S., COLLINGE J.;RL HUM. MUTAT. 2:168-173(1993).
CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THECC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLEDCC "RODS".CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS ANDCC ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN ASCC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE:CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROMECC (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIECC IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) INCC CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTINGCC DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORMCC ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHYCC (EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATECC THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE,CC EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTEDCC FOODSTUFFS.CC -!- DISEASE: CJD OCCURS PRIMARILY AS A SPORADIC DISORDER (1 PERCC MILLION), WHILE 10-15% ARE FAMILIAL. ACCIDENTAL TRANSMISSION OFCC CJD TO HUMANS APPEARS TO BE IATROGENIC (CONTAMINATED HUMAN GROWTHCC HORMONE (HGH), CORNEAL TRANSPLANTATION, ELECTROENCEPHALOGRAPHICCC ELECTRODE IMPLANTATION. . .). EPIDEMIOLOGIC STUDIES HAVE FAILED TOCC IMPLICATE THE INGESTION OF INFECTED ANNIMAL MEAT IN THECC PATHOGENESIS OF CJD IN HUMAN. THE TRIAD OF MICROSCOPIC FEATURESCC THAT CHARACTERIZE THE PRION DISEASES CONSISTS OF (1) SPONGIFORMCC DEGENERATION OF NEURONS, (2) SEVERE ASTROCYTIC GLIOSIS THAT OFTENCC APPEARS TO BE OUT OF PROPORTION TO THE DEGREE OF NERF CELL LOSS,CC AND (3) AMYLOID PLAQUE FORMATION. CJD IS CHARACTERIZED BYCC PROGRESSIVE DEMENTIA AND MYOCLONIC SEIZURES, AFFECTING ADULTS INCC MID-LIFE. SOME PATIENTS PRESENT SLEEP DISORDERS, ABNORMALITIES OFCC HIGH CORTICAL FUNCTION, CEREBELLAR AND CORTICOSPINAL DISTURBANCES.CC THE DISEASE ENDS IN DEATH AFTER A 3-12 MONTHS ILLNESS.CC -!- DISEASE: GSS IS A HETEROGENEOUS DISORDER AND WAS DEFINED AS ACC "SPINOCEREBELLAR ATAXIA WITH DEMENTIA AND PLAQUELIKE DEPOSITS".CC GSS INCIDENCE IS LESS THAN 2 PER 100 MILLION.CC -!- DISEASE: KURU IS TRANSMITTED DURING RITUALISTIC CANNIBALISM, AMONGCC NATIVES OF THE NEW GUINEA HIGHLANDS. PATIENTS EXHIBIT VARIOUSCC MOVEMENT DISORDERS LIKE CEREBELLAR ABNORMALITIES, RIGIDITY OF THECC LIMBS, AND CLONUS. EMOTIONNAL LABILITY IS PRESENT, AND DEMENTIA ISCC CONSPICUOUSLY ABSENT. DEATH USUALLY OCCURS FROM 3 TO 12 MONTHCC AFTER ONSET.CC -!- SIMILARITY: TO OTHER PRP.CC -!- DATABASE: NAME=HotMolecBase; NOTE=PrP entry;CC WWW="http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/prp.htm".
Protein sequence databases
FT SIGNAL 1 22FT CHAIN 23 230 MAJOR PRION PROTEIN.FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY).FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY).FT CARBOHYD 181 181 PROBABLE.FT CARBOHYD 197 197 PROBABLE.FT DISULFID 179 214 BY SIMILARITY.FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G-FT Q.FT REPEAT 51 59 1.FT REPEAT 60 67 2.FT REPEAT 68 75 3.FT REPEAT 76 83 4.FT REPEAT 84 91 5.FT VARIANT 102 102 P -> L (IN GSS).FT VARIANT 105 105 P -> L (IN GSS).FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OFFT DEMENTING GSS).FT VARIANT 129 129 M -> V (DETERMINES THE DISEASE PHENOTYPEFT IN PATIENTS WHO HAVE A PRP MUTATION ATFT CODON 178: PATIENTS WITH MET DEVELOP FFI,FT THOSE WITH VAL DEVELOP CJD).FT VARIANT 178 178 D -> N (IN FFI AND CJD).FT VARIANT 180 180 V -> I (IN CJD).FT VARIANT 198 198 F -> S (IN A ATYPICAL FORM OF GSS WITHFT NEUROFIBRILLARY TANGLES).FT VARIANT 200 200 E -> K (IN CJD).FT VARIANT 210 210 V -> I (IN CJD).FT VARIANT 217 217 Q -> R (IN GSS WITH NEUROFIBRILLARYFT TANGLES).FT VARIANT 232 232 M -> R (IN CJD).FT CONFLICT 118 118 MISSING (IN REF. 2).SQ SEQUENCE 253 AA; 27661 MW; FD5373AD CRC32;
Protein sequence databases can be accessed through:
• Uniprot (www.ebi.uniprot.org/)
• Entrez
7
UniProt - record
They are all the result of experimental work
* X ray crystallography* NMR
Three dimensional structures of proteins,DNA and RNA are collected in the Protein Data Bank (PDB)
8
Example of PDB entry
HEADER HORMONE 30-OCT-92 1BPH 1BPH 2COMPND INSULIN (CUBIC) IN 0.1M SODIUM SALT SOLUTION AT PH9 1BPH 3SOURCE BOVINE (BOS $TAURUS) PANCREAS 1BPH 4AUTHOR O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 5REVDAT 2 31-OCT-93 1BPHA 1 REMARK HET FORMUL 1BPHA 1REVDAT 1 15-JAN-93 1BPH 0 1BPH 6JRNL AUTH O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 7JRNL TITL CONFORMATIONAL CHANGES IN CUBIC INSULIN CRYSTALS 1BPH 8JRNL TITL 2 IN THE PH RANGE 7-11 1BPH 9JRNL REF BIOPHYS.J. V. 63 1210 1992 1BPH 10JRNL REFN ASTM BIOJAU US ISSN 0006-3495 030 1BPH 11REMARK 1 1BPH 12REMARK 1 REFERENCE 1
ATOM 1 N GLY A 1 13.994 47.196 31.798 1.00 35.87 1BPH 129ATOM 2 CA GLY A 1 14.277 46.226 30.708 1.00 38.67 1BPH 130ATOM 3 C GLY A 1 15.574 45.507 31.085 1.00 31.18 1BPH 131ATOM 4 O GLY A 1 16.078 45.660 32.217 1.00 22.60 1BPH 132ATOM 5 N ILE A 2 16.088 44.766 30.126 1.00 28.39 1BPH 133ATOM 6 CA ILE A 2 17.342 44.034 30.404 1.00 23.76 1BPH 134ATOM 7 C ILE A 2 18.526 44.939 30.686 1.00 25.29 1BPH 135ATOM 8 O ILE A 2 19.425 44.457 31.392 1.00 18.74 1BPH 136ATOM 9 CB ILE A 2 17.571 43.072 29.158 1.00 27.36 1BPH 137ATOM 10 CG1 ILE A 2 18.638 42.049 29.605 1.00 18.03 1BPH 138ATOM 11 CG2 ILE A 2 17.859 43.936 27.903 1.00 25.54 1BPH 139ATOM 12 CD1 ILE A 2 18.914 40.930 28.590 1.00 17.07 1BPH 140ATOM 13 N VAL A 3 18.619 46.195 30.192 1.00 24.42 1BPH 141ATOM 14 CA VAL A 3 19.774 47.080 30.436 1.00 30.26 1BPH 142ATOM 15 C VAL A 3 19.952 47.453 31.895 1.00 19.08 1BPH 143ATOM 16 O VAL A 3 21.018 47.421 32.561 1.00 28.15 1BPH 144ATOM 17 CB VAL A 3 19.719 48.274 29.462 1.00 33.87 1BPH 145ATOM 18 CG1 VAL A 3 20.847 49.225 29.754 1.00 30.40 1BPH 146ATOM 19 CG2 VAL A 3 19.868 47.724 28.044 1.00 24.51
3D viewersSeveral free programs for viewing protein and nucleic 3D structures:
ss DNA Enzymatically synthesize its complementary strand Detect fluorescence of one nucleotide at a timeRemove the blocking group (reversible terminator)Polymerization of another nucleotide
GCAGCTATTACGGCTATCTGACCGTCGATAAT
GT AC
G
TAC
G
terminatordNTPs
Sequencing by ligation(SOLID - Applied Biosystems)
The method:
It is based on sequential ligation of dye labeled oligonucleotideprobes whereby each probe queries two base positions at a time
DNA ligase rather than polymerase
The system uses 4 fluorescent dyes to enconde for the 16 possible two base combinations
Multiple ligation cycles of probe hybridization, ligation, imaging an analysis are preformed
The resulting product is the removed
The process is repeated for 5 more extension rounds with primershybridized to position n-1, n-2, etc in th adaptor.