Top Banner

Click here to load reader

of 36

Statistical modeling and classification in Biological Sequence Space April 26, 04; 9.520 Gene Yeo Poggio,

Jan 18, 2018

Download

Documents

Silvester Cole

Biological sequences DNA, RNA and proteins: macromolecules built up from smaller units. DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

Statistical modeling and classification in Biological Sequence Space April 26, 04; Gene Yeo Poggio, Build models around known biology In the process, extend knowledge about known biology Predict new examples Validate predictions by prediction accuracy experimental validation higher-level traits of predictions conservation in other genomes Framework/Issues Biological sequences DNA, RNA and proteins: macromolecules built up from smaller units. DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure. Statistical models can be descriptive and/or predictive. Given known biological signal-> describe the signal with statistical modeling & find unknown examples of the same signal Gene-finding (protein-coding genes) Noncoding RNA genes Protein domains Warning: although successful, models are not to be taken literally. Most important: biological confirmation of predictions is almost always necessary. Sequences are full of signals! ACGTAGCTAGCATGCATGCATGACTACGATCGACTACGATCAACGATGCATGCATCGACTACGATCAGCTACGATCAGCATCGACTAGCATCGATCAGCATCGATCAGCATCGACTAGCTACGACTAGCGCTAC Promoter region First exon Internal exon Polyadenylation Last exon Intergenic region How do we model/describe these motifs ? Intergenic regions, 5UTR, 3UTR -> microRNAs, transcription factor binding sites Different models Complexity DNA RNA Protein Protein structure (a variety of methods) Splice site motif (WMM, MM, SVM, NN) Protein gene(HMM,NN) RNA gene (Covariation,SCFG,NN,SVM) Object ModelAssumptions Weight Matrix Model (WMM) Independence (easy) Hidden Markov Model (HMM) Local dependence (medium) Stochastic Context- Free Grammar (SCFG) Non-local Pairwise Dependence (hard) Modeling dependencies in biological sequence motifs With so many genomes being sequenced, it remains important to be able to identify genes and the signals within and around genes computationally. A case study in computational biology: modeling signals in genes What is a (protein-coding) gene? Protein mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA What is a gene, ctd? In general the transcribed sequence is longer than the translated portion: parts called introns (intervening sequence) are removed, leaving exons (expressed sequence), and yet other regions remain untranslated. The translated sequence comes in triples called codons, beginning and ending with a unique start (ATG) and one of three stop (TAA, TAG, TGA) codons. There are also characteristic intron-exon boundaries called splice donor and acceptor sites, and a variety of other motifs: promoters, transcription start sites, polyA sites,branching sites, and so on. All of the foregoing have statistical characterizations. Some facts about human genes Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon The idea behind a HMM genefinder States represent standard gene features: intergenic region, exon, intron, perhaps more (promotor, 5UTR, 3UTR, Poly-A,..). Observations embody state-dependent statistics, such as base composition, dependence, and signal features. E 0 E 1 E 2 E poly-A 3'UTR5'UTR t E i E s I 0 I 1 I 2 intergenic region Forward (+) strand Reverse (-) strand Forward (+) strand Reverse (-) strand promoter AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT GENSCAN (Burge & Karlin) a simple genefinder Splice sites can be an important signal Regular expressions can be limiting CACA AGGT AGT AGAG 5 splice junction in eukaryotes TCTC TCTC 11 N AGC 3 splice junction Most protein binding sites are characterized by some degree of sequence specificity, but seeking a consensus sequence is often an inadequate way to recognize sites. Position-specific distributions came to represent the variability in motif composition. Position-specific scoring matrix (PSSM) S = S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 Odds Ratio R = = Score s = log 2 R P(S|+) P -3 (S 1 )P -2 (S 2 )P -1 (S 3 ) P 5 (S 8 )P 6 (S 9 ) P(S|-) P bg (S 1 )P bg (S 2 )P bg (S 3 ) P bg (S 8 )P bg (S 9 ) T G C A Pos Ok, so we got the genes Heres another catch, there isnt just one version of each gene. But sometimes several molecular biology (transcription, splicing) signals are modeled as states (HMM) or separately, i.e.PSSMs Eg. alternative splicing - CD44 Zhu et al Science (2003) Human chromosome 11p Alternative splicing is a major determinant of protein diversity (Lander 2001, Zavolan 2003) 30-50% of human diseases involve alt. splicing Defining constitutive and alternative exons Constitutive exon Skipped exon 3 alternative exon 5 alternative exon Intron retention Mutually exclusive exons Fragile X Related Gene, FXR1 Conserved alternative, skipped exon - FXR1 Myotonic Dystrophy-containing WD Repeat, DMWD Another example of genes containing CSE: DMWD Predicting new alternatively spliced exons 1.The problem is ill-posed 2.High-dimensional space 3.Not overfit data 4.Simple feature selection 5.Unbalanced data set sizes 6.Labels are more flexible Eg. of experimentally validated Biological sequence space: challenges Models that represent as much of the biology as possible. Biologically motivated features are important Validating attributes: Conservation of events are key in computational biology Higher-level consistency with known biology Experimental validation of predictions are essential Build models around known biology In the process, extend knowledge about known biology Predict new examples Validate predictions by prediction accuracy experimental validation higher-level traits of predictions conservation in other genomes Framework/Issues Secondary StructureTertiary Structure Modeling higher order interactions: Yeast Phe tRNA If time permits The Hammerhead Ribozyme Secondary structureTertiary structure Seq1: A C G A A A G U Seq2: U A G U A A U A Seq3: A G G U G A C U Seq4: C G G C A A U G Seq5: G U G G G A A C Method of Covariation / Compensatory changes One example on how to model and predict RNA 2 o Structure Covariation (using comparative genomics) Mutual information statistic for pair of columns in a multiple alignment = fraction of seqs w/ nt. x in col. i, nt. y in col. j = fraction of seqs w/ nt. x in col. i is maximal (2 bits) if x and y individually appear at random (A,C,G,U equally likely), but are perfectly correlated (e.g., always complementary) sum over x, y = A, C, G, U Inferring 2 o Structure from Covariation Stochastic Context-Free Grammars (SCFGs) A generalized model which is capable of handling non-local dependencies between words in a language (or bases in an RNA) Ref: Durbin et al. Biological Sequence Analysis 1998 An SCFG Model of RNA 2 o Structure Production Rules: P aWb (pair) L aW (left bulge/loop) R Wa (right bulge/loop) B SS (bifurcation) S W (start) E (end) last page some of the slides were obtained from various places: available online slides on the web (primarily from lectures by terry speed). slides from chris burge, dirk holste