Gene Finding in Eukaryotes Jan-Jaap Wesselink [email protected]Computational and Structural Biology Group, Centro Nacional de Investigaciones Oncol ´ ogicas Madrid, July 2008 Jan-Jaap Wesselink [email protected]Gene Finding in Eukaryotes Madrid, July 2008 1 / 24
54
Embed
Gene Finding in Eukaryotesubio.bioinfo.cnio.es/Cursos/cursoVerano2008/madrid08/gene_finding... · Gene Finding in Eukaryotes Jan-Jaap Wesselink [email protected] Computational and
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24
Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24
Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24
Different Types of Information Can be Used:Signals: search for signals of transcription, splicing, translation.Typically, these signals are assigned a score, and the highestscoring signals are combined.Content: here, one tries to discriminate the protein coding fromnon-coding regions. Statistical models of nucleotide frequenciesand dependencies in codons are used here.Homology: significant sequence similarity of a genomic DNAsequence to a known gene, implies that it is likely to share itsfunction. This information may be used in the gene predictionprocess.
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 11 / 24
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Example (patterns)Strings: P = GCCACCTAGGConsensus sequences: subsitutions occur at certain positionsRegular expressions: describe set of strings generated by aregular language.Decision treesPosition Weight Matrices
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 12 / 24
Y = pyrimidine (C or T), W = A or T, R = purine ( A or G)Regular expression: Prosite pattern:P = G − [GN]− [SGA]−G − x − R − x − [SGA]− C − x(2)− [IV ]
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 13 / 24
Y = pyrimidine (C or T), W = A or T, R = purine ( A or G)Regular expression: Prosite pattern:P = G − [GN]− [SGA]−G − x − R − x − [SGA]− C − x(2)− [IV ]
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 13 / 24
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Construction of PWM for splice sites (...GT.... or ...AG....)1 align sequences for true splice sites2 calculate relative frequencies3 same for false donor sites4 calculate log odds ratio true/false frequencies
Mi,j = ln(ftrue
ffalse)
5 score of a given sequence is sum of matrix coefficients
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 14 / 24
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Coding Statisticsthere are 64 codons for 22 amino acidsdifferent codons are used in exons than in intronscompute codon usage from coding DNA → frequenciesfor a sequence, S, the higher the number of codons in S thatoccur frequently in coding sequences, the higher the probabilitythat S is coding for a protein
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 16 / 24
Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24
Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24
Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24
Various Uses of Homology in Gene Predictionone can compare a genomic DNA sequence with a database of ESTs(using e.g. blastn)genomic DNA sequences can be compared to a database proteinsequences (using blastx, to identify coding regionscomparison of predicted peptides with a protein sequence data base canbe used to assign putative functionsthe genome of one species can be compared to the genome of another,closely related species: conserved regions often correspond toconserved functions (e.g. exons, parts of promoters)
Jan-Jaap Wesselink [email protected] Gene Finding in Eukaryotes Madrid, July 2008 19 / 24