1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Post on 18-Dec-2015
224 Views
Preview:
Transcript
Computational Molecular BiologyMPI for Molecular Genetics
1
DNA sequence analysisGene prediction
Gene prediction methods
Gene indices
Mapping cDNA on genomic DNA
Applications
Computational Molecular BiologyMPI for Molecular Genetics
2
DNA sequence analysisGene prediction
exon 2exon 1 exon npromotor
5‘UTR
3‘UTRProtein coding sequence
exon n-1
Computational Molecular BiologyMPI for Molecular Genetics
3
Gene predictionStrategies for detecting ORFs / exons
Distribution of Stop-codons
Codon usage
Hexamer frequencies
Prediction of the coding frame
Splice site recognition (Eucaryotes only)
Computational Molecular BiologyMPI for Molecular Genetics
4
Gene predictionCodon usage (single exon)
Frame 1
Frame 2
Frame 3
coding
non-coding
Computational Molecular BiologyMPI for Molecular Genetics
5
Gene predictionCodon usage (single exon)
Frame 1
Frame 2
Frame 3
coding
non-coding
correct start
coding sequence
Computational Molecular BiologyMPI for Molecular Genetics
6
Gene predictionCodon usage (multiple exons)
Frame 1
Frame 2
Frame 3
coding
non-coding
Splice sites
Exons:208. .2951029. .13491500. .16882686. .29343326. .34443573. .36804135. .43094708. .48464993. .50967301. .73897860. .80138124. .84058553. .87139089. .922513841. .14244
Computational Molecular BiologyMPI for Molecular Genetics
7
Gene predictionCodon usage (multiple exons)
Frame 1
Frame 2
Frame 3
coding
non-coding
Splice sites
Exons:208. .2951029. .13491500. .16882686. .29343326. .34443573. .36804135. .43094708. .48464993. .50967301. .73897860. .80138124. .84058553. .87139089. .922513841. .14244
Computational Molecular BiologyMPI for Molecular Genetics
8
Gene predictionAdditional criteria
Detection of start codons
Detection of potential promotor elements
Detection of repetitive sequences (mostly untranslated)
Homology to known genes of related
organisms
Computational Molecular BiologyMPI for Molecular Genetics
9
Gene predictionSoftware
GENSCAN (C.Burge & S.Karlin)
Grail (neural network; Ueberbacher et al.)
MZEF (M. Zhang,1997)
FGeneH, Hexon (V.Solovyev et al., 1994)
Genie, etc.All programs are using dynamic programming for detection of theoptimal solution
Computational Molecular BiologyMPI for Molecular Genetics
10
DNA sequences in public databases
Human
~ 4 million ESTs + 130 000 RNAs
Mouse
~ 2.7 million ESTs + 30 000 RNAs
Computational Molecular BiologyMPI for Molecular Genetics
11
Expressed sequence tags (EST)
AAAAAA...mRNATTTTTT...
cDNA is usually oligo dT primed, or by random primers
Reverse transcriptase stops ‚randomly‘
cDNA
Several cDNAs for the same mRNA may be generated
Computational Molecular BiologyMPI for Molecular Genetics
12
Expressed sequence tags (EST)
Average: 1500 bp
<700 bpVector
(known sequence)
Clone = mRNA fragmentDechiffered sequence (EST)
3‘-primer
Computational Molecular BiologyMPI for Molecular Genetics
13
Expressed sequence tags (EST)
Isolation of mRNAs from tissue(s)
Generation of cDNAs reflecting parts of the RNAs
Cloning of cDNAs into a vector (often random orientation)
End sequencing of the clones
Computational Molecular BiologyMPI for Molecular Genetics
14
Generation of ESTsbasecalling problems
close to 3‘ end of EST
close to 5‘ end of EST
missing bases
Computational Molecular BiologyMPI for Molecular Genetics
15
Coverage of an mRNA by ESTs
AAAAAA...putativemRNA exon 15‘UTR exon 2 3‘UTR
expressed sequence tags(ESTs)
Computational Molecular BiologyMPI for Molecular Genetics
16
Characteristics of ESTs
Highly redundant
Low sequence quality
(Cheap)
Reflect expressed genes
May be tissue/stage specific
Computational Molecular BiologyMPI for Molecular Genetics
17
Gene indices
UniGene (NCBI)
TIGR Gene Indices
STACK (SANBI)
GeneNest (DKFZ,MPI)
Clustering of EST and mRNA sequences of an organism toreduce redundance in sequence data.
Goal: Each cluster represents one gene or mRNA
Computational Molecular BiologyMPI for Molecular Genetics
18
Gene indicesGeneNest workflow
EMBL database Unigene database
Quality clipping Quality clipping
BLAST/QUASARsearch, clustering
Assembly,Consensus sequences
Visualization
Computational Molecular BiologyMPI for Molecular Genetics
19
Gene indicesQuality clipping
Removal of vector sequence
Masking of repetitive sequences (e.g. Alu)
Removal of terminal sequences of low quality
In order to cluster based on gene-specific sequence datathe following steps have to be performed:
Computational Molecular BiologyMPI for Molecular Genetics
20
Gene indices Clustering
Minimal % identity (e.g. > 95%)
Minimal length of match (e.g. >40 bp)
No internal matches (TIGR gene indices)
Same origin of tissue (only STACK)
Sequences are usually clustered if the matching part between two sequences fullfills several (empirical) criteria:
Computational Molecular BiologyMPI for Molecular Genetics
21
Gene indices Assembly
Contigs, reflecting parts of different transcripts
One consensus sequence per contig
A relative order of the sequences (alignment)
Sequences in a cluster are assembled to group those sequences which are globally similar, resulting in
Computational Molecular BiologyMPI for Molecular Genetics
22
Gene indicesConsensus sequences
Reduced error rate
Consensus often longer than any single sequence contributing
Efficient database search
Detection of exon/intron boundaries and alternative splice variants
Computational Molecular BiologyMPI for Molecular Genetics
24
Gene indices Alignment Software
Phrap (Phil Green)
CAP3 (X. Huang)
TIGR assembler
GAP4 (R. Staden)
Computational Molecular BiologyMPI for Molecular Genetics
25
GeneNest visualization(http://genenest.molgen.mpg.de)
Computational Molecular BiologyMPI for Molecular Genetics
26
GeneNest visualization(http://genenest.molgen.mpg.de)
Computational Molecular BiologyMPI for Molecular Genetics
27
TIGR Gene Indices(http://www.tigr.org/)
Alignment scheme
Computational Molecular BiologyMPI for Molecular Genetics
28
UniGene(http://www.ncbi.nih.nlm.gov/UniGene)
Computational Molecular BiologyMPI for Molecular Genetics
29
UniGene(http://www.ncbi.nih.nlm.gov/UniGene)
Computational Molecular BiologyMPI for Molecular Genetics
30
Mapping of consensus sequences on genomic DNA
genomic sequence
exons
consensus sequence( mRNA)
missing intron
Computational Molecular BiologyMPI for Molecular Genetics
32
Gene indicesApplications
Detection of exon/intron boundaries
Detection of alternative splicing
Detection of Single Nucleotide Polymorphisms
Genome annotation
Analysis of gene expression
Genome-genome comparison
Computational Molecular BiologyMPI for Molecular Genetics
33
Alternative Splicing
hnRNA
mRNA 2exon 15‘UTR exon 2
mRNA 1exon 15‘UTR exon 3
exon 15‘UTR exon 2 exon 3
Computational Molecular BiologyMPI for Molecular Genetics
34
Alignment of EST consensus sequences and genomic target
genomic sequence
Computational Molecular BiologyMPI for Molecular Genetics
35
Detection of the appropriate genomic target sequence
Local similarity of EST consensus and genomic DNA>96% identity
genomic sequence
Computational Molecular BiologyMPI for Molecular Genetics
36
Cutting out genomic target sequence
genomic sequence
Computational Molecular BiologyMPI for Molecular Genetics
37
Alternative Splicing(mapping on genomic DNA)
genomic sequence
exons
consensus sequence( mRNA)
splice variant
Computational Molecular BiologyMPI for Molecular Genetics
38
SpliceNest(http://SpliceNest.molgen.mpg.de)
putative exons
genomic sequence
aligned GeneNestconsensus
alternative exon
Computational Molecular BiologyMPI for Molecular Genetics
39
Alternative Splicing(additional exon)
skipped exon
Splice variants of adenylsuccinate lyase
gene prediction errors ?
unspliced ?
Computational Molecular BiologyMPI for Molecular Genetics
40
Alternative Splicing
Splice variants of APECED gene
number of sequences genomic sequencealternative variants
Computational Molecular BiologyMPI for Molecular Genetics
42
Alternative Splicing (alternative donor site)
Computational Molecular BiologyMPI for Molecular Genetics
46
Single Nucleotide Polymorphisms(SNP)
SNPs are single base differences within one species
Several million SNPs detected in Human
SNPs may be related to diseases
Computational Molecular BiologyMPI for Molecular Genetics
47
Single Nucleotide Polymorphisms(SNP)
SNP or basecalling error ?
Computational Molecular BiologyMPI for Molecular Genetics
48
Genome Annotation / Ensembl(http://www.ensembl.org)
Computational Molecular BiologyMPI for Molecular Genetics
49
Analysis of gene expressiontissue-specificity
Counting frequency of EST derived from a specific tissue within one sequence cluster
Searching for cluster/contigs which are tissue specific (e.g. tumor)
Searching for alternative splice variants which are potentially tissue specific
Computational Molecular BiologyMPI for Molecular Genetics
50
Analysis of gene expressionPDZ-domain containing protein PDZK1 (Hs.15456)
liver tumor
kidney
Computational Molecular BiologyMPI for Molecular Genetics
51
Analysis of gene expressionsmall muscular protein, SMPX (Hs.88492)
heart
muscle
Computational Molecular BiologyMPI for Molecular Genetics
52
Analysis of gene expressionhypothetical protein (Hs.32343)
thyroid tumor
heart
ovary
Computational Molecular BiologyMPI for Molecular Genetics
53
Analysis of gene expressionnon-redundant gene set
Selection of ‚optimal‘ clones
Generation of gene-specific PCR-products
Computational Molecular BiologyMPI for Molecular Genetics
54
Analysis of gene expression ‚optimal clones‘
clone availability
type of clone library
length of the clone
relative position to the consensus sequence
homology to other genes
existence of repetitive elements
Computational Molecular BiologyMPI for Molecular Genetics
55
Analysis of gene expressiongene-specific PCR-products
putative gene consensussequence exon A exon Cexon B
repetitive sequencesimilarity to another gene
potential gene-specific fragment
potential gene-specific fragment
top related