Finding Genes in the Rice Genome Hao Bailin T-Life Research Center, Fudan University Beijing Genomics Institute , Academia Sinica Institute of Theoretical Physics, Academia Sinica (www.itp.ac.cn/~hao/) On-going work by a team of 10-12 people since August 2001: Zheng Weimou, Xie Huimin, Liu Jinsong, Xu Zhao, Fang Lin, Li Heng, Gao Lei, Jin Jiao, et al. Nothing written
32
Embed
Finding Genes in the Rice Genome Hao Bailin T-Life Research Center, Fudan University Beijing Genomics Institute, Academia Sinica Institute of Theoretical.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Finding Genes in theRice Genome
Hao Bailin
T-Life Research Center, Fudan University
Beijing Genomics Institute , Academia Sinica
Institute of Theoretical Physics, Academia Sinica
(www.itp.ac.cn/~hao/)
On-going work by a team of 10-12 people since August 2001: Zheng Weimou, Xie Huimin, Liu Jinsong, Xu Zhao, Fang Lin, Li Heng, Gao Lei, Jin Jiao, et al. Nothing written yet.
The difference was described in Xu Shen’s ( 许慎《说文解字》 ) Chinese Dictionary of East Han Dynasty (~ 2nd Century AD)J.H. Zhang et al. Rice cultivation of Jianhu Remains in
Henan Province, Science J. ( 《科学》杂志 ) , 53( 4 ), 2002 , 3 (in Chinese)
• “Ab initio” or “de novo” algorithms: GeneMark, GenScan, FgeneSH, Genie, …based on gene-structure models and training data. (Our on-going project: BGF, the BGI Gene Finder)
• Homolog methods based on sequence alignment with known genes in databases
• Mixed approach using both strategy: TwinScan
Different Stages of Gene-Finding
• Use all possible existing programs and services on the web with a public-domain or home-made genome viewer
• Write your own gene-finder, trained for the specific organism
• A dream for the time being: design a self-training and self-developing program “for any species” which would improve itself iteratively starting from a few available reads, cDNAs, and ESTs
Performance of Gene-Finders in Eukaryote Genomes
• M. Q. Zhang, Nature Review Genetics, 3 (2002) 698-710 (mostly for the human genome):
Nucleotide level: 80% Exon level: 45% Whole gene structure: 20%• FgeneSH and BGF for rice (our tests on 128 cDNA-confirm
Each strand carries the same amount of information, but different sets of genes.Two strands are equivalent in information content.Two strands are not equivalent in gene content.Biological processing (duplication, transcription) goes from 5’ to 3’. Finding genes on one strand at a time or on two strands at the same time: one-pass or two-pass programs.
5’-UTR 3’-UTR
transcribe
Genomic DNA
Pre-mRNA
splice
mRNA
translate
AA seq ( protein primary seq )
fold
Protein fold
start stop
5’ 3’
RNA Pol II +…
splicesome u1u2u4u5u6RNP
ribsome init.
+ elong. factors term.
chaperonine
Three Scales of Search• Local: signals with minimal signature (start, stop, sp
licing); movable signals (caps, promoters, polyAs, branching points, some very weak) --- clustering, discrimination analysis, various statistical models
• Hidden Markov Model: geometric distribution of intron lengths
• Semi-Hidden Markov Model: needs sequence-generating models and length probability for each node
• Language theory approach
Flow Chart of GenScan
Chris Burge (1996): A 27-state semi-HMM A simpler model: 19-stateA model taking UTR introns into account : 35-state
Figure : N, intergenic
region; P,promotor; F,
5’UTR; , single-
exon gene; , initial
exon; phase
k internal exon; ,ter
-minal exon; T, 3’UTR;
A,polyadenylation signal;
and, , phase k
intron. ) strand.
snglE
initE
)20( kEk
termE
)20( kI k
Problems: Minor and Major
• Ambiguity symbols (N, W, S, R, …)
• (1-p) at flanking D-type nodes
• Indels and frame-shifts
• Gradient effects in gene structure
• Introns in 5’-UTRs and 3’-UTRs: leading to 35-state Markov Models
• Alternative splicing and sub-optimal paths
• Limit of probabilistic models
• Deterministic approaches
Dyck language: A language of nested parentheses
• Many types of parentheses
• Finite depth of nesting
• Context-free language
Our case:
• Only 3 types of parentheses
• Shallow nesting
• Conjecture: may be regular language
Two Test Datasets for RiceGene-Finders
• The 28469 japonica full-length cDNAs (Kikuchi et al., Science 301 (18 July 2003)
• Select a high-quality subset without overlaps with publically available cDNAs
• A single-gene set: 500 sequences with one gene in each
• A multi-gene set: 46 sequences with 199 genes in total (at least 4 genes in a sequence)
Assessment of Gene-Finders
Test done between 22 July and 2 August 2003
• FgeneSH (trained on monocotyledons)
• GeneMark.hmm
• RiceHMM
• GlimmerR
• GenScan (trained on maize)
• BGF
Our Ultimate Goal
• An iterative, self-training, self-improving gene-finder “for any species”, starting from a small number of reads with or without EST, cDNA supports
• Annotaion and re-annotation of the rice genomes
• Plant comparative genomics, especially, that of Gramene and Crucifers
tRNA features
• tRNA gene pre-tRNA mature tRNA
• Mature tRNA: 75 – 95 bases
• Cloverleaf like structure
• Five arms: acceptor arm, D arm, anticodon arm, V loop (extra arm), T C arm
How many tRNA genes are present in an organism?
• Codon tRNA amino acid
• 61 encoding codons
• 20 amino acids
• Are there 61 species of tRNA with all possible anticodons ?
• Met (M) has one codon but two tRNAs
Wobble hypothesis Crick, 1966
• Many tRNAs recognize more than one codon
• Through non-Watson-Crick base pairings
• Less than 61 tRNAs are needed
The Modified Wobble Hypothesis(Guthrie & Abelson 1982)
• In eukaryotes, 46 different tRNA species would be enough.
• The modified wobble hypothesis is almost perfectly hold in H. sapiens, S. cerevisiae, A. thaliana, C.elegans whose complete collection of tRNAs are now known.
aa codonA C H anti aa codonA C H anti aa codonA C H anti aa codonA C H anti