Analyses of ORFans in microbial and viral ge nomes Journal club presentation on Mar. 14 Albert Yu
Jan 12, 2016
Analyses of ORFans in microbial and viral genomes
Journal club presentation on Mar. 14
Albert Yu
ORFan
Defenition: an ORF with no detectable sequence similarity to other ORFs in the database considered
Nearly all genomes have ORFans (df %)
The more genomes sequenced, the more ORFans have found
Most are annotated as hypothetical proteins of unknown function (no exp.)
ORFan continue
More data…
real , functional proteins
3D nstructure
conserved in closely related species (Ka/Ks)
Origin of ORFans ????????
Viral genome Microbial genome?
Viral laterally transferred genes (especially phages)
Viral genome Microbial genome
Question: the origin of ORFans
Test hypothesis: ORFans have been acquired through lateral gene transfer from viruses
To find homologs to these microbial ORFans within the virus sequence database
Genome-wide quantitative study
• BLASTP
• 277 microbial genomes
• 1456 viral genomes
• H(g): the number of genomes having at least one homolog of ORFan g
• U(g): uniqueness: the genomic distance between the genomes with ORFan g
Classification of ORFans
• Singleton: without any homolog wherever
H=1, BLASTP=1
• Paralogous: homologs in the same genome
H=1, BLASTP>1
• Orthologous: homologs within very closely related microbial genome
H>1, U <= 0.1(by observations)
The U-value for all ORFs in prokaryote genomes
In total:
ORFs: 818906
ORFans: 110186
S: 64324(7.8%)
P: 10419(1.3%)
O: 35443(4.3%)
0.64
S or p
O
• ORFans-VH%(OVH): % of ORFans having homologs in viruses (0% ~ 63.8%)
• Non-ORFans-VH%(NOVH): % of non-ORFans having homologs in viruses (4.1% ~ 18.2%)
• The strength of the hypothesis = the value between these two VH%
Percentages of microbial ORFs with homologs in viruses
Red: OVH
Blue: NOVH24 phylogenetic clades
Bacteria
Archea
Firmicutes
Gamma proteobacteria
The average % of OVH and NOVH in various groups
148
66
6310% vs 9 %
8.5% vs 2.7 %
6.6% vs 0.8 %
Conclusion
• Most OVH << NOVH: current evidence supporting the hypothesis is weak
• Firmicutes and Gamma-proteobacteria have the highest number of homologs in viruses (viral database is biased)
Viral database bias
1456 viruses
280 phages (109--Gamma; 102--Firmicutes; 69--others)
Sampling ?????
Viral genome Microbial genome
• 277 Microbial genomes• 1456 viruses
All-virus-DB: 43566 ORFs• 280 phages (20%)
Phage-DB: 18368 ORFs (42%)ORFans:
all-virus: 13078(30%) (v.s. all-virus-DB) 8200 (v.s. all nr, env-nr)
all-phage: 6765 (v.s. all-virus-DB) 7047 (v.s. phage-DB)
Some characteristics of ORFans
• Bacterial ORFans are shorter than non-ORFans on average
• Bacterial ORFans have significant lower GC3 content than non-ORFans
The length of Viral ORFans and non-ORFans
Length: Non-ORFans > ORFans
Length: ORFans < non-ORFans
GC3%: ORFans < non-ORFans
The number of ORFs per genome in 1456 viruses
Focusing on phage: higher %
The growing of the number of phage ORFans (consistent)
Drop to 0 ?
Keep increasing
38.4%
• Each microbial species is a host for at least 10 phage species --- the phage diversity is at least 10 times higher than microbial diversity
• Only 280 phage genomes in database (low phage sampling)
Less than 5 phages
Virus sampling bias between and within groups
The H-value percentages for all phage ORFs and prokaryotic ORFs
prokaryotesphages
9.1% - ORFans
11.3% - ortho
38.4% - ORFans
32.4% - ortho
the H-value percentages of phage ORFs
• 4397(61.5%) / 7150(63.8%) / 11212 (prophage/ prokaryotic homologs/ phage non-ORFans)
• 589(44.7%) / 1317(18.7%) / 7047 (prophage/ prokaryotic homologs/ phage ORFans)
• 4987(58.9%)/8467(46.4%)/18248 (prophage/ prokaryotic homologs/ phage ORFs)