Reconsidering the significance of genomic word frequency

arX

iv:q

-bio

/060

9022

v1 [

q-bi

o.G

N]

14 S

ep 2

006

Reconsidering the significance of genomicword frequency

Mikl os Csuros∗ Laurent Noe† Gregory Kucherov†

February 5, 2008

Determining what constitutes unusually frequent and rare in a genome is a fun-damental and ongoing issue in genomics [6]. Sequence motifsmay be frequentbecause they appear in mobile, structural or regulatory elements. It has been sug-gested that some recurrent sequence motifs indicate hitherto unknown or poorlyunderstood biological phenomena [17]. We propose that the distribution of DNAwords in genomic sequences can be primarily characterized by a double Pareto-lognormal distribution, which explains lognormal and power-law features foundacross all known genomes. Such a distribution may be the result of completelyrandom sequence evolution by duplication processes. The parametrization of ge-nomic word frequencies allows for an assessment of significance for frequent orrare sequence motifs.

The simplest type of sequence motif is a DNA word of a fixed length ℓ, calledan ℓ-mer. The number of occurrences of a wordw is denoted byN(w), and thedistribution ofN(w) across allℓ-mers is called theword frequency distributionorspectrum. The statistical significance of a word’s unusual abundance(or rarity) isassessed by referring to a null model of random sequences. Standard null mod-els include sequences of independent and identically distributed letters (Bernoullimodel) and low-order Markov models [16]. In such random text models, over-and underrepresentation of short words (typically,8 ≤ ℓ ≤ 16) are evaluated byusing Poisson or Gaussian approximations, implying a rapidly decreasing tail inthe spectrum. It is also customary to use localized random shuffling of the stud-

∗Department of Computer Science and Operations Research, Universite de Montreal, CP 6128,succ. Centre-Ville, Montreal, Quebec H3C 3J7, Canada.

†Laboratoire d’Informatique Fondamentale de Lille, Bat. M3, 59655 Villeneuve d’Ascq Cedex,France.

1

http://arXiv.org/abs/q-bio/0609022v1

1 100 100 1000occurrence

1k

10

1

1M

words

real

Bernoulli

Markov

shuffled

random duplications, m=18

Figure 1: 13-mer frequencies in repeat-masked human chromosome 5, and inrandom sequences of same length.

ied genome sequence in order to preserve large-scale compositional heterogeneity.Empirical word frequency distributions in shuffled sequences also have a light tail.

In reality, genomic word frequency distributions have a prominent heavy tail,which is not captured by random text models (Fig. 1). Depending on the genomelengthL and its relative size with respect to the number ofℓ-mers4ℓ, the spectrummay show a power-law decrease on the left or right, or have a lognormal shape.The heavy tail on the right cannot be entirely attributed to mobile elements, as itis present even in repeat-masked vertebrate and many smaller genomes (Figures 2and 3). Consequently, random text models and shuffling tend to underestimate theprobability of frequent words in long sequences.

To this day, there has been no exact characterization of genomic spectra, asidefrom the observation of power-law behavior for certain wordsizes [8, 9, 10] inthe right-hand tail. This note aims to point out that a parametric distribution de-scribes word frequencies extremely well in prokaryotic andeukaryotic genomesequences. The distribution in question is the so-calleddouble Pareto-lognormal(DPL) distribution [15]. It fits many real-life size distributions, including that ofwealth in society, human settlement sizes, and file sizes on the Internet. The dis-tribution has four parameters:α > 0, β > 0, ν andτ > 0; it has a power-law

2

https://www.researchgate.net/publication/265820520_The_Double-Pareto_Lognormal_Distribution-A_New_Parametric_Model_for_Size_Distribution?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

https://www.researchgate.net/publication/7063451_Oligonucleotide_frequencies_in_DNA_follow_a_Yule_distribution?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

https://www.researchgate.net/publication/11200241_Power-law_behavior_applies_to_a_wide_variety_of_genomic_properties?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

https://www.researchgate.net/publication/13325042_Systematic_analysis_of_coding_and_noncoding_DNA_sequences_using_methods_of_statistical_linguistics?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

1000101 1000 10000occurrence

100k

10k

1k

100

10

1

1M

words

S. cerevisiae, 11-mers

A. pernix, 9-mers

N. equitans, 8-mers

B. subtilis, 8-mers

M. genitalium, 9-mers

Figure 2: Spectra in smaller genomes. For each full genome (concatenated chro-mosome sequences if necessary), theℓ-mer frequency distribution is shown bydots, and the fitted DPL distribution by a solid line. Notice the power law in thelower tail of the B. subtilis genome spectrum.

(Pareto) tail to the left and to the right, with slopes characterized by the parame-tersβ andα, respectively; in the middle, its shape is dominated by a lognormaldistribution with parametersν andτ .

Figure 2 illustrates that a single DPL distribution’s four parameters can beadjusted to describe spectra across hundreds or thousands of word frequenciesin non-vertebrate genomes. (More examples for different spectra are shown onthe web sitehttp://www.iro.umontreal.ca/˜csuros/spectrum/ .)For very short words with respect to the genome lengthL (about8 ≤ ℓ ≤

(log4 L) − 4), the spectrum has mainly a lognormal shape. Asℓ increases, themode of the lognormal component shifts downwards and the lower power-lawtail becomes more and more discernible, followed by the appearance of the up-per power-law tail. For long words (from aroundℓ > (log4 L) + 2), the upperpower-law tail dominates the spectrum. Given that compact genomes are mostlycomposed of coding sequences, the upper power-law tail is another manifestationof the power law for protein domain occurrences [8].

In organisms with strong dinucleotide bias, such as for CpG in vertebrates or

3

http://www.iro.umontreal.ca/~csuros/spectrum/


1 100 100 1000occurrence

1k

10

1

100

1M

words

13-mers

12-mers

11-mers

10-mers

9-mers

8-mers

Figure 3: Spectra of CpG-freeℓ-mers on repeat-masked human chromosome 12.

for ApT in honey bee [2], the spectrum can be decomposed into multiple DPLdistributions by dinucleotide content (Figure 4). Spectrafor words without CpGdinucleotides in repeat-masked vertebrate sequences havea marked DPL shape(Fig. 3). We illustrate the analysis of a large genomic spectrum with the exampleof human chromosome 12, which is typical of the human genome with respect torepeat element distribution and cytosine-guanine content[19]. Abundant repeatelements may cause deviations from the DPL distribution, which may be the basisof their identification using a DPL null model, but often theyare absorbed in thefundamental DPL curve (Figure 5). Table 1 analyzes the composition of the spec-trum’s tail in human chromosome 12. The contribution of non-repeat sequencesin the tail decreases when moving toward higher word frequencies, but it levelsoff at about 25%.

The power-law tail of gene family size distributions [8] canbe explained bybirth and death processes [5, 14]. Similar arguments apply to genomic spectra.Consider the occurrences of a particular word along the genome as a “population.”The population size is affected by mutational events, including duplications, inser-tions, deletions and point mutations. The population can increase by any copyingmechanism, including segmental duplication and retrotranscription. The popula-tion decreases if an occurrence is destroyed by some mutation. Point mutations

4

https://www.researchgate.net/publication/11081503_Birth_and_death_of_protein_domains_A_simple_model_of_evolution_explains_power_law_behavior?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==


https://www.researchgate.net/publication/7237742_The_finished_DNA_sequence_of_human_chromosome_12?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

https://www.researchgate.net/publication/8650372_A_model_explaining_the_size_distribution_of_gene_and_protein_families?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

1000101 1000 10000occurrence

100

10

1

1000

words all

CpG=0

CpG=1

CpG=2

CpG=3

Figure 4: 9-mer frequencies in non-repeat annotated regions of human chromo-some 12. The spectrum can be decomposed by CpG content into a few DPLdistributions, shown by solid lines.

n 12-mers (a) sequence (b) non-repeat (c) SINE (d) LINE (e) LTR(f) other (g)12 19.61 69.6 47.2 16.7 21.4 8.4 6.325 5.34 39.7 41.5 21.3 22.6 7.6 7.030 3.67 33.9 39.6 23.1 22.9 7.3 7.250 1.30 22.6 34.0 28.8 23.2 6.5 7.560 0.91 19.9 32.1 31.0 23.1 6.2 7.5

100 0.37 14.7 27.9 36.7 22.6 5.4 7.4120 0.28 13.4 26.8 38.5 22.3 5.1 7.3200 0.14 10.6 24.6 43.0 20.7 4.7 7.0240 0.10 9.7 24.2 44.9 19.4 4.6 6.9

Table 1: Composition of the 12-mer spectrum’s tail in human chromosome 12.Column (a) gives the fraction of words that occur at leastn times; column (b)lists the fraction of the genome sequence covered by such words. Columns (c–g) list the fraction of those word occurrences within non-repeat regions, shortinterspersed elements, long interspersed elements, long terminal repeats, and otherrepeat elements (including DNA transposons, simple repeats, low-complexity andtandem repeats), respectively. Fractions are expressed aspercentages.

5

1000101 1000 10000occurrence

10k

1k

10

100

1

100k

words

all

non-repeat

Alu

L1

simple

Figure 5: Decomposing the contribution of different repeatfamilies in the spec-trum of CpG-free 12-mers along human chromosome 12. Most of the deviationfrom DPL in the right-hand tail is caused by Alu and L1 families and simple re-peats.

6

can create new populations, but so can insertion (at the insertion boundaries) anddeletion (by bringing two halves of a word together). It is thus conceivable thata birth and death model appropriately models the spectrum’sevolution. In orderto illustrate this point, we carried out a simulation experiment in which a DNAsequence evolved solely by copying. We iteratively expanded an initial randomDNA sequence by selecting a contiguous piece of a fixed lengthm in each iter-ation, and copying it back into the sequence at a random position. The resultingsequence exhibits the same heavy tail as the real-life sequences do (Figure 1). Theparameters of the fitted DPL distribution are affected by thecopy sizem, and fur-ther vary with the introduction of a point mutation process.It is thus likely thatthe DPL distribution of genomic spectra is yet another sign of ceaseless “tinker-ing” [4] within the genome.

Heavy-tail distributions are signs of self-similarity, orlong-range autocorre-lation, which has been observed before in DNA sequences [7, 12], and studiedthoroughly in telecommunications engineering [18]. Long-range autocorrelationat the single nucleotide level can result from so-called expansion-randomizationprocesses [11], which model sequence evolution by deletions, mutations and du-plications.

The birth and death model implies that some words occur oftensimply bychance, and not because of their functionality. Words that are abundant at an earlypoint of evolution tend to preserve their relative abundance in the course of ran-dom copy events, in accordance with the principle of “rich get richer” underlyingpower-law distributions. Therefore, even the high frequency of a particular wordacross many related species does not imply functionality onits own, as the wordmay have been frequent by chance in a common ancestor already. Associationwith genes is not necessarily a sign of functionality either, since transcribed re-gions that harbor a frequent word can help its propagation throughout the genomeby retrotranscription.

Our investigations show that the distribution of word frequencies can be wellapproximated by a parametric model, the double Pareto-lognormal distribution.Such a distribution may result from a long history of evolutionary tinkering: copy-ing, rearranging, deleting, and changing different parts of the genome. The heavytail of word frequency distributions means that findings of frequent motifs need tobe assessed with extreme care, especially if their overrepresentation is related toword occurrences in random texts.

7

https://www.researchgate.net/publication/231142614_Universality_of_long-range_correlations_in_expansion-randomization_systems?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

https://www.researchgate.net/publication/21854714_Long_Range_Correlations_in_Nucleotide_Sequences?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

https://www.researchgate.net/publication/3195991_On_multimedia_networks_Self-similar_traffic_and_network_performance?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

https://www.researchgate.net/publication/231000709_Long-Range_Correlation_and_Partial_1fa_Spectrum_in_a_Noncoding_DNA_Sequence?el=1_x_8&enrichId=rgreq-b9f6b891-cad3-4469-ae72-a0447527ad58&enrichSource=Y292ZXJQYWdlOzIxODM0ODc7QVM6OTk2MDA1NzI2MTY3MTBAMTQwMDc1ODAzNTI1Mw==

Methods

Words were counted only on one strand of the DNA sequences (the ‘plus’ strand of thesequence file), with the exception of the 16-mers in the humangenome, where both strandswere scanned. We counted the occurrence of a wordw if it appeared in a given sequenceat some positioni..i + ℓ − 1, without ambiguous nucleotides. The DPL distribution wasfitted using its cumulative distribution function (cdf), which is

F (x) = Φ( ln x − ν

τ

)

+α

α + βxβe−βν+β2τ2/2Φ

(

−ln x − ν + βτ2

τ

)

−β

α + βx−αeαν+α2τ2/2Φ

( ln x − ν − ατ2

τ

)

for x > 0 andF (x) = 0 for x ≤ 0, whereΦ(·) denotes the cdf of the standard normaldistribution. The spectrum consists of the numbersW (n) of ℓ-mers occurring exactlyntimes for alln = 0, 1, 2, . . . In order to fit the distribution’s parameters, the spectrum(

W (n) : n = 0, 1 . . .)

was considered as a set of binned values for independently drawn

samples from a continuous DPL distribution:W (n) was compared to the predicted value

4ℓ(

F (n + 1

2) − F (n − 1

2))

. We used custom-made programs to carry out the parameter

fitting, using the Levenberg-Marquardt algorithm [13], a nonlinear least-squares method,for which the starting parameter values were set by likelihood maximization [15].

We defined CpG content of a wordw (Fig. 4) as the number of non-overlappingCG and GC dinucleotides inw. The contribution of different annotations (Fig. 5) werecomputed by multiplying eachW (n) value in the spectrum by the fraction of occurrenceswithin the annotated regions for words appearingn times in the entire sequence. Forthe random shuffling of Fig. 1, we partitioned the sequence into contiguous segmentscontaining exactly 1000 non-ambiguous nucleotides. Non-ambiguous nucleotides weregarbled in each segment by generating a uniform random permutation.

Human sequences (original and repeat-masked) and repeat annotations (Figures 1,3,4 and 5) were obtained from the UCSC genome browser [3] gateway’s FTP server(ftp://hgdownload.cse.ucsc.edu/ ), for version hg18 (NCBI Build 36.1). (Therepeat annotations were generated by the programs RepeatMasker [20] and Tandem Re-peats Finder [1].) Other sequences (Fig. 2) were downloadedfrom the NCBI FTP server(ftp://ftp.ncbi.nlm.nih.gov/genomes/ ).

8

ftp://hgdownload.cse.ucsc.edu/

ftp://ftp.ncbi.nlm.nih.gov/genomes/

References

[1] G. Benson. Tandem repeats finder: a program to analyze DNAsequences.Nucleic Acids Research, 27(2):573–580, 1999.

[2] The Honey Bee Genome Sequencing Consortium. The genome of a highlysocial species, the honey bee Apis mellifera. 2006. Under revision.

[3] A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano,H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu, J. Hillman-Jackson, R. M. Kuhn, J. S. Pedersen, A. Pohl, B. J. Raney, K. R.Rosen-bloom, A. Siepel, K. E. Smith, C. W. Sugnet, A. Sultan-Qurraie, D. J.Thomas, H. Trumbower, R. J. Weber, M. Weirauch, A. S. Zweig, D. Haus-sler, and W. J. Kent. The UCSC genome browser database: update 2006.Nucleic Acids Research, 34:D590–598, 2006.

[4] F. Jacob. Evolution and tinkering.Science, 196(4295):1161–1166, 1977.

[5] G. P. Karev, Y. I. Wolf, A. Y. Rzhetsky, F. S. Berezovskaya, and E. V. Koonin.Birth and death of protein domains: a simple model of evolution explainspower law behavior.BMC Evolutionary Biology, 2:18, 2002.

[6] S. Karlin. Statistical signals in bioinformatics.Proceedings of the NationalAcademy of Sciences of the USA, 102(38):13355–13362, 2005.

[7] W. Li and K. Kaneko. Long-range correlation and partial1/fα spectrum ina noncoding DNA sequence.Europhysics Letters, 17:655–660, 1992.

[8] N. M. Luscombe, J. Qian, Z. Zhang, T. Johnson, and M. Gerstein. Thedominance of the population by a selected few: power-law behavior ap-plies to a wide variety of genomic properties.Genome Biology, 3(8):re-search0040.10040.7, 2002.

[9] R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng,M. Simons, and H. E. Stanley. Systematic anaysis of coding and noncodingDNA sequences using methods of statistical lingustics.Physical Review E,52(3):2939–2950, 1995.

[10] C. Martindale and A. K. Konopka. Oligonucleotide frequencies in DNAfollow a Yule distribution.Computers & Chemistry, 20(1):35–38, 1996.

9

[11] P. W. Messer, M. Lassig, and P. F. Arndt. Universality of long-range correla-tions in expansion-randomization systems.Journal of Statistical Mechanics,2005. P10004.

[12] C. K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F. Sciortino, M. Si-mons, and H. E. Stanley. Long-range correlations in nucleotide sequences.Nature, 356:168–170, 1992.

[13] W. H. Press, S. A. Teukolsky, W. V. Vetterling, and B. P. Flannery.NumericalRecipes in C: The Art of Scientific Computing. Cambridge UniversIty Press,second edition, 1997.

[14] W. J. Reed and B. D. Hughes. A model explaining the size distribution ofgene families.Mathematical Biosciences, 189:97–102, 2004.

[15] W. J. Reed and M. Jorgensen. The double Pareto-lognormal distribution —a new parametric model for size distributions.Communications in Statistics:Theory and Methods, 33(8):1733–1753, 2004.

[16] G. Reinert, S. Schbath, and M. S. Waterman. Probabilistic and statisti-cal properties of words: An overview.Journal of Computational Biology,7(1/2):1–46, 2000.

[17] I. Rigoutsos, T. Huynh, K. Miranda, A. Tsirigos, A. McHardy, and D. Platt.Short blocks from the noncoding parts of the human genome have instanceswithin nearly all known genes and relate to biological processes.Proceed-ings of the National Academy of Sciences of the USA, 103(17):6605–6610,2006.

[18] Z. Sahinoglu and S. Tekinay. On multimedia networks: self-similar trafficand network performance.IEEE Communications Magazine, 37(1):48–52,1999.

[19] S. E. Scherer et al. The finished DNA sequence of human chromosome 12.Nature, 440(7082):346–351, 2006.

[20] A. F. A. Smit, R. Hubley, and P. Green. Repeatmasker open-3.0, 1996–2004.http://www.repeatmasker.org .

Correspondence Correspondence should be addressed to M.Cs. (www:http://www.iro.umontreal.ca/˜c

10

http://www.repeatmasker.org

http://www.iro.umontreal.ca/~csuros/

Reconsidering the significance of genomic word frequency

Documents