Top Banner
Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious Diseases in Africa, Stellenbosch, South Africa, June 25-27 Asamoah Nkwanta, Morgan State University
39

Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Some Probabilistic Results on the

Non-randomness of Simple Sequence Repeats in DNA

Sequences

2007 DIMACS Workshop on Mathematical Modeling of Infectious Diseases in Africa,Stellenbosch, South Africa, June 25-27

Asamoah Nkwanta, Morgan State UniversityJoint work with Wilfred Ndifon & Dwayne Hill

Page 2: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Nonrandomness of Microsatellites

“Numerous lines of evidence have demonstrated that genomic distribution of simple sequence repeats (SSRs) is nonrandom because of their effects on chromatin organization, regulation of gene activity, recombination, DNA replication, cell cycle, …”

You-Chun Li, et. al., Microsatellites Within Genes: Structure, Function, and Evolution, Molecular Biology and Evolution 21 (2004)

Page 3: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

TOPICS

Introduction DNA/RNAPreliminaries on SSRsCounting SSRsProbability & Expectations of SSRsResults & Conclusion

Page 4: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Introduction: Molecular Biology

Proteins are the building blocks of living organisms.

The information necessary for producing proteins is encoded in Deoxyribonucleic acid (DNA).

DNA is considered as a set of words defined over the genetic alphabet consisting of the letters A (Adenine), T (Thymine), C (Cytosine), & G (Guanine).

Page 5: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Introduction: Molecular Biology (Cont.)

DNA has a double-helix structure

Page 6: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Introduction: Molecular Biology (Cont.)

Ribonucleic acid (RNA) mediates the translation of DNA into proteins.

An RNA molecule consists of a sequence of ribonucleotides. Each ribonucleotide contains one of four bases: A, C, G and U (Uracil) (Note. Uracil is substituted for Thymine in DNA).

Page 7: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Introduction: Molecular Biology (Cont.)

Central Dogma of Molecular Biology

DNA RNA Protein

Transcription / Translation

Page 8: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Introduction: Molecular Biology (Cont.)

Page 9: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Introduction: Molecular Biology (Cont.)

Revised Central Dogma

Page 10: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

DNA Microsatellites: Agents of Evolution?; January 1999; Scientific American Magazine; by Moxon, Wills; 6 Page(s) A human's genetic code consists of roughly three billion bases of DNA, the familiar "letters" of the DNA alphabet. But a mere 10 to 15 percent of those bases make up genes, the blueprints cells use to build proteins. Some of the remaining base sequences in humans-and in many other organisms-perform crucial functions, such as helping to turn genes "on" and "off" and holding chromosomes together. Much of the DNA, however, seems to have no obvious purpose at all, leading some to refer to it as "junk.“

Part of this "junk DNA" includes strange regions known as DNA satellites. These are repetitive sequences made up of various combinations of the four DNA bases-adenine (A), cytosine (C) , guanine (G) and thymine (T)-repeated over and over, like a genetic stutter. In the past several years, researchers have begun to find that so-called microsatellites, those containing the shortest repeat sequences, have a significance disproportionately great for their size and perform a variety of remarkable functions. 

Page 11: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Preliminaries: What are SSRs?

Short Sequence Repeats (SSRs) or Microsatellites are defined as regions (motifs) within DNA sequences where short tandemly repeated sequences of nucleotides, 1 to 6 base-pairs in length, occur in genomic DNA.

The lengths of sequences most often used are di-, tri-, or tetra-nucleotides

Page 12: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Preliminaries (cont.): What are SSRs?

Example: TACCCAGCAGGCCTATATATA.

This is an DNA sequence of length 21 which contains stretches of dimers (TA), trimers (CAG), and teramers (TATA).

CAG – contiguous & TA – non-contiguous

Page 13: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Preliminaries (Cont.)

Table 1. Total Lengthsa of Simple Sequence Repeats by Repeated Unit Length

Length of repeated motif (bp)

Taxonomic group 1

2

3

4

5

6

Total

Primates 3429 1643 477 1368 898 341 8156

Human chromosome 22 5141 1511 604 1906 1097 419 10678

Rodentia 1839 5461 1196 2942 1417 1034 13889

Mammalia 1397 2312 532 915 774 693 6623

Vertebrata 1418 2449 1069 1279 709 220 7144

Arthropoda 985 1403 956 439 732 875 5390

C. elegans 428 556 337 144 225 449 2139

Embryophyta 1245 1067 880 184 491 272 4139

S. cerevisiae 1075 580 646 93 204 406 3004

Fungi 905 272 485 194 395 426 2677

Arthropoda 985 1403 956 439 732 875 5390 a Base pairs (bp) per megabase of DNA.

SSRs are relatively abundant.

Page 14: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Preliminaries (Cont.)

Some Characteristics of SSRs: They are

Highly Mutable Good Molecular Markers Involved in Gene Regulation Involved in the Develop. of Immune System

Cells Associated with at least 20 human diseases,

including Huntington’s disease and some cancers.

Page 15: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Preliminaries (Cont.)

Real World Applications:

Linkage analysis (related to inheritance) DNA fingerprinting Genome sequencing (e.g., genome of the

apple plant) Diagnosis of genetic disorders Paternity tests Forensic studies Population & Ecological genetic studies

Page 16: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Ecological Genetics of Parasitic Sea Lice

Typical epidermal lesions caused by adult female Lepeophtheirus salmonis in the region of the anal fin of a wild-caught salmon

Marine Ecology Research Group: www.st-andrews.ac.uk/~merg/sea%20lice.htm

Page 17: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Recent research has focused on the development and screening of L. salmonis specific microsatellites. Microsatellites such as CACACACACACA are dispersed throughout the genome.

Ecological Genetics of Parasitic Sea Lice (cont.)

Chromas file depicting base sequence of L. salmonis repeat region[ CA19-AA-CA4 ] [bases 117 to 164] and flanking regions.

Primers designed to anneal to the DNA sequences flanking the microsatellite region allow the indirect measure of the number of repeat units. The variability in repeat number is often high and the construction of multilocus genotypes may

allow analyses at both the population and individual levels.

Page 18: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Population Genetic Studies

African American Lives, an unprecedented four-part PBS series. Shows how DNA analysis is used to trace lineage through American history and back to Africa. Microsatellites play an important role in lineage analysis.

Page 19: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Population Genetic Studies (cont.)

LINEAGE AND ADMIXTURE: THE

TESTS LEARNING FROM DNA

Migration of Populations Around the World

Page 20: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Diagnosis of Genetic Disorders

Rethinking genotype and phenotype correlations in polyglutamine expansion disorders Susan E. Andrew1, Y. Paul Goldberg2 and Michael R. Hayden1,2,*

Page 21: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Counting Non-contiguous SSRs

1 2 n i

Definition 1: An DNA sequence X of length n is denoted by the random sequence

where each is defined over the 4-letter nucleotide alphabet

A,C,G,T .

Note randomness here refers to the non-uni

X x x x x

form Bernoulli model

(meaning all bases of X have independent and possibility unequal probabilities).

For instance is a DNA

sequence of length 21. TA, CAG, and CAGCAG are SSRs

X TACCCAGCAGGCCTATATATA

n

of lengths 2, 3, and 6, respectively.

Page 22: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Counting Non-contiguous SSRs (cont.)

1 2 kDefinition 2: A k-mer Y is a subsequence y y of a

DNA sequence X of length n where 1 6.

For instance for ,

TA is a 2-mer (dimer) and CAG is a 3-mer (trimer).

Y y

k

X TACCCAGCAGGCCTATATATA

Page 23: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Counting Non-contiguous SSRs (cont.)

Definition 3: A kt-linked SSR of a k-mer Y is a subsequence

of X which is of length kt that consists of t tandem copies of Y.

For instance for ,

CAGCAG is a 6-linked SSR of the tri

X TACCCAGCAGGCCTATATATAmer

and TATATATA is an 8-linked SSR of the dimer .

Y CAG

Y TA

Page 24: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Counting Non-contiguous SSRs (cont.)

We simply count the number of ways of distributing l occurrences of a kt-linked SSR of Y into (n-klt+1) possible positions in a DNA sequence X of length n by the binomial coefficient:

See the following example.

- 1n klt

l

Page 25: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Counting Non-contiguous SSRs (cont.)

Example 1: How many arrangements of the 3 non-contiguous

occurrences of the 4-linked SSR of Y=GA are in

X

denotes an arbitrary base?

Using the above binomial co

GAGA GAGA GAGA

where

efficient: 15, 2, 2,

and 3. ,

15 12 1 4 4

3 3

n k t

l Thus

Page 26: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Counting Non-contiguous SSRs (cont.)

Example 1 (cont.): The 4 arrangements of the 3 non-contiguous

occurrences of the 4-linked SSR of Y=GA are

GAGA GAGA GAGA

GAGA GAGA GAGA

GAGA GAGA GAGA

denotes an arbitrary base.

GAGA GAGA GAGA

where

Page 27: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Counting Non-contiguous SSRs (cont.)

Lemma 1: The number of non-contiguous arrangements of

occurrences of a kt-linked SSR of a k-mer Y in a DNA

sequence X of length n is given by the nth coefficient of

the following generating function

l

1 1

11

1 .

1

l ktn

ln klt l

n kltzG z z

lz

Page 28: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Probability of SSRs

An urn model approach is used to compute the probability of SSRs of Y at a position i in a DNA sequence X of length n.

Page 29: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Probability of SSRs (cont.)

1

Definition 4: Let denote the number of occurrences of a k-mer Y in

a DNA sequence X. Then where is the number of tandem

copies found in the ith SSR of Y.

Note. is used to compute sta

Y

j

Y i ii

Y

N

N t t

N

tistics on the occurrence of SSRs of Y.

Page 30: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Probability of SSRs (cont.)

Theorem 1: Let U denote a random variable representing the number of

tandem copies of a k-mer Y occurring at position i in a DNA sequence X.

Then,

1

1 1

i

t

Y Yi

Y

N n kNP U t

n N k

1

is the probability that an occurrence of a kt-linked SSR of Y starts

at position i in X.

t

Page 31: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Probability of SSRs (cont.)

2

2

Corollary 1: The variance of frequencies of SSRs of a k-mer Y in a

DNA sequence of length n is

2 .

1 1 1 1

Y

Y Y

N

n N k n N k

Page 32: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Probability of SSRs (cont.)

Theorem 2: The expected number of non-contiguous

occurrences of a kt-linked SSR of a k-mer Y in a

DNA sequence X is given by

- 1 . Y i yE P U t n kN

Page 33: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Index of Nonrandomness

2

Y

Metric:

1 / , E 0

/

where O is the observed number of SSRs, E is the

expected number of SSRs, and = / is the

representation of SSRs of Y in X.

Y YY

Y Y

Y Y

Y Y Y

O EI

O E

R O E

Page 34: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Results of Index of Nonrandomness

The index of nonrandomness provides an approach to identifying genomic loci in which SSR occurrences exhibit significant deviations from random expectations

No simulations are needed to compute deviations from random expectations

Closed form expression for finding the variance of SSRs (non-uniform Bernoulli model)

Higher index implies more nonrandomness in microsatellite DNA

The trimer CCG exhibited a high degree of nonrandomness which was unexpected

Page 35: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Results of Index of Nonrandomness (cont.)

Page 36: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Potential Biological Applications

Screening organismal genomes for putative disease-associated genes

Identifying loci of interest for future genomic studies

Computing the exclusion probability of SSR-based genetic markers used in paternity tests

Establishing relationships between SSR nonrandomness and the incidence of particular infectious diseases (dynamics)

Page 37: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Other Applications of Microsatellite DNA and Index of Nonrandomness

Alzheimer’s Disease (2007)

Prostate Cancer (In Progress)

Sickle Cell Disease (TBD)

Malaria (TBD)

Tuberculosis, HIV, and E-coli (???)

Page 38: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Related Sources

Identifying nonrandom occurrences of simple sequence repeats in genomic DNA sequences (with W. Ndifon and D. Hill), Ethnicity and Disease, Proceedings From RCMI 9th Intl. Symposium on Health Disparities 15 (2005)

S5-67 – S5-70.

Some probabilistic results on the nonrandomness of simple sequence repeats in DNA sequences (with W. Ndifon and D. Hill), Bulletin of Mathematical

Biology 68 (2006) 1747 –1759.

Differential enrichment of simple sequence repeats in selected Alzheimer-associated genes (with W. Ndifon and D. Hill), Cellular and Molecular Biology

(Noisy-le-grand) 1553 (2007) 23 – 31.

Page 39: Some Probabilistic Results on the Non-randomness of Simple Sequence Repeats in DNA Sequences 2007 DIMACS Workshop on Mathematical Modeling of Infectious.

Acknowledgments

National Science Foundation, DIMACS, SACEMA, University of Stellenbosch, AIMS

Office of Faculty Development, SCMNS & Departments of Mathematics and Chemistry at Morgan State University

Collaborators: Wilfred Ndifon, Princeton University and Dwayne Hill, Morgan State University