Beyond Genome Annotation - Characterizing Chromosome Features Terry Clark Assistant Professor Electrical Engineering and Computer Science The University.

Beyond Genome Annotation -Characterizing Chromosome

Features

Terry Clark

Assistant Professor Electrical Engineering and Computer Science

The University of Kansas

2005 ITTC Research Review

April 7, 2005

ABSTRACT

Genome sequence data and their annotations are routinely used for determining genetic variation, assessing gene products, designing primers for various experiments, designing microarrays and other laboratory and computational applications. Well-known methods for genome analysis include sequence alignment, motif-based systems, and stochastic models, among others. Genome sequences are also representative of a dynamic chemical and physical interplay among proteins and DNA in the eukaryotic nucleus involving chromatin and various proteins. This organization of nuclear DNA is critical to the function and specialization of cells through regulation of genes. Toward understanding genome structure, our laboratory develops, uses, and applies methods ranging from computational linguistics to molecular modeling. One such method is an unsupervised, alignment-free approach that naturally tolerates re-organizations and insertions common to genome evolution; and as unsupervised permits de novo determination of features and feature association. In this presentation I develop a notion in an unsupervised, alignment-free context that we call a lexicon, an inductively generated set of nucleotide “words” of varying length devised to represent optimally a given sequence. The resulting lexicon and parse provide points of departure for sequence analyses utilizing lexicon content, the sequence representation, and sequence information content. The insights gained from bioinformatics are rationalized by and also steer molecular modeling studies. A representative application will be presented in this talk. (Selected slides from the presentation follow.)

DNA sequencinga basic tool for genome study

DNA sequence

GCTGAGGGAAGTGAGAGACTGAGGTGGGGNCTGGAGGAGCCTGAAAAGCAGAAGTAGGAGGAAGCAGAGCTGCTCGGAACAGATCCAGAAACAGCATGTACTCACCCATCCCCCAGAGCGGCTCTCCGTTCCCACCGACCGTGAAGCTCCCTGGCCTGCACATATGGAGGGTGGAGAAGCTGAAGCCAGTGCCTGTGGCCCCTGAGAACTACGGCATTTTCTTCTCGGGAGACTCCTACCTGGTGCTGCACAATGGCCCGGAAGAGCTCTCCCACCTGCACCTGTGGATCGGCCAGCAGTCGTCCCGGGACGAGCAGGGGGGCTGCGCCATATTGGCCGTGCACCTCAACACCCTGCTCGGAGAGCGGCCTGTGCAGCACCGAGAGTCACAGGGCAATGAGTCCGACCTCTTCATGAGCTACTTCCCCCACGGCCTCAAGTACCAGGAAGGCGGCGTGGAGTCGGCGTTTCACAAGACCTCCCCAGGAACCGCCCCAGCTGCCATCAAGAAACTCTACCAGGTGAAGGGCAAGAAGAACATTCGTGCCACTGAGCGGGTGCTGAGCTGGGACAGTTTCAACACAGGGGACTGCTTCATCCTGGATCTGGGCCAGAACATCTTTGCCTGGTGTGGTGCGAAGTCCAACATATTGGAGCGGAACAAGGCACGGGACCTGGCACTGGCCATCCGGGACAGCGAGCGGCAGGGCAAGGCCCACGTGGAGATCGTCACCGATGGGGAGGAGCCTGCCGACATGATACAGGTCTTGGGTCCCAAGCCCTCTCTGAAGGAGGGTAACCCTGAGGAAGACCTCACAGCTGACCGGACAAACGCACAGGCCGCGGCTCTGTATAAGGTCTCTGACGCCACTGGACAGATGAACCTGACCAAGCTGGCTGATTCCAGCCCCTTCGCCCTCGAGCTGCTGATACCCGATGACTGCTTTGTGTTGGACAACGGACTCTGCGGCAAGATCTACATCTGGAAGGGGCGCAAAGCTAATGAGAAGGAGAGGCAGGCGGCCCTCCAAGTGGCGGAGGACTTTATCACCCGCATGCGGTATGCCCCAAACACTCAGGTGGAGATTCTGCCCCAGGGCCGCGAGAGTGCCATCTTCAAGCAATTCTTCAAGGACTGGAAGTGAGGGTGGGCATCTCCCTGCCCCTACCTCCTACCCACTTGCTCCTCC

Human Chromosome 12

The Model: DNA as a Sequence of Features

gene gene

binding site

transposonLTR

LTR

LTRLTR

To detect features in a nucleotide sequence without prior knowledge solely based on nucleotide occurrence patterns, we apply an unsupervised algorithm developed initially for modeling speech acquisition.

Text (the DNA sequences) are presented to the algorithm as unbroken sequences of characters using the nucleotide alphabet. The task is to find the vocabulary for the text, which we also call a corpus.

A chromosome may be thought of as a collection of different languages. This analogy intuitively follows from the inhomogeneities in nucleotide compositions arising from the various functions that DNA performs.

A central computation in this approach is the probability of a parameter in the representation of the sequence (corpus). For this, the well-known forward – backward algorithm is used which takes into account all paths through a lattice of representations, where a representation of a sequence is a concatenation of words.

Represented above are two positions in a sequence, namely, positions a and b. The arcs into these locations are all possible paths, each using some combination of the current lexicon. Roughly, is the sum of the probabilities of paths in the model from the front of the sequence to location a; whereas is the same from the end of the sequence back to location b. The parameter under consideration, word w, spans the sequence between locations a and b.

wa b

)(a

)(a

)(b

)(b

With the forward and backward probabilities, and the probability of the parameter under consideration, w, the probability of w spanning the region from a to b in sequence s is given by:

)(

)()()()|,(

sp

swpsswbap

G

bGaG

With this equation for all representations, and all points a and b, the count of parameter w is determined. Such counts are the basis of the expectation step in the EM optimization algorithm; the maximization step adjusts probabilities in the model to maximize the expectation of the evidence based on the model.

Parameters are added to and deleted from the lexicon by combining existing parameters based on the evidence and the estimated cost/benefit of the new parameter to the description length.

A 1363 0.312471T 664 0.152224C 624 0.143054G 465 0.106602

...

CCTTA 9 0.00206327AAACCCTAAT 9 0.00206327GTTTT 9 0.00206327TCCTAAACCCT 9 0.00206327CAAACC 8 0.00183402CCAT 8 0.00183402AACCCTAAACC 8 0.00183402ACTCCA 8 0.00183402CCTTAAACCCTAAACC 8 0.00183402CTAAACCCTAA 8 0.00183402CTTTAAAACCTAAATCCTA 8 0.00183402CTAG 8 0.00183402ATCCTACTTTAGCTTC 8 0.00183402TTCGTATGATTTTTGGTTTTC 7 0.00160477GGATT 7 0.00160477ACCCTAAACATTAAAACCTAAACCC 7 0.00160477ATCTTCCAACAAGGAAAGAACACTTTA 7 0.00160477ATCTAGTCATATTTGAC 7 0.00160477AAAGTATATTTGGTC 7 0.00160477CTTCTA 7 0.00160477GTTGCGGTTCTAGTTCTTATACTCAATC 7 0.00160477

A portion of a lexicon from a chunk containing satellites

% wc -l chr4range007[789]_Lexicon_Frequency.txt

201 chr4range0077_Lexicon_Frequency.txt 117 chr4range0078_Lexicon_Frequency.txt 215 chr4range0079_Lexicon_Frequency.txt

Number of words contained in lexicons around this region

word count in representation frequency

>1KX5:A HISTONE H3

ARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTEL LIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEASEAYLVALFEDTNLCAIHAKRVTIM PKDIQLARRIRGERA

>1KX5:B HISTONE H4

SGRGKGGKGLGKGGAKRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKV FLENVIRDAVTYTEHAKRKTVTAMDVVYALKRQGRTLYGFGG

. . .

>1KX5:H HISTONE H2B.2

PEPAKSAPAPKKGSKKAVTKTQKKDGKKRRKTRKESYAIYVYKVLKQVHPDTGISSKAMS IMNSFVNDVFERIAGEASRLAHYNKRSTITSREIQTAVRLLLPGELAKHAVSEGTKAVTK YTSAK

>1KX5:I DNA

ATCAATATCCACCTGCAGATACTACCAAAAGTGTATTTGGAAACTGCTCCATCAAAAGGC ATGTTCAGCTGGAATCCAGCTGAACATGCCTTTTGATGGAGCAGTTTCCAAATACACTTT TGGTAGTATCTGCAGGTGGATATTGAT

>1KX5:J DNA

ATCAATATCCACCTGCAGATACTACCAAAAGTGTATTTGGAAACTGCTCCATCAAAAGGC ATGTTCAGCTGGATTCCAGCTGAACATGCCTTTTGATGGAGCAGTTTCCAAATACACTTT TGGTAGTATCTGCAGGTGGATATTGAT

Protein and DNA Sequences: 8 Histones and 2 DNA Strands

ITTC High Performance Computing Infrastructure

• 128 processor cluster (64 nodes)– 3.2 GHz Processors (Xeon based)– 4 GB RAM / node– 146 GB SCSI Disk / node

• 8 dual processor server nodes• 25-Terabyte File Server• Tape Robot System (LTO3 Ultrium)• High Performance Network

Compute nodes and server cluster components. System housed in newly expanded and remodeled machine room 218, Nichols Hall.

Beyond Genome Annotation - Characterizing Chromosome Features Terry Clark Assistant Professor Electrical Engineering and Computer Science The University.

Documents

sequence alignment

genome study slide

sequence analyses

sequence representation

abstract genome sequence

genome evolution

given sequence

genome analysis