Genomic Sequence Analysis using Electron-Ion Interaction Potential

Masumi Kobayashi Performance Evaluation Laboratory

University of Aizu

Purpose

To find the gene regions by using Lindley Equation and Electron-Ion Interaction Potential (EIIP).

To judge similarity of two DNA sequences that shortens the processing time by using Lindley equation and Electron-Ion Interaction Potential (EIIP).

DNA sequence consists of four nucleotide letters: A(adenine), T(thymine), G(guanine), and C(cytosine).

Base A is always paired with base T, and C is always paired with D, and DNA is double helix.

DNA Sequence and Amino Acid Sequence A DNA sequence consists of a row of four nucleotides, and

each nucleotide triplet is called a codon. And a codon corresponds to an amino acid.

DNA Sequence | ・・・ |ATG|CGA|TAT|AAA|GCT|TTC| ・・・ |

Amino Acid Sequence

| ・・・ | M | R | L | K | A | F | ・・・ |

Codon 61 codons are transformed into amino acid. For example, both TTT and TTC code for Phenylalanine(F). 3 codons, TAA, TAG, and TGA are called Stop Codon.

Codon AminoAcid Codon AminoAcid Codon AminoAcid Codon AminoAcidTTT TCT TAT TGTTTC TCC TAC TGCTTA TCA TAA TGA STOPTTG TCG TAG TGG WCTT CCT CAT CGTCTC CCC CAC CGCCTA CCA CAA CGACTG CCG CAG CGGATT ACT AAT AGTATC ACC AAC AGC SATA ACA AAA AGAATG M ACG AAG AGGGTT GCT GAT GGTGTC GCC GAC GGCGTA GCA GAA GGAGTG GCG GAG GGG

The waiting time of the customer of queuing theory and a DNA sequence

In order to use Lindley equation, we need to describe the relation between the waiting time of the customer of queuing theory and a DNA sequence.

A score is given for the similarity of the amino acid of two target gene sequences, and sum of score is made to correspond to waiting time of queuing theory.

Lindley Equation

: The score of the n-th letter.

: The sum of the score to the n-th letter.

Amino AcidSequence

F L I ……… M V S T

1S 2S 3S

}0,max{ 1 nnn SWW

ikknin SW }11{max

1nS nS

1nW nSNegative

Electron-Ion Interaction Potential (EIIP) Prof. Toyoizumi and Tuchiya showed a technique to find gene coding regions by using Lindley equation. But there is a problem, the determination of score required for Lindley equation is artificial.

In this research, we decide theoretical score by using Electron-Ion Interaction Potential. Each amino acid is represented by the EIIP value, which describes the average energy states of all valance electrons in particular amino acids.

AminoAcid EIIPLeu(L) 0Ile(I) 0

Asn(N) 0.0036Gly(G) 0.005Val(V) 0.0057Glu(E) 0.0058Pro(P) 0.0198His(H) 0.0242Lys(K) 0.0371Ala(A) 0.0373Tyr(Y) 0.0561Trp(W) 0.0548Gln(Q) 0.0761Met(M) 0.0823Ser(S) 0.0829Cys( )Ｃ 0.0829Thr(T) 0.0941Phe(F) 0.0946Arg( )Ｒ 0.0959Asp(D) 0.1263

Gene Finding Experiment

The target sequence of this experiment is the genome data of Escherichia coil O157:H7 Sakai.

Escherichia coil O157:H7 Sakai is a major food-born infection pathogen that causes diarrhea, coilitis, and hemolytic uremia syndrome.

We calculate using Lindley equation and EIIP.nW

Example of Amino Acid Scores and the Stop Codon Score (1)

Score = EIIP - 0.0885

Negative Score

Positive Score

Stop Codon Score-2 × 0.0085

AminoAcid ScoreLeu(L) - 0.0085Ile(I) - 0.0085

Asn(N) - 0.0849Gly(G) - 0.0835Val(V) - 0.0828Glu(E) - 0.0827Pro(P) - 0.0687His(H) - 0.0643Lys(K) - 0.0514Ala(A) - 0.0512Tyr(Y) - 0.0369Trp(W) - 0.0337Gln(Q) - 0.0124Met(M) - 0.0062Ser(S) - 0.0056Cys( )Ｃ - 0.0056Thr(T) 0.0056Phe(F) 0.0061Arg( )Ｒ 0.0074Asp(D) 0.0378

StopCodon- 0.1064- 0.177

Example of Amino Acid Scores and the Stop Codon Score (2-1)

Score = EIIP – 0.0045

Negative Score

Positive Score

Stop Codon Score

-2 × 0.0445

Asn(N) - 0.0409Gly(G) - 0.0395Val(V) - 0.0388Glu(E) - 0.0387Pro(P) - 0.0247His(H) - 0.0203Lys(K) - 0.0074Ala(A) - 0.0072Tyr(Y) 0.0071Trp(W) 0.0103Gln(Q) 0.0316Met(M) 0.0378Ser(S) 0.0384Cys( )Ｃ 0.0384Thr(T) 0.0496Phe(F) 0.0501Arg( )Ｒ 0.0514Asp(D) 0.0818

StopCodon- 0.1064- 0.089

Example of Amino Acid Scores and the Stop Codon Score (2-2)

Asn(N) - 0.0409Gly(G) - 0.0395Val(V) - 0.0388Glu(E) - 0.0387Pro(P) - 0.0247His(H) - 0.0203Lys(K) - 0.0074Ala(A) - 0.0072Tyr(Y) 0.0071Trp(W) 0.0103Gln(Q) 0.0316Met(M) 0.0378Ser(S) 0.0384Cys( )Ｃ 0.0384Thr(T) 0.0496Phe(F) 0.0501Arg( )Ｒ 0.0514Asp(D) 0.0818

StopCodon- 0.1064- 0.178

Change the Stop Codon Score.-0.089 → -0.178

(-4 × 0.0445)

Threshold of Amino Acid Sequence

may become high by chance in the region that is meaningless at an amino acid sequence.

The threshold is used in order to distinguish from meaningless regions.

The score sequence of an amino acid sequence assumes that it is independent and identically distribution.

can be considered to be the waiting time of GI/GI/1 queuing system.

Threshold and the Probability that will exceed the Threshold accidentally

The probability that will exceed (Threshold) by chance is 0.05.

for any then

xn exWP ][

}1][:0sup{ nsSeEs/log0 pw 10 ppwWP n ][ 0

The waiting time GI/GI/1 queuing system fills the following inequalities.

is the probability judged to be a meaningful sequence although it is a meaningless sequence.

Distinction of gene coding regions and junk regions by Threshold

Similarity Comparison Experiment

The target sequence of this experiment is the genome data of human - and -Hemoglobins.

Hemoglobin is contained in erythrocyte and consists of a “hem” containing iron, and a “globin” which is protein, and has the important role of carrying oxygen inside of the body.

We calculate using Lindley equation and EIIP.

Sequences of Human - and -Hemoglobins The genome data that we use is a gene coding region of Human - and -Hemoglobins.

A gene coding region of Human -Hemoglobin

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

A gene coding region of Human -Hemoglobin

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Amino Acid and the Stop Codon Scores

AminoAcid AminoAcidLeu(L) Tyr(Y)Ile(I) Trp(W)

Asn(N) Gln(Q)Gly(G) Met(M)Val(V) Ser(S)Glu(E) Cys( )ＣPro(P) Thr(T)His(H) Phe(F)Lys(K) Arg( )ＲAla(A) Asp(D)

0.02290.0291

Stop Codon Score - 0.1064

- 0.0475- 0.0474- 0.0334- 0.029

0.02970.0297

Score Score- 0.00160.0016

- 0.0532- 0.0532- 0.0496- 0.0482

- 0.0161- 0.0159

0.04090.04140.04270.0731

EIIP - 0.0532

-2 × 0.0532

Calculation Results of in -Hemoglobin and -Hemoglobin nW

Hemoglobin Hemoglobin

The difference (absolute value) of calculation results of in -Hemoglobin and -Hemoglobin

1 9 17 25 33 41 49 57 65 73 81 89 97 105

The Difference of α - ,β - Hemoglobins.

The Difference of α - Hemoglobin andRandom Sequence β - Hemoglobin.The Average is 0.03874.

The Average is 0.061567.

Conclusion

We could find the gene regions from the DNA sequence by Lindley equation and EIIP.

We could show a technique of similarity comparison which shortened the processing time by Lindley equation and EIIP.

Genomic Sequence Analysis using Electron-Ion Interaction Potential

lindley equation

sum of score

theoretical score

determination of score

example of amino acid

dnadna sequence

dna sequences

experimentthe target

Documents

Sequence analysis of peptide:oligonucleotide...

Genomic Sequence Questions - UWI St. Augustine · PDF...

Genomic Sequence Questions - UWI St....

Nucleotide Sequence of the Genomic RNA of Pepper MOttle...

Going Organic - Genomic sequence alignment in Elasticsearch

Nucleotide sequence, genomic organization and · PDF...

Nucleotide sequence of a genomic clone encoding a cowpea...

The Genomic HyperBrowser: inferential genomics at the...

Using BLAST for Genomic Sequence Annotation

Discovering genomic islands using DNA sequence embedding

Genomic EWSR1 Fusion Sequence as Highly Sensitive and...

Large Scale Machine Learning for Genomic Sequence Analysis

2007 Genomic RNA sequence of feline coronavirus strain FCoV....

Genomic Organization, 5'-Upstream Sequence, and Chromosomal....

Oscillibacter valericigenes Sjm18-20T - Springer ·...

A comparative analysis of genomic DNA sequence from