Genomic Sequence Analysis using Electron-Ion Interaction Potential

Post on 19-Jan-2016

40 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Genomic Sequence Analysis using Electron-Ion Interaction Potential. Masumi Kobayashi Performance Evaluation Laboratory University of Aizu. Purpose. To find the gene regions by using Lindley Equation and Electron-Ion Interaction Potential (EIIP). - PowerPoint PPT Presentation

Transcript

Genomic Sequence Analysis using Electron-Ion Interaction Potential

Masumi Kobayashi Performance Evaluation Laboratory

University of Aizu

Purpose

To find the gene regions by using Lindley Equation and Electron-Ion Interaction Potential (EIIP).

To judge similarity of two DNA sequences that shortens the processing time by using Lindley equation and Electron-Ion Interaction Potential (EIIP).

DNA

DNA sequence consists of four nucleotide letters: A(adenine), T(thymine), G(guanine), and C(cytosine).

Base A is always paired with base T, and C is always paired with D, and DNA is double helix.

DNA Sequence and Amino Acid Sequence A DNA sequence consists of a row of four nucleotides, and

each nucleotide triplet is called a codon. And a codon corresponds to an amino acid.

DNA Sequence | ・・・ |ATG|CGA|TAT|AAA|GCT|TTC| ・・・ |

Amino Acid Sequence

| ・・・ | M | R | L | K | A | F | ・・・ |

Codon

Codon 61 codons are transformed into amino acid. For example, both TTT and TTC code for Phenylalanine(F). 3 codons, TAA, TAG, and TGA are called Stop Codon.

Codon AminoAcid Codon AminoAcid Codon AminoAcid Codon AminoAcidTTT TCT TAT TGTTTC TCC TAC TGCTTA TCA TAA TGA STOPTTG TCG TAG TGG WCTT CCT CAT CGTCTC CCC CAC CGCCTA CCA CAA CGACTG CCG CAG CGGATT ACT AAT AGTATC ACC AAC AGC SATA ACA AAA AGAATG M ACG AAG AGGGTT GCT GAT GGTGTC GCC GAC GGCGTA GCA GAA GGAGTG GCG GAG GGG

A

H

Q

N

K

D

E

R

G

F

L

L

I

V

S

P

T

C

STOP

Y

R

The waiting time of the customer of queuing theory and a DNA sequence

In order to use Lindley equation, we need to describe the relation between the waiting time of the customer of queuing theory and a DNA sequence.

A score is given for the similarity of the amino acid of two target gene sequences, and sum of score is made to correspond to waiting time of queuing theory.

Lindley Equation

: The score of the n-th letter.

: The sum of the score to the n-th letter.

Amino AcidSequence

nS

nW

F L I ……… M V S T

1S 2S 3S

1W

2W

3W

2S

3S

}0,max{ 1 nnn SWW

n

ikknin SW }11{max

1nS nS

1nW nSNegative

value

0nW

Electron-Ion Interaction Potential (EIIP) Prof. Toyoizumi and Tuchiya showed a technique to find gene coding regions by using Lindley equation. But there is a problem, the determination of score required for Lindley equation is artificial.

In this research, we decide theoretical score by using Electron-Ion Interaction Potential. Each amino acid is represented by the EIIP value, which describes the average energy states of all valance electrons in particular amino acids.

AminoAcid EIIPLeu(L) 0Ile(I) 0

Asn(N) 0.0036Gly(G) 0.005Val(V) 0.0057Glu(E) 0.0058Pro(P) 0.0198His(H) 0.0242Lys(K) 0.0371Ala(A) 0.0373Tyr(Y) 0.0561Trp(W) 0.0548Gln(Q) 0.0761Met(M) 0.0823Ser(S) 0.0829Cys( )C 0.0829Thr(T) 0.0941Phe(F) 0.0946Arg( )R 0.0959Asp(D) 0.1263

Gene Finding Experiment

The target sequence of this experiment is the genome data of Escherichia coil O157:H7 Sakai.

Escherichia coil O157:H7 Sakai is a major food-born infection pathogen that causes diarrhea, coilitis, and hemolytic uremia syndrome.

We calculate using Lindley equation and EIIP.nW

Example of Amino Acid Scores and the Stop Codon Score (1)

Score = EIIP - 0.0885

Negative Score

Positive Score

Stop Codon Score-2 × 0.0085

AminoAcid ScoreLeu(L) - 0.0085Ile(I) - 0.0085

Asn(N) - 0.0849Gly(G) - 0.0835Val(V) - 0.0828Glu(E) - 0.0827Pro(P) - 0.0687His(H) - 0.0643Lys(K) - 0.0514Ala(A) - 0.0512Tyr(Y) - 0.0369Trp(W) - 0.0337Gln(Q) - 0.0124Met(M) - 0.0062Ser(S) - 0.0056Cys( )C - 0.0056Thr(T) 0.0056Phe(F) 0.0061Arg( )R 0.0074Asp(D) 0.0378

StopCodon- 0.1064- 0.177

Example of Amino Acid Scores and the Stop Codon Score (2-1)

Score = EIIP – 0.0045

Negative Score

Positive Score

Stop Codon Score

-2 × 0.0445

AminoAcid ScoreLeu(L) - 0.0445Ile(I) - 0.0445

Asn(N) - 0.0409Gly(G) - 0.0395Val(V) - 0.0388Glu(E) - 0.0387Pro(P) - 0.0247His(H) - 0.0203Lys(K) - 0.0074Ala(A) - 0.0072Tyr(Y) 0.0071Trp(W) 0.0103Gln(Q) 0.0316Met(M) 0.0378Ser(S) 0.0384Cys( )C 0.0384Thr(T) 0.0496Phe(F) 0.0501Arg( )R 0.0514Asp(D) 0.0818

StopCodon- 0.1064- 0.089

Example of Amino Acid Scores and the Stop Codon Score (2-2)

AminoAcid ScoreLeu(L) - 0.0445Ile(I) - 0.0445

Asn(N) - 0.0409Gly(G) - 0.0395Val(V) - 0.0388Glu(E) - 0.0387Pro(P) - 0.0247His(H) - 0.0203Lys(K) - 0.0074Ala(A) - 0.0072Tyr(Y) 0.0071Trp(W) 0.0103Gln(Q) 0.0316Met(M) 0.0378Ser(S) 0.0384Cys( )C 0.0384Thr(T) 0.0496Phe(F) 0.0501Arg( )R 0.0514Asp(D) 0.0818

StopCodon- 0.1064- 0.178

Change the Stop Codon Score.-0.089 → -0.178

(-4 × 0.0445)

Threshold of Amino Acid Sequence

may become high by chance in the region that is meaningless at an amino acid sequence.

The threshold is used in order to distinguish from meaningless regions.

The score sequence of an amino acid sequence assumes that it is independent and identically distribution.

can be considered to be the waiting time of GI/GI/1 queuing system.

nW

nS

nW

Threshold and the Probability that will exceed the Threshold accidentally

The probability that will exceed (Threshold) by chance is 0.05.

pnW

0w

for any then

xn exWP ][

}1][:0sup{ nsSeEs/log0 pw 10 ppwWP n ][ 0

The waiting time GI/GI/1 queuing system fills the following inequalities.

is the probability judged to be a meaningful sequence although it is a meaningless sequence.

p

nW

Distinction of gene coding regions and junk regions by Threshold

Similarity Comparison Experiment

The target sequence of this experiment is the genome data of human - and -Hemoglobins.

Hemoglobin is contained in erythrocyte and consists of a “hem” containing iron, and a “globin” which is protein, and has the important role of carrying oxygen inside of the body.

We calculate using Lindley equation and EIIP.

nW

Sequences of Human - and -Hemoglobins The genome data that we use is a gene coding region of Human - and -Hemoglobins.

A gene coding region of Human -Hemoglobin

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

A gene coding region of Human -Hemoglobin

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Amino Acid and the Stop Codon Scores

AminoAcid AminoAcidLeu(L) Tyr(Y)Ile(I) Trp(W)

Asn(N) Gln(Q)Gly(G) Met(M)Val(V) Ser(S)Glu(E) Cys( )CPro(P) Thr(T)His(H) Phe(F)Lys(K) Arg( )RAla(A) Asp(D)

0.02290.0291

Stop Codon Score - 0.1064

- 0.0475- 0.0474- 0.0334- 0.029

0.02970.0297

Score Score- 0.00160.0016

- 0.0532- 0.0532- 0.0496- 0.0482

- 0.0161- 0.0159

0.04090.04140.04270.0731

EIIP - 0.0532

-2 × 0.0532

Calculation Results of in -Hemoglobin and -Hemoglobin nW

Hemoglobin Hemoglobin

The difference (absolute value) of calculation results of in -Hemoglobin and -Hemoglobin

0

0.05

0.1

0.15

0.2

0.25

0.3

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

Wn

The Difference of α - ,β - Hemoglobins.

The Difference of α - Hemoglobin andRandom Sequence β - Hemoglobin.The Average is 0.03874.

The Average is 0.061567.

nW

Conclusion

We could find the gene regions from the DNA sequence by Lindley equation and EIIP.

We could show a technique of similarity comparison which shortened the processing time by Lindley equation and EIIP.

top related