Genomic Sequence Analysis using Electron-Ion Interaction Potential
Post on 19-Jan-2016
40 Views
Preview:
DESCRIPTION
Transcript
Genomic Sequence Analysis using Electron-Ion Interaction Potential
Masumi Kobayashi Performance Evaluation Laboratory
University of Aizu
Purpose
To find the gene regions by using Lindley Equation and Electron-Ion Interaction Potential (EIIP).
To judge similarity of two DNA sequences that shortens the processing time by using Lindley equation and Electron-Ion Interaction Potential (EIIP).
DNA
DNA sequence consists of four nucleotide letters: A(adenine), T(thymine), G(guanine), and C(cytosine).
Base A is always paired with base T, and C is always paired with D, and DNA is double helix.
DNA Sequence and Amino Acid Sequence A DNA sequence consists of a row of four nucleotides, and
each nucleotide triplet is called a codon. And a codon corresponds to an amino acid.
DNA Sequence | ・・・ |ATG|CGA|TAT|AAA|GCT|TTC| ・・・ |
Amino Acid Sequence
| ・・・ | M | R | L | K | A | F | ・・・ |
Codon
Codon 61 codons are transformed into amino acid. For example, both TTT and TTC code for Phenylalanine(F). 3 codons, TAA, TAG, and TGA are called Stop Codon.
Codon AminoAcid Codon AminoAcid Codon AminoAcid Codon AminoAcidTTT TCT TAT TGTTTC TCC TAC TGCTTA TCA TAA TGA STOPTTG TCG TAG TGG WCTT CCT CAT CGTCTC CCC CAC CGCCTA CCA CAA CGACTG CCG CAG CGGATT ACT AAT AGTATC ACC AAC AGC SATA ACA AAA AGAATG M ACG AAG AGGGTT GCT GAT GGTGTC GCC GAC GGCGTA GCA GAA GGAGTG GCG GAG GGG
A
H
Q
N
K
D
E
R
G
F
L
L
I
V
S
P
T
C
STOP
Y
R
The waiting time of the customer of queuing theory and a DNA sequence
In order to use Lindley equation, we need to describe the relation between the waiting time of the customer of queuing theory and a DNA sequence.
A score is given for the similarity of the amino acid of two target gene sequences, and sum of score is made to correspond to waiting time of queuing theory.
Lindley Equation
: The score of the n-th letter.
: The sum of the score to the n-th letter.
Amino AcidSequence
nS
nW
F L I ……… M V S T
1S 2S 3S
1W
2W
3W
2S
3S
}0,max{ 1 nnn SWW
n
ikknin SW }11{max
1nS nS
1nW nSNegative
value
0nW
Electron-Ion Interaction Potential (EIIP) Prof. Toyoizumi and Tuchiya showed a technique to find gene coding regions by using Lindley equation. But there is a problem, the determination of score required for Lindley equation is artificial.
In this research, we decide theoretical score by using Electron-Ion Interaction Potential. Each amino acid is represented by the EIIP value, which describes the average energy states of all valance electrons in particular amino acids.
AminoAcid EIIPLeu(L) 0Ile(I) 0
Asn(N) 0.0036Gly(G) 0.005Val(V) 0.0057Glu(E) 0.0058Pro(P) 0.0198His(H) 0.0242Lys(K) 0.0371Ala(A) 0.0373Tyr(Y) 0.0561Trp(W) 0.0548Gln(Q) 0.0761Met(M) 0.0823Ser(S) 0.0829Cys( )C 0.0829Thr(T) 0.0941Phe(F) 0.0946Arg( )R 0.0959Asp(D) 0.1263
Gene Finding Experiment
The target sequence of this experiment is the genome data of Escherichia coil O157:H7 Sakai.
Escherichia coil O157:H7 Sakai is a major food-born infection pathogen that causes diarrhea, coilitis, and hemolytic uremia syndrome.
We calculate using Lindley equation and EIIP.nW
Example of Amino Acid Scores and the Stop Codon Score (1)
Score = EIIP - 0.0885
Negative Score
Positive Score
Stop Codon Score-2 × 0.0085
AminoAcid ScoreLeu(L) - 0.0085Ile(I) - 0.0085
Asn(N) - 0.0849Gly(G) - 0.0835Val(V) - 0.0828Glu(E) - 0.0827Pro(P) - 0.0687His(H) - 0.0643Lys(K) - 0.0514Ala(A) - 0.0512Tyr(Y) - 0.0369Trp(W) - 0.0337Gln(Q) - 0.0124Met(M) - 0.0062Ser(S) - 0.0056Cys( )C - 0.0056Thr(T) 0.0056Phe(F) 0.0061Arg( )R 0.0074Asp(D) 0.0378
StopCodon- 0.1064- 0.177
Example of Amino Acid Scores and the Stop Codon Score (2-1)
Score = EIIP – 0.0045
Negative Score
Positive Score
Stop Codon Score
-2 × 0.0445
AminoAcid ScoreLeu(L) - 0.0445Ile(I) - 0.0445
Asn(N) - 0.0409Gly(G) - 0.0395Val(V) - 0.0388Glu(E) - 0.0387Pro(P) - 0.0247His(H) - 0.0203Lys(K) - 0.0074Ala(A) - 0.0072Tyr(Y) 0.0071Trp(W) 0.0103Gln(Q) 0.0316Met(M) 0.0378Ser(S) 0.0384Cys( )C 0.0384Thr(T) 0.0496Phe(F) 0.0501Arg( )R 0.0514Asp(D) 0.0818
StopCodon- 0.1064- 0.089
Example of Amino Acid Scores and the Stop Codon Score (2-2)
AminoAcid ScoreLeu(L) - 0.0445Ile(I) - 0.0445
Asn(N) - 0.0409Gly(G) - 0.0395Val(V) - 0.0388Glu(E) - 0.0387Pro(P) - 0.0247His(H) - 0.0203Lys(K) - 0.0074Ala(A) - 0.0072Tyr(Y) 0.0071Trp(W) 0.0103Gln(Q) 0.0316Met(M) 0.0378Ser(S) 0.0384Cys( )C 0.0384Thr(T) 0.0496Phe(F) 0.0501Arg( )R 0.0514Asp(D) 0.0818
StopCodon- 0.1064- 0.178
Change the Stop Codon Score.-0.089 → -0.178
(-4 × 0.0445)
Threshold of Amino Acid Sequence
may become high by chance in the region that is meaningless at an amino acid sequence.
The threshold is used in order to distinguish from meaningless regions.
The score sequence of an amino acid sequence assumes that it is independent and identically distribution.
can be considered to be the waiting time of GI/GI/1 queuing system.
nW
nS
nW
Threshold and the Probability that will exceed the Threshold accidentally
The probability that will exceed (Threshold) by chance is 0.05.
pnW
0w
for any then
xn exWP ][
}1][:0sup{ nsSeEs/log0 pw 10 ppwWP n ][ 0
The waiting time GI/GI/1 queuing system fills the following inequalities.
is the probability judged to be a meaningful sequence although it is a meaningless sequence.
p
nW
Distinction of gene coding regions and junk regions by Threshold
Similarity Comparison Experiment
The target sequence of this experiment is the genome data of human - and -Hemoglobins.
Hemoglobin is contained in erythrocyte and consists of a “hem” containing iron, and a “globin” which is protein, and has the important role of carrying oxygen inside of the body.
We calculate using Lindley equation and EIIP.
nW
Sequences of Human - and -Hemoglobins The genome data that we use is a gene coding region of Human - and -Hemoglobins.
A gene coding region of Human -Hemoglobin
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
A gene coding region of Human -Hemoglobin
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
Amino Acid and the Stop Codon Scores
AminoAcid AminoAcidLeu(L) Tyr(Y)Ile(I) Trp(W)
Asn(N) Gln(Q)Gly(G) Met(M)Val(V) Ser(S)Glu(E) Cys( )CPro(P) Thr(T)His(H) Phe(F)Lys(K) Arg( )RAla(A) Asp(D)
0.02290.0291
Stop Codon Score - 0.1064
- 0.0475- 0.0474- 0.0334- 0.029
0.02970.0297
Score Score- 0.00160.0016
- 0.0532- 0.0532- 0.0496- 0.0482
- 0.0161- 0.0159
0.04090.04140.04270.0731
EIIP - 0.0532
-2 × 0.0532
Calculation Results of in -Hemoglobin and -Hemoglobin nW
Hemoglobin Hemoglobin
The difference (absolute value) of calculation results of in -Hemoglobin and -Hemoglobin
0
0.05
0.1
0.15
0.2
0.25
0.3
1 9 17 25 33 41 49 57 65 73 81 89 97 105
113
121
129
137
Wn
The Difference of α - ,β - Hemoglobins.
The Difference of α - Hemoglobin andRandom Sequence β - Hemoglobin.The Average is 0.03874.
The Average is 0.061567.
nW
Conclusion
We could find the gene regions from the DNA sequence by Lindley equation and EIIP.
We could show a technique of similarity comparison which shortened the processing time by Lindley equation and EIIP.
top related