Top Banner
Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department of Computer Science, Austin Peay State University, Clarksville, Tennessee, USA Jennifer L. Leopold, Ph.D., Department of Computer Science, Ronald L. Frank, Ph.D., Department of Biological Sciences, Missouri University of Science and
58

Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

Jan 11, 2016

Download

Documents

Dina Miles
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

Protein Secondary Structure Prediction Using BLAST and

Exhaustive RT-RICO(Relaxed Threshold Rule Induction from Coverings)

Leong Lee, Ph.D., Department of Computer Science, Austin Peay State University, Clarksville, Tennessee, USA

Jennifer L. Leopold, Ph.D., Department of Computer Science,Ronald L. Frank, Ph.D., Department of Biological Sciences,

Missouri University of Science and Technology, Rolla, Missouri, USA

Page 2: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

2

Introduction

• Central Dogma of Biology• Protein Structure Prediction: A Brief Introduction• Protein Secondary Structure Prediction Problem• Related Work• BLAST-ERT-RICO• Exhaustive RT-RICO Rule Generation Algorithm• References

Page 3: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

3

What is life made of ?What are living organisms made of ?

Page 4: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

4

Molecular Biology: A Brief Introduction

• What is life made of?• Organisms are made of cells• A great diversity of cells exist in nature, but they have some

common features (Jones and Pevzner, 2004)

– Born, eat, replicate, and die– A cell would be roughly analogous to a car factory

Page 5: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

5

Molecular Biology: A Brief Introduction

• All life on this planet depends mainly on three types of molecules: DNA, RNA, and proteins

• A cell’s DNA holds a library describing how the cell works

• RNA acts to transfer short pieces of information to different places in the cell, smaller volumes of information are used as templates to synthesize proteins

• Proteins perform biochemical reactions, send signals to other cells, form body’s components, and do the actual work of the cell. (Jones and Pevzner, 2004)

Page 6: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

6

Central Dogma of Biology

• DNA --> transcription --> RNA --> translation --> protein• Is referred to as the central dogma in molecular biology

(Jones and Pevzner, 2004)

• DNA sequence determines protein sequence• Protein sequence determines protein structure• Protein structure determines protein function• Regulatory mechanisms deliver the right amount of the right

function to the right place at the right time (Lesk, 2008)

Page 7: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

7

Molecular Biology: A Brief Introduction

• DNA: the structure and the four genomic letters code for all living organisms , double helix structure, can replicate

• Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G on complimentary strands (chemically attached) (Jones and Pevzner, 2004)

Page 8: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

8

Molecular Biology: A Brief Introduction

• Cell Information: instruction book of life• DNA/RNA: strings written in four-letter nucleotide (A C G T/U)• Protein: strings written in 20-letter amino acid• Example, the transcription of DNA into RNA, and the translation

of RNA into a protein (Jones and Pevzner, 2004)

DNA: TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT

RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA

Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop

Page 9: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

9

Molecular Biology: A Brief Introduction

• Genetic code, from the perspective of mRNA. AUG also acts as a “start” codon

Image courtesy of Griffiths et al.

Page 10: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

10

Protein Structure Prediction: A Brief Introduction• 3D structure of pepsin (PDB ID: 1PSN)

>1PSN:A|PDBID|CHAIN|SEQUENCEVDEQPLENYLDMEYFGTIGIGTPAQDFTVVFDTGSSNLWVPSVYCSSLACTNHNRFNPEDSSTYQSTSETVSITYGTGSMTGILGYDTVQVGGISDTNQIFGLSETEPGSFLYYAPFDGILGLAYPSISSSGATPVFDNIWNQGLVSQDLFSVYLSADDQSGSVVIFGGIDSSYYTGSLNWVPVTVEGYWQITVDSITMNGEAIACAEGCQAIVDTGTSLLTGPTSPIANIQSDIGASENSDGDMVVSCSAISSLPDIVFTINGVQYPVPPSAYILQSEGSCISGFQGMNLPTESGELWILGDVFIRQYFTVFDRANNQVGLAPVA

Image courtesy of RCSB Protein Data Bank (http://www.pdb.org)

Page 11: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

11

Protein Structure Prediction: A Brief Introduction• Genomic projects provide us with the linear amino acid

sequence of hundreds of thousands of proteins• If only we could learn how each and every one of these folds

in 3D…• Malfunctioning of proteins is the most common cause of

endogenous diseases• Most life-saving drugs act by interfering with the action of

foreign protein• So far, most drugs have been discovered by trial-and-error• Our lack of understanding of complex interplay of proteins –

drugs might not aimed at best target, hence side-effects (Tramontano, 2006)

Page 12: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

12

Protein Structure Prediction: A Brief Introduction• Experimental methods can provide us the precise arrangement of every

atom of a protein.– X-ray crystallography and NMR spectroscopy

• X-ray crystallography requires protein or complex to form a reasonably well ordered crystal, a feature that is not universally shared by proteins.

• NMR spectroscopy needs proteins to be soluble and there is a limit to the size of protein that can be studied.

• Both are time consuming techniques, we cannot hope to use them to solve the structures of all proteins in the universe in the near future.

• Problem: How to relate the amino acid sequence of a protein to its 3D structure.

It is estimated that the human body may contain over two million proteins, coded for by only 20,000 - 25,000 genes. The total number of proteins found in terran biological organisms is likely to exceed ten million, but nobody knows for sure. Data is available on just over a million proteins. …wisegeek.com

Page 13: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

13

Background – Protein Primary Structure

• Protein primary structures are chains of amino acids• 20 amino acids {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}

– 1san:A– MTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG

Image courtesy of

National H

uman G

enome Research Institute (N

HG

RI)

Page 14: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

14

Background - Protein Secondary Structure

• Secondary structure is normally defined by hydrogen bonding patterns

• Amino acids vary in ability to form various secondary structure elements

• 8 types of secondary structure defined: {G, H, I, T, E, B, S, -}

>1SAN:A:sequenceMTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG>1SAN:A:secstr----HHHHHHHHHHHHH-SS--HHHHHHHHHHHT--SHHHHHHHHHHHHTTTTTS-TT-S--

Image courtesy of Carl Fürstenberg

Alpha helices are shown in color, and random

coil in white, there are no beta sheets show

n

Page 15: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

15

Protein Secondary Structure Prediction - Motivation• Important research problem in bioinformatics / biochemistry• High importance for design of drugs and novel enzymes• Determination of protein structures by experimental methods

is lagging far behind discovery of protein sequences• Predicting protein tertiary structure is an extremely

challenging problem, but tractable if using simpler secondary structure definitions; focus for current research (tertiary structure of a protein is its three-dimensional structure, as defined by the atomic coordinates)

Page 16: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

16

Protein Secondary Structure Prediction Problem Description• Input (Baldi et al., 2000)

– Amino acid sequence, A = a1, a2, … aN

– Data for comparison, D = d1, d2, … dN

– ai is an element of a set of 20 amino acids, {A,R,N…V}– di is an element of a set of secondary structures, {H,E,C}, which

represents helix H, sheet E, and coil C.• Output

– Prediction result: X = x1, x2, … xN

– xi is an element of a set of secondary structures, {H,E,C}• 3-Class Prediction (Zhang and Zhang, 2003)

– Multi-class prediction problem with 3 classes {H,E,C} in which one obtains a 3 x 3 confusion matrix Z = (zij)

Page 17: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

17

Protein Secondary Structure Prediction Problem Description• 3 x 3 matrix (3 classes)

Prediction H E C

H Z11

Reality E Z22

C Z33

Zij: input predicted to be in class j while in reality belonging to class iQ total = 100 ∑i Zii / N (percentage)

Page 18: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

18

Q3 Score

• Q3 = Wαα + Wββ + Wcc

Wαα = % of helices correctly predicted

Wββ = % of sheets correctly predicted

Wcc = % of coils correctly predicted

• Example of Q3 calculation

Protein: 10% helices, 10% sheets, 80% coilsPrediction: 100% coils

Q3 = 0% + 0% + 80% = 0.80

Page 19: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

19

Q3 Score

• Q3 = Wαα + Wββ + Wcc

Wαα = % of helices correctly predicted

Wββ = % of sheets correctly predicted

Wcc = % of coils correctly predicted

• Example of Q3 calculation, length 10

Amino acid (primary structure) sequence (A):MTYTRYQTLE

(Secondary structure) data for comparison (D): HHHEEECCCC

(Secondary structure) Prediction (M): HHEEECCCCC

Q3 = 2/10 + 2/10 + 4/10 = 0.80

Page 20: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

20

Related Work

• Not easy to evaluate the performance of a protein secondary structure prediction method (e.g., different datasets used for training and testing)

• Rost and Sander (1993a) selected a list of 126 protein domains (RS126); now constitutes comparative standard

• Cuff and Barton (1999) described development of non-redundant test set of 396 protein domains (CB396)

• PHD, one of the first methods surpassing the 70% accuracy threshold, uses multiple sequence alignments as input to a neural network (Rost and Sander, 1993b)

Page 21: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

21

Related Work

• PHD effectively utilizes evolutionary information by exploiting the well-known fact that homologous proteins have similar 3D structures

• Random mutations in DNA sequence can lead to different amino acids in the protein sequences

• Mutations resulting in a structural change are not likely to retain protein function; thus, structure more conserved than sequence (Rost, 2003)

• Rost (2003) also has stated that a value of around 88% likely will be the operational upper limit for prediction accuracy

In evolutionary biology, homology refers to any similarity between characteristics of organisms that is due to their shared ancestry. Homology among proteins and DNA is often concluded on the basis of sequence similarity, especially in bioinformatics. For example, in general, if two or more genes have highly similar DNA sequences, it is likely that they are homologous. But sequence similarity may also arise without common ancestry:

Page 22: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

22

Q3 Scores of Secondary Structure Prediction Methods

MethodsRS126 Test

DatasetCB396 Test

DatasetOther Test

DatasetsPHD 73.5% 71.9%

DSC 71.1% 68.4%

PREDATOR 70.3% 68.6%

NNSSP 72.7% 71.4%

CONSENSUS 74.8% 72.9%

Fadime, 2-stage 74.1%

PSIPRED 78.3%

Hu, SVM 78.8%

Kim, SVMpsi 76.1% 78.5%

Nguyen, 2-stage SVM 78.0% 76.3%

Page 23: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

23

History of Prediction Accuracy

• The secondary structure prediction problem was first defined in the 1960s

• Before the 1990s, the prediction accuracy was only around 60% for most methods

• Recently, some methods have reached or even surpassed 80% accuracy (Q3 score), by utilizing evolutionary information of proteins, large databases, and various machine learning approaches such as artificial neural networks and support vector machines.

• How did we reach/surpass this 80% threshold?

Page 24: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

24

Rost’s Neural Network (Rost and Sander 1993a)

Image courtesy of Rost and Sander

Page 25: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

25

Rost’s Neural Network (Rost and Sander 1993a)

PHD, uses multiple sequence alignments as input to a neural network (Rost and Sander, 1993b)

Page 26: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

26

BLAST-ERT-RICO

• Given input protein A (amino acid sequence, A = a1, a2, … aN), protein BLAST search (Web-based) performed using A as query sequence

• BLAST returns a list of proteins with significant sequence alignments

• Suitable proteins chosen to form training dataset for A• RT-RICO algorithm generates rules from the training dataset;

rules used to predict the secondary structure for protein A• Output is predicted secondary structure sequence X

Page 27: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

27

BLAST-ERT-RICO Step 1Online BLAST and PDB Data Match• BLAST search (Web crawler program) performed using A as

query sequenceSay A = APAFSVSPASGASDGQSVSVSVAAAGETYYI…

• Returns list of proteins with significant sequence alignments and corresponding BLAST scores; proteins with score ≤ 30 removed from list (test protein A also removed)

• Some of these proteins may have corresponding secondary structure records in PDB (Berman et al., 2000)

• Those retrieved records, become inputs to next step, data preparation

• If a protein from the list does not have known secondary structure record in PDB, we will require data from offline preprocessing

Page 28: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

28

BLAST-RT-ERICO

Step 1 Online BLAST and PD

B Data M

atch

Page 29: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

29

BLAST-ERT-RICO Step 2Data Preparation (Math content, skip)

• For test protein A, there is set of protein primary structure sequence Bi and set of corresponding secondary structure sequence Ci where Bi {B∈ 1, B2, B3, B4, … By}, Ci {C∈ 1, C2, C3, C4, … Cy}

• Primary structure sequence is Bi = bi,1, bi,2, bi,3, … bi, wi

• Corresponding secondary structure sequence is Ci = ci,1, ci,2, ci,3, … ci, wi

• B1 to By are not necessarily of same length, because they represent different proteins

• Each bi,j is an element of a set of 20 amino acids, {A,R,N…V}• ci,j is an element of set of 8-state secondary structures, {H, G, I, E,

B, T, S, -} (PDB); converted to an element of a set of 4-state secondary structures, {H, E, C, -}

Page 30: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

30

BLAST-RT-RICO Step 2Data Preparation (Math content, skip)

• If Bi is primary structure sequence, Ci is secondary structure sequence, and length of sequence(s) is wi, then each n-residue segment is of form: bi,j-floor(n/2), … bi,j-1, bi,j, bi,j+1, … bi,j+floor(n/2), ci,j; and j has value from ceiling(n/2) to (wi – floor(n/2))

• This data preparation step performed for all Bi and Ci pairs, where i is from 1 to y

• These n-residue segments are main inputs to ERT-RICO rule generation algorithm

Page 31: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

31

BLAST-ERT-RICO Step 2Data Preparation

• Protein primary structure n-residue segments and related secondary structure elements representation (n=9)

Page 32: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

32

BLAST-ERT-RICO

Step 2 Data Preparation

Page 33: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

33

BLAST-ERT-RICO Step 3Rule Generation

• Sample rules generated by ERT-RICO (n=9, m=1708)

+,+,+,L,+,+,+,+,S,E,84.21,19,16,0.93676815+,+,+,T,V,+,+,+,+,E,76.47,51,39,2.28337237Q,A,+,+,+,+,+,+,G,E,100.00,7,7,0.40983607……(3,L)(8,S) -> (9,E), 84.21%, occurrences of ((3,L)(8,S)) = 19, occurrences of ((3,L)(8,S) -> (9,E)) = 16, Support % = 0.93676815(3,T)(4,V) -> (9,E), 76.47%, occurrences of ((3,T)(4,V)) = 51, occurrences of ((3,T)(4,V) -> (9,E)) = 39, Support % = 2.28337237(0,Q)(1,A)(8,G) -> (9, E), 100.00%, occurrences of ((0,Q)(1,A)(8,G)) = 7, occurrences of ((0,Q)(1,A)(8,G) -> (9, E)) = 7, Support % = 0.40983607……

Page 34: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

34

BLAST-ERT-RICO

Step 3 Rule Generation

Page 35: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

35

BLAST-ERT-RICO Step 4 Prediction

• Protein primary structure n-residue segments and related secondary structure elements prediction (n=9)

• Here xi is an element of the set {H,E,C,-}. It is then converted to an element of the set {H, E, C}.

Page 36: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

36

BLAST-ERT-RICO Step 4 Prediction (may skip)

• The prediction algorithm is also dependent on the selection of the threshold value

• Suppose that a threshold value t = 0.8 (80%) is chosen• The algorithm first searches for matching rules with 100%

confidence value. The secondary structure element with the highest total support value (among 100% confidence value rules) is selected

• If no matching rule exists among 100% confidence value rules, the algorithm then searches for other matching rules (with confidence values greater than or equal to 90%)

• If no matching rule exists among those with confidence value greater than or equal to 90%, the algorithm searches for matching rules with confidence values greater than or equal to 80%

Page 37: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

37

BLAST-ERT-RICO Step 4 Prediction (may skip)

• This lowering of threshold or confidence value (at a decreasing rate of 10%) stops at the threshold value t, in this case 80% (threshold = 0.8); it can go lower if the chosen t is of a smaller value

• The secondary structure element with the highest total support value among these rules is selected as the predicted secondary structure element for that specific position

• If no matching rule is found for the segment at all, the secondary structure of the previous position is used as the predicted secondary structure

Page 38: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

38

BLAST-ERT-RICO

Step 4 Prediction

Page 39: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

39

BLAST-ERT-RICO, Offline Preprocessing(future work needed here)

• If no protein with significant sequence alignments has corresponding known secondary structure sequence from PDB (answer is “no” in Fig. 1.), prediction for test protein needs to be handled slightly differently

• All proteins and corresponding secondary structure sequences from PDB downloaded to form initial dataset; test datasets (RS126 or CB396) removed; protein domains from different protein families selected to form training datasets

• Now we have set of protein primary structure sequence Bi and corresponding secondary structure sequence Ci; same data preparation, rule generation, and prediction steps applied

Page 40: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

40

BLAST-RT-RICOO

ffline Preprocessing

Page 41: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

41

Exhaustive RT-RICO (ERT-RICO)Rule Generation Algorithm• Most computationally intensive• Previously, this research team presented a prediction method,

BLAST-RT-RICO• Some areas of the algorithm were in need of improvement;

most importantly, the time complexity for the rule generation step needed to be reduced

• RT-RICO has a time complexity of O(m22n), where m is the number of all entities (the number of rows of n-residue segments), and n = |S| (the number of attributes). m2 dominates the time complexity because n is a small value (9 for this case)

Page 42: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

42

Exhaustive RT-RICO (ERT-RICO)Rule Generation Algorithm• Sometimes a very large m can cause running time issues• When we ran datasets with different n value and t (threshold)

value combinations to find the optimal segment length and threshold value, we faced the challenge of running several datasets in a reasonable period of time

• We developed the Exhaustive RT-RICO algorithm (ERT-RICO), which is a modified version of the old RT-RICO algorithm, and has an improved time complexity of O(mlog(m)2n). mlog(m) dominates the time complexity

• ERT-RICO has a space complexity of O((2n-1)(20n)(4)); in practice the space required is much smaller than that, due to the fact that different segments generate a large number of duplicate rules

Page 43: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

43

ERT-RICO, Number of All Possible Rules

Page 44: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

44

ERT-RICO Rule Generation Algorithm

• Space complexity could be an issue; n = 9, need (29-1)(209)(4), around 1.04653 × 1015 counters; we made data structure adjustments (different segments generate lots of duplicate rules)

• We know all possible values for each position in a segment (hence all possible rules)

• For an m×(n+1) matrix, each row (segment) is of length n+1• The first n elements are made up of letters from a set of 20

amino acid residues, {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}, and the last element is a letter from a set of four secondary structure states {H, E, C, -}

• Convert a rule to a numeric value, and convert the number back to the original rule

Page 45: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

45

ERT-RICO, Converting A Rule to A Number

Page 46: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

46

ERT-RICO Rule Generation Algorithm• The ERT-RICO rule generation algorithm

finds the set C of all relaxed coverings of R in S (and the related rules), with threshold probability t (0 < t 1), where S is the set of all attributes, and R is the set of all decisions.

• The input to ERT-RICO is in the form of an m×(n+1) matrix, where m is the number of all entities (the number of n-residue plus one secondary structure element segments), and n = |S| (the number of attributes).

Algorithm 2: ERT-RICObegin

for each segment (each row of matrix) for each 2n-1 rules that can be generated from segment

generate unique hash key which is a numeric index

if hash index does not exist in the hash table then

add hash index and hash value (1) to the hash table

(hash value = number of occurrences of each rule)

elseupdate hash value in the hash

table(hash value = hash value + 1)

end-if end-for

end-for

for each key in the hash tablegenerate rule from key(in amino acid and secondary structure letters)calculate confidence and support using hash value and

related keysif confidence > t then

add rule, confidence, and support to output file

end-if end-for

end-algorithm.

Page 47: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

47

Conclusion

• ERT-RICO has an improved time complexity of O(mlog(m)2n)• This improvement over RT-RICO’s O(m22n), has enabled the

research team to run much larger test datasets with different choices of segment length and threshold value

• Preliminary test results showed that BLAST-ERT-RICO achieved a Q3 score of 92.19% on the standard test dataset RS126

• Current optimal segment length: n = 9 • Current optimal threshold: t = 0.8• The adoption of the ERT-RICO algorithm also resolves the space

complexity issues of our earlier implementations(Hash table design eliminates the need of counters for all entries & individual counters for duplicate entries.

Maximum hash table size is around 47 million entries => fits in RAM)

Page 48: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

48

Conclusion

• The test programs (rule-generation and prediction for RS126 set, n=9) were written in PERL and executed on a computer with Intel Dual-Core processor, 32 GB of RAM, and Windows 7 OS

• The total program running time was approximately 21 days (which definitely can be improved in the future)

• Even with the use of standard test datasets, it is still difficult to compare the accuracies of prediction methods

• RS126 set is a very representative test dataset; all test proteins can generate a number of significant alignments through BLAST

Page 49: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

49

Future Work – 71,000 proteins

• It is still difficult to compare the accuracies of prediction methods

• In early 2011, there were around 71,000 proteins (unique PDB IDs) with known secondary structure in the Protein Data Bank (PDB) database

• Most test datasets use only around 100 to 500 protein domains

• If all these 71,000 proteins can be used to evaluate a particular method, the resulting Q3 score should be well representative

Page 50: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

50

Future Work – homologous protein selection• So far (last few years), I used all proteins with

significant sequence alignments (certain blast scores) to generate rules

• Result? long rule generation time • This could be improved by developing a different

algorithm for selecting proteins for rule generations• Some work (algorithm design and programming) has

already been done in this area

Page 51: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

51

Future Work – offline processing

• Current offline processing algorithm normally produces lower Q3 scores

• After acquiring some good parameters for ERT-RICO, we can use these values to develop a better offline processing algorithm

• Some work (such as algorithm design) has been done• Other test datasets like the CB396 set, which requires

offline processing for a number of proteins, can be used to test the future offline processing algorithm

Page 52: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

52

Future Work – 71,000 proteins

• Will further improve the homologous protein selection process,

• Develop a better offline processing algorithm • I can run the Perl computer programs to predict the

secondary structure of around 71,000 proteins

Page 53: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

53

Future Work – External Funding

• To prepare a NSF grant proposal for the construction of a Web-based server, to make the prediction capability available to the bioinformatics community

• And/or to run more test datasets to improve the algorithm

• Target: Advances in Biological Informatics (ABI) program, by July, 2013 (Full Proposal Deadline)

Page 54: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

::: Thank You :::

Leong Lee, Ph.D., Department of Computer Science, Austin Peay State University, Clarksville, Tennessee, USA

Jennifer L. Leopold, Ph.D., Department of Computer Science,Ronald L. Frank, Ph.D., Department of Biological Sciences,

Missouri University of Science and Technology, Rolla, Missouri, USA

Page 55: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

55

ReferencesAltschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) ‘Gapped BLAST and PSI-BLAST: a new

generation of protein database search programs’, Nucleic Acids Res., Vol. 25, No. 17, pp.3389-402. Andreeva, A., Howorth, D., Chandonia, J. M., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2008) ‘Data growth and its impact on the

SCOP database: new developments’, Nucleic Acids Res, Vol. 36 (Database issue), D419-25. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F. and Nielsen, H. (2000) ‘Assessing the accuracy of prediction algorithms for classification: an

overview’, Bioinformatics, Vol. 16, No. 5, pp.412-24. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. and Bourne, P. E. (2000) ‘The Protein Data Bank’,

Nucleic Acids Res., Vol. 28, No. 1, pp.235-42. BLAST (2009). BLAST: Basic Local Alignment Search Tool. Obtained through the Internet: http://blast.ncbi.nlm.nih.gov/, [accessed 30/11/2009] Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S. and Jones, D. T. (2005) ‘Protein structure prediction servers at University College

London’, Nucleic Acids Res., Vol. 33(Web Server issue), W36-8. Cuff, J. A. and Barton, G. (1999) ‘Evaluation and improvement of multiple sequence methods for protein secondary structure prediction’, Proteins,

Vol. 34, pp.508–519.

Cuff, J. A. and Barton, G. (2000) ‘Application of multiple sequence alignment profiles to improve protein secondary structure prediction’, Proteins, Vol. 40, No. 3, pp.502-11.

Fadime, U. Y., O¨zlem, Y. and Metin, T. (2008) ‘Prediction of secondary structures of proteinsnext term using a two-stage method’, Computers & Chemical Engineering, Vol. 32, No. 1-2, pp.78-88.

Page 56: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

56

ReferencesFrishman, D. and Argos, P. (1997) ‘Seventy-five percent accuracy in protein secondary structure prediction’, Proteins, Vol. 27, pp.329–335. Grzymala-Busse, J. W. (1991) ‘Ch.3. Knowledge Acquisition’, Managing Uncertanity in Expert System, (pp.43-76), Boston: Kluwer Academic. Han, J. and Kamber, M. (2001) Data Mining: Concepts and Techniques, (pp.155-157) Morgan Kaufmann. Hu, H., Pan, Y., Harrison, R. and Tai, P. (2004) ‘Improved protein secondary structure prediction using support vector machine and a newencoding scheme and an advanced tertiary classifier’, IEEE Trans. NanoBiosci., Vol. 3, pp.265–271. Jones, D. T. (1999) ‘Protein secondary structure prediction based on position-specific scoring matrices’, J. Mol. Biol., Vol. 292, No. 2, pp.195-

202.

Jones, N. C. And Pevzner, P. A. (2004) An Introduction to Bioinformatics Algorithms, MIT Press.

Kabsh, W. and Sander, C. (1983) ‘How good are predictions of protein secondary structure?’, FEBS Letters, Vol. 155, pp.179-182. Kim, H. and Park, H., (2003) ‘Protein secondary structure prediction based on an improved support vector machines approach’, Protein Eng.,

Vol. 16, pp.553-60. King, R. D. and Sternberg, M. J. E. (1996) ‘Identification and application of the concepts important for accurate and reliable protein

secondary structure prediction’, Protein. Sci., Vol. 5, pp.2298–2310.

Klepeis, J. L. and Floudas, C. A. (2002) ‘Ab initio prediction of helical segments in polypeptides’, J Comput. Chem, Vol. 23, No. 2, pp.245-66.

Page 57: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

57

ReferencesLeopold, J. L., Maglia, A. M., Thakur, M., Patel, B. and Ercal, F. (2007) ‘Identifying Character Non-Independence in Phylogenetic Data Using Parallelized

Rule Induction From Coverings’, Data Mining VIII: Data, Text, and Web Mining and Their Business Applications, WIT Transactions on Information and Communication Technologies, Vol. 38, pp. 45-54.

Levitt, M. and Chothia, C. (1976) ‘Structural patterns in globular proteins’, Nature, Vol. 261, No. 5561, pp.552-8.

Lee, L., Leopold, J. L., Frank, R. L., and Maglia, A. M. (2009) ‘Protein Secondary Structure Prediction Using Rule Induction from Coverings,’ Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2009 , Nashville, Tennessee, USA, pp. 79-86.

Lee, L., Kandoth, C., Leopold, J. L., and Frank, R. L. (2010a) ‘Protein Secondary Structure Prediction Using Parallelized Rule Induction from Coverings ,’ International Journal of Medicine and Medical Sciences, Vol. 1, No. 2, pp. 99-105.

Lee, L., Leopold, J. L., Kandoth, C., and Frank, R. L. (2010b) ‘Protein secondary structure prediction using RT-RICO: a rule-based approach,’ The Open

Bioinformatics Journal, Vol. 4, pp. 17-30.. Lee, L., Leopold, J. L., Edgett, P. G., and Frank, R. L. (2010c) ‘Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction,’

Proceedings of ANNIE 2010 conference, St. Louis, Missouri, USA.

Lee, L., Leopold, J. L., and Frank, R. L. (2011) ‘Protein secondary structure prediction using BLAST and Relaxed Threshold Rule Induction from Coverings ,’ Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2011 , Paris, France, accepted for publication.

Lesk, A. M. (2008) Introduction to Bioinformatics, 3rd Edition, Oxford.

Maglia, A. M., Leopold, J. L. and Ghatti, V. R. (2004) ‘Identifying Character Non-Independence in Phylogenetic Data Using Data Mining Techniques’, Proc. Second Asia-Pacific Bioinformatics Conference Dunedin, New Zealand.

Page 58: Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverings) Leong Lee, Ph.D., Department.

58

ReferencesMurzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995) ‘SCOP: a structural classification of proteins database for the investigation of sequences

and structures’, J Mol. Biol, Vol. 247, No. 4, pp.536-40. Nguyen, N. and Rajapakse, J. C. (2007) ‘Two stage support vector machines for protein secondary structure prediction’, Intl. J. Data Mining &

Bioinformatics, Vol. 1, pp.248-269. Pawlak, Z. (1984) ‘Rough Classification’, Int. J. Man-Machine Studies, Vol. 20, pp.469-483. Rost, B. and Sander, C. (1993a) ‘Prediction of protein secondary structure at better than 70% accuracy’, J. Mol. Biol.,Vol. 232, pp.584-599. Rost, B. and Sander, C. (1993b) ‘Improved prediction of protein secondary structure by use of sequence profiles and neural networks’, Proc. Natl. Acad.

Sci. USA, Vol. 90, pp.7558–7562. Rost, B. (2003) ‘Rising accuracy of protein secondary structure prediction’, In: Chasman, D. (Ed.), Protein structure determination, analysis, and modeling

for drug discovery, (pp.207–249), New York: Dekker. Salamov, A. A. and Solovyev, V. V. (1995) ‘Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence

alignments’, J Mol. Biol., Vol. 247, pp.11–15.

Tramontano, A. (2006) Protein Structure Prediction, Wiley-vch.

Wong, P. C., Whitney, P. and Thomas, J. (1999) ‘Visualizing Association Rules for Text Mining’ Proceedings of the 1999 IEEE Symposium on Information Visualization, pp. 120-123, 152.

Zhang, C. T. and Zhang, R. (2003) ‘Q9, a content-balancing accuracy index to evaluate algorithms of protein secondary structure prediction’, Int J

Biochem Cell Biol., Vol. 35, No. 8, pp.1256-62.