Top Banner
Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting Assistant Professor, Dept of Computer Science University of North Carolina at Greensboro
67

Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

Dec 16, 2015

Download

Documents

Deven Whitehead
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

Rule Visualization of Protein Motif Sequence Data for

Secondary Structure Prediction- An Overview

Leong Lee, Ph.D. University of Missouri (MS&T)Visiting Assistant Professor, Dept of Computer Science

University of North Carolina at Greensboro

Page 2: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

2

Introduction

• Molecular Biology: A Brief Introduction• Central Dogma of Biology• Protein Structure Prediction: A Brief Introduction• Protein Secondary Structure Prediction Problem• Related Work• Rule-Based RT-RICO• BLAST-RT-RICO• RT-RICO Rule Generation Algorithm• Rule Visualization of Protein Motif Sequence Data• Conclusion• References, More Related Work, Detailed RT-RICO

Page 3: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

3

What is life made of ?What are living organisms made of ?

Page 4: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

4

Molecular Biology: A Brief Introduction

• What is life made of?• Organisms are made of cells• A great diversity of cells exist in nature, but they have some

common features (Jones and Pevzner, 2004)

– Born, eat, replicate, and die– A cell would be roughly analogous to a car factory

Page 5: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

6

Molecular Biology: A Brief Introduction

• All life on this planet depends mainly on three types of molecules: DNA, RNA, and proteins

• A cell’s DNA holds a library describing how the cell works

• RNA acts to transfer short pieces of information to different places in the cell, smaller volumes of information are used as templates to synthesize proteins

• Proteins perform biochemical reactions, send signals to other cells, form body’s components, and do the actual work of the cell. (Jones and Pevzner, 2004)

Page 6: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

7

Central Dogma of Biology

• DNA --> transcription --> RNA --> translation --> protein• Is referred to as the central dogma in molecular biology

(Jones and Pevzner, 2004)

• DNA sequence determines protein sequence• Protein sequence determines protein structure• Protein structure determines protein function• Regulatory mechanisms, delivers the right amount of the right

function to the right place at the right time (Lesk, 2008)

Page 7: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

9

Molecular Biology: A Brief Introduction

• Cell Information: instruction book of life• DNA/RNA: strings written in four-letter nucleotide (A C G T/U)• Protein: strings written in 20-letter amino acid• Example, the transcription of DNA into RNA, and the translation

of RNA into a protein (Jones and Pevzner, 2004)

DNA: TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT

RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA

Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop

Page 8: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

10

Molecular Biology: A Brief Introduction

• Genetic code, from the perspective of mRNA. AUG also acts as a “start” codon

Page 9: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

11

Protein Structure Prediction: A Brief Introduction• 3D structure of pepsin (PDB ID: 1PSN)

>1PSN:A|PDBID|CHAIN|SEQUENCEVDEQPLENYLDMEYFGTIGIGTPAQDFTVVFDTGSSNLWVPSVYCSSLACTNHNRFNPEDSSTYQSTSETVSITYGTGSMTGILGYDTVQVGGISDTNQIFGLSETEPGSFLYYAPFDGILGLAYPSISSSGATPVFDNIWNQGLVSQDLFSVYLSADDQSGSVVIFGGIDSSYYTGSLNWVPVTVEGYWQITVDSITMNGEAIACAEGCQAIVDTGTSLLTGPTSPIANIQSDIGASENSDGDMVVSCSAISSLPDIVFTINGVQYPVPPSAYILQSEGSCISGFQGMNLPTESGELWILGDVFIRQYFTVFDRANNQVGLAPVA

Page 10: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

12

Protein Structure Prediction: A Brief Introduction• Genomic projects provide us with the linear amino acid

sequence of hundreds of thousands of proteins• If only we could learn how each and every one of these folds

in 3D…• Malfunctioning of proteins is the most common cause of

endogenous diseases• Most life-saving drugs act by interfering with the action of

foreign protein• So far, most drugs have been discovered by trial-and-error• Our lack of understanding of complex interplay of proteins –

drugs might not be aimed at best target, side-effects (Tramontano, 2006)

Page 11: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

13

Protein Structure Prediction: A Brief Introduction• Experimental methods can provide us the precise arrangement of

every atom of a protein– X-ray crystallography and NMR spectroscopy

• X-ray crystallography requires protein or complex to form a reasonably well ordered crystal, a feature that is not universally shared by proteins

• NMR spectroscopy needs proteins to be soluble and there is a limit to the size of protein that can be studied

• Both are time consuming techniques, we cannot hope to use them to solve the structures of all proteins in the universe in the near future

• Problem: How to relate the amino acid sequence of a protein to its 3D structure

Page 12: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

14

Background – Protein Primary Structure

• Protein primary structures are chains of amino acids• 20 amino acids {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}

– 1san:A– MTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG

Image A

uthor:

National H

uman G

enome R

esearch Institute (NH

GR

I)

Page 13: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

15

Background - Protein Secondary Structure

• Secondary structure is normally defined by hydrogen bonding patterns

• Amino acids vary in ability to form various secondary structure elements

• 8 types of secondary structure defined: {G, H, I, T, E, B, S, -}

>1SAN:A:sequenceMTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG>1SAN:A:secstr----HHHHHHHHHHHHH-SS--HHHHHHHHHHHT--SHHHHHHHHHHHHTTTTTS-TT-S--

Image A

uthor: Carl F

ürstenberg

Alpha helices are show

n in colour, and random coil in w

hite, there are no beta sheets shown.

Page 14: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

16

Protein Secondary Structure Prediction - Motivation• Important research problem in bioinformatics / biochemistry• Of high importance for design of drugs and novel enzymes• Determination of protein structures by experimental methods

is lagging far behind discovery of protein sequences• Predicting protein tertiary structure is an even more

challenging problem, but more tractable if using simpler secondary structure definitions; focus for current research (tertiary structure of a protein is its three-dimensional structure, as defined by the atomic coordinates)

Page 15: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

17

Protein Secondary Structure Prediction Problem Description• Input (Baldi et al., 2000)

– Amino acid sequence, A = a1, a2, … aN

– Data for comparison, D = d1, d2, … dN

– ai is an element of a set of 20 amino acids, {A,R,N…V}– di is an element of a set of secondary structures, {H,E,C}, which

represents helix H, sheet E, and coil C.• Output

– Prediction result: M = m1, m2, … mN

– mi is an element of a set of secondary structures, {H,E,C}• 3-Class Prediction (Zhang and Zhang, 2003)

– Multi-class prediction problem with 3 classes {H,E,C} in which one obtains a 3 x 3 confusion matrix Z = (zij)

Page 16: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

18

Protein Secondary Structure Prediction Problem Description• 3 x 3 matrix (3 classes)

Prediction H E C

H Z11

Reality E Z22

C Z33

Zij: input predicted to be in class j while in reality belonging to class iQ total = 100 ∑i Zii / N (percentage)

Page 17: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

19

Q3 Score

• Q3 = Wαα + Wββ + Wcc

Wαα = % of helices correctly predicted

Wββ = % of sheets correctly predicted

Wcc = % of coils correctly predicted

• Example of Q3 calculation

Protein: 10% helices, 10% sheets, 80% coilsPrediction: 100% coils

Q3 = 0% + 0% + 80% = 0.80

Page 18: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

20

Q3 Score

• Q3 = Wαα + Wββ + Wcc

Wαα = % of helices correctly predicted

Wββ = % of sheets correctly predicted

Wcc = % of coils correctly predicted

• Example of Q3 calculation, length 10

Amino acid (primary structure) sequence (A):MTYTRYQTLE

(Secondary structure) data for comparison (D): HHHEEECCCC

(Secondary structure) Prediction (M): HHEEECCCCC

Q3 = 2/10 + 2/10 + 4/10 = 0.80

Page 19: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

21

Related Work

• Rost (2003) classifies protein secondary structure prediction methods into 3 generations

• First generation methods depend on single residue statistics to perform prediction

• Second generation methods depend on segment statistics• Third generation methods use evolutionary information to

predict secondary structure; e.g., PHD (Rost and Sander, 1993a)

• One of the best secondary structure predictors is the PSIPRED Protein Structure Prediction Server (Jones, 1999); uses a two-stage neural network, based on position-specific scoring matrices.

• Recently, trend to use support vector machine (SVM) to predict protein secondary structures

Page 20: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

22

Related Work

• Levitt and Chotia (1976) proposed to classify proteins as 4 basic types according to their α-helix and β-sheet content– “All-α” class proteins consist almost entirely (at least 90%)

of α-helices– “All-β” class proteins composed mostly of β-sheets (at

least 90%)– “α/β” class proteins have alternating, mainly parallel

segments of α-helices and β-sheets– “α+β” class proteins have mixture of all-α and all-β

regions, mostly in sequential order

Page 21: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

23

Related Work

• Fadime, O¨zlem, and Metin (2008), used different 2-stage method; Q3 74.1% (different test dataset)

• First stage determines class of unknown proteins with 100% accuracy

• Second stage uses probabilistic approach• Simplifies problem: given a protein amino acid sequence, if it

can be determined which one of the 4 classes protein belongs to, other approaches can be applied to predict the secondary structure elements within the 4 classes

• Shows there are statistical relationships between a secondary structure element and its neighboring amino acid residues

Page 22: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

24

Related Work

• Not easy to evaluate performance of a protein secondary structure prediction method (e.g., different datasets used for training and testing)

• Rost and Sander (1993a) selected a list of 126 protein domains (RS126); now constitutes comparative standard

• Cuff and Barton (1999) described development of non-redundant test set of 396 protein domains (CB396)

• PHD, one of the first methods surpassing the 70% accuracy threshold, uses multiple sequence alignments as input to a neural network (Rost and Sander, 1993b)

Page 23: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

25

Related Work

• PHD effectively utilizes evolutionary information by exploiting the well-known fact that homologous proteins have similar 3D structures

• Random mutations in DNA sequence can lead to different amino acids in the protein sequences

• Mutations resulting in a structural change are not likely to retain protein function; thus, structure more conserved than sequence (Rost, 2003)

• Rost (2003) also has stated that a value of around 88% likely will be the operational upper limit for prediction accuracy

In evolutionary biology, homology refers to any similarity between characteristics of organisms that is due to their shared ancestry. Homology among proteins and DNA is often concluded on the basis of sequence similarity, especially in bioinformatics. For example, in general, if two or more genes have highly similar DNA sequences, it is likely that they are homologous. But sequence similarity may also arise without common ancestry:

Page 24: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

26

Q3 Scores of Secondary Structure Prediction Methods

MethodsRS126 Test

DatasetCB396 Test

DatasetOther Test

DatasetsPHD 73.5% 71.9%

DSC 71.1% 68.4%

PREDATOR 70.3% 68.6%

NNSSP 72.7% 71.4%

CONSENSUS 74.8% 72.9%

Fadime, 2-stage 74.1%

PSIPRED 78.3%

Hu, SVM 78.8%

Kim, SVMpsi 76.1% 78.5%

Nguyen, 2-stage SVM 78.0% 76.3%

Page 25: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

27

Q3 Scores of Secondary Structure Prediction Methods• Due to differences in approaches, data availability, and test

design strategies, difficult to directly compare different methods’ prediction results

• Q3 scores comparison should be used as general guide, not strict percentile comparison

• Q3 scores under “Other Test Datasets” column should NOT be directly compared (uses different test datasets)

Page 26: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

28

Background - RT-RICO

• We developed a rule-based secondary structure prediction method called RT-RICO

• Paper 1: Rule-based RT-RICO: improvements to the prediction algorithm; RS126 Q3 score 81.75%, CB396 Q3 score 79.19% (Lee, Leopold, Kandoth and Frank, 2010b)

• Paper 2: BLAST-RT-RICO: modified method BLAST-RT-RICO; RS126 Q3 score 89.93%, CB396 Q3 score 87.71% (Lee, Leopold and Frank, 2011)

• Paper 3: Rule Visualization: modifications to an existing visualization technique are proposed in order to visualize and analyze the RT-RICO and BLAST-RT-RICO association rules (Lee, Leopold, Edgett and Frank, 2010d)

Page 27: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

29

Rule-Based RT-RICO (Paper 1)

RT-RICO Step 1• All protein names and corresponding folding types of each

protein retrieved from the SCOP database (Andreeva et al., 2008) • All available corresponding protein sequences and secondary

structure sequences obtained from PDB database (Berman et al., 2000)

• 5 databases of protein domains (with their amino acid sequences and secondary structure sequences) of different protein domain types (e.g., “all-α”, “all-β”, “α/β”, “α+β” and “others”) built

• Proteins from test datasets (RS126 or CB396) first removed; Protein domains from different protein families selected to form training datasets

Page 28: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

30

Rule-Based RT-RICO (Paper 1), Step 1

Page 29: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

31

Rule-Based RT-RICO (Paper 1) Step 1Data PreparationRT-RICO Step 1• Protein secondary structure sequences from PDB formed

from 8 states of secondary structure, {H, G, I, E, B, T, S, -}• 8 states are converted to 4 states to facilitate rule generation:

(final Q3 calculation uses 3 states)(G, H, I) => Helix H; (E, B) => Sheet E; (T, S) => Coil C; (-) => “-”

• Klepeis and Floudas (2002): use of overlapping segments of 5 residues effective in predicting the helical segments of proteins

Page 30: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

32

Rule-Based RT-RICO (Paper 1) Step 1Data Preparation

Page 31: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

33

Rule-Based RT-RICO (Paper 1) Step 1Data Preparation

Page 32: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

34

Rule-Based RT-RICO (Paper 1) Step 2Rule Generation

• RT-RICO generate rules

Page 33: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

35

Rule-Based RT-RICO (Paper 1) Step 2Rule Generation

Page 34: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

36

Rule-Based RT-RICO (Paper 1) Step 3Prediction

• Loads protein primary structures from test dataset• Predicts secondary structure elements

Page 35: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

37

Rule-Based RT-RICO (Paper 1) Step 3Prediction• Each of these segments compared with generated rules; first

searched for matching rules with 100% confidence value • If no matching rule existed among 100% confidence value

rules, searched for other matching rules (with confidence values ≥ 90%, but < 100%)

• Secondary structure element with highest total support value selected as predicted secondary structure element for the specific position

• If no matching rule found for the segment at all, secondary structure of the previous position used as predicted secondary structure

Page 36: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

38

Rule-Based RT-RICO (Paper 1) Step 3Prediction

Page 37: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

39

RT-RICO Rule Generation Algorithm(4 new definitions and 2 new algorithms )

• Algorithm RT-RICO (Relaxed Threshold Rule Induction From Coverings) finds the set C of all relaxed coverings of R in S (and the related rules), with threshold probability t (0 < t 1), where S is the set of all attributes, and R is the set of all decisions.

• The set of all subsets of the same cardinality k of the set S is denoted Pk = {{xi1, xi2, … , xik} | xi1, xi2, … , xik S}

Algorithm 2: RT-RICObegin for each attribute x in S do

compute [x]*; compute partition R* k:=1 while k |S| do

for each set P in Pk do

if (xP [x]* r,t R*) then

beginfind values of attributes from the entities that

are in the region (B B’) such that (|B B’| / |B|) t;add rule to output file;

end k := k+1

end-while;end-algorithm.

Page 38: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

40

BLAST-RT-RICO (Paper 2)

• After Rule-Based RT-RICO (Paper 1), can we do better? • Given input protein A (amino acid sequence, A = a1, a2, … aN),

protein BLAST search (Web-based) performed using A as query sequence

• BLAST returns list of proteins with significant sequence alignments

• Suitable proteins chosen to form training dataset for A• RT-RICO algorithm generates rules from the training dataset;

rules used to predict the secondary structure for protein A• Output is predicted secondary structure sequence M• BLAST-RT-RICO is accepted for publication

Page 39: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

41

BLAST-RT-RICO (Paper 2)

Page 40: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

59

BLAST-RT-RICO (Paper 2)

Results (more tests needed)

Page 41: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

61

Rule Visualization (Paper 3)

• Association rule is implication of the form X → Y where X is set of antecedent items, and Y is consequent item (Wong et al., 1999)

• Wong’s technique designed to handle only Boolean association rules (Han and Kamber, 2001), rules concerning only the presence or absence of attributes

• Our rules for secondary structure are multi-valued (considered quantitative)

• We generate numerous rules (e.g., 572,531 from “all-α” class training set)

Page 42: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

62

Rule Visualization (Paper 3)

• Rules sorted by confidence value, then by support value

• Sorted this way due to prediction steps

Page 43: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

63

Rule Visualization (Paper 3)

• Can be visualized by modified version of Wong’s technique

• Different colors will represent different amino acids and different secondary structure elements

Page 44: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

64

Rule Visualization (Paper 3)

• Interesting observations

• Only 15 different amino acids (instead of 20) appear

• All decision attribute values at position 5 are “H/Helix”

• Motivated to compare color patterns!

Page 45: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

65

Rule Visualization (Paper 3)

• Positions 0 to 4 are antecedent items and position 5 is only consequent item

• Can change amino acids’ colors (or any attribute’s color) in 3D diagrams to represent different properties

• In Fig. 5 amino acid colors chosen according to different amino acid types (e.g., acidic, basic, nonpolar, and polar uncharged)

• Colors can be changed to distinguish amino acids of different sizes, or other relevant chemical properties

Page 46: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

66

Rule Visualization (Paper 3)As shown in Table V, amino acids belonging to same type use similar color shades (acidic: orange; basic: teal; nonpolar: green; polar uncharged: pink)

Page 47: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

67

Rule Visualization (Paper 3)

• Colors can be changed to distinguish amino acids of different sizes (Fig. 10)

• Python programming language, matplotlib plotting library: zooming, rotating about any axis, and saving as image file

Page 48: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

68

Rule Visualization (Paper 3)Different Classes

• Rule sequences between Fig. 5 and Fig. 7 are clearly different

Page 49: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

70

Rule Visualization (Paper 3)Different Classes • In the graph for "all-α" by amino acid type (Fig. 5), acidic and

basic amino acids occur at frequency expected for number of amino acids in those groups

• Conversely, significant preponderance of nonpolar amino acids and a paucity of polar uncharged

• Although basic amino acids occur with expected frequency, overall concentrated in middle position, 2, with fewer at edge positions, 0 and 4

• Nonpolar amino acids not equally distributed by position; inverse of trend for basic amino acids (i.e. concentrated at edge positions, 0 and 4, fewer in middle position, 2)

Page 50: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

74

Rule Visualization (Paper 3)Different Test Proteins• BLAST-RT-RICO uses BLAST search to find list of proteins with

significant sequence alignments (for each test protein) • Rules are generated from these proteins• Using visualization technique, can more readily get sense of

information that rules convey, and can compare rule sets for test proteins

• Proteins with significant sequence alignments may carry important evolutionary information!

Page 51: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

75

Rule Visualization (Paper 3)Different Test Proteins

• Fig. 12 and Fig. 13 help us visualize the concept that different sets of amino acids are responsible for the two rule sets.

Page 52: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

76

Rule Visualization (Paper 3)Different Test Proteins• May lead to other future research topics related to protein

secondary structure; e.g., encourages researcher to ask questions such as:

(1) how different rules (or groups of rules) affect the functions of an individual protein or a protein family,

(2) why certain rules only exist in one protein class, but not in another, and

(3) why some test proteins produce common rules although the proteins have different structure

Page 53: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

77

Rule Visualization (Paper 3)

• Will help researchers discern patterns of residue association in protein structure as other more complex properties of those amino acids are applied to the visualization

• For brevity, figures each show only about 30 rules; on 21” monitor, 1000s rules can be displayed and analyzed

• Implementation supports zooming, rotating, etc., allowing users to have “big picture” of a particular set of rules

Page 54: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

78

Conclusion

• Novel rule-based method that generates rules for predicting protein secondary structure

• Rule-based RT-RICO (paper 1): Q3 accuracy scores of 81.75% for RS126 and 79.19% for CB396

• BLAST-RT-RICO approach (paper 2): Q3 scores of 89.93% for RS126 and 87.71% for CB396 – promising, but more tests needed for test proteins with “no known homologous template structures in the PDB database”.

Page 55: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

79

Conclusion

• Rule Visualization (paper 3): technique to visualize those rules, compare rule sets between different protein classes, and compare rule sets of different test proteins

• In future, useful to construct BLAST-RT-RICO prediction server with functions to analyze training datasets and prediction results

• Also consider other properties of proteins and sequences of length > 5

• Conduct more tests

Page 56: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

80

Questions?

• Robbins, R.J. (1992). Challenges in human genome project. IEEE Engineering in Medicine and Biology, 11, 25-34.

• “… Consider the 3.2 gigabytes of human genome as equivalent to 3.2 GB of files on the mass-storage device of some computer system of unknown design. … Reverse engineering that unknown computer system (both the hardware and the 3.2 GB of software) all the way back to the full set of design and maintain specifications. …. resulting image of the mass-storage device will not be a file-by-file copy, but rather a streaming dump of bytes… files are known to be fragmented… erased files… garbage… only a partial, and sometimes incorrect understanding of the CPU… 3.2 GB are the binary specifications… millions of maintenance revisions… spaghetti-coding… hackers… self-modifying code… and relying upon undocumented system quirks.”

Page 57: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

81

Teaching Interests: Web Application Development (AmphibAnat.org)• NSF funded ($1,116K)• Web interface design:

(different design templates)• Client-side programming:

JavaScript, CSS, html• Server-side programming:

C#.net• Relational database

design/admin: Microsoft SQL Server

• Server setup/admin: Microsoft IIS web server and Microsoft Windows server

Page 58: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

82

Teaching Interests: Web Application Development (RDBOM Ontology Sys.)

• NSF funded• Ontology theory / Automata /

Algorithm Design• Web interface design:

(different design templates)• Client-side programming:

JavaScript, CSS, html• Server-side programming:

C#.net• Relational database

design/admin: Microsoft SQL Server

• Server setup/admin: Microsoft IIS web server and Microsoft Windows server

Page 59: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

83

Teaching Interests: Web Application Development (leeleong.com)• Web interface design:

(different design templates)• Client-side programming:

JavaScript, CSS, html• Server-side programming:

PHP• Relational database

design/admin: MySQL Server

• Server setup/admin: Apache web server

• Web graphics / photography• Personal hobby

Page 60: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

84

Teaching Interests: Web Design (web building projects)Common Call Campus Ministry RollaShootingClub.org

Page 61: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

85

Teaching Interests: Skills

• Programming: MS ASP.NET, C#, PHP, MATLAB, Perl, C, C++, Java, JavaScript, Pascal, Flash ActionScript, Director Lingo, HTML, SMIL, XML

• Database: MySQL Database, MS SQL Server• Server Administration: MS Win Server, MS IIS, Apache Web

Server, Real/Helix Streaming Server• Web/Multimedia: Adobe Dreamweaver, Fireworks, Flash,

Director, Freehand, Photoshop, Premiere• Streaming System: RealPlayer, Helix Producer, Helix Server,

SMIL

Page 62: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

86

Teaching Interests: New Course Development• Qualified to teach any core computer science course at the

undergraduate level as well as specialized graduate courses• I would be most interested in developing (new courses)

– Advanced Bioinformatics – Bioinformatics– Data Mining– Neural Networks & Applications– Theory of Computation Courses– Web Multimedia Development Courses

(web application development, web game programming)

– Basic Web Design (basic design theories, web aesthetics, web interface design)

Page 63: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

::: Thank You :::

Leong Lee, Ph.D. University of Missouri (MS&T)Visiting Assistant Professor, Dept of Computer Science

University of North Carolina at Greensboro

Page 64: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

88

ReferencesAltschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) ‘Gapped BLAST and PSI-BLAST: a new

generation of protein database search programs’, Nucleic Acids Res., Vol. 25, No. 17, pp.3389-402. Andreeva, A., Howorth, D., Chandonia, J. M., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2008) ‘Data growth and its impact on the

SCOP database: new developments’, Nucleic Acids Res, Vol. 36 (Database issue), D419-25. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F. and Nielsen, H. (2000) ‘Assessing the accuracy of prediction algorithms for classification: an

overview’, Bioinformatics, Vol. 16, No. 5, pp.412-24. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. and Bourne, P. E. (2000) ‘The Protein Data Bank’,

Nucleic Acids Res., Vol. 28, No. 1, pp.235-42. BLAST (2009). BLAST: Basic Local Alignment Search Tool. Obtained through the Internet: http://blast.ncbi.nlm.nih.gov/, [accessed 30/11/2009] Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S. and Jones, D. T. (2005) ‘Protein structure prediction servers at University College

London’, Nucleic Acids Res., Vol. 33(Web Server issue), W36-8. Cuff, J. A. and Barton, G. (1999) ‘Evaluation and improvement of multiple sequence methods for protein secondary structure prediction’, Proteins,

Vol. 34, pp.508–519.

Cuff, J. A. and Barton, G. (2000) ‘Application of multiple sequence alignment profiles to improve protein secondary structure prediction’, Proteins, Vol. 40, No. 3, pp.502-11.

Fadime, U. Y., O¨zlem, Y. and Metin, T. (2008) ‘Prediction of secondary structures of proteinsnext term using a two-stage method’, Computers & Chemical Engineering, Vol. 32, No. 1-2, pp.78-88.

Page 65: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

89

ReferencesFrishman, D. and Argos, P. (1997) ‘Seventy-five percent accuracy in protein secondary structure prediction’, Proteins, Vol. 27, pp.329–335. Grzymala-Busse, J. W. (1991) ‘Ch.3. Knowledge Acquisition’, Managing Uncertanity in Expert System, (pp.43-76), Boston: Kluwer Academic. Han, J. and Kamber, M. (2001) Data Mining: Concepts and Techniques, (pp.155-157) Morgan Kaufmann. Hu, H., Pan, Y., Harrison, R. and Tai, P. (2004) ‘Improved protein secondary structure prediction using support vector machine and a newencoding scheme and an advanced tertiary classifier’, IEEE Trans. NanoBiosci., Vol. 3, pp.265–271. Jones, D. T. (1999) ‘Protein secondary structure prediction based on position-specific scoring matrices’, J. Mol. Biol., Vol. 292, No. 2, pp.195-

202.

Jones, N. C. And Pevzner, P. A. (2004) An Introduction to Bioinformatics Algorithms, MIT Press.

Kabsh, W. and Sander, C. (1983) ‘How good are predictions of protein secondary structure?’, FEBS Letters, Vol. 155, pp.179-182. Kim, H. and Park, H., (2003) ‘Protein secondary structure prediction based on an improved support vector machines approach’, Protein Eng.,

Vol. 16, pp.553-60. King, R. D. and Sternberg, M. J. E. (1996) ‘Identification and application of the concepts important for accurate and reliable protein

secondary structure prediction’, Protein. Sci., Vol. 5, pp.2298–2310.

Klepeis, J. L. and Floudas, C. A. (2002) ‘Ab initio prediction of helical segments in polypeptides’, J Comput. Chem, Vol. 23, No. 2, pp.245-66.

Page 66: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

90

ReferencesLeopold, J. L., Maglia, A. M., Thakur, M., Patel, B. and Ercal, F. (2007) ‘Identifying Character Non-Independence in Phylogenetic Data Using Parallelized

Rule Induction From Coverings’, Data Mining VIII: Data, Text, and Web Mining and Their Business Applications, WIT Transactions on Information and Communication Technologies, Vol. 38, pp. 45-54.

Levitt, M. and Chothia, C. (1976) ‘Structural patterns in globular proteins’, Nature, Vol. 261, No. 5561, pp.552-8.

Lee, L., Leopold, J. L., Frank, R. L., and Maglia, A. M. (2009) ‘Protein Secondary Structure Prediction Using Rule Induction from Coverings,’ Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2009 , Nashville, Tennessee, USA, pp. 79-86.

Lee, L., Kandoth, C., Leopold, J. L., and Frank, R. L. (2010a) ‘Protein Secondary Structure Prediction Using Parallelized Rule Induction from Coverings ,’ International Journal of Medicine and Medical Sciences, Vol. 1, No. 2, pp. 99-105.

Lee, L., Leopold, J. L., Kandoth, C., and Frank, R. L. (2010b) ‘Protein secondary structure prediction using RT-RICO: a rule-based approach,’ The Open

Bioinformatics Journal, Vol. 4, pp. 17-30.. Lee, L., Leopold, J. L., Edgett, P. G., and Frank, R. L. (2010c) ‘Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction,’

Proceedings of ANNIE 2010 conference, St. Louis, Missouri, USA.

Lee, L., Leopold, J. L., and Frank, R. L. (2011) ‘Protein secondary structure prediction using BLAST and Relaxed Threshold Rule Induction from Coverings ,’ Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2011 , Paris, France, accepted for publication.

Lesk, A. M. (2008) Introduction to Bioinformatics, 3rd Edition, Oxford.

Maglia, A. M., Leopold, J. L. and Ghatti, V. R. (2004) ‘Identifying Character Non-Independence in Phylogenetic Data Using Data Mining Techniques’, Proc. Second Asia-Pacific Bioinformatics Conference Dunedin, New Zealand.

Page 67: Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction - An Overview Leong Lee, Ph.D. University of Missouri (MS&T) Visiting.

91

ReferencesMurzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995) ‘SCOP: a structural classification of proteins database for the investigation of sequences

and structures’, J Mol. Biol, Vol. 247, No. 4, pp.536-40. Nguyen, N. and Rajapakse, J. C. (2007) ‘Two stage support vector machines for protein secondary structure prediction’, Intl. J. Data Mining &

Bioinformatics, Vol. 1, pp.248-269. Pawlak, Z. (1984) ‘Rough Classification’, Int. J. Man-Machine Studies, Vol. 20, pp.469-483. Rost, B. and Sander, C. (1993a) ‘Prediction of protein secondary structure at better than 70% accuracy’, J. Mol. Biol.,Vol. 232, pp.584-599. Rost, B. and Sander, C. (1993b) ‘Improved prediction of protein secondary structure by use of sequence profiles and neural networks’, Proc. Natl. Acad.

Sci. USA, Vol. 90, pp.7558–7562. Rost, B. (2003) ‘Rising accuracy of protein secondary structure prediction’, In: Chasman, D. (Ed.), Protein structure determination, analysis, and modeling

for drug discovery, (pp.207–249), New York: Dekker. Salamov, A. A. and Solovyev, V. V. (1995) ‘Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence

alignments’, J Mol. Biol., Vol. 247, pp.11–15.

Tramontano, A. (2006) Protein Structure Prediction, Wiley-vch.

Wong, P. C., Whitney, P. and Thomas, J. (1999) ‘Visualizing Association Rules for Text Mining’ Proceedings of the 1999 IEEE Symposium on Information Visualization, pp. 120-123, 152.

Zhang, C. T. and Zhang, R. (2003) ‘Q9, a content-balancing accuracy index to evaluate algorithms of protein secondary structure prediction’, Int J

Biochem Cell Biol., Vol. 35, No. 8, pp.1256-62.