Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer.

Protein Local 3D Structure Prediction by Super Granule

Support Vector Machines (Super GSVM)

Dr. Bernard Chen Assistant Professor

Department of Computer Science University of Central Arkansas

Fall 2009

Goal of the Dissertation The main purpose is trying to obtain

and extract protein sequence motifs information which are universally conserved and across protein

family boundaries.

And then use these information to do Protein Local 3D Structure Prediction

ResearchFlow

Part3Motif Information Extraction

Part2Discovering Protein

Sequence Motifs

Part1Bioinformatics Knowledge

and Dataset Collection

Part4Protein Local Tertiary Structure Prediction

Data set

HSSP matrix: 1b25

HSSP matrix: 1b25

HSSP matrix: 1b25

Representation of Segment Sliding window size: 9 Each window corresponds to a sequence

segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP.

More than 560,000 segments (413MB) are generated by this method.

DSSP: Obtain 2nd Structure information

ResearchFlow



Sequence Motifs




Granular Computing Model

Original dataset

Fuzzy C-Means Clustering

Information Granule 1

Information Granule M

New Improved or Greedy K-means Clustering

New Improved or Greedy K-means Clustering

Join Information

Final Sequence Motifs Information

...

...

Reduce Time-complexity

Wei’s method: 1285968 sec (15 days) * 6 = 7715568 sec (90 days)

Granular Model: 154899 sec + 231720 sec * 6 = 1545219 sec (18 days) (FCM exe time) (2.7 Days)

Comparison of Quality Measures

Different Methods >60% S.D. >70% S.D. H-B Measure

Traditional 25.82% 0.93 10.44% 0.61 0.2543

Zhong-60-1020 31.46% 0.26 10.42% 0.59 0.2871

Zhong-61-985 31.71% 0.81 10.84% 0.07 0.2784

Zhong-62-900 31.04% 0.19 10.29% 0.64 0.2768

FCM-K-means 37.14% 1.46 12.99% 0.74 0.3589

FIK Model

FIK Model 0 40.15% 1.09 13.44% 0.49 0.3730

FIK Model 800 40.23% 0.45 13.37% 0.58 0.3717

FIK Model 1000 39.15% 0.39 13.27% 0.29 0.3665

FIK Model 1200 38.90% 0.43 12.89% 0.77 0.3697

FIK Model 1400 37.80% 0.80 12.59% 0.44 0.3655

FGK Model

FGK Model 200 42.45% 0.06 14.14% 0.02 0.3393

FGK Model 250 42.77% 0.07 14.06% 0.07 0.3443

FGK Model 300 41.08% 0.14 13.89% 0.02 0.3311

FGK Model 350 37.47% 0.51 13.49% 0.14 0.3489

FGK Model 400 37.62% 1.56 13.86% 1.29 0.3676

Best Selection 44.18% 0 15.02% 0 0.3664

ResearchFlow



Sequence Motifs




Super GSVM-FE Motivation First, the information we try to generate is

about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique;

Second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule.

Original dataset




Greedy K-means Clustering


Join Information

Final Sequence Motifs Information

...

...

For Each Cluster

Ranking SVMFeature Elimination

...Ranking SVMFeature Elimination



...

… … For Each Cluster

Collect SurvivedSegments

Collect SurvivedSegments

… …

Five iterations of traditional K-maens

Five iterations of traditional K-maens

For Each Cluster

For Each Cluster

...

Super GSVM-FE

Additional Portion

Extracted Motif Information

ResearchFlow



Sequence Motifs




3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file

3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file

Testing Data The latest release of PISCES includes

4345 PDB files. Compare with the dataset in our

experiment, 2419 PDB files are excluded.

Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset.

Testing Data

We convert the testing dataset by the approach we introduced

more than 490,000 segments are generated as testing dataset.

Super GSVM

Training dataset






Collect all extracted clusters and Ranking-SVMs

...

...

For Each Cluster

Train Ranking SVMand thenEliminate 20% lower rank members

... Train Ranking SVMand thenEliminate 20% lower rank members


Five iterations of traditional K-means


All Sequence clusters

All Ranking SVMs

Independent testing Dataset

Feed to the belonging

SVM Predict the local 3D structure

If the rank belongs to

cluster

Find the closest cluster within a given

distance threshold

If not, find the next closest

cluster

Prediction Accuracy

Prediction Coverage

Future Works

Incorporate Chou-Fasman parameter for SVM training

Future Works For each

cluster, instead of building SVM model, we build Decision Tree instead

Training dataset






Collect all extracted clusters and Ranking-SVMs

...

...

For Each Cluster

Build Decision Tree

...Build Decision Tree




All Sequence clusters

Test by DT

Independent testing Dataset

Feed to the belonging

DT Predict the local 3D structure

If the rank belongs to

cluster

Find the closest cluster within a given

distance threshold

If not, find the next closest

cluster

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer.

Documents

protein files

information granule

protein sequences

pdb protein data bank

training dataset

protein family boundaries

pdb files

independent testing