Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009
Jan 01, 2016
Protein Local 3D Structure Prediction by Super Granule
Support Vector Machines (Super GSVM)
Dr. Bernard Chen Assistant Professor
Department of Computer Science University of Central Arkansas
Fall 2009
Goal of the Dissertation The main purpose is trying to obtain
and extract protein sequence motifs information which are universally conserved and across protein
family boundaries.
And then use these information to do Protein Local 3D Structure Prediction
ResearchFlow
Part3Motif Information Extraction
Part2Discovering Protein
Sequence Motifs
Part1Bioinformatics Knowledge
and Dataset Collection
Part4Protein Local Tertiary Structure Prediction
Data set
HSSP matrix: 1b25
HSSP matrix: 1b25
HSSP matrix: 1b25
Representation of Segment Sliding window size: 9 Each window corresponds to a sequence
segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP.
More than 560,000 segments (413MB) are generated by this method.
DSSP: Obtain 2nd Structure information
ResearchFlow
Part3Motif Information Extraction
Part2Discovering Protein
Sequence Motifs
Part1Bioinformatics Knowledge
and Dataset Collection
Part4Protein Local Tertiary Structure Prediction
Granular Computing Model
Original dataset
Fuzzy C-Means Clustering
Information Granule 1
Information Granule M
New Improved or Greedy K-means Clustering
New Improved or Greedy K-means Clustering
Join Information
Final Sequence Motifs Information
...
...
Reduce Time-complexity
Wei’s method: 1285968 sec (15 days) * 6 = 7715568 sec (90 days)
Granular Model: 154899 sec + 231720 sec * 6 = 1545219 sec (18 days) (FCM exe time) (2.7 Days)
Comparison of Quality Measures
Different Methods >60% S.D. >70% S.D. H-B Measure
Traditional 25.82% 0.93 10.44% 0.61 0.2543
Zhong-60-1020 31.46% 0.26 10.42% 0.59 0.2871
Zhong-61-985 31.71% 0.81 10.84% 0.07 0.2784
Zhong-62-900 31.04% 0.19 10.29% 0.64 0.2768
FCM-K-means 37.14% 1.46 12.99% 0.74 0.3589
FIK Model
FIK Model 0 40.15% 1.09 13.44% 0.49 0.3730
FIK Model 800 40.23% 0.45 13.37% 0.58 0.3717
FIK Model 1000 39.15% 0.39 13.27% 0.29 0.3665
FIK Model 1200 38.90% 0.43 12.89% 0.77 0.3697
FIK Model 1400 37.80% 0.80 12.59% 0.44 0.3655
FGK Model
FGK Model 200 42.45% 0.06 14.14% 0.02 0.3393
FGK Model 250 42.77% 0.07 14.06% 0.07 0.3443
FGK Model 300 41.08% 0.14 13.89% 0.02 0.3311
FGK Model 350 37.47% 0.51 13.49% 0.14 0.3489
FGK Model 400 37.62% 1.56 13.86% 1.29 0.3676
Best Selection 44.18% 0 15.02% 0 0.3664
ResearchFlow
Part3Motif Information Extraction
Part2Discovering Protein
Sequence Motifs
Part1Bioinformatics Knowledge
and Dataset Collection
Part4Protein Local Tertiary Structure Prediction
Super GSVM-FE Motivation First, the information we try to generate is
about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique;
Second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule.
Original dataset
Fuzzy C-Means Clustering
Information Granule 1
Information Granule M
Greedy K-means Clustering
Greedy K-means Clustering
Join Information
Final Sequence Motifs Information
...
...
For Each Cluster
Ranking SVMFeature Elimination
...Ranking SVMFeature Elimination
Greedy K-means Clustering
Greedy K-means Clustering
...
… … For Each Cluster
Collect SurvivedSegments
Collect SurvivedSegments
… …
Five iterations of traditional K-maens
Five iterations of traditional K-maens
For Each Cluster
For Each Cluster
...
Super GSVM-FE
Additional Portion
Extracted Motif Information
ResearchFlow
Part3Motif Information Extraction
Part2Discovering Protein
Sequence Motifs
Part1Bioinformatics Knowledge
and Dataset Collection
Part4Protein Local Tertiary Structure Prediction
3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file
3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file
Testing Data The latest release of PISCES includes
4345 PDB files. Compare with the dataset in our
experiment, 2419 PDB files are excluded.
Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset.
Testing Data
We convert the testing dataset by the approach we introduced
more than 490,000 segments are generated as testing dataset.
Super GSVM
Training dataset
Fuzzy C-Means Clustering
Information Granule 1
Information Granule M
Greedy K-means Clustering
Greedy K-means Clustering
Collect all extracted clusters and Ranking-SVMs
...
...
For Each Cluster
Train Ranking SVMand thenEliminate 20% lower rank members
... Train Ranking SVMand thenEliminate 20% lower rank members
… … For Each Cluster
Five iterations of traditional K-means
Five iterations of traditional K-means
All Sequence clusters
All Ranking SVMs
Independent testing Dataset
Feed to the belonging
SVM Predict the local 3D structure
If the rank belongs to
cluster
Find the closest cluster within a given
distance threshold
If not, find the next closest
cluster
Prediction Accuracy
Prediction Coverage
Future Works
Incorporate Chou-Fasman parameter for SVM training
Future Works For each
cluster, instead of building SVM model, we build Decision Tree instead
Training dataset
Fuzzy C-Means Clustering
Information Granule 1
Information Granule M
Greedy K-means Clustering
Greedy K-means Clustering
Collect all extracted clusters and Ranking-SVMs
...
...
For Each Cluster
Build Decision Tree
...Build Decision Tree
… … For Each Cluster
Five iterations of traditional K-means
Five iterations of traditional K-means
All Sequence clusters
Test by DT
Independent testing Dataset
Feed to the belonging
DT Predict the local 3D structure
If the rank belongs to
cluster
Find the closest cluster within a given
distance threshold
If not, find the next closest
cluster