An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Presented by Thahir P. Mohamed Advancing Practice, Instruction & Innovation through Informatics October 19-23, 2008
18
Embed
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
An algorithm to guide selection of specific biomolecules to be studied
by wet-lab experimentsJessica Wehner and Madhavi Ganapathiraju
Department of Biomedical InformaticsUniversity of Pittsburgh School of Medicine
Pittsburgh PA USA
Presented byThahir P. Mohamed
Advancing Practice, Instruction & Innovation through InformaticsOctober 19-23, 2008
2
Protein Structure
Primary Structure: Chain of amino acids
Secondary Structure: Sub-structures such as helixes and strands
Tertiary Structure: Atomic resolution of protein structure
Protein structure is essential for successful design of drugs
3
Challenges in Protein Structure Prediction
• X-ray crystallography, NMR spectroscopy are wet-lab methods to determine structure.
• Very expensive
• Very time consuming
• Computational techniques are applied to predict protein structure
4
Computational Protein Structure Prediction
• Machine Learning techniques applied to predict structure
• Experimentally determined structures are used to learn to predict new structures
• When not enough data to learn from:
• Active learning is applied to select the next protein to be studied experimentally
5
Active Learning
Unlabeled Proteins
Possible Labels:
6
Cluster Unlabeled Proteins
Clustered Protiens
Possible Labels:
Active Learning
7
Cluster Unlabeled Proteins
Selection Algorithm
Clustered Proteins
Possible Labels:
Active Learning
8
Cluster Unlabeled Proteins
Selection Algorithm
Clustered Proteins
Possible Labels:
Active Learning
9
Prediction
Labeled Protiens
Cluster Unlabeled Proteins
Selection Algorithm
Possible Labels:
Active learning guides selection of data points for which you ask for labels
Active Learning
10Membrane Protein Structure Prediction
Membrane Protein importance and challenges
Membrane Proteins: 30% of genes cell regulation and signaling pathways 60% of drug targets
Yet, Difficult to study experimentally 1% of known protein structures
Active learning can be used as a tool against the limited number of known MP structures despite the large number of
known MP sequences
11
‘Features’ Representation
Data reduction is performed by SVD, resulting in a final 4 features per window.
• Find the most dense cluster– Choose N points closest to its centroid
– Find labels for these points (TM or NTM)
– Find the majority label, say L
– Assign L to all points in the cluster
• Repeat for next dense cluster
Clusters with no known structures are marked for study by experiments
14
Design 1 Results• Increase the number of data points for which we ask
structure • Compare how accuracy varies between guided selection
(via active learning) versus random selection.
0102030405060708090
1 4 7 10 13 16 19 22 25 28 31 34 37 40
Pe
rce
nt
Number of labels per node
Density based PRECISION Density based FSCORE
Random based PRECISION Random based FSCORE
A total of only 10 labels per node ~ 1% data
15
Design 2:Protein – based Selection
• Pick a random protein
• Find labels for all windows in this protein
• For each node containing labels, find the mode L of all labels it contains
• Assign L to remaining data in node
• Repeat and update for new protein, until half have been selected
16
Protein-based results
Repeated for different permutations of protein selection order, and observed several metrics.
Pe
rce
nt
Conclusions17
• We developed a framework that allows us to select a few proteins or fragments of proteins which, when annotated with experimental methods, may be used to label remaining protein sequences.
• We have shown that it is possible to achieve higher accuracy values with guided selection of data compared to random selection of data.
Acknowledgements
Madhavi GanapathirajuJessica Wehner
JW funded through NIH-NSF Bioengineering & Bioinformatics Summer
Institute
Visit us at
Department of Biomedical Informatics University of Pittsburgh