An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.

An algorithm to guide selection of specific biomolecules to be studied

by wet-lab experimentsJessica Wehner and Madhavi Ganapathiraju

Department of Biomedical InformaticsUniversity of Pittsburgh School of Medicine

Pittsburgh PA USA

Presented byThahir P. Mohamed

Advancing Practice, Instruction & Innovation through InformaticsOctober 19-23, 2008

2

Protein Structure

Primary Structure: Chain of amino acids

Secondary Structure: Sub-structures such as helixes and strands

Tertiary Structure: Atomic resolution of protein structure

Protein structure is essential for successful design of drugs

3

Challenges in Protein Structure Prediction

• X-ray crystallography, NMR spectroscopy are wet-lab methods to determine structure.

• Very expensive

• Very time consuming

• Computational techniques are applied to predict protein structure

4

Computational Protein Structure Prediction

• Machine Learning techniques applied to predict structure

• Experimentally determined structures are used to learn to predict new structures

• When not enough data to learn from:

• Active learning is applied to select the next protein to be studied experimentally

5

Active Learning

Unlabeled Proteins

Possible Labels:

6

Cluster Unlabeled Proteins

Clustered Protiens

Possible Labels:

Active Learning

7


Selection Algorithm

Clustered Proteins

Possible Labels:

Active Learning

8


Selection Algorithm

Clustered Proteins

Possible Labels:

Active Learning

9

Prediction

Labeled Protiens


Selection Algorithm

Possible Labels:

Active learning guides selection of data points for which you ask for labels

Active Learning

10Membrane Protein Structure Prediction

Membrane Protein importance and challenges

Membrane Proteins: 30% of genes cell regulation and signaling pathways 60% of drug targets

Yet, Difficult to study experimentally 1% of known protein structures

Active learning can be used as a tool against the limited number of known MP structures despite the large number of

known MP sequences

11

‘Features’ Representation

Data reduction is performed by SVD, resulting in a final 4 features per window.

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7

Residue: A L H W R A A G A A T V L L V I V E R G A P G A Q L I

Topology: - - - - - M M M M M M M M M M M M - - - - - - - - - -

Charge: - - p – p - - - - - - - - - - - - n p - - - - - - - -

E-Prop: D d . . A D D . D D a d d d d d d D A . D D . D a d d

Properties

ChargeSizePolarityAromaticityElectronic Properties

12Clustering the Data

Dim 1Dim 2

Dim

3

Neural Network Self Organizing Map (SOM)

• Finds centroids of clusters in the data

13

Design 1:Density-based Selection

• Find the most dense cluster– Choose N points closest to its centroid

– Find labels for these points (TM or NTM)

– Find the majority label, say L

– Assign L to all points in the cluster

• Repeat for next dense cluster

Clusters with no known structures are marked for study by experiments

14

Design 1 Results• Increase the number of data points for which we ask

structure • Compare how accuracy varies between guided selection

(via active learning) versus random selection.

0102030405060708090

1 4 7 10 13 16 19 22 25 28 31 34 37 40

Pe

rce

nt

Number of labels per node

Density based PRECISION Density based FSCORE

Random based PRECISION Random based FSCORE

A total of only 10 labels per node ~ 1% data

15

Design 2:Protein – based Selection

• Pick a random protein

• Find labels for all windows in this protein

• For each node containing labels, find the mode L of all labels it contains

• Assign L to remaining data in node

• Repeat and update for new protein, until half have been selected

16

Protein-based results

Repeated for different permutations of protein selection order, and observed several metrics.

Pe

rce

nt

Conclusions17

• We developed a framework that allows us to select a few proteins or fragments of proteins which, when annotated with experimental methods, may be used to label remaining protein sequences.

• We have shown that it is possible to achieve higher accuracy values with guided selection of data compared to random selection of data.

Acknowledgements

Madhavi GanapathirajuJessica Wehner

JW funded through NIH-NSF Bioengineering & Bioinformatics Summer

Institute

Visit us at

Department of Biomedical Informatics University of Pittsburgh

Thank you!

Cathedral of Learning, University of Pittsburgh

www.dbmi.pitt.edu/madhavi

An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.

Documents

d d d d d d d

strandstertiary structure

active learningpossible

labelsactive learning

number of data points

centroidfind labels

known structures

points tm