Top Banner
COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engi neering
58

COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Jan 12, 2016

Download

Documents

Arline Palmer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

COT 6930HPC and Bioinformatics

Protein Structure Prediction

Xingquan ZhuDept. of Computer Science and Engineering

Page 2: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

DNA RNA

cDNAESTsUniGene

phenotype

GenomicDNADatabases

Protein sequence databases

protein

Protein structure databases

transcription translation

Gene expressiondatabase

Page 3: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Outline Protein Structure

Why structure How to predict protein structure

Experimental methods Computational methods (predictive methods)

Protein Structure Prediction Secondary structure prediction (2D)

Machine learning methods for protein secondary structure prediction Tertiary structure prediction (3D)

Ab initio Homology modeling

Page 4: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

ProteinsProteins

Proteins play a crucial role in virtually all biological processes with a broad range of functions.

The activity of an enzyme or the function of a protein is governed by the

three-dimensional structure

Page 5: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Protein Structure is Hierarchical

Protein Structure Video

http://www.youtube.com/watch?v=lijQ3a8yUYQ

Page 6: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Primary Structure: Sequence

The primary structure of a protein is the amino acid sequence

Page 7: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Protein Structure Prediction Problem

Protein structure prediction Predict protein 3D structure from (amino acid) sequence One step closer to useful biological knowledge Sequence → secondary structure → 3D structure → function

Page 8: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Outline Protein Structure

Why structure How to Predict Protein Structure

Experimental methods Computational methods (predictive methods)

Protein Structure Prediction Secondary structure prediction (2D)

Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)

Ab initio Homology modeling

Page 9: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Why Predict Structure?

Structure determines

function

Molecular function

Structure is more conserved than

sequence

Goals:

1. Predict structure from sequence

2. Predict function based on structure

3. Predict function based on sequence

Page 10: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Why predict structure: Structure is more conserved than sequence

28% sequence identity

Page 11: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Why predict structure: Can Label Proteins by Dominant Structure SCOP: Structural Classification Of Proteins

Page 12: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Why predict structure: Large number proteins vs. relative smaller number folds

Small number of unique folds found in practice 90% proteins < 1000 folds, estimated ~4000 total folds

http://www.rcsb.org/pdb/home/home.doAs of 02/05/2008 48,878 structures

Page 13: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Examples of Fold Classes

Page 14: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

How to Predict Protein Structure

A related biological question: what are the factors that determine a structure? Energy Kinematics

How can we determine structure? Experimental methods

X-ray crystallography or NMR (Nuclear magnetic resonance) spectrometry limitation: protein size, require crystallized proteins

Computational methods (predictive methods) 2-D structure (secondary structure) 3-D structure (tertiary structure)

Page 15: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Geometry of Protein Structure

rotatable rotatable

Page 16: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Inter-atomic Forces

Covalent bond (short range, very strong) Binds atoms into molecules / macromolecules

Hydrogen bond (short range, strong) Binds two polar groups (hydrogen + electronegative atom)

Disulfide bond / bridge (short range, very strong) Covalent bond between sulfhydryl (sulfur + hydrogen) groups

Hydrophobic / hydrophillic interaction (weak) Hydrogen bonding w/ H2O in solution

Van der Waal’s interaction (very weak) Nonspecific electrostatic attractive force

Page 17: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Types of Inter-atomic Forces

Page 18: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Quick Overview of Energy

Bond Strength (kcal/mole)

H-bonds 3-7

Ionic bonds 10

Hydrophobic interactions 1-2

Van der vaals interactions 1

Disulfide bridge 51

Page 19: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Protein Folding Animation

http://www.youtube.com/watch?v=fvBO3TqJ6FE http://www.youtube.com/watch?v=swEc_sUVz5I

Page 20: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Two Related Problems in Structure Prediction

Directly predicting protein structure from the amino acid sequence has proved elusive

Two sub-problems Secondary Structure Prediction Tertiary Structure Prediction

Page 21: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Secondary Structure Predication (2D)

For each residues in a protein structure, three possible states: a (a-helix), ß (ß-strand), t (others).

amino acid sequence

Secondary structure sequence

Currently the accuracy of secondary structure methods is nearly 80% (2000).

Secondary structure prediction can provide useful information to improve other sequence and structure analysis methods, such as sequence alignment and 3-D modeling.

http://bioinf.cs.ucl.ac.uk/psipred/psiform.html

Page 22: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Outline Protein Structure

Why structure How to Predict Protein Structure

Experimental methods Computational methods (predictive methods)

Protein Structure Prediction Secondary structure prediction (2D)

Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)

Ab initio Homology modeling

Page 23: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

PSSP: Protein Secondary Structure Prediction

Three Generations• Based on statistical information of single

amino acids• Based on local amino acid interaction

(segments). Typically a segment containes 11-21 aminoacids

• Based on evolutionary information of the homology sequences

Page 24: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Secondary Structure preferences for Amino Acids

The normalized frequencies for each conformation were calculated from the fraction of residues of each amino acid that occurred in that conformation, divided by this fraction for all residues.

Random occurrence of a particular amino in a conformation would give a value of unity. A value greater than unity indicates a preference for a particular type of secondary structure.

Page 25: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Outline Protein Structure

Why structure How to Predict Protein Structure

Experimental methods Computational methods (predictive methods)

Protein Structure Prediction Secondary structure prediction (2D)

Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)

Ab initio Homology modeling

Page 26: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Machine learning methods for Protein Secondary Structure Prediction

Introduction to classification Generalize protein secondary structure prediction

as a machine learning problem Introduction to Neural Network

Page 27: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Classification and Classifiers

Given a data base table DB with a set of attribute values and a special atribute C, called a class label.

Example:

A1 A2 A3 A4 C

1 1 m g Tumor

0 1 v g Normal

1 0 m b Normal

Page 28: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Classification and Classifiers

An algorithm is called a classification algorithm if it uses the data to build a set of patterns Decision rules or decision trees, etc. Those patters are structured in such a way that we can use them to

classify unknown sets of objects- unknown records.

For that reason (because of the goal) the classification algorithm is often called shortly a classifier.

Classifier Example

Page 29: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Classification and Classifiers

Building a classifier consists of two phases: Training and testing. In both phases we use data (training data set and disjoint test data

set) for which the class labels are known for ALL of the records. The training data set to create patterns (rules, trees, or to

train a Neural network). Evaluate created patterns with the use of of test data, which

classification is known. The measure for a trained classifier accuracy is called

predictive accuracy.

Page 30: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Predictive Accuracy Evaluation

The main methods of predictive accuracy evaluations are:

• Re-substitution (N ; N)• Holdout (2N/3 ; N/3)• x-fold cross-validation (N-N/x ; N/x)• Leave-one-out (N-1 ; 1), where N is the number of instances in the dataset

The process of building and evaluating a classifier is also called a supervised learning, or lately when dealing with large data bases a classification method in Data Mining

Page 31: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Classification Models: Different Classifiers

Typical classification models Decision Trees (ID3, C4.5) Nearest Neighbors Support Vector Machines Neural Networks

Most of the best classifiers for PSSP are based on Neural Network model

Demonstration

Page 32: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Machine learning methods for Protein Secondary Structure Prediction

Introduction to classification Generalize protein secondary structure prediction

as a machine learning problem Introduction to Neural Network

Page 33: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

How to generalize protein secondary prediction as a machine learning problem? Using a sliding window to move along the amino acid

sequence Each window denotes an instance Each amino acid inside the window denotes an attribute The known secondary structure of the central amino acid is the class

label

Page 34: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

How to generalize protein secondary prediction as a machine learning problem?

A set of “examples” are generated from sequence with known secondary structures

Examples form a training set Build a neural network classifier Apply the classifier to a sequence with unknown

secondary structure

Page 35: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Machine learning methods for Protein Secondary Structure Prediction

Introduction to classification Generalize protein secondary structure prediction

as a machine learning problem Introduction to Neural Network

Page 36: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Introduction to Neural Network

What is an artificial Neural Network? An extremely simplified model of the brain

Essentially a function approximator Transforms inputs into outputs to the best of its ability

Page 37: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Introduction to Neural Network

Composed of many “neurons” that co-operate to perform the desired function

Page 38: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

How do Neural Network Work? A neuron (perceptron) is a single layer NN The output of a neuron is a function of the weighted

sum of the inputs plus a bias

Page 39: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Activation Function

Binary active function f(x)=1 if x>=0 f(x)=0 otherwise

The most common sigmoid function used is the logistic function f(x) = 1/(1 + e-x) The calculation of derivatives are important for neural

networks and the logistic function has a very nice derivative f’(x) = f(x)(1 - f(x))

Page 40: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Where Do The Weights Come From?

The weights in a neural network are the most important factor in determining its function

Training is the act of presenting the network with some sample data and modifying the weights to better approximate the desired function Supervised Training

Supplies the neural network with inputs and the desired outputs

Response of the network to the inputs is measured The weights are modified to reduce the difference between the

actual and desired outputs

Page 41: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Perceptron Example Simplest neural network with the ability to learn

Made up of only input neurons and output neurons Output neurons use a simple threshold activation

function In basic form, can only solve linear problems

Limited applications

Page 42: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Perceptron Example

Perceptron weight updating If the output is not correct, the weights are adjusted

according to the formula: wnew = wold + ·(desired – output)input

Assuming given instance {(1,0,1), 0}

Page 43: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Multi-Layer Feedforward NN

An extension of the perceptron Multiple layers

The addition of one or more “hidden” layers in between the input and output layers

Activation function is not simply a threshold Usually a sigmoid function

A general function approximator Not limited to linear problems

Information flows in one direction The outputs of one layer act as inputs to the next layer

Page 44: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Multi-Layer Feedforward NN Example XOR problem

Page 45: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Back-propagation

Searches for weight values that minimize the total error of the network over the set of training examples Forward pass: Compute the outputs of all units in the

network, and the error of the output layers. Backward pass: The network error is used for

updating the weights (credit assignment problem).

Page 46: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

NN for Protein Secondary Structure Prediction

Page 47: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Outline Protein Structure

Why structure How to Predict Protein Structure

Experimental methods Computational methods (predictive methods)

Protein Structure Prediction Secondary structure prediction (2D)

Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)

Ab initio Homology modeling

Page 48: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Ab initio Prediction

Sampling the global conformation space Lattice models / Discrete-state models Molecular Dynamics

Picking native conformations with an energy function Solvation model: how protein interacts with water Pair interactions between amino acids

Page 49: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Lattice String Folding HP model: main modeled force is hydrophobic attraction

Amino Acids are classified into two types Hydrophopic (H) or Polar (P)

NP-hard in both 2-D square and 3-D cubic Constant approximation algorithms Not so relevant biologically

Page 50: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Lattice String Folding

Page 51: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Energy Minimization Many forces act on a protein

Hydrophobic: inside of protein wants to avoid water Hydrophobic molecules associate with each other in water solvent as if water

molecules is the repellent to them. It is like oil/water separation. Packing: atoms can't be too close, nor too far away van der Waals interactions Bond angle/length constraints Long distance, e.g.

Electrostatics & Hydrogen bonds Disulphide bonds Salt bridges

Can calculate all of these forces, and minimize Intractable in general case, but can be useful

Page 52: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Molecular Dynamics (MD)Molecular Dynamics (MD)

In molecular dynamics simulation, we simulate motions of atoms as a function of time according to Newton’s equation of motion. The equations for a system consisting on N atoms can be written as

Here, ri and mi represent the position and mass of atom i and Fi(t) is the force on atom i at time t. Fi(t) is given by

where V ( r1, r2, …, rN) is the potential energy of the system that depends on the positions of the N atoms in the system. ∇i is

). , 2, 1,( ,d

d2

2

Nitt

tm i

ii F

r

,,,, 21 Nii V rrrF

zyxi

kji

(1)

(3)

(2)

Page 53: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Energy Functions used in Energy Functions used in Molecular SimulationMolecular Simulation

pairs ,ticelectrosta

pairs , der Waalsvan

612

Hbonds

1012

dihedralsangles

2

0

bonds

2

0totalcos1

jiij

ji

jiij

ij

ij

ij

ij

ij

ij

ij

b

r

qq

r

B

r

A

r

D

r

C

nKKrrKV

Electrostatic term

H-bonding term

Van der Waals term

Bond stretching term

Dihedral termAngle bending term

r ΦΘ

+ ーO H

rr r

The most time demanding part.

Page 54: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Outline Protein Structure

Why structure How to Predict Protein Structure

Experimental methods Computational methods (predictive methods)

Protein Structure Prediction Secondary structure prediction (2D)

Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)

Ab initio Homology modeling

Page 55: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Homology-based Prediction

Align query sequence with sequences of known structure, usually >30% similar

Superimpose the aligned sequence onto the structure template, according to the computed sequence alignment

Perform local refinement of the resulting structure in 3D

90% of new structures submitted to PDB in the past three years have similar folds in PDB

The number of unique structural folds is small (possibly a few thousand)

Page 56: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Homology-based Prediction

Raw model

Loop modeling

Side chain placement

Refinement

Page 57: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Homology-based Prediction

Page 58: COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering.

Outline Protein Structure

Why structure How to predict protein structure

Experimental methods Computational methods (predictive methods)

Protein Structure Prediction Secondary structure prediction (2D)

Machine learning methods for protein secondary structure prediction Tertiary structure prediction (3D)

Ab initio Homology modeling