Top Banner
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu .
42

CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Jan 11, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 18 Protein Bioinforamtics

and Protein Secondary Structure Prediction

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun Hu

Course page: http://www.scigen.org/csce555University of South Carolina

Department of Computer Science and Engineering2008

www.cse.sc.edu.

Page 2: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

OutlineOutlineUnderstanding Protein StructuresProtein bioinformatics: what and

why?Protein Secondary Structure

Prediction: problem & algorithm Summary

Page 3: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

ProteinsProteins

Large organic compounds made of amino acids

Proteins play a crucial role in virtually all biological processes with a broad range of functionsfunctions.

The activity of an enzyme or the function of a protein is governed by the three-dimensional structure structure

Page 4: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

How ProteinsHow ProteinsAre Are GeneratedGenerated

folding

Page 5: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Protein BioinformaticsProtein BioinformaticsAnalysis and prediction of protein

structures (Structural Bioinformatics)◦Protein Design: design a sequence

that will fold into a designated structure

Assist experimental biology in assigning functions or suggesting functional hypotheses for all known proteins.

Page 6: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

DNA RNA

cDNAESTsUniGene

phenotype

GenomicDNADatabases

Protein sequence databases

protein

Protein structure databases

transcription translation

Gene expressiondatabase

Protein BioinformaticsProtein Bioinformatics

Page 7: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

TOP 10 Most Wanted TOP 10 Most Wanted solutions solutions in protein bioinformaticsin protein bioinformatics1. Protein sequence alignment2. Predicting protein features from

sequence3. Function prediction4. Protein structure prediction5. Membrane proteins6. Functional site identification7. Protein-protein interaction8. Protein-small molecule interaction

(Docking)9. Protein design10. Protein engineering

Page 8: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Why Protein Bioinformatics?Why Protein Bioinformatics?Function = Function = interactions interactions

Disease Mechanism, Gene regulation, Drug design…

Page 9: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Relevance of Protein Relevance of Protein StructureStructurein the Post-Genome Erain the Post-Genome Era

sequence

structure

function

medicine

Page 10: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Protein Structure ExampleProtein Structure Example

Beta Sheet

Helix Loop

2 chains

Page 11: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Proteins Structure is Proteins Structure is Hierarchical Hierarchical

Single peptide chain Multiple peptide

chainsLocal Folding Long-range Folding

Multi-meric organization

Sequence

Page 12: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

How to Obtain Protein How to Obtain Protein StructuresStructuresExperimental methods (>50,000)

X-ray crystallography or NMR (Nuclear magnetic resonance) spectrometry limitation: protein size, require crystallized proteins

Difficult to get crystallized for membrane proteins

Computational methods (predictive methods)

2-D structure (secondary structure) 3-D structure (tertiary structure) CASP competition: Critical Assessment of

Techniques for Protein Structure Prediction

Page 13: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Protein Structure Prediction Protein Structure Prediction ProblemProblemGiven the amino acid sequence

of a protein, what’s its shape in three-dimensional space?◦ Sequence → secondary structure → 3D

structure → function

Page 14: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Why Prediction Needed?Why Prediction Needed?The functions of a protein is

determined by its structure.Experimental methods to

determine protein structure are time-consuming and expensive.

Big gap between the available protein sequences and structures.

Page 15: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Growth of Protein Sequences Growth of Protein Sequences and Structuresand Structures

Data from http://www.dna.affrc.go.jp

50,000 as 2008

30000*X species

Page 16: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

What determines structures: What determines structures: Inter-atomic ForcesInter-atomic Forces

Covalent bond (short range, very strong)◦ Binds atoms into molecules / macromolecules

Hydrogen bond (short range, strong)◦ Binds two polar groups (hydrogen + electronegative atom)

Disulfide bond / bridge (short range, very strong)◦ Covalent bond between sulfhydryl (sulfur + hydrogen) groups

Hydrophobic / hydrophillic interaction (weak)◦ Hydrogen bonding w/ H2O in solution

Van der Waal’s interaction (very weak)◦ Nonspecific electrostatic attractive force

Electrostatic forces: ◦ oppositely charged side chains form salt bridges

Page 17: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Secondary Structure Predication Secondary Structure Predication (2D)(2D)

For each residues in a protein structure, three possible states: a (a-helix), ß (ß-strand), t (others).amino acid sequence

Secondary structure sequence

Currently the accuracy of secondary structure methods is nearly 80-82% (2006). Theoretical uplimit is 90% due to uncertainty 10% in real proteins

Secondary structure prediction can provide useful information to improve other sequence and structure analysis methods, such as sequence alignment and 3-D modeling.

http://bioinf.cs.ucl.ac.uk/psipred/psiform.html

Page 18: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

PSSP: Protein Secondary PSSP: Protein Secondary Structure PredictionStructure PredictionThree Generations

•Based on statistical information of single amino acids

•Based on local amino acid interaction (segments). Typically a segment containes 11-21 aminoacids

•Based on evolutionary information of the homology sequences

Page 19: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Formulate PSSP as a machine Formulate PSSP as a machine learning classification problemlearning classification problemUsing a sliding window to move along the

amino acid sequence◦ Each window denotes an instance◦ Each amino acid inside the window denotes an

attribute◦ The known secondary structure of the central amino

acid is the class label

Page 20: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

How to generalize protein How to generalize protein secondary prediction as a machine secondary prediction as a machine learning problem?learning problem?A set of “examples” are generated

from sequence with known secondary structures

Examples form a training setBuild a neural network classifierApply the classifier to a sequence

with unknown secondary structure

Page 21: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Introduction to Neural Introduction to Neural NetworkNetworkWhat is an Artificial Neural Network?

◦An extremely simplified model of the brain Essentially a function approximator Transforms inputs into outputs to the best of its

ability

Page 22: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

How do Neural Network How do Neural Network Work?Work?A neuron (perceptron) is a single layer NNThe output of a neuron is a function of the

weighted sum of the inputs plus a bias

Page 23: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Activation FunctionActivation Function

Binary active function◦f(x)=1 if x>=0◦f(x)=0 otherwise

The most common sigmoid function used is the logistic function◦f(x) = 1/(1 + e-x)

Page 24: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Multi-Layer Feedforward NN Multi-Layer Feedforward NN ExampleExample

XOR problem (nonlinear classification capable)

Page 25: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Where Do The Weights Come Where Do The Weights Come From?From?The weights in a neural network are the

most important factor in determining its function

Training is the act of presenting the network with some sample data and modifying the weights to better approximate the desired function (class labels)◦ Supervised Training

Supplies the neural network with inputs and the desired outputs

Response of the network to the inputs is measured The weights are modified to reduce the difference between the

actual and desired outputs

Page 26: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Training in Perceptron Neural NetTraining a perceptron:

Find the weights W that minimizes the error function:

P

i

ii XtWXFE1

2)().(

P: number of training dataXi: training vectorsF(W.Xi): output of the perceptront(Xi) : target value for Xi

Use steepest descent:

- compute gradient:

- update weight vector:

- iterate

Nw

E

w

E

w

E

w

EE ,...,,,

321

EWW oldnew (e: learning rate)

Page 27: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Back-propagation Back-propagation algorithm algorithm For Mult-layer NN, the errors of hidden

layers are not known Searches for weight values that

minimize the total error of the network over the set of training examples◦ Forward pass: Compute the outputs of all

units in the network, and the error of the output layers.

◦ Backward pass: The network error is backpropogated for updating the weights (credit assignment problem).

Page 28: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

04/21/23 Copyright G. A. Tagliarini, PhD 28

Feedforward Network Training by Feedforward Network Training by Backpropagation: Process Backpropagation: Process SummarySummarySelect an architectureRandomly initialize weightsWhile error is too large

◦Select training pattern and feedforward to find actual network output

◦Calculate errors and backpropagate error signals

◦Adjust weightsEvaluate performance using the

test set

Page 29: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

NN for Protein NN for Protein Secondary Secondary Structure Structure PredictionPrediction

0

Page 30: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

How to Encode Each Amino How to Encode Each Amino Acid?Acid?20 bit binary sequence10000000000000000000-----A01000000000000000000-----R00100000000000000000-----N…00000000000000000001-----V

Page 31: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Evaluation of Performance: Evaluation of Performance: Accuracy(Q3)Accuracy(Q3)

ALHEASGPSVILFGSDVTVPPASNAEQAK

hhhhhooooeeeeoooeeeooooohhhhh

ohhhooooeeeeoooooeeeooohhhhhh

Amino acid sequence

Actual Secondary Structure

Q3=22/29=76%

Q3 for random prediction is 33%

Secondary structure assignment in real proteins is uncertain to about 10%; Therefore, a “perfect” prediction would have Q3=90%.

Page 32: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Performances(CASP)Performances(CASP)

CASP YEAR# of

Targets<Q3> Group

CASP1 1994 6 63%Rost and

Sander

CASP2 1996 24 70%Rost

CASP3 1998 18 75% Jones

CASP4 2000 28 80% Jones

Page 33: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

SummarySummaryProtein bioinformatics is a very

important area with many interesting problems

Computational methods can have big impact in medicine and molecular biology

Secondary protein structure prediction algorithms are very strong

Page 34: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Slides AcknowledgementsSlides AcknowledgementsJinbo Xu University of WaterlooXingquan Zhu

Page 35: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Why predict structure: Can Label Why predict structure: Can Label Proteins by Dominant StructureProteins by Dominant Structure

Protein classification, Structural Blasting

Page 36: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Amino AcidsAmino Acids

Side chain

Each amino acid is identified by its side chain, which determines the properties of this amino acid.

Page 37: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Side Chain PropertiesSide Chain Properties

hydrophobic V, L, I, M, F

Hydrophilic N, E, Q, H, K, R, D

In-between G, A, S, T, Y, W, C, P

Positively charged R, H, L

Negatively charged D, E

Polar but not charged N, Q, S, T

nonpolar A, G, I, L, M, P, V

Aromatic F, W, Y

Hydrophobic amino acids stay inside of a protein, whileHydrophilic ones tend to stay in the exterior of a protein.Oppositely charged amino acids can form salt bridge.Polar amino acids can participate hydrogen bonding

Page 38: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Alpha Helix ExamplesAlpha Helix Examples

Page 39: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Beta Sheet ExamplesBeta Sheet Examples

Parallel beta sheet Anti-parallel beta sheet

Page 40: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

04/21/23 Copyright G. A. Tagliarini, PhD 40

Calculate Outputs For Each Calculate Outputs For Each Neuron Based On The PatternNeuron Based On The Pattern The output from neuron j

for pattern p is Opj where

and

k ranges over the input indices and Wjk is the weight on the connection from input k to neuron j

Feedforward

Inpu

ts

Out

puts

jnetjpje

netO

1

1)(

k

jkpkbiasj WOWbiasnet *

Page 41: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

04/21/23 Copyright G. A. Tagliarini, PhD 41

Calculate The Error Signal Calculate The Error Signal For Each Output NeuronFor Each Output NeuronThe output neuron error signal pj

is given by pj=(Tpj-Opj) Opj (1-Opj)Tpj is the target value of output

neuron j for pattern pOpj is the actual output value of

output neuron j for pattern p

Page 42: CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

04/21/23 Copyright G. A. Tagliarini, PhD 42

Calculate The Error Signal Calculate The Error Signal For Each Hidden NeuronFor Each Hidden NeuronThe hidden neuron error signal pj

is given by

where pk is the error signal of a post-synaptic neuron k and Wkj is the weight of the connection from hidden neuron j to the post-synaptic neuron k

kjk

pkpjpjpj WOO )1(