CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 18 Protein Bioinforamtics

and Protein Secondary Structure Prediction

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun Hu

Course page: http://www.scigen.org/csce555University of South Carolina

Department of Computer Science and Engineering2008

www.cse.sc.edu.

OutlineOutlineUnderstanding Protein StructuresProtein bioinformatics: what and

why?Protein Secondary Structure

Prediction: problem & algorithm Summary

ProteinsProteins

Large organic compounds made of amino acids

Proteins play a crucial role in virtually all biological processes with a broad range of functionsfunctions.

The activity of an enzyme or the function of a protein is governed by the three-dimensional structure structure

How ProteinsHow ProteinsAre Are GeneratedGenerated

folding

Protein BioinformaticsProtein BioinformaticsAnalysis and prediction of protein

structures (Structural Bioinformatics)◦Protein Design: design a sequence

that will fold into a designated structure

Assist experimental biology in assigning functions or suggesting functional hypotheses for all known proteins.

DNA RNA

cDNAESTsUniGene

phenotype

GenomicDNADatabases

Protein sequence databases

protein

Protein structure databases

transcription translation

Gene expressiondatabase

Protein BioinformaticsProtein Bioinformatics

TOP 10 Most Wanted TOP 10 Most Wanted solutions solutions in protein bioinformaticsin protein bioinformatics1. Protein sequence alignment2. Predicting protein features from

sequence3. Function prediction4. Protein structure prediction5. Membrane proteins6. Functional site identification7. Protein-protein interaction8. Protein-small molecule interaction

(Docking)9. Protein design10. Protein engineering

Why Protein Bioinformatics?Why Protein Bioinformatics?Function = Function = interactions interactions

Disease Mechanism, Gene regulation, Drug design…

Relevance of Protein Relevance of Protein StructureStructurein the Post-Genome Erain the Post-Genome Era

sequence

structure

function

medicine

Protein Structure ExampleProtein Structure Example

Beta Sheet

Helix Loop

2 chains

Proteins Structure is Proteins Structure is Hierarchical Hierarchical

Single peptide chain Multiple peptide

chainsLocal Folding Long-range Folding

Multi-meric organization

Sequence

How to Obtain Protein How to Obtain Protein StructuresStructuresExperimental methods (>50,000)

X-ray crystallography or NMR (Nuclear magnetic resonance) spectrometry limitation: protein size, require crystallized proteins

Difficult to get crystallized for membrane proteins

Computational methods (predictive methods)

2-D structure (secondary structure) 3-D structure (tertiary structure) CASP competition: Critical Assessment of

Techniques for Protein Structure Prediction

Protein Structure Prediction Protein Structure Prediction ProblemProblemGiven the amino acid sequence

of a protein, what’s its shape in three-dimensional space?◦ Sequence → secondary structure → 3D

structure → function

Why Prediction Needed?Why Prediction Needed?The functions of a protein is

determined by its structure.Experimental methods to

determine protein structure are time-consuming and expensive.

Big gap between the available protein sequences and structures.

Growth of Protein Sequences Growth of Protein Sequences and Structuresand Structures

Data from http://www.dna.affrc.go.jp

50,000 as 2008

30000*X species

What determines structures: What determines structures: Inter-atomic ForcesInter-atomic Forces

Covalent bond (short range, very strong)◦ Binds atoms into molecules / macromolecules

Hydrogen bond (short range, strong)◦ Binds two polar groups (hydrogen + electronegative atom)

Disulfide bond / bridge (short range, very strong)◦ Covalent bond between sulfhydryl (sulfur + hydrogen) groups

Hydrophobic / hydrophillic interaction (weak)◦ Hydrogen bonding w/ H2O in solution

Van der Waal’s interaction (very weak)◦ Nonspecific electrostatic attractive force

Electrostatic forces: ◦ oppositely charged side chains form salt bridges

Secondary Structure Predication Secondary Structure Predication (2D)(2D)

For each residues in a protein structure, three possible states: a (a-helix), ß (ß-strand), t (others).amino acid sequence

Secondary structure sequence

Currently the accuracy of secondary structure methods is nearly 80-82% (2006). Theoretical uplimit is 90% due to uncertainty 10% in real proteins

Secondary structure prediction can provide useful information to improve other sequence and structure analysis methods, such as sequence alignment and 3-D modeling.

http://bioinf.cs.ucl.ac.uk/psipred/psiform.html

PSSP: Protein Secondary PSSP: Protein Secondary Structure PredictionStructure PredictionThree Generations

•Based on statistical information of single amino acids

•Based on local amino acid interaction (segments). Typically a segment containes 11-21 aminoacids

•Based on evolutionary information of the homology sequences

Formulate PSSP as a machine Formulate PSSP as a machine learning classification problemlearning classification problemUsing a sliding window to move along the

amino acid sequence◦ Each window denotes an instance◦ Each amino acid inside the window denotes an

attribute◦ The known secondary structure of the central amino

acid is the class label

How to generalize protein How to generalize protein secondary prediction as a machine secondary prediction as a machine learning problem?learning problem?A set of “examples” are generated

from sequence with known secondary structures

Examples form a training setBuild a neural network classifierApply the classifier to a sequence

with unknown secondary structure

Introduction to Neural Introduction to Neural NetworkNetworkWhat is an Artificial Neural Network?

◦An extremely simplified model of the brain Essentially a function approximator Transforms inputs into outputs to the best of its

ability

How do Neural Network How do Neural Network Work?Work?A neuron (perceptron) is a single layer NNThe output of a neuron is a function of the

weighted sum of the inputs plus a bias

Activation FunctionActivation Function

Binary active function◦f(x)=1 if x>=0◦f(x)=0 otherwise

The most common sigmoid function used is the logistic function◦f(x) = 1/(1 + e-x)

Multi-Layer Feedforward NN Multi-Layer Feedforward NN ExampleExample

XOR problem (nonlinear classification capable)

Where Do The Weights Come Where Do The Weights Come From?From?The weights in a neural network are the

most important factor in determining its function

Training is the act of presenting the network with some sample data and modifying the weights to better approximate the desired function (class labels)◦ Supervised Training

Supplies the neural network with inputs and the desired outputs

Response of the network to the inputs is measured The weights are modified to reduce the difference between the

actual and desired outputs

Training in Perceptron Neural NetTraining a perceptron:

Find the weights W that minimizes the error function:

P

i

ii XtWXFE1

2)().(

P: number of training dataXi: training vectorsF(W.Xi): output of the perceptront(Xi) : target value for Xi

Use steepest descent:

- compute gradient:

- update weight vector:

- iterate

Nw

E

w

E

w

E

w

EE ,...,,,

321

EWW oldnew (e: learning rate)

Back-propagation Back-propagation algorithm algorithm For Mult-layer NN, the errors of hidden

layers are not known Searches for weight values that

minimize the total error of the network over the set of training examples◦ Forward pass: Compute the outputs of all

units in the network, and the error of the output layers.

◦ Backward pass: The network error is backpropogated for updating the weights (credit assignment problem).

04/21/23 Copyright G. A. Tagliarini, PhD 28

Feedforward Network Training by Feedforward Network Training by Backpropagation: Process Backpropagation: Process SummarySummarySelect an architectureRandomly initialize weightsWhile error is too large

◦Select training pattern and feedforward to find actual network output

◦Calculate errors and backpropagate error signals

◦Adjust weightsEvaluate performance using the

test set

NN for Protein NN for Protein Secondary Secondary Structure Structure PredictionPrediction

0

How to Encode Each Amino How to Encode Each Amino Acid?Acid?20 bit binary sequence10000000000000000000-----A01000000000000000000-----R00100000000000000000-----N…00000000000000000001-----V

Evaluation of Performance: Evaluation of Performance: Accuracy(Q3)Accuracy(Q3)

ALHEASGPSVILFGSDVTVPPASNAEQAK

hhhhhooooeeeeoooeeeooooohhhhh

ohhhooooeeeeoooooeeeooohhhhhh

Amino acid sequence

Actual Secondary Structure

Q3=22/29=76%

Q3 for random prediction is 33%

Secondary structure assignment in real proteins is uncertain to about 10%; Therefore, a “perfect” prediction would have Q3=90%.

Performances(CASP)Performances(CASP)

CASP YEAR# of

Targets<Q3> Group

CASP1 1994 6 63%Rost and

Sander

CASP2 1996 24 70%Rost

CASP3 1998 18 75% Jones

CASP4 2000 28 80% Jones

SummarySummaryProtein bioinformatics is a very

important area with many interesting problems

Computational methods can have big impact in medicine and molecular biology

Secondary protein structure prediction algorithms are very strong

Slides AcknowledgementsSlides AcknowledgementsJinbo Xu University of WaterlooXingquan Zhu

Why predict structure: Can Label Why predict structure: Can Label Proteins by Dominant StructureProteins by Dominant Structure

Protein classification, Structural Blasting

Amino AcidsAmino Acids

Side chain

Each amino acid is identified by its side chain, which determines the properties of this amino acid.

Side Chain PropertiesSide Chain Properties

hydrophobic V, L, I, M, F

Hydrophilic N, E, Q, H, K, R, D

In-between G, A, S, T, Y, W, C, P

Positively charged R, H, L

Negatively charged D, E

Polar but not charged N, Q, S, T

nonpolar A, G, I, L, M, P, V

Aromatic F, W, Y

Hydrophobic amino acids stay inside of a protein, whileHydrophilic ones tend to stay in the exterior of a protein.Oppositely charged amino acids can form salt bridge.Polar amino acids can participate hydrogen bonding

Alpha Helix ExamplesAlpha Helix Examples

Beta Sheet ExamplesBeta Sheet Examples

Parallel beta sheet Anti-parallel beta sheet


Calculate Outputs For Each Calculate Outputs For Each Neuron Based On The PatternNeuron Based On The Pattern The output from neuron j

for pattern p is Opj where

and

k ranges over the input indices and Wjk is the weight on the connection from input k to neuron j

Feedforward

Inpu

ts

Out

puts

jnetjpje

netO

1

1)(

k

jkpkbiasj WOWbiasnet *


Calculate The Error Signal Calculate The Error Signal For Each Output NeuronFor Each Output NeuronThe output neuron error signal pj

is given by pj=(Tpj-Opj) Opj (1-Opj)Tpj is the target value of output

neuron j for pattern pOpj is the actual output value of

output neuron j for pattern p


Calculate The Error Signal Calculate The Error Signal For Each Hidden NeuronFor Each Hidden NeuronThe hidden neuron error signal pj

is given by

where pk is the error signal of a post-synaptic neuron k and Wkj is the weight of the connection from hidden neuron j to the post-synaptic neuron k

kjk

pkpjpjpj WOO )1(

CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.

Documents

protein size

protein bioinforamtics

threedimensional structure

sequence secondary structure

available protein sequences

growth of protein sequences

amino acid sequence

broad range of functions