CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu .
42
Embed
CSCE555 Bioinformatics Lecture 18 Protein Bioinforamtics and Protein Secondary Structure Prediction Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 18 Protein Bioinforamtics
and Protein Secondary Structure Prediction
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555University of South Carolina
Department of Computer Science and Engineering2008
www.cse.sc.edu.
OutlineOutlineUnderstanding Protein StructuresProtein bioinformatics: what and
why?Protein Secondary Structure
Prediction: problem & algorithm Summary
ProteinsProteins
Large organic compounds made of amino acids
Proteins play a crucial role in virtually all biological processes with a broad range of functionsfunctions.
The activity of an enzyme or the function of a protein is governed by the three-dimensional structure structure
How ProteinsHow ProteinsAre Are GeneratedGenerated
folding
Protein BioinformaticsProtein BioinformaticsAnalysis and prediction of protein
structures (Structural Bioinformatics)◦Protein Design: design a sequence
that will fold into a designated structure
Assist experimental biology in assigning functions or suggesting functional hypotheses for all known proteins.
DNA RNA
cDNAESTsUniGene
phenotype
GenomicDNADatabases
Protein sequence databases
protein
Protein structure databases
transcription translation
Gene expressiondatabase
Protein BioinformaticsProtein Bioinformatics
TOP 10 Most Wanted TOP 10 Most Wanted solutions solutions in protein bioinformaticsin protein bioinformatics1. Protein sequence alignment2. Predicting protein features from
sequence3. Function prediction4. Protein structure prediction5. Membrane proteins6. Functional site identification7. Protein-protein interaction8. Protein-small molecule interaction
(Docking)9. Protein design10. Protein engineering
Why Protein Bioinformatics?Why Protein Bioinformatics?Function = Function = interactions interactions
Disease Mechanism, Gene regulation, Drug design…
Relevance of Protein Relevance of Protein StructureStructurein the Post-Genome Erain the Post-Genome Era
sequence
structure
function
medicine
Protein Structure ExampleProtein Structure Example
Beta Sheet
Helix Loop
2 chains
Proteins Structure is Proteins Structure is Hierarchical Hierarchical
Single peptide chain Multiple peptide
chainsLocal Folding Long-range Folding
Multi-meric organization
Sequence
How to Obtain Protein How to Obtain Protein StructuresStructuresExperimental methods (>50,000)
X-ray crystallography or NMR (Nuclear magnetic resonance) spectrometry limitation: protein size, require crystallized proteins
Difficult to get crystallized for membrane proteins
For each residues in a protein structure, three possible states: a (a-helix), ß (ß-strand), t (others).amino acid sequence
Secondary structure sequence
Currently the accuracy of secondary structure methods is nearly 80-82% (2006). Theoretical uplimit is 90% due to uncertainty 10% in real proteins
Secondary structure prediction can provide useful information to improve other sequence and structure analysis methods, such as sequence alignment and 3-D modeling.
http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
PSSP: Protein Secondary PSSP: Protein Secondary Structure PredictionStructure PredictionThree Generations
•Based on statistical information of single amino acids
•Based on local amino acid interaction (segments). Typically a segment containes 11-21 aminoacids
•Based on evolutionary information of the homology sequences
Formulate PSSP as a machine Formulate PSSP as a machine learning classification problemlearning classification problemUsing a sliding window to move along the
amino acid sequence◦ Each window denotes an instance◦ Each amino acid inside the window denotes an
attribute◦ The known secondary structure of the central amino
acid is the class label
How to generalize protein How to generalize protein secondary prediction as a machine secondary prediction as a machine learning problem?learning problem?A set of “examples” are generated
from sequence with known secondary structures
Examples form a training setBuild a neural network classifierApply the classifier to a sequence
with unknown secondary structure
Introduction to Neural Introduction to Neural NetworkNetworkWhat is an Artificial Neural Network?
◦An extremely simplified model of the brain Essentially a function approximator Transforms inputs into outputs to the best of its
ability
How do Neural Network How do Neural Network Work?Work?A neuron (perceptron) is a single layer NNThe output of a neuron is a function of the
weighted sum of the inputs plus a bias
Activation FunctionActivation Function
Binary active function◦f(x)=1 if x>=0◦f(x)=0 otherwise
The most common sigmoid function used is the logistic function◦f(x) = 1/(1 + e-x)
Multi-Layer Feedforward NN Multi-Layer Feedforward NN ExampleExample
XOR problem (nonlinear classification capable)
Where Do The Weights Come Where Do The Weights Come From?From?The weights in a neural network are the
most important factor in determining its function
Training is the act of presenting the network with some sample data and modifying the weights to better approximate the desired function (class labels)◦ Supervised Training
Supplies the neural network with inputs and the desired outputs
Response of the network to the inputs is measured The weights are modified to reduce the difference between the
actual and desired outputs
Training in Perceptron Neural NetTraining a perceptron:
Find the weights W that minimizes the error function:
P
i
ii XtWXFE1
2)().(
P: number of training dataXi: training vectorsF(W.Xi): output of the perceptront(Xi) : target value for Xi
Use steepest descent:
- compute gradient:
- update weight vector:
- iterate
Nw
E
w
E
w
E
w
EE ,...,,,
321
EWW oldnew (e: learning rate)
Back-propagation Back-propagation algorithm algorithm For Mult-layer NN, the errors of hidden
layers are not known Searches for weight values that
minimize the total error of the network over the set of training examples◦ Forward pass: Compute the outputs of all
units in the network, and the error of the output layers.
◦ Backward pass: The network error is backpropogated for updating the weights (credit assignment problem).
04/21/23 Copyright G. A. Tagliarini, PhD 28
Feedforward Network Training by Feedforward Network Training by Backpropagation: Process Backpropagation: Process SummarySummarySelect an architectureRandomly initialize weightsWhile error is too large
◦Select training pattern and feedforward to find actual network output
◦Calculate errors and backpropagate error signals
◦Adjust weightsEvaluate performance using the
test set
NN for Protein NN for Protein Secondary Secondary Structure Structure PredictionPrediction
0
How to Encode Each Amino How to Encode Each Amino Acid?Acid?20 bit binary sequence10000000000000000000-----A01000000000000000000-----R00100000000000000000-----N…00000000000000000001-----V
Evaluation of Performance: Evaluation of Performance: Accuracy(Q3)Accuracy(Q3)
ALHEASGPSVILFGSDVTVPPASNAEQAK
hhhhhooooeeeeoooeeeooooohhhhh
ohhhooooeeeeoooooeeeooohhhhhh
Amino acid sequence
Actual Secondary Structure
Q3=22/29=76%
Q3 for random prediction is 33%
Secondary structure assignment in real proteins is uncertain to about 10%; Therefore, a “perfect” prediction would have Q3=90%.
Performances(CASP)Performances(CASP)
CASP YEAR# of
Targets<Q3> Group
CASP1 1994 6 63%Rost and
Sander
CASP2 1996 24 70%Rost
CASP3 1998 18 75% Jones
CASP4 2000 28 80% Jones
SummarySummaryProtein bioinformatics is a very
important area with many interesting problems
Computational methods can have big impact in medicine and molecular biology
Secondary protein structure prediction algorithms are very strong
Slides AcknowledgementsSlides AcknowledgementsJinbo Xu University of WaterlooXingquan Zhu
Why predict structure: Can Label Why predict structure: Can Label Proteins by Dominant StructureProteins by Dominant Structure
Protein classification, Structural Blasting
Amino AcidsAmino Acids
Side chain
Each amino acid is identified by its side chain, which determines the properties of this amino acid.
Side Chain PropertiesSide Chain Properties
hydrophobic V, L, I, M, F
Hydrophilic N, E, Q, H, K, R, D
In-between G, A, S, T, Y, W, C, P
Positively charged R, H, L
Negatively charged D, E
Polar but not charged N, Q, S, T
nonpolar A, G, I, L, M, P, V
Aromatic F, W, Y
Hydrophobic amino acids stay inside of a protein, whileHydrophilic ones tend to stay in the exterior of a protein.Oppositely charged amino acids can form salt bridge.Polar amino acids can participate hydrogen bonding
Alpha Helix ExamplesAlpha Helix Examples
Beta Sheet ExamplesBeta Sheet Examples
Parallel beta sheet Anti-parallel beta sheet
04/21/23 Copyright G. A. Tagliarini, PhD 40
Calculate Outputs For Each Calculate Outputs For Each Neuron Based On The PatternNeuron Based On The Pattern The output from neuron j
for pattern p is Opj where
and
k ranges over the input indices and Wjk is the weight on the connection from input k to neuron j
Feedforward
Inpu
ts
Out
puts
jnetjpje
netO
1
1)(
k
jkpkbiasj WOWbiasnet *
04/21/23 Copyright G. A. Tagliarini, PhD 41
Calculate The Error Signal Calculate The Error Signal For Each Output NeuronFor Each Output NeuronThe output neuron error signal pj
is given by pj=(Tpj-Opj) Opj (1-Opj)Tpj is the target value of output
neuron j for pattern pOpj is the actual output value of
output neuron j for pattern p
04/21/23 Copyright G. A. Tagliarini, PhD 42
Calculate The Error Signal Calculate The Error Signal For Each Hidden NeuronFor Each Hidden NeuronThe hidden neuron error signal pj
is given by
where pk is the error signal of a post-synaptic neuron k and Wkj is the weight of the connection from hidden neuron j to the post-synaptic neuron k