Top Banner
Protein Structure Analysis - II Liangjiang (LJ) Wang [email protected] April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23
29

Protein Structure Analysis - II Liangjiang (LJ) Wang [email protected] April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Protein Structure Analysis - II

Liangjiang (LJ) Wang

[email protected]

April 10, 2005

PLPTH 890 Introduction to Genomic Bioinformatics

Lecture 23

Page 2: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Outline• Protein structure alignment (DALI and

VAST).

• Protein secondary structure prediction (PHDsec, PSIPRED, etc).

• Prediction of 3-D protein structures:

– Homology modeling.

– Threading.

– Ab initio prediction.

• Protein structural genomics.

Page 3: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Protein Structure Comparison• Why is structure comparison important?

– To understand structure-function relationship.– To study the evolution of many key proteins

(structure is more conserved than sequence).

• Comparing 3-D structures is much more difficult than sequence comparison.

• Protein structure classification:– SCOP: Structure Classification Of Proteins.– CATH: Class, Architecture, Topology and

Homology.

• Protein structure alignment: DALI and VAST.

Page 4: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Protein Structure Alignment• Positions of atoms in two or more 3-D protein

structures are compared.

• Must first determine which atoms to align. At least two sets of three common reference points should be identified.

• Atoms in structures are matched to minimize the average deviation.

• Computers are NOT good at comparing 3-D objects (an NP-hard problem).

(Baxevanis and Ouellette, 2005)

Page 5: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

How to Compare Structures?

Structure 1 Structure 2

Description 1 Description 2

Scores

Similarity, classification

Feature extraction

Comparison

Statistical analysis

Page 6: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

DALI• DALI is for Distance matrix ALIgnment.

• Each structure is represented as a two-dimensional array (matrix) of distances between all pairs of C atoms.– Remember what a C atom is?

• Assume that similar 3-D structures have similar inter-residue distances.

• DALI uses distance matrices to align protein structures.

• DALI is available at http://www.ebi.ac.uk/dali/.

Page 7: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

VAST• VAST is for Vector Alignment Search Tool.

• Each structure is represented as a set of secondary structure elements (SSEs).– SSEs: helices or strands.

• VAST scores pairs of SSEs based on their type, orientation and connectivity.

• The SSE matches of statistical significance are then extended (similar to BLAST).

• Structures in MMDB have been pre-computed, and organized as structure neighbors in Entrez.

• VAST can be accessed at http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml.

Page 8: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Secondary Structure Prediction• Given the sequence of a polypeptide,

secondary structures are predicted.

• Assume that secondary structures are fully determined by local interactions among neighboring residues.

• Early analysis were based on the frequencies of amino acid found in different types of secondary structures.– For example, proline occurs at turns, but not in helices.

• Modern approaches use machine learning techniques and multiple sequence alignments.

Page 9: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Machine Learning Approach

QEALDAAGDKLVVVDF

HHHHHHLLLLEEEEEE

H – HelixE – SheetL – Loop

Training Dataset Test Dataset

Classifier (Model)

Performance?

Training

Testing

YesPrediction

No

Page 10: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

PHDsec• For a given protein sequence:

– Search for homologous sequences.– Produce a multiple sequence alignment.– Generate a profile (evolutionary information).

• PHDsec uses a feed-forward artificial neural network to predict the secondary structures.

RAP SS K Y

EH L

Input layer

Hidden layer

Output layer

(PHDsec can be accessed at http://www.predictprotein.org/)

Page 11: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

PSIPRED

• For a given protein sequence:

– Perform a PSI-BLAST search.

– Create a profile that conveys the evolutionary information at each position.

– Feed the profile into a system of neural networks (or support vector machines).

• PSIPRED can be accessed at http://bioinf.cs.ucl.ac.uk/psipred/.

Page 12: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

How to Evaluate the Performance?• EVA: an independent server for evaluation

of protein structure prediction methods.

• The best tool for three-stateper-residuesecondary structure predictionnow reachesthe accuracyof about 78%.

(http://cubic.bioc.columbia.edu/eva/)

Page 13: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Prediction of 3-D Protein Structures• There are about 30,000 structures in PDB, but

more than 1.8 million non-redundant protein sequences in UniProt (Swiss-Prot + TrEMBL).

• Computational structure prediction may provide valuable information for most of the protein sequences derived from genome sequencing projects.

• Three predictive methods:

– Homology (or comparative) modeling.

– Threading (or fold recognition).

– Ab initio structure prediction.

Page 14: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Sequence - Structure Relationship• In cells, protein folding is determined by the

amino acid sequence. But, protein structures can also be affected by post-translational modifications and the cellular environment.

• Proteins with ≥ 30% sequence identity tend to have similar structures. However, exceptions do exist …

(Viral capsid protein, 1PIV:1) (Glycosyltransferase, 1HMP:A)

80-residue stretch (yellow) with 40% sequence identity

(Bourne, 2004)

Page 15: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Homology Modeling • Probably the most accurate method for

protein structure prediction.

• Five different steps:– Find a known structure related to the query

sequence by sequence comparison.– Align the query sequence with the known

structure (template).– Build a model by modifying the backbone and

side chains of the template.– Refine the model using energy minimization.– Validate the model using visual inspection or

software tools.

Page 16: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Homology Modeling (Cont’d) • Accuracy of structure prediction depends on

the percent amino acid sequence identity shared between the query and template.

• For >50% sequence identity, RMSD (Root Mean Square Deviation) is only 1 Å for main-chain atoms, which is comparable to the accuracy of a medium-resolution NMR structure or a low-resolution X-ray structure.

• Homology modeling may not be used for predicting protein structures if the sequence identity is less than 30%.

Page 17: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Homology Modeling Servers• SWISS-MODEL (http://swissmodel.expasy.org/): A

popular site for structure homology modeling.

• SDSC1 (http://cl.sdsc.edu/hm.html): the #1 ranked server for homology modeling on the EVA site.

SDSC1

http://cubic.bioc.columbia.edu/eva/

Page 18: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Threading (Baxevanis and Ouellette, 2005)

Page 19: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Threading (Cont’d)• Threading takes a query sequence and passes

(threads) it through the 3-D structure of each protein in a fold database (known structures).

• As a sequence is threaded, the fit of the sequence in the fold is evaluated using some functions of energy or packing efficiency.

• Threading may find a common fold for proteins with essentially no sequence homology.

• Structures predicted from threading techniques often are not of high quality (RMSD > 3 Å).

• Based on EVA results, 3D-PSSM is the best threading server (http://www.sbg.bio.ic.ac.uk/~3dpssm/).

Page 20: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Ab Initio Structure Prediction• Ab initio prediction can be used when a protein

sequence has no detectable homologues in PDB.

• Protein folding is modeled based on global free-energy minimization.

• Since the protein folding problem has not yet been solved, the ab initio prediction methods are still experimental and can be quite unreliable.

• One of the top ab initio prediction methods is called Rosetta, which was found to be able to successfully predict 61% of structures (80 of 131) within 6.0 Å RMSD (Bonneau et al., 2002).

• The HMMSTR/Rosetta Server can be accessed at http://www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php.

Page 21: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Comparing Structure Prediction Methods

A – C: homology modeling with 60% (A), 40% (B) and 30% (C) sequence identity.

D and E: ab initio protein structure prediction.

Predicted structures are in red, and actual structures are in blue.

(Baker and Sali, 2000)

Page 22: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Example: Cysteine-Rich Peptides

C C C C

C

Signal helix and cleavage site

C C C

NCR: Nodule-specific Cysteine Rich genes in legumes.Avr9: fungal avirulence protein from Cladosporium fulvum.Defensin: antimicrobial peptides.Proteinase inhibitor: Serine proteinase inhibitors.SCR6: S-locus of Brassica, SI, interact with SRK6.

Page 23: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Ab Initio Prediction of Cys Rich Peptides

LSG-TC51151 PsENOD3

Defensin (AAG40321, M. sativa) Avr9 (Cladosporium fulvum)

Page 24: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Protein Structural Genomics • A worldwide initiative aimed at determining a

large number of protein structures in a high throughput mode.

• In the US, nine structural genomics centers have been funded by the National Institutes of Health (NIH).

• More information may be found at http://www.rcsb.org/pdb/strucgen.html.

• TargetDB (http://targetdb.pdb.org/): a centralized registration database for target sequences from the worldwide structural genomics projects.

Page 25: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

A Target Selection Pipeline from JCSG

Methods TMHMM

Protein size(7 - 80 kDa) Low complexity Redundancy

BLAST against PDB sequences

Page 26: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Summary• Fast and accurate structure alignment is still

a very hard problem to be solved.

• Machine learning techniques are widely used in protein secondary structure prediction.

• Homology modeling is probably the most reliable method for structure prediction.

• The protein folding problem has not yet been solved.

Page 27: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Prediction of Solvent Accessibility• Solvent accessibility: the relative area of a

residue’s surface that is exposed to the surrounding solvent.

• The solvent-accessible residues may be part of an active site or a binding site, while the buried residues may play an important role in stabilizing the protein structure.

• PHDacc (http://www.predictprotein.org/): a neural network-based method (similar to PHDsec).

• Jpred (http://www.compbio.dundee.ac.uk/~www-jpred/): a neural network system that predicts both secondary structure and solvent accessibility.

Page 28: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Predicting Transmembrane Segments• Transmembrane segments share common

biophysical features (e.g., hydrophobicity).

• PHDhtm (http://www.predictprotein.org/):– Part of the PredictProtein services.– Transmembrane helices are predicted using a

neural network system.

• TMHMM (http://www.cbs.dtu.dk/services/TMHMM/):– A set of known transmembrane segments are

represented as HMMs.– A query sequence is matched to a known

transmembrane pattern.

Page 29: Protein Structure Analysis - II Liangjiang (LJ) Wang ljwang@ksu.edu April 10, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 23.

Signal Peptide Prediction• Extracellular proteins or proteins targeted to

subcellular compartments contain short signal peptides (often at the N-terminal).

• PSORT (http://psort.ims.u-tokyo.ac.jp/): A rule-based expert system for predicting subcellular localization of proteins from their amino acid sequences. The algorithm of k-nearest neighbors is used for reasoning.

• SignalP (http://www.cbs.dtu.dk/services/SignalP/): predicts the presence and location of signal peptide cleavage sites using a combination of neural networks and HMMs.