Bioinformatics and Protein Structural Analysis Surabhi Agarwal The molecular structures of proteins are complex and can be defined at various levels. These structures can also be predicted from their amino-acid sequences. Protein structure prediction is one of the most widespread fields of research in bioinformatics.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Bioinformatics and Protein Structural Analysis
Surabhi Agarwal
The molecular structures of proteins are complex and can be defined at various levels. These structures can
also be predicted from their amino-acid sequences. Protein structure prediction is one of the most
widespread fields of research in bioinformatics.
Master Layout (Part 1)
5
3
2
4
1 This animation consists of 2 parts:Part 1: Protein Structural DatabasesPart 2: Uses of Structural databases
Different types of data and the organization of data in a
Structural Database
Search the Database for Protein Structures
Definitions of the components:Part 1 – Protein structural databases
5
3
2
4
11. Query Peptide: The unknown protein or peptide whose sequence is
first determined, with which further analysis is performed. This protein sequence is compared with other known protein sequences in existing databases.
2. Protein sequence: The linear chain or sequence of amino acids, which form the structural unit of a protein, is known as the protein sequence. This sequence is unique for all proteins and is also known as the primary structure of the protein.
3. Sequence similarity: The process by which the amino acid sequences of two proteins are aligned linearly to evaluate their similarities.
4. 3-D structural alignment: The three dimensional structural alignment is the process of super-positioning two given protein structures. This can be achieved by using suitable software by entering protein identifiers or their atomic coordinates.
5
3
2
4
15. Geometry of Protein Structure: Geometry of a protein structure
refers to the three dimensional coordinates of its atoms and the angles between their bonds. These are essential to simulate the protein structure on computers.
6. Biology of Protein Structure: Information regarding the biological source of the protein and its metabolic roles within the cell and organism is referred to as the biology of protein structure.
7. SCOP classification: SCOP stands for “Structural Classification of Proteins” and aims to provide a detailed description of the various structural and evolutionary relationships between all proteins that have been structurally characterized. SCOP Classification can be done at four levels - Class, Fold, Superfamily and Family.
8. CATH classification: CATH stands for “Class Architecture Topology and Homologous Superfamily” and provides a semi-automatic, hierarchical classification of protein domains. The levels for CATH classification are Class, Architecture, Topology and Homologous Superfamily.
Definitions of the components:Part 1 – Protein structural databases
Step 1: Protein Structure Database: Search 1
5
3
2
4
Protein Structural Database
Enter Protein ID or text query Capsid
Structure Features Biology
Experiment
10 Retro Transcribing Viruses
X-RAY CRYSTALLOGRAPHY
Sequence Features
< 500
Optional Inputs
Macromolecule type
Number of Chains
Number of models
Molecular Weight
Secondary Structure Content
Secondary Structure Length
SCOP classification
CATH classification
Number of Chains
Source Organism
Expression Organism
Enzyme Classification
Biological Process
Cellular componentExperimental method
Resolution
Crystal Properties
Detectors used
Experimental Data Available
Source Organism
Sequence
Translated Nucleotide Sequence
Sequence Length
Sequence Motif
Sequence Length Experimental method
Search
http://www.pdb.org/pdb/search/advSearch.do
Step 1: Protein Structure Database: Search
Action Audio Narration
1
5
3
2
4
Description of the actionSchematic for Database functioning
Follow the steps as shown in the animations. First show the basic layout of the database. Then input the test “Capsid” in the text box on the top of the page. For each 4 categories, when the down-link gets clicked announce the options as the mouse hovers on them. The downlink in the animation should look like the downlink in web-pages. Re-create all images.
The protein structural databases contain a basic search box which requires the input for an identifier of the protein. This identifier can be the protein name, key-word, ID, author, etc. In this example, we take the case of Viral Capsid Proteins. These databases have advanced search features which are optional but help in making the query very specific. The general options can be categorized in 4 broad classes. Structural Features, Biology, Sequence Data and Experimental Details.
http://www.pdb.org/pdb/search/advSearch.do
Step 2.a: Protein Structure database: Output
Action Audio Narration
1
5
3
2
4Description of the action
Protein Structural Database
Number of Hits
Follow the steps as shown in the animations. Re-create all images. Show the display of “67” in front of tab titled “Number of Hits”. Then show the figure under the 2nd horizontal line. Show clicking effect on the 1st point. This slide and the 8 that follow it, are part for the same animated webpage.
The search results for the query protein entered showed 67 structures in the database that match the criteria given by the user in the search options. The first page of the results shows the titles of all the hits. The user then needs to select the protein structure of their interest to study in detail. Here we select the structure titled “HIV CAPSID C-TERMINAL DOMAIN (CAC146)” for further study.
67
1. HIV CAPSID C-TERMINAL DOMAIN (CAC146)
2. X-RAY CRYSTAL STRUCTURE OF EQUINE INFECTIOUS ANEMIA VIRUS (EIAV) CAPSID PROTEIN P26
Summary Sequence data Sequence similarity 3D similarity
BiologyMethods Geometry
1. 1AUM
2. Molecule:HIV CAPSIDStructure Weight: 7970.16Type:polypeptide(L)Chains:ALength:70Classification: Viral Protein
Derived data
Follow the steps as shown in the animations. Re-create all images. This slide and the 7 slides that follow it, are part for the same webpage. The mouse pointer should be shown clicking on each of the 8 tabs one –by-one , and the text below it changes accordingly. Always highlight the active tab with a different color as done in websites..As each of the four headings is being narrated in the audio narration, that particular text must be highlighted in the animation.
The summary page shows all the general information pertaining to the basic features of the protein. This includes:1 . Protein Identifier2. Molecule name, structure weight, polymer type, number of chains, length of the molecule and its classification3. Source organism and Expression organism4. Journal, paper and author name
4.“Structure of the carboxyl-terminal dimerization domain of the HIV-1 capsid protein”, Science, 1997
Schematic for Database functioning
Step 2.c - Protein Structure database: Output
Action Audio Narration
1
5
3
2
4Description of the action
Protein Structural Database
Summary Sequence data Sequence similarity 3D similarity
BiologyMethods Geometry Derived data
Follow the steps as shown in the animations. Re-create all images. This is a follow-up slide to slide #8, as described there.
The sequence data tab contains all the information related to the amino acid sequence corresponding to the protein under consideration1. FATSA sequence for all chains in the polypeptide 2. Type of chain such as polypeptide, glyco-peptide, lipo-peptide, etc.3. Diagrammatic representation of the Classification and Secondary structure of this chain - assigning residues with helix, sheet or turn
Summary Sequence data Sequence similarity 3D similarity
BiologyMethods Geometry Derived data
Follow the steps as shown in the animations. Re-create all images. This is a follow-up slide to slide #8, as described there.
The sequence similarity tab shows the information related to comparative studies of the two sequences. 1. Option to perform BLAST search. 2. List of Clusters of proteins is produced. These clusters are formed and ranked based on the resolution of the structures within them. The better the quality (resolution) of the cluster, higher it is ranked.When the user clicks on a particular cluster, the component proteins within the cluster are displayed along with supporting information..
Summary Sequence data Sequence similarity 3D similarity
BiologyMethods Geometry Derived data
Follow the steps as shown in the animations. Re-create all images. This is a follow-up slide to slide #8 , as described there.
The structural similarity tab shows the information related to comparative studies of the two structures. It establishes equivalences based on 3D conformations of both proteins. The default visualization tool for PDB is Jmol. Structural alignment is covered in more detail in the second part of this animation.
Summary Sequence data Sequence similarity 3D similarity
BiologyMethods Geometry Derived data
This tab provides details of the methodology used in conducting those experiments. This includes,
1. Crystallization methods, pH, temperature, and other details of the experiment2. Crystal Data (Space group, unit cell dimensions)3. Diffraction source, diffraction protocol and diffraction detectors4. Data related to Resolution and Refinement details5. Software, programs and Computing utilized.A brief summary of this result is shown in this animation. For details visit
All tables have to be re-drawn by the animator. Follow the steps as shown in the animations. This is a follow-up slide to slide #8, as described there.
Summary Sequence data Sequence similarity 3D similarity
BiologyMethods Geometry Derived data
All tables have to be re-drawn by the animator. Follow the steps as shown in the animations. This is a follow-up slide to slide #8 , as described there.
The Geometry of the molecule contains all the spatial information about the Geometry of the molecule, so that it can be simulated in a virtual environment. This includes:Bond length: Number of occurrences and their positions in the chainsBond Angles: Number of occurrences and their positions in the chainsDihedral Angles: Number of occurrences and their positions in the chainsRamachandran plot, Fold Deviation Scores and other structural detailshttp://www.pdb.org/pdb/explore/geometryDisplay.do?structureId=1AUM
Schematic for Database functioning
The position, total number, range of the covalent bond lengths between two adjacent atoms in a protein molecule
The angle formed by 3 consecutive atoms in native conformation of a protein and their statistics
The angle formed by 2 consecutive planes of 4 linearly bonded atoms. Their occurrence, positions along with other statistics.
Ramachandran Map to show the residues that lie in the favored region (outlined in Dark Blue) and the permitted region (outlined in light blue)
67/68 residues lie in the favored region and none of the residues lie in the
Values for Fold Deviation Score . For a specific reference value, FDS is a multiple of the standard deviationPlot for Fold Deviation Score. x- axis has the residue positions and y-axis has the FDS values
Summary Sequence data Sequence similarity 3D similarity
BiologyMethods Geometry Derived data
Follow the steps as shown in the animations. Re-create all images. This is a follow-up slide to slide #8 , as described there.
The biology tab contains information about the significance of the molecule at the biological and cellular level. This includes 1. Molecule type 2. Formula weight 3. Monomers, and linkages 4. Source method 5. Ligands and prosthetic groups 6. Gene detail and Genome information 7. Keywords
Schematic for Database functioning
Description HIV CAPSID
FragmentC-TERMINAL DOMAIN,
RESIDUES 146 - 231 Nonstandard Linkage no
Nonstandard Monomers no Polymer Type polypeptide(L)
Summary Sequence data Sequence similarity 3D similarity
BiologyMethods Geometry Derived data
Follow the steps as shown in the animations. Re-create all images. This is a follow-up slide to slide #8 , as described there.
Data for the same protein but from other resources such as SCOP, CATH and PFAM classification details are provided in the derived data tab. For more detailed analysis visit http://www.pdb.org/pdb/explore/derivedData.do?structureId=1AUM
1 This animation consists of 2 parts:Part 1: Protein Structural DatabasesPart 2: Uses of Structural databases
Functional Annotation
Protein Structural alignment Secondary Structure Prediction
Definitions of the componentsPart 2 – Uses of structural databases
5
3
2
4
11. Protein Structural Alignment: The geometry of two given protein structures
can be compared by means of available software tools that analyse their three dimensional similarity to each other.
2. Protein Structure Prediction: The prospective secondary structures of peptides or proteins can be predicted from a given stretch of amino acid residues by using machine learning algorithms.
3. Machine Learning Algorithms: These are computer algorithms that can be trained from a given classified dataset. Thereafter, these programs train their parameters in a such a way, that they can classify new data. Most widely used Machine Learning Algorithms in Bioinformatics are Artificial Neural Networks, Hidden Markov Modeling, Support Vector Machines, etc.
4. Functional Annotation: For novel proteins that are yet to be characterized, the potential functions can be predicted by techniques such as Homology Modelling which provide an initial insight into the protein’s properties.
Definitions of the componentsPart 2 – Uses of structural databases
5
3
2
4
15. Gene Ontology: Also known as GO terms, they are identifiers to represent a
gene’s functional properties categorized to cover three domains namely, “cellular component”, “molecular function” and “biological process”.
6. Root Mean Square Deviation (RMSD): Qauantification of the average distance between the atoms of the super-imposed proteins. The higher is the RMSD value, the lower is the similarity.
7. Protein Structural Alignment Server: Web based servers which help in determining the structural similarity of two given proteins by superimposing the two proteins and calculating various comparative parameters. Currently there are a large number of web based servers assigned for this task. Few examples of available servers for this include DALI (Distance Matrix Alignment), MAMMOTH (Matching Molecular Models Obtained from Theory), CE/CE-MC (Combinatorial Extension -- Monte Carlo), SSAP(Sequential Structure Alignment Program), ProFit (Protein least-squares Fitting), etc.
Step 1: Structure Alignment - Input
Action Audio Narration
1
5
3
2
4Description of the action
Protein Structural Alignment Server (DALI)
Follow the steps as shown in the animations. Re-create all images. Enter the 2 IDs in the text box. Follow it with clicking effect on “Submit” Button. Show the action in progress effect as shown in the slide. Follow it with the two simple structures getting superimposed and highlight the no-aligned areas. Follow this with the actual output in the next slide.
Two given proteins can be structurally aligned to evaluate the similarity between them. The server requires an input of two protein sequences or their IDs, which are then simulated and aligned based on their 3D coordinates, bond angles and dihedral angles. Few of the various servers available for this are DALI, MAMMOTH, CE/CE-MC, SSAP and ProFit.
Enter the first PDB ID and Chain(or Upload a Protein Structure)
Enter the second PDB ID and Chain(or Upload a Protein Structure)
Description of the actionFollow the steps as shown in the animations. Mention the definitions of the result in audio narration as well as written format. Re-create all images.
The results are 1. P-value: It is the probability measure that the two structure are similar. If P-value < 0.05 indicates significant similarity2. Raw score: It is used to compare other similarity matches with same proteins3. RMSD: Measure of the average distance between the atoms of the super-imposed proteins4. Percentage sequence identity in the alignment
Action Audio NarrationDescription of the actionWeb-Tool functioning
Follow the steps as shown in the animations. Re-create all images.
Once the amino acid sequence of the protein is known, its secondary and tertiary structures can be predicted using many prediction algorithms, which utilize information from previous structurally characterized sequences. In the secondary structure prediction, 1.“h” represents Alpha Helix2.“e” represents Beta Sheets,3.“c” represents CoilsSince all known proteins have not yet been structurally characterized, this provides a useful bioinformatics analysis tool for researchers. The various servers for structure prediction are GOR, HNN, PredictProtein, NNPredict and Sspro.
Follow the steps as shown in the animations. Re-create all images.
Given a particular amino acid sequence, the cellular, molecular and biological processes associated with the sequence can be predicted using functional annotation servers. These processes are represented by a unique set of identifiers called “Gene Ontology Terms” or the “GO Terms”. The GO term can be a word or an alphanumeric identifier which includes a definition with cited sources and a namespace indicating the domain to which it belongs. The various server for this include DbAli Annolite, PFP, ProteomeAnalyst, GOPET, SpearMint and ProKnow.
Enter the sequence of amino acids (primary structure of protein)
Interactivity option 1: Predict the 3 Dimensional Structure of Human Serum Albumin and cross-validate
Boundary/limitsInteracativity Type Options Results
1
2
5
3
4
Input the term “human serum albumin” in a structural Database 1
Click on the hit which matches with your query 2
Go to the “sequence details” tab and retrieve the FASTA sequence of the protein 3
Go to the 3D structure details and save the actual co-ordinates and the 3D structure of the protein, derived from experimental details 4
Select a structural alignment tool and superimpose the predicted structure on the actual structure derived from the database 6
Predict the tertiary structure from the amino-acid sequence and save the predicted structure coordinates 5
Arrange the steps in the order to be performed. Remove the step number from the bottom of the tab
Remove the step number mentioned in the tabs in “yellow” color. Show all the steps in the mixed order. The user must click on the tabs order wise. If the user clicks at a tab which is not in the right order, then flash a message saying “try again”
All the tabs must be arranged in right order.
Check for the quality of the alignment. If the RMSD value is low, then the structural alignment is good. Thereby, the structure prediction was correct 7
Interactivity option 2.a - True/False - Questions
Interactivity Type Options Results
1
2
5
3
4True or False Flash the Questions one at a time. User needs to
press either the “Green tab” marked “TRUE” or the “Red Tab” marked “FALSE”. If the answer is correct flash “Tick”. If the answer is incorrect flash “Cross”. For all questions which have an answer “False”, also mention the correct answer as shown in the next slide
Next Slide
GO stands for “Genetic Oncology”
DALI is a server for Protein Structural Alignment
SCOP is a classification scheme for Nucleic Acids
p-value is one of the result from Structural Alignment
In protein secondary structure, “e” stands for coil
Flash the Questions one at a time. User needs to press either the “Green tab” marked “TRUE” or the “Red Tab” marked “FALSE”. If the answer is correct flash “Tick”. If the answer is incorrect flash “Cross”
The questions are followed by their correct answers
GO stands for “Genetic Oncology”
DALI is a server for Protein Structural Alignment
SCOP is a classification scheme for Nucleic Acids
p-value is one of the result from Structural Alignment
In protein secondary structure, “e” stands for coil
RMSD stands for “Root Mean Square Distance”
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
GO stands for “Genetic Ontology”
SCOP is a classification scheme
for ProteinsIn protein secondary
structure, “e” stands for beta sheets
RMSD stands for “Root Mean Square Deviation”
Interactivity option 2.c - True/False - Example
Boundary/limitsInteracativity Type Options Results
1
2
5
3
4True or False
Flash the Questions one at a time. User needs to press either the “Green tab” marked “TRUE” or the “Red Tab” marked “FALSE”. If the answer is correct flash “Tick”. If the answer is incorrect flash “Cross” and the correct answer as mentioned in the next slide
This is an example slide to show the various cases of answers.
GO stands for “Genetic Oncology”
TRUE
FALSE
The correct answer
is “False”. GO stands for “Genetic Ontology”
DALI is a server for Protein Structural Alignment
SCOP is a classification scheme for Nucleic Acids
SCOP is a classification scheme
for Proteins
Questionnaire1. Which is the server for Protein Structure Prediction ?
Answers: a) ProtParam b) PeptideMass c) nnPREDICT
d) DALI
2. Which is the server for Functional annotation of Proteins?
Answers: a) DALI b) GOR c) SSAP d) Proteome
Analyst
3. Which amongst these is NOT the output for Functional annotation?
Answers: a) GO Term b)Source Organism c) Probability
of annotation d) Description of Function
4. By default, PDB structures appear in which visualization tool?
Answers: a) VMD b) NAMD c) Jmol d) None of the
above
5. PDB is primarily which Database?
a) Protein b) Nucleotide c) Gene d) None of the Above