Protein Sequencing Algorithms – A survey Muhammad Usman (Author) School of Science and Technology University of Management and Technology Lahore, Pakistan [email protected]Abstract— Protein sequencing is used in many fields. In this technique, sequence of amino acids in a protein is determined by using an algorithm. For this, there should be better understanding of structures as well as functions of proteins in any living organism. In this paper, different algorithms of protein sequencing have been discussed. Majorly ten algorithms are discussed with their applied formulas as well as their steps and then well demonstrated with graphs. In comparative analysis, these algorithms are compared; paper is concluded with best possible algorithm for protein sequencing. Keywords—(Adenine, Guanine Thyamine, Markov, Oligonucleotides, Nucleotides, RNA, DNA) I. INTRODUCTION The word Protein is derivate from a word in Greek language, “proteios” that stands for Primary. It’s not hard to say that proteins are one of the vital building blocks for a living individual. They are composed of a chain of amino acids of various types (around 25 are commonly used) mostly refereed as standard amino acids. Scientists have been researching on them for over 200 years that includes their structure, functionality and use. Still, there are many queries unanswered in this domain like, how they transforming a basic linear primary structure (amino acids) to a useful 3D assemblage. From core it is biological problem that is rooted in multiple domains. In Computer Sciences, it can be mapped to an NP – hard problem, in both Physics and Geometry the same problem can be classified in to “self avoiding walk”. The solution for such problems needs a complete integration of various domains and is very interesting to address. An accurate prediction may lead a bundle of fields in the coming era. Proteins have various types like functional, structural, hormonal etc. Proteins are composed of a unique pattern of amino acids (essential & non essential). Amino acids that are essential and our body does not produce, we take them from outside. DNA is a structural part of gene which is in double helix form. DNA consists of 4 bases nucleotide, one phosphate group and one sugar group. Nucleotide bases further consists of adenine, guanine, thymine and cytosine. Out of these four, three combine to form a helix structure to form an amino acid. For instance, adenine, guanine, and thymine combine to form a unique amino acid called methyonine. One amino acid is coded by 3 bases. To create amino acid there must be an algorithm followed which is called transcription. After that there is another method which produces messenger RNA (ribo nucleic acid), and finally messenger RNA is translated in proteins. The method to create proteins from nucleotide chain is called translation. The overall procedure is well explained in the figure below. Tommy Bennet and James A. Coker [3] came up with NGA(Niche Genetic Algorithm) that later on was declared as an extension to GA which can address the problems related to multiple optima. They also compared NGA with DSGA (Dynamic Radius Species Conserving Genetic Algorithm) and found promising result. There are various algorithms designed for translation and transcription. Here we will find a comparison for transcription and translation algorithms for different types of proteins. II. RELATED WORK One of the common and traditional way to predict the structure (folding and formation) of proteins is GA (Genetic Algorithm). It has a good computational power to predict the structure of proteins. But when is comes to multiple optima (multiple proteins), GA is not considered to be that efficient. Michael Scott Brown, Tommy Bennet and James A. Coker [3] came up with NGA (Niche Genetic Algorithm) that later on was declared as an extension to GA which can address the problems related to multiple optima. They also compared NGA with DSGA (Dynamic Radius Species Conserving Genetic Algorithm) and found promising result. Alexander S. Krylov, and Renad I. Zhdanov [5] worked on binding proteins. They experimented on short oligonucleotides (short chained) and micro-array of hydrogel cells – biochip. Firstly they worked on how a protein can recognize hortest single strand oligonucleotide which they achieved by binding oligonucleotides from 2 – 12 bases. They tried it for different number of bases in this range and constructed a microarray that DNA Messenger RNA Amino Acids Proteins Transcription Translation Figure 1: Proteins Formation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
mineral oil. Some conditions of temperature as well as of
timings were also considered by the authors for better results.
PCR gel extraction kit was also then used for the purification
process of PGLYRP1. After amplification, cloning is done and
cloned into a pMD 19-T Simple vector. For sequencing of
these cloned proteins, authors used Bug Dye Terminator v3.1
cycle sequencing ready reaction kit. These sequences were then
assembled using software named DNASTAR and complete
coding sequence is obtained of PGLYRP1.
After the processes of amplification, cloning and sequences of
protein gene, analysis on data was done. Sequence obtained
was firstly confirmed through some checks using software
named Chromas 1.45 and if there is any correction needed, that
will be made before further processing. Authors took many
parameters and many sites in MegAlign program for their
analysis study. Authors used already discussed molecular
evolutionary genetic analysis for their own study. They took
different values for different ratios. In results, authors used
trees and tables for comparison of different values if ratios and
then discussed these values in detail according to the type of
specie.
IgNARs were discussed by Oleg V. Kovalenko, Andrea
Olland, Nicole Piché-Nicholas, Adarsh Godbole, Daniel King,
Kristine Svenson, Valerie Calabro, Mischa R. Müller, Caroline
J. Barelle, William Somers, Davinder S. Gill, Lidia Mosyak
and Lioudmila Tchistiakova [11]. They defined this recognition
process in nine major steps. First step was of designing and
cloning the variants of humainized V-NAR. In this step, E06
variants were codonoptimized for expression in mammalian
cells and synthesized by using GeneArt AG. Some control such
as murine CMV promoter is considered while process of
cloning. Second step was of expression and purification of V-
NAR proteins. Authors used COS-1 expression type for
representation of fusion protein named V-NAR-hFc. On basis
of recommendation of manufacturer, cells used TransIT
reagent for tranfection. Similarly monomeric V-NARs were
expressed in COS-1 cells as well and they were purified using
chromatography technique. Different minerals used for the
process of chromatigraphy such as sodium phosphate, NaCl,
and imidazole. Concentration of protein is then determined by
using OD 280mm. Cells which are grown in serum-free style,
expression of FreeStyle293 was used. Third step was of
isolation of E-06 proteins. In this step, E-06 was applied with
Ni2--NTA Super flow resin. Resulted substance is then washed
by using PBS supplement contains imidazole. Dialyze the E06
again BS will remove excessive imidazole and process it for
next step. For the removal of oligomeric speciies, PBS contains
lipid-free HSA is used. Incubation is then done for one hour
and Superdex 200 was applied to it for the removal of excess
E06. At the end, remaining fractions were pooled and prepared
it for the process of crystallization. Fourth step is of ELISA.
Proteins of serum albumin in used for binding of
experimentations. Direct and indirect ELISA is done. Detection
of V-NAR bindings in case of direct ELISA is done with costar
assay plates which were coated with PBS. Fusion protein such
as VNAR-hFc were diluted by using assay buffer and sandwich
ELISA, anti-hFc pAb coating on plates was used Fifth step of
crystallization. In this step, major consideration was of
temperature fixing. E06 crystals were obtained by keeping
temperature at 18 degree Celsius for hanged drops. Different
quantities of solutions were used with different minerals such
as protein complex, NaCl and sodium acetate. At the end of
this step, diamond shaped crystals were obtained in one night
which continues growing up to one week approx. Sixth step is
of data collection and processing. Data was collected by using
APS beamline 22-ID on a detector of MAR-300. Program
named Xia2 was used for scaling and integration of intensities.
Another program named autoProc was also used for the same
purpose. Seventh step is to phasing, model building and
refinement of E06. For this process, PHASER is used for the
replacement of complex E06 with HSA. Model used was apo
HSA (PDB ID: 1AO6). At the end Phenix was used for the
refinement process. Different programs ans models were used
for different type of proteins in this step. Eighth step is of
measurements of E06. Kinetic constants of E06 were collected
by using surface plasmon resonance (Biacore T100, GE Life
Sciences). Finally last step is of assigning accession numbers.
Factors as well as coordinates based on structure were
deposited with the Worldwide Protein Data Bank - PDB ID:
4HGK (E06) and PDB ID: 4HGM (huE06 v1.1)..
IV. COMPARITIVE ANAYSIS
of the techniques discussed in paper was by Micheal Scott
Brown Niche [3] of Genetic algorithms. These algorithms were
better for proteins recognition but it reduces the dimension. As
proteins are in 3D but this algorithm first converted proteins
into 2D and then process it further. By doing so, search space is
also reduced.
Other technique was of markov chains used for sequencing of
proteins. Authors Małgorzata Grabinska and Paweł Błazej [2]
compared their work with the already presented algorithm of
PMC. Supervised learning was used for the training of data and
then original data is tested. Gene Mark algorithm was proposed
by Paweł Mackiewicz [2]. They also used markov chains but
they treated every protein sequence has three unique markov
chains. They also compared their flow with PMC algorithm
and ROC curves were used for efficiency calculations. True
positive rate for these algorithms has shown less variation.
Figure 09: PMC & Three chained Algorithm Comparison
Protein sequencing is discussed by Elena N. Kitova [8] by
using direct ESI-MS Measurements. Initially the algorithm
detects and quantify free and ligand proteins and then authors
used different formulas for linear and non linear data.
Comparatively, most of the authors used markov chains for
sequencing of proteins. Because markov chains can be used for
any dimensional data. But defficiency of this technique was
different computational cost of linear and non linear data.
Similarly the least expensive technique was used by the Gaelle
LENGLET and Sabine DEPAUW [6]. Chromatography is
widely used as well as less expensive. It also gives better
results but each stage of process used different kind of
technique.
Another protein recognition technique was presented by Ilda
D’Annessa [4] who worked on role of flexibility in Protein
DNA Drug Recognition. Author used specially designed
software for the processing of data. They took many
experiments using "LabVIEW virtual instrument interface" and
shows that results are better as compare to other algorithms.
Glyceraldehydes were used by Gaelle LENGLET and Sabine
DEPAUW [6] for recognition of protein. They used
chromatographic techniques, electrophoresis and MS analysis
for different types of data. For linear data they used
chromatographic techniques but for proper and chained data,
electrophoresis was used and was then refined by MS analysis.
They used EMSAs (electrophoretic mobility – shift assay)
protocol for the processing of extracted data.
Conclusions
The major purpose of this paper was the search and study of
protein sequencing, recogintion and creative exercise of this
knowledge to develop a novel approach to forecast protein-
protein complexes. Foundation of this study is a Neiche
Genetic Algorithm function that was derived from a previously
prepared dataset of Genetic Algorithm. On basis of its result, it
was used for computational scanning to calculate changes in
the binding of protein complexes. Computed and tentative
values proven good correlations and, thus, a PMS – algorithm
was introduced to perk up the predictive power. Based on these
findings, the PMS – algorithm was developed, which allows
identifying scums in protein and performing. The results have
shown that PMS – algorithm has not inly the state-of-the-art
process with respect to predictive power but also in terms of
computational speed. Markov chains were also productively
appraised by re - score six diferent datasets that includes bound
and unbound protein predictions. Furthermore, the chained
algorithm, it is useful if it is applied as an objective function in
mixture with different Markov chains to predict 3D structures
of protein-protein structure. For this, model based learned
learned algorithms were used to test protein sequencing. The
direct ESI-MS Measurements approach showed average results
for bound and restrained protein complex predictions. Not
many factors were recognized to persuade on the success of the
sequencing approach, such as the series of probable
conformational changes of a protein. Finally, a large-scale
validation study on peptidoglycan-recognition protein into was
performed. Results there by obtained allow identifying those
protein-protein interfaces that are best for molecular docking
approaches.
0
20
40
60
80
100
120
0 50 100 150
se
ns
itiv
ity
1 - specificity
ROC Curve
Figure 12: Linear Data Analysis
y = 2554.x + 36508
0200000400000600000800000
10000001200000
0 200 400 600
Sam
ple
s
Protiens Formation
Linear data Chromatographic Technique
Figure 13: Non Linear Data Analysis
y = -4.516x2 + 4342x - 29016
-2000000
200000400000600000800000
10000001200000
0 200 400 600
Sam
ple
s
Concentration
Non Linear Data , Electrophoresis Technique
REFERENCES
[1] Hoon Choi, Seungsoo Han, Donghyuk Shin, Sangho Lee. Sangho Lee. (2012), Polyubiquitin recognition by AtSAP5, an A20-type zinc finger containing protein from Arabidopsis thaliana.
[2] Małgorzata Grabinska, Paweł Błazej and Paweł Mackiewicz (Wrocław). (2013), Two Algorithms based on Markov Chains and their application to Recognition of Protein coding genes in Prokaryotic Genomes.
[3] Michael Scott Brown and James Coker. (2014), Niche Genetic Algorithms are better than traditional Genetic Algorithms for de novo Protein Folding.
[4] Ilda D’Annessa, Cinzia Tesauro, Paola Fiorani, Giovanni Chillemi, Silvia Castelli, Oscar Vassallo, Giovanni Capranico, and Alessandro Desideri. (2012), Role of Flexibility in Protein-DNA-Drug Recognition: The Case of Asp677Gly-Val703Ile TopoisomeraseMutant Hypersensitive to Camptothecin.
[5] Alexander S. Krylov and Renad I. Zhdanov. (2012), Nucleic acid – protein fingerprints. Novel protein classification based on nucleic acid – protein recognition.
[6] Ga¨elle LENGLET, Sabine DEPAUW, Denise MENDY and Marie-H´el`ene DAVID-CORDONNIER. (2013), Protein recognition of the S23906-1–DNA adduct by nuclear proteins: direct involvement of glyceraldehyde-3 phosphate dehydrogenase (GAPDH).
[7] Alfred V.Aho. (2012), Algorithms for finding patterns in Strings.
[8] Elena N. Kitova, Amr El-Hawiet, Paul D. Schnier, John S. Klassen. (2012), Reliable Determinations of Protein–Ligand Interactions by Direct ESI-MS Measurements. Are We There Yet?
[9] Parwiz Abrahimi, William G. Chang, Martin S. Kluger, Yibing Qyang, George Tellides, W. Mark Saltzman, Jordan S. Pober. (2015), Efficient Gene Disruption in Cultured Primary Human Endothelial Cells by CRISPR/Cas9.
[10] W. Liu, Y.F. Yao, L. Zhou, Q.Y. Ni and H.L. Xu. (2013), Evolutionary analysis of the short-type peptidoglycan-recognition protein gene (PGLYRP1) in primates.
[11] Oleg V. Kovalenko, Andrea Olland, Nicole Piché-Nicholas, Adarsh Godbole, Daniel King, Kristine Svenson, Valerie Calabro, Mischa R. Müller, Caroline J. Barelle, William Somers, Davinder S. Gill, Lidia Mosyak and Lioudmila Tchistiakova. (2013), Atypical Antigen Recognition Mode of a Shark IgNAR Variable Domain Characterized by Humanization and Structural Analysis.
[12] Quentin R. Johnson, Richard J. Lindsay, Loukas Petridis and Tongye Shen. (2015), Investigation of Carbohydrate Recognition via Computer Simulation.
[13] Jiansheng Jiang, Bing-Rui Zhou, Rodolfo Ghirlando and Tsan Xiao. (2013), A conserved mechanism for centromeric nucleosome recognition by centromere protein CENP-C.
[14] Wei-Lun Hsu. (2013), Mechanisms of binding diversity in Protein Disorder: Molecular Recognition features mediating protein interaction Networks.
[15] Wells, J. A.; McClendon, C. L., Reaching for high-hanging fruit in drug discovery at protein-protein interfaces. Nature 2007, 450, (7172), 1001-9.2.
[16] Mulder, G. J., Ueber die Zusammensetzung einiger thierischen Substanzen. Journal für praktische Chemie 1839, 16, 129-151.
[17] Campbell, N. A., Biologie. Spektrum Akademischer Verlag: Heidelberg, Berlin, Oxford, 1997; p 80.4.
[18] Crick, F. H., The genetic code--yesterday, today, and tomorrow. Cold Spring Harb Symp Quant Biol 1966, 31, 1-9. 5.
[19] Atkins, J. F.; Gesteland, R., Biochemistry. The 22nd amino acid. Science 2002, 296, (5572), 1409-10.6.
[20] Xu, X. M.; Carlson, B. A.; Mix, H.; Zhang, Y.; Saira, K.; Glass, R. S.; Berry, M. J.; Gladyshev, V. N.;
[21] Hatfield, D. L., Biosynthesis of selenocysteine on its tRNA in eukaryotes. PLoS Biol 2007, 5, (1), e4.7.