B B ioinformā ioinformā ti ti ka ka Proteīnu un RNS struktūras Proteīnu un RNS struktūras LU, 2008, LU, 2008, Juris V Juris V īksna īksna
BBioinformāioinformātitikaka
Proteīnu un RNS struktūrasProteīnu un RNS struktūras
LU, 2008,LU, 2008, Juris VJuris Vīksnaīksna
Proteīni: ko mēs ar to saprotam ar proteīnu struktūru,
struktūru reprezentācija Ar proteīnu struktūrām saistītās problēmas RNS: ko mēs ar to saprotam ar RNS struktūru Ar RNS struktūrām saistītās problēmas Metodes proteīnu struktūru salīdzināšanai Proteīnu struktūru datubāzes Rīki proteīnu struktūru salīdzināšanai un vizualizācijai Proteīnu struktūru klasifikācijas RNS struktūru prognozēšana
Šodien:
Proteīni
[Adapted from R.Shamir]
...VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANK...
Protein sequence:
Proteīnu struktūra
[Adapted from R.Shamir]
Proteīnu struktūra
[Adapted from R.Shamir]
We will be interested mostly in secondary and tertiary structure
Proteīnu struktūras noteikšana - kristalogrāfija
[Adapted from G.Lee]
The basics:
Purify an protein crystal
Shoot an X-ray through the rotating crystal
Collect Data in one of many ways
Interpret data
Proteīnu struktūras noteikšana - kristalogrāfija
[Adapted from G.Lee]
Problems:
Crystal setup takes….forever (almost)
Interpreting the data is no easy task But all methods create this mass of data
Expensive($$$)
Proteīnu struktūras noteikšana - kristalogrāfija
[Adapted from G.Lee]
In the end, biologists want the best results possible and X-ray Crystallography provides this right now
It gets the job done No other method does the job better
Proteīnu struktūras noteikšana -kristalogrāfija
Proteīnu struktūras noteikšana - kristalogrāfija
Proteīnu struktūras noteikšana - kristalogrāfija
Magnet
Radio frequencyamplifiers
Samples
Proteīnu struktūras noteikšana - NMR
[Adapted from V.Arcus]
NMR - Nuclear magnetic resonance
Proteīnu struktūras noteikšana - NMR
[Adapted from V.Arcus]
Proteīnu struktūras noteikšana - NMR
Protein NMR requires large amounts of very pure protein..
[Adapted from V.Arcus]
Extraction from the natural source
a major disadvantage here is the very low levels of protein in tissues
for example, one might start with 10 l of blood and get 1 mg of protein!
this also requires a large number of purification steps
the main advantage is the maintenance of post-translational modifications
NMR vai kristalogrāfija?
[Adapted from V.Arcus]
Both techniques to determine protein structures
NMR uses protein in solution
X-ray crystallography uses protein crystals
Both techniques require large amounts of pure protein
Both techniques require expensive equipment!
NMR priekšrocības
[Adapted from V.Arcus]
Protein in solution! Can look at the dynamic properties of the protein
structure Can look at the interactions between the protein and
ligands, substrates or other proteins Can look at protein folding Sample is not damaged in any way No “phase problem” Can “characterise” your protein using NMR
NMR trūkumi
[Adapted from V.Arcus]
Size limit! The maximum size of a protein for NMR structure
determination is ~30 kDa. This eliminates ~50% of all proteins
High solubility is a requirement Comparatively low resolution
Kristalogrāfijas priekšrocības
[Adapted from V.Arcus]
No size limit As long as you can crystallise it
Solubility requirement is less stringent Simple definition of resolution Direct calculation from data to electron density and
back again
Kristalogrāfijas trūkumi
[Adapted from V.Arcus]
Crystallisation! This is a process bottleneck Binary (all or nothing)
Phase problem
If the cell contains two electrons (eachwith the same scattering power) andtheir positional relationship is such thatthe distance between them is exactlyone-half the distance between reflectingplanes, then they will cancel out eachothers contribution to diffraction.
Proteīnu struktūras fails
HEADER HYDROLASE 03-NOV-00 1G65 TITLE CRYSTAL STRUCTURE OF EPOXOMICIN:20S PROTEASOME REVEALS A TITLE 2 MOLECULAR BASIS FOR SELECTIVITY OF ALPHA,BETA-EPOXYKETONE TITLE 3 PROTEASOME INHIBITORS COMPND MOL_ID: 1;
.............................................................................
.............................................................................
ATOM 115 CD PRO A 17 44.162 -73.549 30.303 1.00 34.52 C ATOM 116 N SER A 18 47.730 -73.191 28.777 1.00 37.54 N ATOM 117 CA SER A 18 49.119 -72.807 28.499 1.00 40.24 C ATOM 118 C SER A 18 50.025 -74.009 28.289 1.00 42.29 C ATOM 119 O SER A 18 51.252 -73.870 28.152 1.00 42.34 O ATOM 120 CB SER A 18 49.661 -71.974 29.653 1.00 42.60 C ATOM 121 OG SER A 18 49.219 -72.500 30.895 1.00 45.61 O ATOM 122 N GLY A 19 49.411 -75.189 28.300 1.00 42.88 N ATOM 123 CA GLY A 19 50.145 -76.427 28.117 1.00 44.73 C ATOM 124 C GLY A 19 50.743 -76.999 29.391 1.00 43.86 C ATOM 125 O GLY A 19 51.585 -77.900 29.352 1.00 45.98 O ATOM 126 N LYS A 20 50.315 -76.498 30.532 1.00 42.31 N
PDB file format
Proteīnu struktūra - atomu koordinātas
[Adapted from M.Gerstein and I.Eidhammer, I.Jonassen]
Structure is described by 3D coordinates (X,Y,Z) of all C atoms
Proteīnu struktūra - foldi
"Fold" representation of 7timA0
Hydrogen bonding patterns for four helices; Structures are represented in a diagrammatic way to simplify counting the atoms in each H-bonded loop.
27 ribbon 310 helix 3.613 helix helix
Proteīnu struktūra - spirāles
[Adapted from S.Rafferty]
Proteīnu struktūra - sloksnes Composed of strands Adjacent Strands may be parallel or antiparallel Strands are flat: think of a beta sheet as a helix with two residues
per turn
Parallel AntiParallel
[Adapted from S.Rafferty]
Proteīnu foldi - sandwhich ()
Proteīnu foldi - barrels ()
Proteīnu foldi - horseshoe (-)
Proteīnu foldi - helix “bundles” ()
Proteīnu foldi - mijiedarbības
Transcription factors - homeodomain proteins
Proteīnu foldi - daži skaitļi
Proteīnu struktūras - citas reprezentācijas
Different representations of myoglobin molecule
Contact map (graph-based) representationof proteinstructure
Noteikšana (ne gluži bioinformātikas problēma) Prognozēšana (protein folding problēma; viens no
bioinformātikas Holy Grail...) Salīdzināšana (nav gluži triviāli, bet ir metodes, kas
praksē darbojas pietiekami labi) Reprezentācijas Virsmas modelēšana Proteīnu mijiedarbību modelēšana/prognozēšana Vizualizācija
Ar proteīnu struktūrām saistītās problēmas
The folded state is a low energy state under physiological conditions: H2O, pH ~ 7.0, NaCl
Protein folding
GGibbsFreeEnergy
U
I
F
GU–F
Kas ietekmē protein folding:
• Hidrofobiskie spēki (ūdens "izspiešana")• Ūdeņraža saites• Elektrostatiskie spēki• Disulfīdu saites• Chaperones
Protein folding
Chaperones
Chaperone proteins were first identified as "heat-shock proteins" (hsp60 and hsp70)
Hsp70 recognizes exposed, unfolded regions of new protein chains - especially hydrophobic regions
It binds to these regions, apparently protecting them until productive folding reactions can occur
Occurs while the chain is still being translated
CASP
http://predictioncenter.gc.ucdavis.edu/casp7/Casp7.html
CAFASP
http://www.cs.bgu.ac.il/~dfischer/CAFASP4/
Prioni
Prion - proteinaceous infectious particle
PrPc -the normal version Hypothetical structure of PrPsc
Prioni Spontaneously (rare): the normal fold is overwhelmingly the favored conformation
Inherited: a mutation in the PRNP gene destabilizes the normal conformation
Transmitted: ingestion of PrPsc from diet, surgical instruments, blood, or blood-derived products
Molekulārās virsmas
Key-and-lockprincips:
RNS struktūra
RNA sequence:
...AGGCUAUGGCCA...
Single-stranded, but
A tends to pair with UG tends to pair with C
RNS sekundārā struktūra
5’ 3’ G--C G--C C--GA | U--A G--CA AA A A A
[Adapted from C.Staben]
RNS sekundārā struktūra
[Adapted from K.Selesniemi]
Pseudo-knot
RNS terciālā struktūra
[Adapted from K.Selesniemi]
RNS struktūras noteikšana - fizikālās metodes
[Adapted from P. De Rijk]
The experimental method giving the highest resolution is single crystal X-ray diffraction. X-ray diffraction reveals secondary, tertiary and three dimensional structures. Unfortunately, it is very difficult to obtain crystals of RNA molecules suitable for X-ray diffraction. The structure of tRNA's have been solved using this technique.
RNS struktūras noteikšana - fizikālās metodes
[Adapted from P. De Rijk]
NMR can provide details about local conformation, and can be used to determine secondary, tertiary and, in theory, three-dimensional structures. The size of RNA molecules that can studied using NMR is currently rather limited. Oligonucleotides used in NMR studies are designed to adopt structures found in larger RNA molecules.
RNS struktūras noteikšana - fizikālās metodes
[Adapted from P. De Rijk]
Direct observation of partially denatured RNA molecules is possible using electron microscopy. However, the choice of denaturing conditions is crucial, and the resolution of electron microscopy is usually too limited to see fine details.
RNS struktūras noteikšana - ķīmiskās metodes
[Adapted from P. De Rijk]
RNA structure has been probed by testing the accessibility of nucleotides to chemical and enzymatic modification. The RNA molecules are exposed to chemical reagents or enzymes with a specific affinity for either single-stranded or double stranded RNA. This method is only applicable for short RNAs because of the limited resolution of gel electrophoresis. For larger RNAs reverse transcriptase is used to synthesize DNA complementary to the RNA starting from a radioactively labeled primer. Modified residues cause the reverse transcriptase to stop, and separation of the synthesised DNAs by gel electrophoresis can then be used to determine the positions of modification.
RNS struktūras noteikšana - mutāciju analīze
[Adapted from P. De Rijk]
RNA structure or protein-RNA interactions can also be studied by the introduction of specific mutations into the RNA sequence. The effect of the mutations can be assayed by measuring the ability of the mutated sequence to bind a protein which specifically recognizes the normal RNA or by testing the change in some function. Caution is required with the interpretation of mutation analysis results. Loss of protein binding or other functions is not always necessarily caused by a change in RNA secondary structure.
Noteikšana (ne gluži bioinformātikas problēma) Prognozēšana (atšķirībā no proteīniem salīdzinoši viegla,
bet sekundārajai, nevis terciālajai struktūrai) Salīdzināšana (mērķi mazliet citi, nekā proteīniem) Mijiedarbība (tik tālu, iespējams, mēs vēl neesam tikuši)
Ar RNS struktūrām saistītās problēmas
Struktūru salīdzināšana
Translation
Rotation
Translation and rotation
x1, y1, z1 x2, y2, z2
x3, y3, z3
x1 + d, y1, z1
x2 + d, y2, z2
x3 + d, y3, z3
[Adapted from T.Hanekamp]
How to estimate comparison "quality"?
Root Mean Square Deviation (RMSD)
n = number of atoms
di = distance between the corresponding atoms in structures
nRMSD i
id
2
[Adapted from T.Hanekamp]
Struktūru salīdzināšana - RMSD
Struktūru salīdzināšana - RMSD
RMSD units => e.g. Ångstroms
- identical structures => RMSD = “0”
- similar structures => RMSD is small (1 – 3 Å)
- distant structures => RMSD > 3 Å
[Adapted from T.Hanekamp]
Koordinātu RMSD
[Adapted from I.Eidhammer, I.Jonassen]
Attālumu RMSD
[Adapted from I.Eidhammer, I.Jonassen]
Experimentally it has been shown that these two measures are linearly related:
RMSDD 0.75 RMSDC + 0.2
RMSD metodes
[Adapted from I.Eidhammer, I.Jonassen]
RMSD - optimālās transformācijas atrašana
Given two 3D sets of points:P={pi}, Q={qi} , i=1,…,n;Find a 3-D rotation R0 and translation T0, such that
minR,T i|Rpi + T - qi |2 = i|R0pi + T0- qi |2 .
It can be done in time O(n).
RMSD - struktūru līdzības atrašana
Tātad:
Dotiem k atomu pāriem nav grūti atrast transformāciju, kas minimizē RMSD
Bet:
Iespējamo atomu pāru kopu skaits ir eksponenciāls (no proteīnu "garuma" n un/vai pāru skaita k)
Optimālās pāru kopas atrašana tiek uzskatīta (?) parNP-pilnu problēmu...
Praksē mēdz lietot t.s. double dynamic programmingheiristiku.
RMSD - vēl daži aspekti
Sequence order dependent alignment
RMSD iekļauto atomu pāru secība abās struktūrās atbilstto secībai aminoskābju virknēs
Sequence order independent alignment
RMSD iekļauto atomu pāru secība nav saistīta ar atomusecību aminoskābju virknēs
Nav viennozīmīgi skaidrs, kura no pieejām ir "labāka"Populārākās struktūru salīdzināšanas programmas laikamņem vērā atomu secību aminoskābju virknēs
RMSD - vēl daži aspekti
Līdz šim mēs pieņēmām, ka proteīnu struktūras ir nemainīgas.
Principā struktūras mēdz būt arī elastīgas - var nedaudzmainīties, atkarībā no "ārējiem apstākļiem".
Ir algoritmu modifikācijas, kas ņem vērā struktūru elastību -piem., mēs varam vispirms meklēt nelielus ne-ealstīguslīdzīgus struktūru fragmentus, un tad paskatīties, vei mēsvaram tos iekļaut abās struktūrās tādā pašā secībā.
RMSD - vēl daži aspekti
Virknēm mēs sākām ar pāru salīdzināšanu, un tad apgalvojām, ka bieži vien interesantāk ir vienlaicīgi salīdzinātvairāk kā divas virknes. Kā ir ar struktūrām?
Principā ir programmas, kas salīdzina vienlaicīgi vairāk kādivas struktūras (lietojot, piem., kaut ko līdzīgu pakāpeniskajaiheiristikai), taču multiple alignment problēma struktūrām ir mazāk aktuāla:
struktūru līdzība homologiem saglabājās daudz labāknekā virkņu līdzība
ir cits "evolūcijas modelis" un attālām struktūrām multiplealignment parasti neuzrādīs labi saglabātus struktūrufragmentus
RMSD - DDP pamatprocedūra
[Adapted from I.Eidhammer, I.Jonassen]
RMSD - sākam ar līdzības matricuA B C N Y R Q C L C R P M
A 1
Y 1
C 1 1 1
Y 1
N 1
R 1 1
C 1 1 1
K
C 1 1 1
R 1 1
B 1
P 1
[Adapted from M.Gerstein]
Sakotnēju martricu var konstruēt balstoties uz aminoskābjulīdzību, lai gan bieži izmanto arī vēl citus kritērijus
RMSD - līdzības matricas
Structural Alignment Similarity S(i,J) is dependent
from the 3D coordinates of residues i and j
Distance between i and j
M(i,j) = 100 / (5 + d2)
222 )()()( JiJiJi zzyyxxd
[Adapted from M.Gerstein]
Pēc tam līdzību katram atomu pārim pārrēķina - jo mazāks attālumspēc RMSD minimizējošās transformācijas, jo "līdzīgāki"
RMSD - līdzības matricas
[Adapted from R.B.Altman]
RMSD - līdzības matricas
[Adapted from I.Eidhammer, I.Jonassen]
RMSD trūkumi
all atoms are being treated as equal
(but residues on the surface usually have a greater freedom of movement than residues inside the structure)
the best alignment not necessarily means the best RMSD
RMSD performance depends form the size of molecules
[Adapted from T.Hanekamp]
RSMD alternatīvas
aRMSD = best root-mean-square deviation calculated over all aligned alpha-carbon atoms
bRMSD = the RMSD over the highest scoring residue pairs
wRMSD = weighted RMSD
[Adapted from T.Hanekamp]
Piemērs - 3znf un 4znf salīdzinājums
Lys3030 CA atoms RMS = 0.70Å248 atoms RMS = 1.42Å
[Adapted from T.Hanekamp]
Cik viegli pamanīt struktūru līdzību?
Easy:Globins
125 res.,
~1.5 Å
Tricky:Ig C & V
85 res., ~3 Å
Very Subtle: G3P-dehydro-genase, C-term. Domain >5 Å
[Adapted from M.Gerstein]
Struktūru līdzība un Computer Vision
[Adapted from M.Shatsky]
Vienkāršs heiristisks algoritms For each pair of point triples (one from each molecule),
which form “almost equal” triangle find an affine transformation that transfers one of them to the another.
Find number of pairs which is “almost superimposed” by this transformation and give the results in this order
For the best hypotheses improve the transformation by using RMSD
Complexity (assuming there are n points in each molecule) - O(n7) .
[Adapted from M.Shatsky]
Ja n=100, tad n7=1014 :(
References punktu trijnieki
p1
p2
p3
[Adapted from M.Shatsky]
Refernece frame - ortogonālu vienības vektoru, kuri iziet noviena punkta, trijnieks
Katram (nedeģenerētam) 3D punktu trijniekam var viennozīmīgipiekārtot šādu reference frame
Geometric hashing - ideja
Chose a reference frame Find the point coordinates in this reference frame Use these coordinates as “hash” adresses and place
these points in hash table Repeat this step for each reference frame.
[Adapted from M.Shatsky]
Geometric hashing - ideja
[Adapted from M.Shatsky]
Izvēlamies universālo reference frame, unkatram trijniekam no-hašojam transformācijuuz lokālo reference frame (laiks O(n4))
Geometric hashing - atpazīšana
For the target protein :
Chose a reference frame Find the coordinates of other points in this reference
frame Use coordinates to select the points from hash table Find RMSD transformations for best hypotheses Repeat for each reference frame Select the best alignments
O(n4 + n4 * BinSize) ~ O(n5 )Ja n=100 tad n5=1010
[Adapted from M.Shatsky]
Geometric hashing - 2D piemērs
[Adapted from I.Eidhammer, I.Jonassen]
Geometric hashing - 2D piemērs[Adapted from I.Eidhammer, I.Jonassen]
(a)(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)
(b)(1,8)(2,2)(0,0)(4,-2)(10,0)(8,3)(8,7)
(c)(0,0)(3,-2)(8,0)(6,2)(10,4)(3,8)(0,6)
Geometric hashing - 2D piemērs
[Adapted from I.Eidhammer, I.Jonassen]
midpoint distance
line distance
References sekundārās struktūras elementi
A base fingerprint is a 5D vector composed of: SSE types: helix, strand Line distance Midpoint distance Angle
Geometric hashing - priekšrocības
Independence from sequences Can be used for partially disconnected structures Allows to find interesting “patterns” Comparatively fast Can be applied also for the docking problem Can be easily parallelized
[Adapted from M.Shatsky]
Proteīnu struktūru datubāzes - PDB
http://www.pdb.org
PDB faila fragments
ATOM 1575 C ASP E 211 -4.659 29.609 1.843 1.00 0.03 1ENT1729ATOM 1576 O ASP E 211 -5.333 29.668 2.876 1.00 0.06 1ENT1730ATOM 1577 CB ASP E 211 -6.058 31.009 0.311 1.00 0.12 1ENT1731ATOM 1578 CG ASP E 211 -5.117 32.197 0.534 1.00 0.08 1ENT1732ATOM 1579 OD1 ASP E 211 -4.841 32.534 1.691 1.00 0.30 1ENT1733ATOM 1580 OD2 ASP E 211 -4.650 32.810 -0.429 1.00 0.62 1ENT1734ATOM 1581 N GLY E 212 -3.346 29.481 1.866 1.00 0.02 1ENT1735ATOM 1582 CA GLY E 212 -2.634 29.404 3.141 1.00 0.03 1ENT1736ATOM 1583 C GLY E 212 -1.251 29.989 3.025 1.00 0.08 1ENT1737ATOM 1584 O GLY E 212 -0.818 30.413 1.957 1.00 0.04 1ENT1738ATOM 1585 N ILE E 213 -0.533 30.029 4.146 1.00 0.00 1ENT1739ATOM 1586 CA ILE E 213 0.817 30.575 4.112 1.00 0.03 1ENT1740ATOM 1587 C ILE E 213 1.843 29.530 4.545 1.00 0.04 1ENT1741
Formāts: 80 simboli katrā rindā, katram atribūtam fiksētas pozīcijas
Atoma Nr
Atoms
AS
Chain
X,Y,ZAS Nr
Temp. factor
Occupancy
Only 5 digits are available for theatom serial number, but some structures have already beenreceived with more that 99,999 atoms...
Proteīnu struktūru datubāzes - MMDB
http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml
Struktūru vizualizācija
1) Rasmol un Protein Explorer
http://www.umass.edu/microbio/rasmol/
2) Cn3D
http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml
3) DeepView Swiss-PDB Viewer (quite powerful modeling program)
Also calculates various RSMDs
http://www.expasy.org/spdbv/
Struktūru salīdzināšana - DaliLite
http://www.ebi.ac.uk/DaliLite/
Struktūru salīdzināšana - SSAP
http://www.cathdb.info/cgi-bin/cath/SsapServer.pl
Struktūru salīdzināšana - VAST
http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html
Struktūru salīdzināšana - CE un CL
http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html
Proteīnu struktūru klasifikācijas - SCOP
http://cl.sdsc.edu/cl.html
Proteīnu struktūru klasifikācijas - CATH
http://www.cathdb.info
Proteīnu struktūru klasifikācijas - CATH
CATH - hierarchical classification of protein domainstructures
[C.Orengo, J.Thornton et al; UCL]
CATH number - 3. 30. 70. 330
Class (C) Topology (T) Architecture (A) Homologous superfamily (H)
Proteīnu struktūru klasifikācijas - CATH
CATH number - 3. 30. 70. 330
Class (C) Topology (T) Architecture (A) Homologous superfamily (H)
Class1 - mainly alpha2 - mainly beta3 - alpha-beta4 - low secondary structure content
Assigned automatically
Proteīnu struktūru klasifikācijas - CATH
CATH number - 3. 30. 70. 330
Class (C) Topology (T) Architecture (A) Homologous superfamily (H)
Architecture
overall shape of the domain structure according toorientations of secondary structures
Assigned manually
Proteīnu struktūru klasifikācijas - CATH
CATH number - 3. 30. 70. 330
Class (C) Topology (T) Architecture (A) Homologous superfamily (H)
Topology
shape and connectivity of secondary structures
Assigned automatically by SSAP algorithm
Proteīnu struktūru klasifikācijas - CATH
CATH number - 3. 30. 70. 330
Class (C) Topology (T) Architecture (A) Homologous superfamily (H)
Homologous superfamily
proteins that share a common ancestor
Assigned automatically by sequence comparisons andSSAP
Proteīnu struktūru klasifikācijas - DALI
http://www.ebi.ac.uk/dali/
Proteīnu struktūru klasifikācijas - DALI
http://ekhidna.biocenter.helsinki.fi/dali/start
RNS struktūru prognozēšana?
RNA sequence:
...AGGCUAUGGCCA...
Fortunately here we cando better...
RNS struktūra
[Adapted from R.B.Altman]
RNS struktūra - pseidomezgli
[Adapted from R.B.Altman]
Enerģijas minimizācija
[Adapted from R.B.Altman]
RNS struktūru prognozēšana - DP algoritms
[Adapted from R.B.Altman]
RNS struktūru prognozēšana - DP algoritms
[Adapted from R.B.Altman]