Bioinformātika Proteīnu un RNS struktūras LU, 2008, Juris Vīksna.

BBioinformāioinformātitikaka

Proteīnu un RNS struktūrasProteīnu un RNS struktūras

LU, 2008,LU, 2008, Juris VJuris Vīksnaīksna

Proteīni: ko mēs ar to saprotam ar proteīnu struktūru,

struktūru reprezentācija Ar proteīnu struktūrām saistītās problēmas RNS: ko mēs ar to saprotam ar RNS struktūru Ar RNS struktūrām saistītās problēmas Metodes proteīnu struktūru salīdzināšanai Proteīnu struktūru datubāzes Rīki proteīnu struktūru salīdzināšanai un vizualizācijai Proteīnu struktūru klasifikācijas RNS struktūru prognozēšana

Šodien:

Proteīni

[Adapted from R.Shamir]

...VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANK...

Protein sequence:

Proteīnu struktūra


Proteīnu struktūra


We will be interested mostly in secondary and tertiary structure

Proteīnu struktūras noteikšana - kristalogrāfija

[Adapted from G.Lee]

The basics:

Purify an protein crystal

Shoot an X-ray through the rotating crystal

Collect Data in one of many ways

Interpret data



Problems:

Crystal setup takes….forever (almost)

Interpreting the data is no easy task But all methods create this mass of data

Expensive($$$)



In the end, biologists want the best results possible and X-ray Crystallography provides this right now

It gets the job done No other method does the job better

Proteīnu struktūras noteikšana -kristalogrāfija



Magnet

Radio frequencyamplifiers

Samples

Proteīnu struktūras noteikšana - NMR

[Adapted from V.Arcus]

NMR - Nuclear magnetic resonance




Protein NMR requires large amounts of very pure protein..


Extraction from the natural source

a major disadvantage here is the very low levels of protein in tissues

for example, one might start with 10 l of blood and get 1 mg of protein!

this also requires a large number of purification steps

the main advantage is the maintenance of post-translational modifications

NMR vai kristalogrāfija?


Both techniques to determine protein structures

NMR uses protein in solution

X-ray crystallography uses protein crystals

Both techniques require large amounts of pure protein

Both techniques require expensive equipment!

NMR priekšrocības


Protein in solution! Can look at the dynamic properties of the protein

structure Can look at the interactions between the protein and

ligands, substrates or other proteins Can look at protein folding Sample is not damaged in any way No “phase problem” Can “characterise” your protein using NMR

NMR trūkumi


Size limit! The maximum size of a protein for NMR structure

determination is ~30 kDa. This eliminates ~50% of all proteins

High solubility is a requirement Comparatively low resolution

Kristalogrāfijas priekšrocības


No size limit As long as you can crystallise it

Solubility requirement is less stringent Simple definition of resolution Direct calculation from data to electron density and

back again

Kristalogrāfijas trūkumi


Crystallisation! This is a process bottleneck Binary (all or nothing)

Phase problem

If the cell contains two electrons (eachwith the same scattering power) andtheir positional relationship is such thatthe distance between them is exactlyone-half the distance between reflectingplanes, then they will cancel out eachothers contribution to diffraction.

Proteīnu struktūras fails

HEADER HYDROLASE 03-NOV-00 1G65 TITLE CRYSTAL STRUCTURE OF EPOXOMICIN:20S PROTEASOME REVEALS A TITLE 2 MOLECULAR BASIS FOR SELECTIVITY OF ALPHA,BETA-EPOXYKETONE TITLE 3 PROTEASOME INHIBITORS COMPND MOL_ID: 1;

.............................................................................

.............................................................................

ATOM 115 CD PRO A 17 44.162 -73.549 30.303 1.00 34.52 C ATOM 116 N SER A 18 47.730 -73.191 28.777 1.00 37.54 N ATOM 117 CA SER A 18 49.119 -72.807 28.499 1.00 40.24 C ATOM 118 C SER A 18 50.025 -74.009 28.289 1.00 42.29 C ATOM 119 O SER A 18 51.252 -73.870 28.152 1.00 42.34 O ATOM 120 CB SER A 18 49.661 -71.974 29.653 1.00 42.60 C ATOM 121 OG SER A 18 49.219 -72.500 30.895 1.00 45.61 O ATOM 122 N GLY A 19 49.411 -75.189 28.300 1.00 42.88 N ATOM 123 CA GLY A 19 50.145 -76.427 28.117 1.00 44.73 C ATOM 124 C GLY A 19 50.743 -76.999 29.391 1.00 43.86 C ATOM 125 O GLY A 19 51.585 -77.900 29.352 1.00 45.98 O ATOM 126 N LYS A 20 50.315 -76.498 30.532 1.00 42.31 N

PDB file format

Proteīnu struktūra - atomu koordinātas

[Adapted from M.Gerstein and I.Eidhammer, I.Jonassen]

Structure is described by 3D coordinates (X,Y,Z) of all C atoms

Proteīnu struktūra - foldi

"Fold" representation of 7timA0

Hydrogen bonding patterns for four helices; Structures are represented in a diagrammatic way to simplify counting the atoms in each H-bonded loop.

27 ribbon 310 helix 3.613 helix helix

Proteīnu struktūra - spirāles

[Adapted from S.Rafferty]

Proteīnu struktūra - sloksnes Composed of strands Adjacent Strands may be parallel or antiparallel Strands are flat: think of a beta sheet as a helix with two residues

per turn

Parallel AntiParallel

[Adapted from S.Rafferty]

Proteīnu foldi - sandwhich ()

Proteīnu foldi - barrels ()

Proteīnu foldi - horseshoe (-)

Proteīnu foldi - helix “bundles” ()

Proteīnu foldi - mijiedarbības

Transcription factors - homeodomain proteins

Proteīnu foldi - daži skaitļi

Proteīnu struktūras - citas reprezentācijas

Different representations of myoglobin molecule

Contact map (graph-based) representationof proteinstructure

Noteikšana (ne gluži bioinformātikas problēma) Prognozēšana (protein folding problēma; viens no

bioinformātikas Holy Grail...) Salīdzināšana (nav gluži triviāli, bet ir metodes, kas

praksē darbojas pietiekami labi) Reprezentācijas Virsmas modelēšana Proteīnu mijiedarbību modelēšana/prognozēšana Vizualizācija

Ar proteīnu struktūrām saistītās problēmas

The folded state is a low energy state under physiological conditions: H2O, pH ~ 7.0, NaCl

Protein folding

GGibbsFreeEnergy

U

I

F

GU–F

Kas ietekmē protein folding:

• Hidrofobiskie spēki (ūdens "izspiešana")• Ūdeņraža saites• Elektrostatiskie spēki• Disulfīdu saites• Chaperones

Protein folding

Chaperones

Chaperone proteins were first identified as "heat-shock proteins" (hsp60 and hsp70)

Hsp70 recognizes exposed, unfolded regions of new protein chains - especially hydrophobic regions

It binds to these regions, apparently protecting them until productive folding reactions can occur

Occurs while the chain is still being translated

CASP

http://predictioncenter.gc.ucdavis.edu/casp7/Casp7.html

CAFASP

http://www.cs.bgu.ac.il/~dfischer/CAFASP4/

Prioni

Prion - proteinaceous infectious particle

PrPc -the normal version Hypothetical structure of PrPsc

Prioni Spontaneously (rare): the normal fold is overwhelmingly the favored conformation

Inherited: a mutation in the PRNP gene destabilizes the normal conformation

Transmitted: ingestion of PrPsc from diet, surgical instruments, blood, or blood-derived products

Molekulārās virsmas

Key-and-lockprincips:

RNS struktūra

RNA sequence:

...AGGCUAUGGCCA...

Single-stranded, but

A tends to pair with UG tends to pair with C

RNS sekundārā struktūra

5’ 3’ G--C G--C C--GA | U--A G--CA AA A A A

[Adapted from C.Staben]

RNS sekundārā struktūra

[Adapted from K.Selesniemi]

Pseudo-knot

RNS terciālā struktūra

[Adapted from K.Selesniemi]

RNS struktūras noteikšana - fizikālās metodes

[Adapted from P. De Rijk]

The experimental method giving the highest resolution is single crystal X-ray diffraction. X-ray diffraction reveals secondary, tertiary and three dimensional structures. Unfortunately, it is very difficult to obtain crystals of RNA molecules suitable for X-ray diffraction. The structure of tRNA's have been solved using this technique.



NMR can provide details about local conformation, and can be used to determine secondary, tertiary and, in theory, three-dimensional structures. The size of RNA molecules that can studied using NMR is currently rather limited. Oligonucleotides used in NMR studies are designed to adopt structures found in larger RNA molecules.



Direct observation of partially denatured RNA molecules is possible using electron microscopy. However, the choice of denaturing conditions is crucial, and the resolution of electron microscopy is usually too limited to see fine details.

RNS struktūras noteikšana - ķīmiskās metodes


RNA structure has been probed by testing the accessibility of nucleotides to chemical and enzymatic modification. The RNA molecules are exposed to chemical reagents or enzymes with a specific affinity for either single-stranded or double stranded RNA. This method is only applicable for short RNAs because of the limited resolution of gel electrophoresis. For larger RNAs reverse transcriptase is used to synthesize DNA complementary to the RNA starting from a radioactively labeled primer. Modified residues cause the reverse transcriptase to stop, and separation of the synthesised DNAs by gel electrophoresis can then be used to determine the positions of modification.

RNS struktūras noteikšana - mutāciju analīze


RNA structure or protein-RNA interactions can also be studied by the introduction of specific mutations into the RNA sequence. The effect of the mutations can be assayed by measuring the ability of the mutated sequence to bind a protein which specifically recognizes the normal RNA or by testing the change in some function. Caution is required with the interpretation of mutation analysis results. Loss of protein binding or other functions is not always necessarily caused by a change in RNA secondary structure.

Noteikšana (ne gluži bioinformātikas problēma) Prognozēšana (atšķirībā no proteīniem salīdzinoši viegla,

bet sekundārajai, nevis terciālajai struktūrai) Salīdzināšana (mērķi mazliet citi, nekā proteīniem) Mijiedarbība (tik tālu, iespējams, mēs vēl neesam tikuši)

Ar RNS struktūrām saistītās problēmas

Struktūru salīdzināšana

Translation

Rotation

Translation and rotation

x1, y1, z1 x2, y2, z2

x3, y3, z3

x1 + d, y1, z1

x2 + d, y2, z2

x3 + d, y3, z3

[Adapted from T.Hanekamp]

How to estimate comparison "quality"?

Root Mean Square Deviation (RMSD)

n = number of atoms

di = distance between the corresponding atoms in structures

nRMSD i

id

2


Struktūru salīdzināšana - RMSD

Struktūru salīdzināšana - RMSD

RMSD units => e.g. Ångstroms

- identical structures => RMSD = “0”

- similar structures => RMSD is small (1 – 3 Å)

- distant structures => RMSD > 3 Å


Koordinātu RMSD

[Adapted from I.Eidhammer, I.Jonassen]

Attālumu RMSD


Experimentally it has been shown that these two measures are linearly related:

RMSDD 0.75 RMSDC + 0.2

RMSD metodes


RMSD - optimālās transformācijas atrašana

Given two 3D sets of points:P={pi}, Q={qi} , i=1,…,n;Find a 3-D rotation R0 and translation T0, such that

minR,T i|Rpi + T - qi |2 = i|R0pi + T0- qi |2 .

It can be done in time O(n).

RMSD - struktūru līdzības atrašana

Tātad:

Dotiem k atomu pāriem nav grūti atrast transformāciju, kas minimizē RMSD

Bet:

Iespējamo atomu pāru kopu skaits ir eksponenciāls (no proteīnu "garuma" n un/vai pāru skaita k)

Optimālās pāru kopas atrašana tiek uzskatīta (?) parNP-pilnu problēmu...

Praksē mēdz lietot t.s. double dynamic programmingheiristiku.

RMSD - vēl daži aspekti

Sequence order dependent alignment

RMSD iekļauto atomu pāru secība abās struktūrās atbilstto secībai aminoskābju virknēs

Sequence order independent alignment

RMSD iekļauto atomu pāru secība nav saistīta ar atomusecību aminoskābju virknēs

Nav viennozīmīgi skaidrs, kura no pieejām ir "labāka"Populārākās struktūru salīdzināšanas programmas laikamņem vērā atomu secību aminoskābju virknēs


Līdz šim mēs pieņēmām, ka proteīnu struktūras ir nemainīgas.

Principā struktūras mēdz būt arī elastīgas - var nedaudzmainīties, atkarībā no "ārējiem apstākļiem".

Ir algoritmu modifikācijas, kas ņem vērā struktūru elastību -piem., mēs varam vispirms meklēt nelielus ne-ealstīguslīdzīgus struktūru fragmentus, un tad paskatīties, vei mēsvaram tos iekļaut abās struktūrās tādā pašā secībā.


Virknēm mēs sākām ar pāru salīdzināšanu, un tad apgalvojām, ka bieži vien interesantāk ir vienlaicīgi salīdzinātvairāk kā divas virknes. Kā ir ar struktūrām?

Principā ir programmas, kas salīdzina vienlaicīgi vairāk kādivas struktūras (lietojot, piem., kaut ko līdzīgu pakāpeniskajaiheiristikai), taču multiple alignment problēma struktūrām ir mazāk aktuāla:

struktūru līdzība homologiem saglabājās daudz labāknekā virkņu līdzība

ir cits "evolūcijas modelis" un attālām struktūrām multiplealignment parasti neuzrādīs labi saglabātus struktūrufragmentus

RMSD - DDP pamatprocedūra


RMSD - sākam ar līdzības matricuA B C N Y R Q C L C R P M

A 1

Y 1

C 1 1 1

Y 1

N 1

R 1 1

C 1 1 1

K

C 1 1 1

R 1 1

B 1

P 1

[Adapted from M.Gerstein]

Sakotnēju martricu var konstruēt balstoties uz aminoskābjulīdzību, lai gan bieži izmanto arī vēl citus kritērijus

RMSD - līdzības matricas

Structural Alignment Similarity S(i,J) is dependent

from the 3D coordinates of residues i and j

Distance between i and j

M(i,j) = 100 / (5 + d2)

222 )()()( JiJiJi zzyyxxd


Pēc tam līdzību katram atomu pārim pārrēķina - jo mazāks attālumspēc RMSD minimizējošās transformācijas, jo "līdzīgāki"


[Adapted from R.B.Altman]



RMSD trūkumi

all atoms are being treated as equal

(but residues on the surface usually have a greater freedom of movement than residues inside the structure)

the best alignment not necessarily means the best RMSD

RMSD performance depends form the size of molecules


RSMD alternatīvas

aRMSD = best root-mean-square deviation calculated over all aligned alpha-carbon atoms

bRMSD = the RMSD over the highest scoring residue pairs

wRMSD = weighted RMSD


Piemērs - 3znf un 4znf salīdzinājums

Lys3030 CA atoms RMS = 0.70Å248 atoms RMS = 1.42Å


Cik viegli pamanīt struktūru līdzību?

Easy:Globins

125 res.,

~1.5 Å

Tricky:Ig C & V

85 res., ~3 Å

Very Subtle: G3P-dehydro-genase, C-term. Domain >5 Å


Struktūru līdzība un Computer Vision

[Adapted from M.Shatsky]

Vienkāršs heiristisks algoritms For each pair of point triples (one from each molecule),

which form “almost equal” triangle find an affine transformation that transfers one of them to the another.

Find number of pairs which is “almost superimposed” by this transformation and give the results in this order

For the best hypotheses improve the transformation by using RMSD

Complexity (assuming there are n points in each molecule) - O(n7) .


Ja n=100, tad n7=1014 :(

References punktu trijnieki

p1

p2

p3


Refernece frame - ortogonālu vienības vektoru, kuri iziet noviena punkta, trijnieks

Katram (nedeģenerētam) 3D punktu trijniekam var viennozīmīgipiekārtot šādu reference frame

Geometric hashing - ideja

Chose a reference frame Find the point coordinates in this reference frame Use these coordinates as “hash” adresses and place

these points in hash table Repeat this step for each reference frame.


Geometric hashing - ideja


Izvēlamies universālo reference frame, unkatram trijniekam no-hašojam transformācijuuz lokālo reference frame (laiks O(n4))

Geometric hashing - atpazīšana

For the target protein :

Chose a reference frame Find the coordinates of other points in this reference

frame Use coordinates to select the points from hash table Find RMSD transformations for best hypotheses Repeat for each reference frame Select the best alignments

O(n4 + n4 * BinSize) ~ O(n5 )Ja n=100 tad n5=1010


Geometric hashing - 2D piemērs


Geometric hashing - 2D piemērs[Adapted from I.Eidhammer, I.Jonassen]

(a)(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)

(b)(1,8)(2,2)(0,0)(4,-2)(10,0)(8,3)(8,7)

(c)(0,0)(3,-2)(8,0)(6,2)(10,4)(3,8)(0,6)

Geometric hashing - 2D piemērs


midpoint distance

line distance

References sekundārās struktūras elementi

A base fingerprint is a 5D vector composed of: SSE types: helix, strand Line distance Midpoint distance Angle

Geometric hashing - priekšrocības

Independence from sequences Can be used for partially disconnected structures Allows to find interesting “patterns” Comparatively fast Can be applied also for the docking problem Can be easily parallelized


Proteīnu struktūru datubāzes - PDB

http://www.pdb.org

PDB faila fragments

ATOM 1575 C ASP E 211 -4.659 29.609 1.843 1.00 0.03 1ENT1729ATOM 1576 O ASP E 211 -5.333 29.668 2.876 1.00 0.06 1ENT1730ATOM 1577 CB ASP E 211 -6.058 31.009 0.311 1.00 0.12 1ENT1731ATOM 1578 CG ASP E 211 -5.117 32.197 0.534 1.00 0.08 1ENT1732ATOM 1579 OD1 ASP E 211 -4.841 32.534 1.691 1.00 0.30 1ENT1733ATOM 1580 OD2 ASP E 211 -4.650 32.810 -0.429 1.00 0.62 1ENT1734ATOM 1581 N GLY E 212 -3.346 29.481 1.866 1.00 0.02 1ENT1735ATOM 1582 CA GLY E 212 -2.634 29.404 3.141 1.00 0.03 1ENT1736ATOM 1583 C GLY E 212 -1.251 29.989 3.025 1.00 0.08 1ENT1737ATOM 1584 O GLY E 212 -0.818 30.413 1.957 1.00 0.04 1ENT1738ATOM 1585 N ILE E 213 -0.533 30.029 4.146 1.00 0.00 1ENT1739ATOM 1586 CA ILE E 213 0.817 30.575 4.112 1.00 0.03 1ENT1740ATOM 1587 C ILE E 213 1.843 29.530 4.545 1.00 0.04 1ENT1741

Formāts: 80 simboli katrā rindā, katram atribūtam fiksētas pozīcijas

Atoma Nr

Atoms

AS

Chain

X,Y,ZAS Nr

Temp. factor

Occupancy

Only 5 digits are available for theatom serial number, but some structures have already beenreceived with more that 99,999 atoms...

Proteīnu struktūru datubāzes - MMDB

http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml

Struktūru vizualizācija

1) Rasmol un Protein Explorer

http://www.umass.edu/microbio/rasmol/

2) Cn3D

http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml

3) DeepView Swiss-PDB Viewer (quite powerful modeling program)

Also calculates various RSMDs

http://www.expasy.org/spdbv/

Struktūru salīdzināšana - DaliLite

http://www.ebi.ac.uk/DaliLite/

Struktūru salīdzināšana - SSAP

http://www.cathdb.info/cgi-bin/cath/SsapServer.pl

Struktūru salīdzināšana - VAST

http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html

Struktūru salīdzināšana - CE un CL

http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html

Proteīnu struktūru klasifikācijas - SCOP

http://cl.sdsc.edu/cl.html

Proteīnu struktūru klasifikācijas - CATH

http://www.cathdb.info


CATH - hierarchical classification of protein domainstructures

[C.Orengo, J.Thornton et al; UCL]

CATH number - 3. 30. 70. 330

Class (C) Topology (T) Architecture (A) Homologous superfamily (H)


CATH number - 3. 30. 70. 330


Class1 - mainly alpha2 - mainly beta3 - alpha-beta4 - low secondary structure content

Assigned automatically


CATH number - 3. 30. 70. 330


Architecture

overall shape of the domain structure according toorientations of secondary structures

Assigned manually


CATH number - 3. 30. 70. 330


Topology

shape and connectivity of secondary structures

Assigned automatically by SSAP algorithm


CATH number - 3. 30. 70. 330


Homologous superfamily

proteins that share a common ancestor

Assigned automatically by sequence comparisons andSSAP

Proteīnu struktūru klasifikācijas - DALI

http://www.ebi.ac.uk/dali/

Proteīnu struktūru klasifikācijas - DALI

http://ekhidna.biocenter.helsinki.fi/dali/start

RNS struktūru prognozēšana?

RNA sequence:

...AGGCUAUGGCCA...

Fortunately here we cando better...

RNS struktūra


RNS struktūra - pseidomezgli


Enerģijas minimizācija


RNS struktūru prognozēšana - DP algoritms


RNS struktūru prognozēšana - DP algoritms


Bioinformātika Proteīnu un RNS struktūras LU, 2008, Juris Vīksna.

Documents