Analiza strukturalna i modelowanie białek spliceosomu ludzkiego, doktorat

Analiza strukturalna i modelowanie białek

spliceosomu ludzkiego

Iga Korneta

Praca doktorska wykonana w

Laboratorium Bioinformatyki i Inżynierii Białka

Międzynarodowego Instytutu

Biologii Molekularnej i Komórkowej

w Warszawie

Promotor: prof. dr hab. Janusz M. Bujnicki

Warszawa, 2012

Spis treści

Autoreferat rozprawy ............................................................................................................................2

Wprowadzenie ......................................................................................................................................3

Projekt badawczy..................................................................................................................................8

Analiza strukturalna regionów uporządkowanych białek spliceosomu ludzkiego i stworzenie

biblioteki modeli strukturalnych .........................................................................................................10

Analiza strukturalna regionów nieuporządkowanych białek spliceosomu ludzkiego ........................13

Porównanie białek i regionów strukturalnych białek spliceosomu ludzkiego i G. lamblia ...............19

Publikacja danych ...............................................................................................................................21

Podsumowanie wyników projektu .....................................................................................................23

Bibliografia .........................................................................................................................................24

Publikacje ..............................................................................................................................................26

„Structural Bioinformatics of the Human Spliceosomal Proteome” ..................................................27

„Intrinsic Disorder in the Human Spliceosomal Proteome” ...............................................................47

Oświadczenia ........................................................................................................................................62

Informacja od promotora ...................................................................................................................63

Wkład pracy doktorantki w publikacje ..............................................................................................64

Oświadczenie, Janusz M. Bujnicki ....................................................................................................65

Oświadczenie, Marcin Magnus .........................................................................................................66

Oświadczenie, Iga Korneta ................................................................................................................67

Summary in English .............................................................................................................................68

Autoreferat rozprawy

A u t o r e f e r a t | 3

Wprowadzenie

Spliceosom

Spliceosom to wielkocząsteczkowa maszyna molekularna, która w komórkach eukariotycznych

przeprowadza proces splicingu - usuwania intronów (sekwencji niekodujących) i łączenia egzonów

(sekwencji kodujących) z prekursorowego mRNA (pre-mRNA). Ludzki spliceosom występuje w

dwóch formach, „spliceosomu większego”, który przeprowadza u człowieka >99% reakcji splicingu,

oraz „spliceosomu mniejszego”, który przeprowadza pozostałe <1% [1].

Dwie formy spliceosomu występujące u człowieka są spokrewnione i mają podobną strukturę. Żaden

ze spliceosomów nie jest kompleksem stabilnym ani jednolitym – wprost przeciwnie, składają się one

z wielu elementów, które łączą się i oddzielają w trakcie procesu splicingu. Spliceosom większy

składa się głównie z czterech podjednostek białkowo-RNA nazwanych według zawartych w nich

cząsteczek (sn)RNA: U1 snRNP, U2 snRNP (z podkompleksami SF3A i SF3B), U4/U6 di-snRNP

oraz U5 snRNP. W spliceosomie mniejszym zamiast podjednostek U1 i U2 występuje pojedyncza

podjednostka U11/U12, a zamiast podjednostki U4/U6 występuje podjednostka U4atac/U6atac. Na

cztery podjednostki ludzkiego spliceosomu większego składa się w sumie 45 unikatowych białek, przy

czym siedem z nich (tzw. „białka Sm”) występuje w czterech kopiach, po jednej w każdej

podjednostce, gdzie te białka tworzą platformę wspierającą RNA tej podjednostki. W podjednostce

U4/U6, białka Sm powiązane są z U4 snRNA, natomiast U6 snRNA stowarzyszone jest z platformą

stworzoną z siedmiu białek Lsm („like-Sm” – „podobne do Sm”) spokrewnionych ewolucyjnie z

białkami Sm (przegląd białek: [2]).

Oprócz białek podjednostek spliceosomu, około 70-80 dodatkowych białek występuje w

zgeneralizowanym kompleksie spliceosomalnym (bez rozdziału na spliceosom mniejszy lub większy)

licznie, natomiast do ponad 100 białek występuje dodatkowo nielicznie. Zestaw białek występujących

licznie, w powiązaniu z zestawem białek i RNA podjednostek spliceosomalnych, można uznać za

eksperymentalne przybliżenie zestawu białek niezbędnych do funkcjonowania spliceosomu u

człowieka [3]. Dodatkowe białka spliceosomalne występujące nielicznie mogą uczestniczyć w jego

funkcji tylko w szczególnych warunkach lub też pośredniczyć pomiędzy procesem usuwania intronów

a innymi procesami obróbki mRNA, takimi jak transkrypcja mRNA, przyłączanie czapeczki na końcu

5' mRNA, poliadenylacja końca 3' mRNA, eksport, lokalizacja i niszczenie mRNA oraz tworzenie

kompleksów snoRNP rodziny C/D [4]. Dodatkowe białka mogą stanowić część kompleksów

białkowych lub stanowić niezależne czynniki splicingu. Wśród kompleksów białkowych związanych

funkcjonalnie ze spliceosomem znajdują się: kompleks hPrp19/CDC5L, kompleksy EJC („exon-

junction complex” – „kompleks złącza egzonów”), CBP („cap-binding proteins” – „kompleks

przyłączania czapeczki do RNA”), TREX („transport and exchange” – „transportu i wymiany”) i RES

(„retention and splicing” – „zatrzymywania i splicingu”). Oprócz kompleksu TREX, kompleksy te

występują składają się z białek (a zatem też same występują) w spliceosomie licznie [3].

Proces splicingu (przegląd: [5])

Proces splicingu ma trzy główne etapy, co ma swoje odbicie w dynamice zachowania podjednostek

spliceosomu:

Pierwszym etapem jest definiowanie granic pomiędzy wycinanym intronem a otaczającymi

go egzonami. Możliwe są tutaj alternatywne granice intronu, co prowadzi do zachodzenia

zjawiska tzw. alternatywnego splicingu. W większym spliceosomie ludzkim, definiowanie

granic intronu dokonywane jest przez podjednostki U1 i U2 spliceosomu, przez niezależne

białko SF1 oraz przez dwubiałkowy kompleks U2AF65/U2AF35, które skanują pre-mRNA w

poszukiwaniu funkcjonalnych miejsc intronu [(końce 5' i 3' intronu oraz tzw. punkt

rozgałęzienia intronu (BPS, „branch point site”)]. Ten etap ma dwa stadia:

o podjednostka U1 przyłącza się do końca 5' intronu, kompleks U2AF65/U2AF35 do

końca 3' intronu, a białko SF1 do punktu BPS [kompleks E („entry” – „wejścia”)];

o białko SF1 zastępowane jest przez podjednostkę U2 spliceosomu (kompleks

presplicingowy, kompleks A).


W mniejszym spliceosomie, rolę podjednostek U1 i U2 przejmuje podjednostka U11/U12.

Drugim etapem jest właściwy proces splicingu, którego częścią jest katalityczna reakcja

splicingu. W większym spliceosomie ludzkim, ten proces rozpoczyna się poprzez

przyłączenie potrójnej podjednostki U4/U6.U5 tri-snRNP, powstałej z połączenia

podjednostek U4/U6 oraz U5, do kompleksu A, i również ma kilka stadiów:

o do pre-mRNA z przyłączonymi podjednostkami U1 i U2 (kompleks A) przyłącza się

podjednostka U4/U6.U5 tri-snRNP (kompleks prekatalityczny, kompleks B);

o podjednostki U1 i U4 odłączają się (aktywowany kompleks B, kompleks B*);

o następuje pierwszy krok katalityczny splicingu na interfejsie U2, U5 i U6 snRNA –

atak nukleofilowy miejsca BPS na koniec 5' intronu (dając kompleks katalityczny,

kompleks C);

o następuje drugi krok katalityczny splicingu na interfejsie U2, U5 i U6 snRNA – atak

uwolnionego egzonu 5' na koniec 3' intronu (dając kompleks postsplicingowy).

W mniejszym spliceosomie, rolę podjednostki U4/U6 przejmuje podjednostka U4atac/U6atac.

Trzecim etapem jest recykling podjednostek i odtworzenie podjednostek z początku fazy

pierwszej. Kompleks poreakcyjny dzieli się na podjednostki U2, U5 i U6. Podjednostka U6

łączy się z podjednostką U4, tworząc stosunkowo trwałą podjednostkę U4/U6 di-snRNP.

Następnie podjednostka U4/U6 di-snRNP łączy się z podjednostką U5, odtwarzając

podjednostkę U4/U6.U5 tri-snRNP.

Każde ze stadiów reakcji splicingu (kompleks A, B, B*, C) jest stowarzyszone z własnym garniturem

dodatkowych białek i kompleksów białek splicingu [3].

Wieloetapowość procesu splicingu powoduje, że podstawowa funkcja spliceosomu, czyli katalityczne

wycinanie intronów oraz łączenie egzonów, zależy od właściwego działania wielu dodatkowych

funkcjonalności maszyny spliceosomalnej, takich jak: rozpoznanie końców 5' i 3' intronu (definicja

intronu i egzonu), wzajemne rozpoznanie podjednostek spliceosomu, właściwe zejście się

podjednostek, dynamika i regulacja aktywnego spliceosomu. W przeciwieństwie do samej reakcji

splicingu, wczesne fazy procesu splicingu (rozpoznawanie) nie wymagają reakcji katalitycznych, a za

to bazują na nawiązywaniu wielu słabych kontaktów pomiędzy jednostkami uczestniczącymi. Z kolei

późniejsze katalityczne przejścia pomiędzy różnymi układami wiązań RNA-RNA pomiędzy snRNA

spliceosomalnymi a pre-mRNA intronu są wspomagane przez białka, takie jak helikazy RNA DDX23

i hBrr2 z podjednostki U5. Pokazano również pojedyncze przypadki, gdzie za kontrolę dynamiki

procesu splicingu odpowiadają elementy bazujące na modyfikacjach posttranslacyjnych białek

spliceosomalnych [6], w tym na procesie ubikwitynacji i domenach z nim związanych [7], oraz

elementy nieuporządkowane strukturalnie (patrz niżej) [8].

Spliceosom sprawdza też na bieżąco poprawność produktów pośrednich oraz produktu końcowego

procesu (czyli mRNA po splicingu). Sprawdzanie poprawności również wykonywane jest przez

helikazy RNA, tym razem niezwiązane trwale z żadnym kompleksem, takie jak hPrp5, hPrp16 i

hPrp22 [9].

Badania strukturalne spliceosomu

W momencie pisania tego autoreferatu, wśród wyników badań doświadczalnych struktury

spliceosomu dostępne są mapy pełnego spliceosomu (ludzkiego) z doświadczeń kriomikroskopii

elektronowej [10] oraz modele w wyższej (atomowej) rozdzielczości różnych jego fragmentów, takich

jak prawie kompletna podjednostka U1 spliceosomu ludzkiego [11] czy U4 snRNA związane z

platformą białek Sm [12]. Jednak dla wielu regionów i pełnych białek spliceosomalnych nie istnieje

żaden model doświadczalny. Z drugiej strony, wiele z dostępnych modeli doświadczalnych obejmuje

te same fragmenty białek.


Dla wyczerpującego poznania funkcji spliceosomu niezbędne jest poznanie jego struktury. Pierwszym

z zadań mojego projektu badawczego było stworzenie biblioteki modeli doświadczalnych białek

spliceosomu ludzkiego oraz wykonanie modeli strukturalnych dla regionów białek bez struktur

rozwiązanych doświadczalnie.

W dużej mierze, za motywacją do stworzenia takiej biblioteki stała wizja stworzenia modelu struktury

pełnego spliceosomu w reprezentacji atomowej. Otóż poprzez połączenie modeli struktur

pojedynczych regionów kompleksu w wysokiej rozdzielczości z wynikami analiz doświadczalnych,

takich jak spektrometria mas i kriomikroskopia elektronowa, możliwe byłoby osiągnięcie modelu

strukturalnego spliceosomu w skali atomowej, który następnie mógłby posłużyć do dalszych analiz

[13]. Takie projekty powiodły się wcześniej np. w przypadku projektu stworzenia modelu polimerazy

RNA z bakterii Escherichia coli [14].

Nieuporządkowanie strukturalne w spliceosomie

Analizy ontologii procesów biologicznych i funkcji białek pokazały, że splicing jest jednym z

procesów silnie terminologicznie związanych z białkowym nieuporządkowaniem strukturalnym [15].

Nieuporządkowanie strukturalne w białkach oznacza brak stabilnej struktury trzeciorzędowej w

danym fragmencie białka w roztworze, chociaż możliwe jest istnienie elementów struktury

drugorzędowej i/lub nabycie przez region struktury w pewnych warunkach (na przykład kiedy białko

jest związane w kompleksie). Nieuporządkowane regiony białek są znajdowane w białkach w

wielorakich funkcjach – jako łączniki między uporządkowanymi domenami, miejsca modyfikacji

posttranslacyjnych i miejsca nawiązywania kontaktów białko-białko i białko-RNA. W szczególności,

jednym z miejsc, gdzie nieuporządkowanie strukturalne pełni znaczącą rolę, jest rybosom. Wiele z

białek rybosomalnych składa się z fragmentów uporządkowanych z dołączonymi długimi „ogonami”

nieuporządkowanymi. W dojrzałym rybosomie, te „ogony” wnikają w głąb kompleksu, tworząc

„spoiwo” podtrzymujące strukturę rRNA [16]. Duża zdolność do nawiązywania kontaktów przez

białka nieuporządkowane sprawia też, że często mają one funkcje przy tworzeniu większych

kompleksów białkowych. Białka stanowiące centralne węzły sieci białkowych często są w części lub

całości nieuporządkowane [15].

Mimo silnego terminologicznego związku splicingu z nieuporządkowaniem strukturalnym, kwestia

nieuporządkowania strukturalnego w spliceosomie nie była tematem systematycznej analizy przed

rozpoczęciem przeze mnie projektu badawczego. Dlatego dane literaturowe dotyczące

nieuporządkowania były rozproszone po publikacjach, niekiedy na zupełnie inne tematy, i musiałam je

najpierw samodzielnie odnaleźć. Przez to nie mogę odwołać się w tym miejscu do jednolitego źródła

przeglądowego. Oto syntetyczne opisy regionów nieuporządkowanych spliceosomu ludzkiego, które

wykorzystałam w projekcie:

„domeny RS” oraz regiony nieuporządkowane podobne do domen RS: regiony bogate w

reszty argininy i seryny, najpierw znalezione w czynnikach splicingu z grupy białek SR

(„domeny RS”) a następnie w innych białkach spliceosomalnych („regiony podobne do

domen RS”). W przypadku domen RS, wykazano, że regiony te są nieuporządkowane

strukturalnie. Domeny RS pośredniczą w rozmaitych typach kontaktów

międzycząsteczkowych, w tym pomiędzy białkami SR a pre-mRNA, różnymi białkami SR,

oraz białkami SR i innymi białkami. W szczególności, domeny RS mogą pośredniczyć w

definiowaniu granic intronu poprzez stabilizację interakcji poprzezintronalnych pomiędzy

podjednostką U1 snRNP na końcu 5' intronu a białkiem U2AF65 na końcu 3' intronu.

Domeny RS i regiony podobne do domen RS mogą być fosforylowane na resztach

serynowych, przy czym fosforylacja promuje nawiązywanie kontaktów

międzycząsteczkowych, a defosforylacja promuje przejście do etapu katalizy splicingu.

Wśród regionów podobnych do domen RS znalezionych w innych białkach, pokazano, że

fosforylacja takiego regionu w białku DDX23 podjednostki U5 promuje stabilne związanie

tego białka z podjednostką U4/U6.U5 tri-snRNP i włączenie podjednostki U4/U6.U5 tri-

snRNP do spliceosomu (główne źródło: [17][18]);


regiony bogate w motywy poliprolinowe i poliglutaminowe: występują np. w białku SmB/B'

platformy Sm; posiadają zdolność do tworzenia helis poliprolinowych (poliglutaminowych);

mogą zawierać motywy wiążące uporządkowane domeny białkowe GYF i WW; pokazano, że

wysycenie motywów poliprolinowych za pomocą domeny GYF hamuje splicing na poziomie

kompleksu A; nieuporządkowanie strukturalne tych regionów pokazane zostało na białkach

niezwiązanych z procesem splicingu (główne źródło: [19]);

regiony bogate w glicynę (i argininę): zawierają tryplety RGG i spokrewnione (np. YGG,

RAG). Można je podzielić na długie (~100 reszt aminokwasowych) i krótkie. Te pierwsze

występują w czynnikach splicingu z grupy hnRNP, natomiast te drugie zostały znalezione w

innych białkach splicingu, jak np. białku SmB/B' platformy Sm, białku SF2/ASF z grupy

białek SR, oraz białku U1-70K podjednostki U1. W przeciwieństwie do regionów z

poprzednich dwóch grup, te regiony są przewidywane jako wykazujące duże zagrzebanie

(tzn. brak kontaktu z roztworem, wskazujący na pewien typ lokalnej „zwartości” regionu),

natomiast podobnie do poprzednich dwóch grup nie wykazują typowych przewidywań

struktury drugorzędowej. Reszty argininowe mogą być metylowane. Pokazano, że region

bogaty w glicynę białka hnRNP A1 wiąże in vitro sam siebie oraz inne białka hnRNP. Ten

region jest również niezbędny do wiązania białka hnRNP A1 do podjednostek U2 i U4 oraz

wycisza proces splicingu. Metylacja argininy w regionie bogatym w glicynę i argininę

homologa drożdżowego białka U1-70K obniża wiązanie tego białka przez białko Npl3 (które

samo zawiera tryplety RGG, ale jest uważane za homolog białek z grupy SR) (główne źródło:

[20][21]);

ULMy („UHM ligand motifs” – „ligandy dla domen UHM”) – krótkie motywy białkowe (~20

aminokwasów), które są przewidywane jako nieuporządkowane, ale znajdowane w modelach

doświadczalnych białek spliceosomalnych jako ligandy wiążące domeny UHM, czyli domeny

RRM o strukturze zmienionej tak, że wiążą nie RNA, a białko. Zawarte w bazie danych

motywów białkowych ELM jako rekord LIG_ULM_U2AF65_1, o wzorze

[KR]{1,4}[KR].[KR]W. . Występują np. w białkach U2AF65 i U2AF35, które wiążą koniec

3' intronu – UHM białka U2AF35 wiąże ULM białka U2AF65 (główne źródło: [22]).

Rys. 1 (następna strona): Modele doświadczalne spliceosomu ludzkiego. A: Mapa cryo-EM (kriomikroskopii

elektronowej) spliceosomu ludzkiego w rozdzielczości 22 Å (EMD ID EMD-1294)[10]. B: Struktura krystalograficzna

podjednostki U1 w rozdzielczości 5.5Å (PDB ID 3CW1)[11]. W przypadku elementów białkowych podjednostki

pokazane są jedynie węgle Cα.



Projekt badawczy

Mój projekt badawczy skupił się na analizie strukturalnej i modelowaniu struktury 252 białek

spliceosomu ludzkiego, w tym wszystkich białek podjednostek spliceosomu większego i licznie

występujących dodatkowych białek. Inicjatorem projektu był prof. dr hab. Janusz M. Bujnicki, który

jest również promotorem rozprawy doktorskiej. Praca wykonana w ramach projektu została

sfinansowana z grantu LSHG-CT-2005-518238 szóstego programu ramowego UE „EURASNET”.

Niektóre obliczenia wykonane zostały w Interdyscyplinarnym Centrum Modelowania

Matematycznego i Komputerowego Uniwersytetu Warszawskiego w ramach grantu obliczeniowego

G27-4.

Wykonanie przeze mnie projektu można podzielić na cztery części:

Analizę strukturalną uporządkowanych regionów białek spliceosomu ludzkiego oraz przegląd

rozwiązanych doświadczalnie struktur białek ludzkiego spliceosomu posiadających modele w

reprezentacji atomowej i wykonanie modeli strukturalnych dla regionów białek bez struktur

rozwiązanych doświadczalnie. Ta część projektu leżała w jego początkowym założeniu, a

główną motywacją do jej wykonania była wizja stworzenia modelu struktury pełnego

spliceosomu w reprezentacji atomowej. Prace nad stworzeniem modelu pełnego spliceosomu

są kontynuowane przez inne osoby w grupie badawczej prof. Bujnickiego.

Wyniki analizy uporządkowanych regionów białek spliceosomu ludzkiego zostały opisane w

załączonej publikacji „Structural Bioinformatics of the Human Spliceosomal Proteome”

(Korneta I., Magnus M., Bujnicki JM., 2012, doi: 10.1093/nar/gks347, PMID: 22573172).

Analizę nieuporządkowania strukturalnego w białkach spliceosomu. Etap badania

nieuporządkowania strukturalnego nie był częścią oryginalnego planu projektu badawczego.

Konieczność jego wykonania wynikła dopiero w momencie, kiedy po wstępnej analizie

zorientowałam się, że ponad jedna trzecia łącznej długości białek podjednostek większego

spliceosomu ludzkiego, i ponad połowa łącznej długości wszystkich białek spliceosomalnych,

jest przewidywana jako strukturalnie nieuporządkowana.

Wyniki analizy nieuporządkowania strukturalnego białek spliceosomu ludzkiego zostały

opisane w załączonej publikacji „Intrinsic Disorder in the Human Spliceosomal Proteome”

(Korneta I., Bujnicki JM., 2012 doi: 10.1371/journal.pcbi.1002641, PMID: 22912569).

Porównanie garnituru białek i domen występujących w białkach proteomu spliceosomu

ludzkiego oraz znanego garnituru białek i domen proteomu spliceosomalnego pierwotniaka

Giardia lamblia. Pierwotniak ten cechuje się genomowym minimalizmem, w tym również

minimalną ilością intronów w genomie [23]. Wynik analizy może pomóc w ustaleniu

priorytetów w modelowaniu struktury spliceosomu ludzkiego, ponieważ regiony znajdujące

się zarówno w spliceosomie ludzkim jak i G. lamblia należy najprawdopodobniej potraktować

pierwszoplanowo podczas modelowania struktury.

Wyniki analizy porównawczej proteomów spliceosomalnych ludzkiego i G. lamblia zostały

opisane w załączonej publikacji „Structural Bioinformatics of the Human Spliceosomal

Proteome” (Korneta I., Magnus M., Bujnicki JM., 2012, doi: 10.1093/nar/gks347, PMID:

22573172).

Publikację danych w serwisie internetowym. Wykonane przeze mnie w ramach projektu dane

oraz modele strukturalne są dostępne w Internecie pod adresem

http://iimcb.genesilico.pl/SpliProt3D. Programistą serwisu jest mgr Marcin Magnus, ja przy

jego tworzeniu brałam udział jako projektantka.

Serwis z danymi został opisany w załączonej publikacji „Structural Bioinformatics of the

Human Spliceosomal Proteome” (Korneta I., Magnus M., Bujnicki JM., 2012, doi:

http://iimcb.genesilico.pl/SpliProt3D


10.1093/nar/gks347, PMID: 22573172). Jest to jedyny wynik w tej publikacji, którego nie

jestem autorką.

Wszystkie wyniki projektu zostały opisane w załączonych publikacjach „Structural Bioinformatics of

the Human Spliceosomal Proteome” (Korneta I., Magnus M., Bujnicki JM., 2012, doi:

10.1093/nar/gks347, PMID: 22573172) oraz „Intrinsic Disorder in the Human Spliceosomal

Proteome” (Korneta I., Bujnicki JM., 2012 doi: 10.1371/journal.pcbi.1002641, PMID: 22912569),

które składają się na rozprawę doktorską.


Analiza strukturalna regionów uporządkowanych białek spliceosomu ludzkiego i

stworzenie biblioteki modeli strukturalnych

Metodologia:

Wykrywanie domen: Uporządkowane domeny strukturalne wykryłam za pomocą oprogramowania,

głównie metaserwera GeneSilico (https://genesilico.pl/meta2/) [24], a następnie granice domen

poprawiłam ręcznie. W przypadkach, gdy było to możliwe, domenom przyporządkowałam numery w

klasyfikacji strukturalnej SCOP [25] oraz identyfikatory w klasyfikacji domen konserwowanych

ewolucyjnie PFAM [26].

Analiza miejsc występowania domen uporządkowanych w spliceosomie ludzkim: Listę domen

porównałam z listą ludzkich białek spliceosomalnych podzieloną na grupy według danych

proteomicznych.

Tworzenie biblioteki modeli: Przyporządkowanie modeli do regionów strukturalnych wykonywałam

zgodnie z następującą procedurą: 1. jeżeli dla regionu istniał model doświadczalny (krystalograficzny

lub NMR), przyporządkowywałam ten model regionowi; 2. jeżeli modelu doświadczalnego nie było,

ale można było stworzyć model porównawczy, tworzyłam model porównawczy na szablonie

wskazanym podczas etapu wykrywania domen; 3. jeżeli modelu porównawczego nie dało się stworzyć,

ale dany region był krótki (do ok. 100 aminokwasów), tworzyłam model de novo; 4. jeżeli modelu de

novo nie dało się stworzyć, tworzyłam konstrukcje pro forma, w którym wiarygodnie odtworzona była

jedynie struktura pierwszo- i drugorzędowa. Większość konstrukcji pro forma, które stworzyłam,

przedstawiała regiony nieuporządkowane. Jakość wszystkich modeli (łącznie z doświadczalnymi)

została oceniona oprogramowaniem MetaMQAPII [27] oraz na serwerze QMEAN [28].

Najważniejsze wyniki:

W 252 białkach spliceosomu ludzkiego wykryłam 465 uporządkowanych domen

strukturalnych, w tym 80 domen w białkach podjednostek większego spliceosomu.

Uporządkowane domeny strukturalne stanowią ~90% uporządkowanej części białek

spliceosomu ludzkiego, i ~50% pełnej długości białek (około połowa długości białek jest

przewidywana jako nieuporządkowana strukturalnie). Znalazłam również 25 regionów,

których niektóre właściwości wskazują na to, że mogą stanowić potencjalne uporządkowane

domeny strukturalne, ale których nie można przyporządkować do żadnych znanych grup

domen. W końcu, przeglądając modele doświadczalne kompleksów białek spliceosomalnych

znalazłam również 9 regionów, które można nazwać „domenami nieuporządkowanymi, które

nabywają strukturę” – mają potwierdzoną niezależną funkcję, i posiadają strukturę w

modelach doświadczalnych, ale są przewidywane jako nieuporządkowane w odosobnieniu

(patrz niżej).

Główne typy uporządkowanych domen strukturalnych w białkach spliceosomu ludzkiego to:

o małe domeny wiążące RNA (np. RRM, PWI);

o małe domeny wiążące białkowe nieuporządkowanie strukturalne (np. GYF, WW);

o domeny złożone z powtórzeń strukturalnych wiążące białka (np. TPR, WD40);

o domeny związane z ubikwityną i procesem ubikwitynacji (np. zf-UBP, U-box);

o domeny związane z szokiem termicznym (np. HSP20);

o domeny izomeraz prolinowych (Pro_isomerase);

o domeny wchodzące w skład stabilnych architektur helikaz RNA (np. DEAD);

o małe domeny, które funkcjonują jako ligandy wiążące większe domeny (np. PRP4);

o domeny LSM stanowiące podstawę planu strukturalnego białek Sm/Lsm.

Nowością jest tutaj wykrycie przeze mnie znacznej liczby „domen związanych z

ubikwitynacją” – tzn. domen zazwyczaj występujących w białkach funkcjonujących w

procesie ubikwitynacji. Znaczenie ubikwitynacji białek spliceosomalnych dla kontroli procesu

splicingu pokazano doświadczalnie jedynie w pojedynczych przypadkach.

https://genesilico.pl/meta2/


Domeny związane z ubikwitynacją występują w białkach spliceosomalnych głównie w

białkach związanych z drugim etapem splicingu (kompleksy B-C). Ponieważ w opisanych

przypadkach odwracalny proces ubikwitynacji reguluje działanie spliceosomu, można

postawić hipotezę o istnieniu „podsystemu” regulacyjnego maszyny spliceosomalnej opartego

o (de)ubikwitynację. W takim wypadku, fakt, że białka zawierające domeny związane z

ubikwitynacją występują raczej na późnym stadiach splicingu mógłby być wynikiem faktu, że

te etapy wymagają większej precyzji kontroli niż stadia wczesne (rozpoznawania).

Stworzyłam bibliotekę 104 modeli doświadczalnych (43 krystalograficznych, 61 NMR), 297

wiarygodnych modeli stworzonych na komputerze (255 porównawczych, 43 de novo) oraz

ponad 500 konstrukcji pro forma. Wykonane przeze mnie modele (poza konstrukcjami pro

forma) i dostępne struktury rozwiązane doświadczalnie pokrywają ponad 90% łącznej

długości sekwencji białkowej przewidywanej jako uporządkowana (~50% łącznej ogólnej

sekwencji białkowej). Wykonane przeze mnie modele (znów poza konstrukcjami pro forma)

posiadają parametry odpowiednie do tego, by zostać wykorzystane do dalszych badań

spliceosomu, m.in. do połączenia ich z wynikami analiz kriomikroskopii elektronowej

podjednostek spliceosomu w celu poznania struktury całego kompleksu.

Wśród domen, które nie były wcześniej znalezione przez automatyczne serwisy tworzące

adnotacje, ani opisane literaturze, a które znalazłam w ludzkich białkach spliceosomalnych,

znajdują się domeny z białek podjednostek spliceosomu oraz ważnych białek występujących

licznie [np.: domena zdegenerowanego palca cynkowego C2H2 białka SF3a120 podjednostki

U2, domena BLUF białka hPrp3 podjednostki U4/U6 di-snRNP (adnotowana jako domena

DUF1115), domena PWI białka hBrr2 podjednostki U5 oraz helikaz RNA hPrp2 i hPrp22].

Najciekawszy wynik:

Wykrycie domen i stworzenie modeli strukturalnych było dla mnie najtrudniejszą częścią analizy, z

kilku względów. Po pierwsze, zadanie to było ogromnie pracochłonne i zajęło lwią część czasu

projektu. Po drugie, wymagało więcej „rzemiosła” naukowego niż kreatywności intelektualnej. Po

trzecie wreszcie, ostateczną cezurą wartości modeli będzie dopiero ich wykorzystanie w praktyce.

Niemniej jednak, bywały emocjonujące momenty – najbardziej satysfakcjonującym wynikiem tej

części analizy był, oczywiście, fakt, że znalazłam nowe domeny strukturalne w niektórych z

najważniejszych białek spliceosomu, które wcześniej były wielokrotnie analizowane (np. hBrr2). To

satysfakcjonujące, znaleźć coś, co inni wcześniej przeoczyli.

Rys. 2 (następna strona): Modele domen ludzkich białek spliceosomalnych. A: Domena BLUF (DUF1115)białka

U4/U6-90K (hPrp3) (aminokwasy 540–683). Zaznaczono pozycję konserwowanej reszty W604. Przewidywane RMSD

3.7Å, QMEAN Z-score -3.06. B: Konserwowane jądro domeny PRO8NT białka hPrp8. Model de novo. Przewidywane

RMSD 2.4 Å, QMEAN Z-score -1.93. C-E: Domeny PWI: C: Domena PWI z helikazy hPrp22 (DHX8; pokazane reszty

1–120 ale domena może kończyć się na aminokwasie 92). Przewidywane RMSD 2.4Å, QMEAN Z-score -2.76. D:

Domena PWI z helikazy hPrp2 (DHX16; reszty 1–95). Przewidywane RMSD 5.8Å, QMEAN Z-score -2.19. E: Domena

PWI z helikazy U5-200K (hBrr2; reszty 259–338). Przewidywane RMSD 3.8Å, QMEAN Z-score -0.79.



Analiza strukturalna regionów nieuporządkowanych białek spliceosomu ludzkiego

Metodologia:

Wykrywanie granic przewidywanych regionów nieuporządkowanych: Granice regionów

nieuporządkowanych w ludzkich białkach spliceosomalnych wykryłam za pomocą oprogramowania,

głównie metaserwera GeneSilico (https://genesilico.pl/meta2/), a następnie poprawiłam ręcznie.

Podział regionów nieuporządkowanych na typy: Regiony nieuporządkowane podzieliłam na

następujące typy: 1. regiony nieuporządkowania z przewidywanymi elementami struktury

drugorzędowej (z podtypem wykazującym przewidywania splecionych helis); 2. dłuższe regiony

nieuporządkowania (≥ reszt aminokwasowych) z silnym odchyleniem składu aminokwasowego; 3. inne

regiony. Wśród regionów z silnym odchyleniem składu aminokwasowego wyróżniłam trzy podtypy

odpowiadające syntetycznym opisom, które przedstawiłam w części Wstęp tego autoreferatu [typ

„podobny do domen RS”, „bogaty w poliprolinę/poliglutaminę” oraz „bogaty w glicynę (i

argininę)”] oraz dodałam dwa typy uzupełniające (typ „naładowany” i „nienaładowany”).

Analiza występowania regionów nieuporządkowanych w spliceosomie ludzkim: Listę miejsc

występowania różnych typów regionów nieuporządkowanych porównałam z listą ludzkich białek

spliceosomalnych podzieloną na grupy według danych proteomicznych.

Analiza modyfikacji posttranslacyjnych miejsc nieuporządkowanych: Listę pozycji miejsc

modyfikacji posttranslacyjnych w białkach spliceosomalnych pobrałam z bazy danych sekwencji

białkowych UniProt [29]. Następnie porównałam miejsca występowania modyfikacji

posttranslacyjnych w białkach spliceosomu ludzkiego z listą miejsc występowania różnych typów

regionów nieuporządkowanych.

Analiza przewidywanych regionów nieuporządkowanych znalezionych w modelach

doświadczalnych oraz przewidywanie dodatkowych regionów z tych klas „domen”: Listę „domen

nieuporządkowanych, które nabywają strukturę” znalezionych podczas tworzenia biblioteki modeli

białek (tzn. przewidywanych regionów nieuporządkowanych znalezionych w modelach

doświadczalnych i mających niezależną funkcję) porównałam z listą ludzkich białek spliceosomalnych

podzieloną na grupy według danych proteomicznych, aby dowiedzieć się, w których grupach białek

domeny te występują najczęściej. Wykorzystując metody rozpoznawania wzorca (dla motywów <30

reszt aminokwasowych) oraz wykrywania domen przewidziałam dodatkowe potencjalne miejsca

występowania tych „domen” w białkach.

Przewidywanie i analiza dodatkowych regionów, które potencjalnie nabywają strukturę:

Porównując listę konserwowanych ewolucyjnie domen PFAM z listą regionów nieuporządkowanych,

wytypowałam dodatkowe potencjalne „domeny nieuporządkowane”. Wybierając z listy najbardziej

nieuporządkowanych białek zawierających potencjalne „domeny nieuporządkowane” białka

jednocześnie ewolucyjnie konserwowane i licznie występujące w spliceosomie ludzkim, wytypowałam

białka istotne.

Analiza względnego wieku regionów nieuporządkowanych i uporządkowanych w białkach

spliceosomu ludzkiego: [Uwaga: Ponieważ białka spliceosomalne są silnie konserwowane (zwłaszcza

białka liczne) [30], można wnioskować o ewolucji całego proteomu spliceosomalnego na podstawie

konserwowanych domen obecnych w białkach ludzkich.] Listę konserwowanych ewolucyjnie domen

PFAM występujących w ludzkich białkach spliceosomalnych porównałam z listą domen, które

przewiduje się, że występowały w ostatnim wspólnym przodku eukariontów („last eukaryotic common

ancestor” – LECA), oraz sprawdziłam, czy są to domeny obecnie rozpowszechnione u bakterii i/lub

Archaea. Następnie porównałam stosunkowy wiek i powszechność domen PFAM przypadających na

regiony uporządkowane i nieuporządkowane białek.

Analiza porównawcza nieuporządkowania strukturalnego w podjednostkach ludzkiego spliceosomu

oraz podjednostkach rybosomu ludzkiego i Escherichia coli: Białka rybosomów ludzkiego i E. coli

podzieliłam na regiony uporządkowane, nieuporządkowane z przewidywaną strukturą drugorzędową



oraz nieuporządkowane bez przewidywanej struktury drugorzędowej. Następnie porównałam

parametry nieuporządkowania strukturalnego w tych dwóch rybosomach z parametrami

nieuporządkowania strukturalnego w ludzkim spliceosomie. W końcu, dla rybosomu E. coli

sprawdziłam, jaka część przewidywanych regionów nieuporządkowanych jest znaleziona w modelu

doświadczalnym rybosomu. Dla rybosomu ludzkiego taka analiza była niemożliwa, gdyż nie istniał

model doświadczalny tego rybosomu.


Ludzkie białka spliceosomalne są w wysokim stopniu nieuporządkowane. >30% długości

białek podjednostek większego spliceosomu, >40% długości 122 najważniejszych białek

większego spliceosomu (119 licznie występujących + 3 dodatkowe dobrane do uzupełnienia

kompleksów białkowych) i >50% długości wszystkich ludzkich białek spliceosomalnych jest

przewidywane jako strukturalnie nieuporządkowane.

Wśród podjednostek spliceosomu, białka U1 snRNP, U2 podkompleksu SF3A, U11/U12 di-

snRNP, białka powiązane z U2 i białka specyficzne dla kompleksu U4/U6.U5 tri-snRNP są

bardziej nieuporządkowane niż białka U2 podkompleksu SF3B, białka U4/U6 di-snRNP, U5

snRNP, oraz białka Sm i Lsm. Oznacza to, że, poza białkami specyficznymi dla kompleksu

U4/U6.U5 tri-snRNP, „wczesne” białka spliceosomalne podjednostek (białka obecne na etapie

rozpoznawania) są bardziej nieuporządkowane niż „późne” białka (białka obecne na etapie

katalizy). Podobnie jest dla białek stanowiących niezależne czynniki splicingu – białka

charakterystyczne dla kompleksu A są bardziej nieuporządkowane niż te charakterystyczne

dla kompleksów B-C.

Wczesne białka spliceosomalne zawierają więcej nieuporządkowania strukturalnego bez

przewidywanej struktury drugorzędowej, ale z silnie odchylonym składem aminokwasowym

niż późne białka, natomiast późne białka (w tym białka podjednostki U4/U6.U5 tri-snRNP)

zawierają więcej nieuporządkowania strukturalnego z przewidywaną strukturą drugorzędową.

Nieuporządkowanie strukturalne z przewidywaniami struktury drugorzędowej zazwyczaj

znajdowane jest w modelach doświadczalnych kompleksów spliceosomalnych, natomiast

długie regiony nieuporządkowania bez przewidywanej struktury drugorzędowej, a

wykazującego silne odchylenie składu aminokwasowego, nie. To oznacza, że większość

nieuporządkowania strukturalnego wczesnych białek splicingu może nie nabyć żadnej

struktury w czasie trwania procesu splicingu, natomiast większa część nieuporządkowania

strukturalnego późnych białek może potencjalnie nabyć strukturę w trakcie procesu splicingu.

Dla podjednostki U5, dla której białek tylko ~20% reszt jest przewidywanych jako

nieuporządkowana, ponad połowa z przewidywanych nieuporządkowanych reszt ma

przewidywaną strukturę drugorzędową. To oznacza, że ta podjednostka może być prawie w

całości uporządkowana.

Wśród różnych typów nieuporządkowania wykazującego silne odchylenie składu

aminokwasowego, wszystkie trzy typy, które zdefiniowałam na podstawie syntezy wcześniej

opublikowanych informacji [„podobny do domen RS”, „bogaty w poliprolinę/poliglutaminę”

oraz „bogaty w glicynę (i argininę)”] występują powszechnie w białkach wczesnych,

natomiast jedynie typ podobny do domen RS występuje powszechnie w białkach późnych.

Powszechność występowania tych regionów w białkach wczesnych, w połączeniu z wynikami

badań doświadczalnych dotyczących ich roli, sugeruje, że stanowią one ważny element w

pierwszym etapie splicingu (etapie definiowania granic intronu). Natomiast występowanie

regionów podobnych do domen RS w białkach późnych, również w połączeniu z wynikami

badań doświadczalnych, sugeruje, że mogą być one również odpowiedzialne za kontrolę

dynamiki procesu splicingu.


Dodatkowo, zważywszy na to, że regiony podobne do domen RS oraz regiony bogate w

glicynę i argininę często współwystępują w tych samych białkach oraz występują w białkach,

które wzajemnie ze sobą reagują (nawzajem hamując swoje działanie – w szczególności

białka SR i białko hnRNP A1), zachodzi możliwość, że te dwa typy nieuporządkowania

strukturalnego ze sobą oddziałują, zarówno w tych, jak i w innych białkach.

Białka ludzkiego spliceosomu występujące nielicznie, oprócz tego, że zawierają przeciętnie

więcej nieuporządkowania strukturalnego niż białka występujące licznie, zawierają również

więcej nieuporządkowania wykazującego silne odchylenie składu aminokwasowego. Te

białka zawierają wszystkie trzy typy regionów nieuporządkowanych zdefiniowanych na

podstawie wcześniejszych opisów literaturowych.

Wśród modyfikacji posttranslacyjnych, fosforylacja seryny jest systematycznie związana z

regionami nieuporządkowane ludzkich białek spliceosomalnych podobnymi do domen RS, a

metylacja argininy z regionami nieuporządkowane bogate w glicynę i argininę. N-acetylacja

lizyny acetylacja N-końcowych reszt aminokwasowych ludzkich białek spliceosomalnych nie

zależy od stopnia uporządkowania.

Domeny nieuporządkowane, które można znaleźć w modelach doświadczalnych struktury

kompleksów białek spliceosomalnych, można podzielić na dwa typy: ULMy i inne. ULMy

występują w wielu kopiach w ludzkim proteomie spliceosomalnym, wśród białek wczesnych

(podjednostki U2, białek stowarzyszonych z podjednostką U2, białek kompleksu A). Za

pomocą metod rozpoznawania wzorca znalazłam kilka dodatkowych potencjalnych miejsc

występowania ULMów, również głównie w białkach wczesnych. Powszechność ULMów w

białkach wczesnych sugeruje, że one również stanowią ważny element w pierwszym etapie

splicingu (etapie definiowania granic intronu). Natomiast inne domeny poza ULMami

występują w mniejszej ilości kopii w proteomie spliceosomalnym, i zazwyczaj wiążą się ze

specyficznym partnerem.

Ogólnie rzecz biorąc, w ludzkich proteomie spliceosomalnym jest 51 konserwowanych

ewolucyjnie domen PFAM, które obejmują regiony nieuporządkowane białek (46 różnych

typów domen). Te domeny PFAM mogą wskazywać położenie „domen

nieuporządkowanych”, w tym także „domen nieuporządkowanych, które mogą nabywać

strukturę”. W szczególności, kilka wysoce nieuporządkowanych białek, dla których te domeny

PFAM są jedyną konserwowaną częścią białka, jest silnie konserwowanych ewolucyjnie i

występuje licznie w ludzkim proteomie spliceosomalnym. Te konserwowane wysoce

nieuporządkowane białka występują raczej na późnym etapie splicingu (są to dwa z trzech

białek podjednostki U4/U6.U5 tri-snRNP oraz kilka niezależnych czynników splicingu) i

mogą stanowić potencjalne białka centralne spliceosomalnej sieci białkowej.

Zarówno większość nieuporządkowanych, jak i uporządkowanych konserwowanych domen

PFAM występujących w ludzkich białkach spliceosomalnych była obecna w ostatnim

wspólnym przodku eukariontów. Jednak prawie żadna z domen nieuporządkowanych nie

występują obecnie powszechnie poza Eukaryota, podczas gdy około 1/3 domen

uporządkowanych występuje powszechnie. W szczególności, grupa białek skoncentrowanych

wokół podjednostek U4/U6 di-snRNP i U5 (w tym białka Sm/Lsm oraz C-końcowe domeny

helikaz RNA hPrp2/22/16/43) albo posiada homologi bakteryjne, albo składa się z domen

powszechnie występujących we wszystkich trzech superkrólestwach organizmów. Ta grupa

białek może być najstarszą częścią spliceosomu i stanowić jego trzon, na który później

nadbudowywane były m.in. regiony nieuporządkowane.

Nieuporządkowanie strukturalne spliceosomu ludzkiego różni się znacznie od

nieuporządkowania strukturalnego rybosomu ludzkiego i E. coli. Regiony nieuporządkowane

w rybosomach są znacznie krótsze, i większość z nich wykazuje przewidywania struktury


drugorzędowej. W rybosomie E. coli, większość reszt aminokwasowych przewidywanych

jako nieuporządkowane w odosobnionym białku można znaleźć w strukturze kompleksu.

Podjednostki obu rybosomów wykazują też mniejsze zróżnicowanie w stopniu

przewidywanego nieuporządkowania strukturalnego białek niż podjednostki spliceosomu

ludzkiego. Przyczyną mniejszego zróżnicowania nieuporządkowania strukturalnego w

rybosomach jest zapewne fakt, że większość nieuporządkowania strukturalnego białek

rybosomalnych ma podobną funkcję – tworzy „spoiwo” wspierające strukturę rRNA.

Na podstawie analiz bioinformatycznych nie mogłam przewidzieć, czy funkcja „spoiwa” RNA

jest powszechna w spliceosomie. W modelach doświadczalnych białek spliceosomalnych

znalazłam tylko jeden przewidywany fragment nieuporządkowany, który wiąże snRNA – jest

to fragment na końcu N białka U1-70K. Jednak jest ważny powód, dla którego funkcja

„spoiwa” może być mniej powszechna w spliceosomie niż w rybosomie. Rybosomalne RNA

jest o wiele dłuższe niż RNA spliceosomalne (np. ludzkie 28S rRNA ma 5070 nukleotydów, a

najdłuższe ludzkie snRNA, U2 snRNA, ma ich 188). To oznacza, że snRNA może o wiele

prościej (zapewne) zwinąć się samo , bez pomocy „spoiwa”.

Na podstawie powyższych analiz, stworzyłam model konceptualny podziału spliceosomu

ludzkiego na trzy „warstwy”:

o warstwę „wewnętrzną” („twardego jądra”) – białka (domeny) wysoce

uporządkowane, bezpośrednio wspierające katalizę przeprowadzaną przez snRNA;

precyzyjne mechanizmy działania; w większym spliceosomie głównie białka

podjednostek U2 snRNP SF3B, U4/U6 disnRNP i U5 snRNP, białka Sm/Lsm i

uporządkowane domeny C-końcowe helikaz RNA hPrp2/22/16/43; potencjalnie

najstarsze ewolucyjnie regiony spliceosomu;

o warstwę „pośrednią” („płaszcza”) – głównie nieuporządkowanie strukturalne, które

może przybierać strukturę w niektórych warunkach (głównie takiego, które wykazuje

przewidywania struktury drugorzędowej), w tym konserwowane nieuporządkowane

domeny PFAM; funkcjonalnie głównie stowarzyszone z dynamiką spliceosomu; w

większym spliceosomie głównie białka charakterystyczne dla podjednostki U4/U6.U5

tri-snRNP, niezależne czynniki splicingu charakterystyczne dla kompleksów B-C;

również domeny RS i być może domeny związane z ubikwitynacją;

o warstwę „zewnętrzną” („atmosfery”) – głównie regiony nieuporządkowania

strukturalnego, które nie przybiera struktury w żadnych warunkach, zwłaszcza długie

regiony nieuporządkowania strukturalnego o silnie odchylonym składzie

aminokwasowym [w tym regiony „podobne do domen RS”, „bogate w

poliprolinę/poliglutaminę” oraz „bogaty w glicynę (i argininę)”] oraz ULMy; te

regiony mogą służyć jako „sensory” lub „wypustki”, które kontaktują się ze sobą

nawzajem, z pre-mRNA i z małymi domenami uporządkowanymi również obecnymi

w tej warstwie (np. GYF, WW, UHM). Inne małe domeny uporządkowane (np. RRM,

PWI) mogą również łączyć się z pre-mRNA. Głównie funkcjonalność rozpoznawania

i definiowania granic intronu (a co za tym idzie, regulacja alternatywnego splicingu).

Głównie wczesne białka spliceosomalne (kompleksu A, podjednostek U1, U2 SF3A,

U11/U12 di-snRNP, białka powiązane z podjednostką U2; wśród białek nielicznych,

białka SR, hnRNP, SRm160/300, kompleksu RES). Funkcjonalność może być

regulowana poprzez modyfikacje posttranslacyjne – fosforylację seryn w regionach

podobnych do domen RS i metylację arginin w regionach bogatych w glicynę (i

argininę).

Najciekawszy wynik: Ta część projektu była dla mnie znacznie bardziej interesująca niż część pierwsza, z uwagi na to, że po

połączeniu danych z różnych analiz, byłam pod jej koniec w stanie stworzyć pojedynczy spójny model


(konceptualny – nie strukturalny) dla zjawiska, które poprzednio nie było systematycznie opisane:

występowania i funkcji nieuporządkowania strukturalnego w całym spliceosomie ludzkim. Mój model

może posłużyć jako punkt odniesienia do dalszych badań zjawiska nieuporządkowania w

spliceosomie. Jednocześnie szczegółowe wyniki moich analiz dotyczące konkretnych regionów

białkowych, które opublikowałam razem z ogólnym modelem, będą mogły być sprawdzone

doświadczalnie.

Rys. 3 (następna strona): Model trzech „warstw” spliceosomu (ludzkiego).



Porównanie białek i regionów strukturalnych białek spliceosomu ludzkiego i G. lamblia

Metodologia: Za pomocą wariantów BLASTP i PSI-BLAST narzędzia BLAST dostępnego ze strony

NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi) znalazłam homologi G. lamblia ludzkich białek

spliceosomalnych w bazie danych sekwencji białek Protein. Następnie wykryłam uporządkowane

domeny strukturalne w białkach G. lamblia i podzieliłam białka na regiony za pomocą metodologii

opisanej wcześniej dla białek ludzkich. W końcu, porównałam garnitury białek i domen ludzkich i G.

lamblia.


Znany proteom spliceosomalny G. lamblia ATCC50803 zawiera homologi 30 ludzkich białek

spliceosomalnych; znany proteom G. lamblia P15 zawiera homologi dwóch dodatkowych

białek. Dla porównania, znany proteom drożdża S. cerevisiae zawiera homologi 61 ze 119

licznie występujących białek spliceosomu ludzkiego.

Wśród białek, które posiadają homologi w G. lamblia, są głównie białka Sm i Lsm, białka

podjednostek U2 i U5 spliceosomu oraz spliceosomalne helikazy RNA. Inaczej mówiąc, są to

głównie białka „twardego jądra”, które bezpośrednio wspierają katalityczny proces splicingu

przeprowadzany przez RNA spliceosomalne.

Białka spliceosomalne G. lamblia są zazwyczaj krótsze niż ludzkie. Regiony, których brakuje,

można zaliczyć do trzech głównych typów:

o domen związanych z ubikwitynacją;

o regionów nieuporządkowanych o przewidywanej samodzielnej roli (tzn. nie

łączników międzydomenowych);

o krótkich fragmentów białkowych które służą jako ligandy wiążące domeny białkowe

(zazwyczaj fragmentów przewidywanych jako nieuporządkowane w odosobnieniu,

ale ustrukturalizowanych w kompleksie), oraz ich partnerów.

Inaczej mówiąc, przyjmując hipotezy wcześniej przedstawione w pracy dotyczące roli domen

uporządkowanych związanych z ubikwitynacją oraz nieuporządkowania strukturalnego białek

spliceosomalnych, wśród białek i domen znanego proteomu spliceosomalnego G. lamblia

zdegenerowane lub brakujące są regiony odpowiedzialne za funkcjonalności takie jak

początkowe rozpoznanie granic intronu, wzajemne rozpoznanie podjednostek spliceosomu

oraz kontrolę dynamiki spliceosomu. Natomiast niemalże niezmienione są białka

bezpośrednio wspierające podstawową aktywność spliceosomu, czyli katalityczny proces

splicingu – białka, które ewolucyjnie są potencjalnie najstarszymi elementami spliceosomu

zapożyczonymi z systemów bakteryjnych.

Ponieważ listy białek i regionów strukturalnych białek obecnych i nieobecnych w G. lamblia

w stosunku do człowieka przedstawiają spójny obraz, można założyć, że analiza ma sens. To

oznacza, że lista regionów strukturalnych obecnych w G. lamblia prezentuje dobry punkt

startowy dla modelowania spliceosomu – jest to lista regionów, które powinny się znaleźć w

modelu spliceosomu (z jakiegokolwiek organizmu).

Najciekawszy wynik:

Cała ta analiza! Jej pomysł wziął się stąd, że modelowym organizmem w badaniu spliceosomu są

drożdże S. cerevisiae. Spliceosom drożdżowy jest prostszy niż ludzki, ale nie o wiele – a z drugiej

strony, drożdże Saccharomycetes posiadają własne białka spliceosomalne, które nie mają homologów

nawet u innych grzybów. Podczas poszukiwania homologów białek ludzkich w innych organizmach,

zauważyłam, że organizmem o „minimalnym” zbiorze białek spliceosomalnych jest G. lamblia. Jest

po temu dobra przyczyna, bo ten pierwotniak cechuje się ogólnie minimalizmem genomowym oraz

niewielką liczbą intronów w genomie, co oznacza, że nie powinien mu być potrzebny spliceosom o

http://blast.ncbi.nlm.nih.gov/Blast.cgi


skomplikowanym mechanizmie regulacyjnym. Dlatego uznałam, że dla określenia „minimalnego”

zbioru sekwencji białkowych w proteomie splicesomalnym, powinnam porównać proteom ludzki i G.

lamblia – drożdżowy jest jeszcze zbyt skomplikowany.

Ponieważ G. lamblia jest pasożytem, i istnieje zawsze możliwość, że większość swoich białek

spliceosomalnych pozyskuje od gospodarza, nie można twierdzić (przynajmniej do czasu badań

doświadczalnych, jeżeli takowe kiedykolwiek się odbędą), że G. lamblia posiada „minimalny”

spliceosom. Niemniej jednak, fakt, że zarówno lista białek i regionów, które są obecne w proteomie

splicesomalnym G. lamblia, jak i lista białek i regionów, które są w nim nieobecne, prezentują zbiory

spójne pod względem funkcjonalnym, wskazują, że istotnie tak może być. W każdym razie, lista

regionów wspólnych dla G. lamblia i człowieka stanowi dobry zbiór początkowych regionów dla

modelowania spliceosomu.

Drugim interesującym wynikiem tej analizy jest odkrycie, że silnie konserwowane białko Prp8 ma w

G. lamblia na końcu C inną domenę niż w prawie wszystkich pozostałych organizmach: zamiast

domeny potencjalnie wiążącej ubikwitynę posiada domenę o zwoju domeny ubikwitynowej (jest to

jedyna domena związana z ubikwityną obecna w zestawie białek spliceosomalnych G. lamblia). Prp8

leży niemalże w samym sercu spliceosomu – wiąże się na stałe lub przejściowo z większością

katalitycznych snRNA spliceosomu i pełni centralną rolę w jego poprawnym funkcjonowaniu. Analiza

przyczyn tak drastycznej zmiany w strukturze tego białka, i jej wpływu na ogólną strukturę

spliceosomu, może przynieść ciekawe rezultaty.


Publikacja danych

Metodologia: Dane zostały opublikowane w serwisie internetowym stworzonym przez mgr Marcina

Magnusa pod adresem http://iimcb.genesilico.pl/SpliProt3D. Serwis jest jedynym wynikiem w

projekcie badawczym, którego nie jestem główną autorką. Brałam natomiast udział w jego stworzeniu

jako projektantka, współautorka opisu itp..


Serwis zawiera wszystkie modele strukturalne białek spliceosomu ludzkiego wchodzące w

skład biblioteki opisanej w sekcji „Analiza strukturalna regionów uporządkowanych…”, w

tym modele doświadczalne, porównawcze, de novo i konstrukcje regionów

nieuporządkowanych i niemożliwych do wymodelowania w/w metodami. Modele można

przeszukiwać, pobierać na lokalny komputer itp..

Dodatkowo, serwis zawiera przyrównania sekwencji homologów białek ludzkich z

reprezentatywnych gatunków eukariotycznych, adnotowane za pomocą wyników

przewidywań oraz danych strukturalnych dla białka ludzkiego. Uwzględnione adnotacje to:

przewidywanie nieuporządkowania strukturalnego, struktury drugorzędowej,

nieuporządkowania o potencjale do wiązania białka, zagrzebania i splecionych helis, oraz

dane dotyczące miejsc modyfikacji posttranslacyjnych pobrane z bazy danych UniProt.

Kompletny opis serwisu (w języku angielskim) dostępny jest pod adresem

http://iimcb.genesilico.pl/SpliProt3D/home/. Pełne archiwum plików do ściągnięcia zajmuje około 250

MB.

Najciekawszy wynik:

Najciekawszym wyzwaniem przy projektowaniu serwisu była dla mnie konieczność czytelnego

zwizualizowania kombinacji przyrównania sekwencji homologów białek ludzkich z opisem

właściwości ludzkiego białka. Istniejące narzędzia, które agregują różne typy danych dotyczących

własności białka (np. metaserwer GeneSilico, https://genesilico.pl/meta2/), są potężne, ale zazwyczaj

zorientowane na uwzględnienie jak największej ilości danych kosztem estetyki przekazu – i, co za tym

idzie, nieprzydatne do wizualizacji. Natomiast dla potrzeb serwisu oraz na potrzeby publikacji

wyników w artykułach, konieczna była integracja danych z przyrównań i przewidywań w zwartej

formie. Myślę, że końcowy efekt mojego działania, do którego wykorzystałam program Jalview [31],

dobrze odpowiada postawionemu zadaniu.

Rys. 4 (następna strona): Wizualizacja przyrównania sekwencji i przewidywań strukturalnych dla białka SNRPD3 w

bazie danych SpliProt3D (wykonane w programie Jalview).


http://iimcb.genesilico.pl/SpliProt3D/home/




Podsumowanie wyników projektu

W ramach projektu udało mi się:

systematycznie przeanalizować uporządkowane regiony białek proteomu ludzkiego

spliceosomu; wskazać regiony, które z punktu widzenia struktury są trywialne, interesujące

lub na chwilę obecną niemożliwe do wymodelowania;

stworzyć bibliotekę modeli eksperymentalnych i stworzonych komputerowo, która może

zostać wykorzystana w dalszych badaniach;

stwierdzić istnienie interesującego, a wcześniej prawie zupełnie nieopisanego w literaturze

zjawiska nieuporządkowania strukturalnego w białkach spliceosomu; zgromadzić rozproszone

na ten temat informacje, a następnie systematycznie przeanalizować ludzkie białka

spliceosomalne pod kątem różnych aspektów nieuporządkowania strukturalnego; w końcu,

stworzyć spójny model, który opisuje to zjawisko i może posłużyć jako podwalina dalszych

badań;

jednocześnie przedstawić konkretne przewidywania dotyczące określonych fragmentów

białek, które przewidywane są jako nieuporządkowane;

porównując proteom spliceosomalny ludzki z proteomem spliceosomalnym z G. lamblia

stworzyć listę regionów białek, które powinny znaleźć się w modelu strukturalnym

spliceosomu.

Być może najważniejszym odkryciem, jakiego dokonałam w ramach projektu, jest to najprostsze – to,

że ludzkie białka spliceosomalne są w wysokim stopniu nieuporządkowane. Co więcej, zgodnie z

przewidywaniami, znaczna część nieuporządkowania strukturalnego białek spliceosomalnych nie

nabierze struktury nawet po związaniu w kompleksach białkowych. Dla regionów białek, które są

nieuporządkowane strukturalnie, nie można stworzyć pojedynczych rzetelnych modeli strukturalnych

(zarówno doświadczalnych, jak i przede wszystkim komputerowych) wysokiej jakości. To oznacza, że

tworzenie modelu struktury spliceosomu (zarówno doświadczalnego, jak i przede wszystkim

komputerowego) może być znacznie utrudnione. Na pewno charakteryzacja strukturalna

nieuporządkowanych fragmentów białek spliceosomu ludzkiego wymagać będzie specjalnych metod

doświadczalnych i komputerowych nakierowanych na badanie białek nieuporządkowanych, takich jak

metoda EOM („Ensemble Optimization Method” – metoda optymalizacji wyników z doświadczeń

SAXS/SANS) [32].


Bibliografia

1. Tarn WY, Steitz JA (1996) A novel spliceosome containing U11, U12, and U5 snRNPs excises a

minor class (AT-AC) intron in vitro. Cell 84: 801–811. PMID: 8625417.

2. Valadkhan S, Jaladat Y (2010) The spliceosomal proteome: at the heart of the largest cellular

ribonucleoprotein machine. Proteomics 10: 4128–4141. PMID: 21080498.

3. Agafonov DE, Deckert J, Wolf E, Odenwalder P, Bessonov S, et al. (2011) Semiquantitative

proteomic analysis of the human spliceosome via a novel two-dimensional gel electrophoresis method.

Mol Cell Biol 31: 2667–2682. PMID: 21536652.

4. Collins LJ, Penny D (2009) The RNA infrastructure: dark matter of the eukaryotic cell? Trends

Genet. 25:120-128. PMID: 19171405.

5. Wahl MC, Will CL, Luhrmann R (2009) The spliceosome: design principles of a dynamic RNP

machine. Cell 136: 701–718. PMID: 19239890.

6. McKay SL, Johnson TL (2010) A bird’s-eye view of post-translational modifications in the

spliceosome and their roles in spliceosome dynamics. Mol Biosyst 6: 2093–2102. PMID: 20672149.

7. Bellare P, Small EC, Huang X, Wohlschlegel JA, Staley JP, et al. (2008) A role for ubiquitin in the

spliceosome assembly pathway. Nat Struct Mol Biol 15: 444–451. PMID: 18425143.

8. Mathew R, Hartmuth K, Mohlmann S, Urlaub H, Ficner R, et al. (2008) Phosphorylation of human

PRP28 by SRPK2 is required for integration of the U4/U6-U5 tri-snRNP into the spliceosome. Nat

Struct Mol Biol 15: 435–443. PMID: 18425142.

9. Cordin O, Hahn D, Beggs JD (2012) Structure, function and regulation of spliceosomal RNA

helicases. Curr Opin Cell Biol. 24: 431-438. PMID: 22464735.

10. Azubel M, Wolf SG, Sperling J, Sperling R (2004) Three-dimensional structure of the native

spliceosome by cryo-electron microscopy. Mol Cell 15: 833-839. PMID: 15350226.

11. Pomeranz Krummel,DA, Oubridge C, Leung AK, Li J, Nagai, K (2009) Crystal structure of human

spliceosomal U1snRNP at 5.5 A resolution. Nature, 458, 475–480. PMID: 19325628.

12. Leung AK, Nagai K, Li J (2011) Structure of the spliceosomal U4 snRNP core domain and its

implication for snRNP biogenesis. Nature, 473, 536–539. PMID: 21516107.

13. Jurica, MS (2008) Detailed close-ups and the big picture of spliceosomes. Curr. Opin. Struct.

Biol., 18, 315–320. PMID: 18550358.

14. Opalka N, Brown J, Lane WJ, Twist KA, Landick R, Asturias FJ, Darst SA (2010) Complete

structural model of Escherichia coli RNA polymerase from a hybrid approach. PLoS Biol. 8(9). pii:

e1000483. PMID: 20856905.

15. Tompa P (2009) Structure and Function of Intrinsically Disordered Proteins. Chapman & Hall.

16. Wimberly BT, Brodersen DE, Clemons WM, Jr., Morgan-Warren RJ, Carter AP, et al. (2000)

Structure of the 30S ribosomal subunit. Nature 407: 327–339. PMID: 11014182.

17. Haynes C, Iakoucheva LM (2006) Serine/arginine-rich splicing factors belong to a class of

intrinsically disordered proteins. Nucleic Acids Res 34: 305–312. PMID: 16407336.

http://www.ncbi.nlm.nih.gov/pubmed/8625417

















18. Long JC, Caceres JF (2009) The SR protein family of splicing factors: master regulators of gene

expression. Biochem J 417: 15–27. PMID: 19061484.

19. Kofler M, Schuemann M, Merz C, Kosslick D, Schlundt A, et al. (2009) Proline-rich sequence

recognition: I. Marking GYF and WW domain assembly sites in early spliceosomal complexes. Mol

Cell Proteomics 8: 2461–2473. PMID: 19483244.

20. Han SP, Tang YH, Smith R (2010) Functional diversity of the hnRNPs: past, present and

perspectives. Biochem J 430: 379–392. PMID: 20795951.

21. Bedford MT, Richard S (2005) Arginine methylation - an emerging regulator of protein function.

Mol Cell 18: 263–272. PMID: 15866169.

22. Kielkopf CL, Rodionova NA, Green MR, Burley SK (2001) A novel peptide recognition mode

revealed by the X-ray structure of a core U2AF35/U2AF65 heterodimer. Cell 106: 595–605. PMID:

11551507.

23. Morrison HG, McArthur AG, Gillin FD, Aley SB, Adam RD, Olsen GJ, Best AA, Cande WZ,

Chen F, Cipriano MJ et al. (2007) Genomic minimalism in the early diverging intestinal parasite

Giardia lamblia. Science, 317, 1921–1926. PMID: 17901334.

24. Kurowski MA, Bujnicki JM (2003) GeneSilico protein structure prediction meta-server. Nucleic

Acids Res 31: 3305–3307. PMID: 12824313.

25. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of

proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540.

PMID: 7723011.

26. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein families database.

Nucleic Acids Res 38: D211–222. PMID: 22127870.

27. Pawlowski M, Gajda MJ, Matlak R, Bujnicki JM (2008) MetaMQAP: a meta-server for the quality

assessment of protein models. BMC Bioinformatics, 9, 403. PMID: 18823532.

28. Benkert P, Kunzli M and Schwede T (2009) QMEAN server for protein model quality estimation.

Nucleic Acids Res., 37, W510–W514. PMID: 19429685.

29. Magrane M, Consortium U (2011) UniProt Knowledgebase: a hub of integrated protein data.

Database 2011: bar009. PMID: 21447597.

30. Collins L, Penny D (2005) Complex spliceosomal organization ancestral to extant eukaryotes. Mol

Biol Evol 22: 1053–1066. PMID: 15659557.

31. Waterhouse AM, Procter JB, Martin DM, Clamp M, Barton,G.J. (2009) Jalview Version 2–a

multiple sequence alignment editor and analysis workbench. Bioinformatics, 25, 1189–1191. PMID:

19151095.

32. Bernado P, Mylonas E, Petoukhov MV, Blackledge M, Svergun DI (2007) Structural

characterization of flexible proteins using small-angle X-ray scattering. J Am Chem Soc 129: 5656–

5664. PMID: 17411046.
















Publikacje

Structural bioinformatics of the humanspliceosomal proteomeIga Korneta1, Marcin Magnus1 and Janusz M. Bujnicki1,2,*

1Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology,Warsaw PL-02-109 and 2Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Faculty ofBiology, Adam Mickiewicz University, Poznan PL-61-614, Poland

Received January 18, 2012; Revised March 27, 2012; Accepted March 30, 2012

ABSTRACT

In this work, we describe the results of a comprehen-sive structural bioinformatics analysis of thespliceosomal proteome. We used fold recognitionanalysis to complement prior data on the ordereddomains of 252 human splicing proteins. Examplesof newly identified domains include a PWI domain inthe U5 snRNP protein 200K (hBrr2, residues 258–338),while examples of previously known domains with anewly determined fold include the DUF1115 domainof the U4/U6 di-snRNP protein 90K (hPrp3, residues540–683). We also established a non-redundant set ofexperimental models of spliceosomal proteins, aswell as constructed in silico models for regionswithout an experimental structure. The combinedset of structural models is available for download.Altogether, over 90% of the ordered regions of thespliceosomal proteome can be represented structur-ally with a high degree of confidence. We analyzedthe reduced spliceosomal proteome of theintron-poor organism Giardia lamblia, and as aresult, we proposed a candidate set of orderedstructural regions necessary for a functional spliceo-some. The results of this work will aid experimentaland structural analyses of the spliceosomal proteinsand complexes, and can serve as a starting point formultiscale modeling of the structure of the entirespliceosome.

INTRODUCTION

The spliceosome is a eukaryotic macromolecular ribonu-cleoprotein (RNP) complex that performs the excision ofintrons (non-coding sequences) frompre-mRNAs followingtranscription. In humans, two forms of the spliceosomeexist. The major spliceosome, which excises >99% ofhuman introns, is composed primarily out of four stablesmall nuclear ribonucleoprotein (snRNP) particles

(subunits), named after their small nuclear RNA(snRNA) components: U1, U2, U4/U6 and U5. Theminor spliceosome, which is absent in many species andwhich in human excises the remaining <1% introns,contains a U5 snRNP identical to the one from the majorspliceosome, as well as two other snRNPs: U11/U12,and U4atac/U6atac. The U11/U12, and U4atac/U6atacdi-snRNPs are distinct from, but structurally and function-ally analogous to, the U1 and U2, and U4/U6 di-snRNP,respectively (1). The major human spliceosome contains 45distinct proteins in its snRNP subunits in addition to around80 abundant non-snRNP proteins (2). These proteins,together with the snRNAs, may be considered to be an ex-perimental approximation of the ‘core’ of the spliceosome,that is the set of structural elements necessary for the pro-cession of the splicing reaction. Proteomics analyses ofspliceosomal proteomes from various species yield also upto over 100 non-abundant splicing proteins (2–8), whichmay be active e.g. in certain instances of splicing. Out ofthe 45 distinct snRNP proteins, only seven, the so-calledSm proteins, are present in more than one copy. The Smproteins form heteroheptamers with a toric shape, one pereach of the U1, U2, U4 and U5 snRNPs. In each snRNP,the Sm heteroheptamer forms a platform that supportsthe respective snRNA. A similar platform associatedwith the U6 snRNA is composed of a set of seven related‘like-Sm’ proteins (9).Splicing-related proteins may also participate in other

cellular events, including mRNA transcription (10,11), 50

capping, 30 cleavage and polyadenylation, as well asmRNA export, localization and decay (12,13) and box C/D snoRNP formation (14). While the majority ofnon-snRNP proteins are independent factors, some associ-ate into non-snRNP protein complexes, which include thehPrp19/CDC5L (NTC) complex (15), the exon-junctioncomplex (EJC) (16), the cap-binding complex (CBP) (17),the retention-and-splicing complex (RES) (18), and thetransport-and-exchange complex (TREX) (19). Thesecomplexes may also have non-splicing functions (16,20).A characteristic feature of the spliceosome is its

extraordinary dynamism, as the snRNP composition of

*To whom correspondence should be addressed. Tel: +48 22 597 0750; Fax: +48 22 597 0715; Email: [email protected]

Nucleic Acids Research, 2012, 1–20doi:10.1093/nar/gks347

� The Author(s) 2012. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Nucleic Acids Research Advance Access published May 9, 2012 by guest on A

ugust 13, 2012http://nar.oxfordjournals.org/

Dow

nloaded from

http://nar.oxfordjournals.org/

a spliceosome entity bound to the substrate pre-mRNAchanges depending on the stage of the splicing reaction.For the major spliceosome, an E (entry) complexspliceosome contains U1 snRNP, an A complex containsU1 and U2 snRNP, a B complex contains U1 and U2snRNP in addition to a tri-snRNP entity composed ofthe U4/U6 and U5 snRNPs, called U4/U6.U5, while theactivated B (B-act) and catalytic (C) complexes containU2, U5 and U6 snRNPs. After the splicing catalysisoccurs and the mRNA is released, the initial configurationof the snRNPs (U1, U2 and U4/U6 and U5 separately) isrecycled (21). Each stage-specific configuration of thesnRNP subunits is also associated with a differentnon-snRNP protein complement. As a result, just likethe snRNP composition, the non-snRNP composition ofa given instance of the spliceosome also varies (2). Inrecent years, evidence has surfaced that ubiquitin-based(22–24) and intrinsic disorder-based (25) systems maycontribute to the regulation of splicing assembly anddynamics.To further the studies of the spliceosome and the asso-

ciation between splicing and other cellular processes, it isuseful to determine the domain architecture and thethree-dimensional structures of spliceosomal proteins.Detailed knowledge of protein structure can help determinehow molecules perform their biological functions.Structure can also aid in understanding the effects of vari-ations, resulting, e.g. from SNPs or from alternativesplicing, which may have implications for disease.Besides, identification of structural similarities can revealdistant evolutionary relationships between proteins thatcannot be detected from a comparison of their sequencesalone (26). Of particular importance is the structuralanalysis of components of larger systems and complexesthat have eluded high-resolution structural characteriza-tion. For instance, it has been suggested that high-resolution models of individual snRNP components maybe fit into molecular envelopes created by low-resolutioncryo-electron microscopy (cryo-EM) maps (27) to con-struct structures of the spliceosome at different stages ofits action (28). Thereby, structural characterization of indi-vidual components of the spliceosome can bring us closerto modeling the structure and function of the entire system.There are two main potential gaps in our understanding

of the structure of the protein components of thespliceosome. The first one lies in recognizing the proteinarchitecture at the primary level, e.g. the detection ofconserved/structured domains and disordered regions.Most structural domains of splicing proteins areannotated by automated inferences in protein sequencedatabases such as UniProt (29). Many domains, especiallythose of the ‘core’ splicing proteins, have also beencharacterized in literature. However, automated annota-tions are limited in that they can only either spread infor-mation that is already available in the system (such asthrough homology inferences) or information thatconforms to tight preset standards (such as in the detec-tion of domains that conform to PFAM domain profiles)(30). Hence, at times, elements of protein architectureremain undetected throughout automated annotation,

and can only be determined through additional analysesand human interpretation of other data.

The second gap lies in the lack of structural representa-tion. Partial or complete structures have been determinedfor many splicing-related proteins and their complexes.These include a nearly complete U1 snRNP (31), U4snRNP core with the Sm ring (32), several complexesassociated with the spliceosome such as the human EJC(33) or the human CBP (34) and various protein–proteinand protein–RNA complexes, such as the human U2snRNP protein p14 (SF3b14a) bound to a region ofSF3b155 (35). In total, as of December 2011, data fromthe Protein Data Bank (PDB) (36) show that at least 340structures have been determined by X-ray crystallographyand NMR for human spliceosomal proteins or theirdomains, either alone or in various complexes. Many ofthese structural models are redundant because they repre-sent the same regions of the same proteins. However, formany regions, no three-dimensional models are available.

As an essential step towards enhancing our currentunderstanding of the spliceosome, we have carried out asystematic structural bioinformatics analysis of theproteins of the human spliceosomal proteome, with adual focus on characterizing their ordered parts andmodeling their structures. In an effort to help set thepriorities for future modeling of the entire spliceosome,we also compared the human spliceosomal proteomewith the proteome of the parasitic diplomonad Giardialamblia, known for its genomic minimalism. We putforward the set of structural regions common for humanand G. lamblia as an attractive target for future studies.This analysis complements a parallel study of the unstruc-tured part of the proteins of the spliceosome (I.K. andJ.M.B., submitted for publication), and runs alongsideefforts of many research groups to characterize the struc-ture of spliceosomal RNAs and map out the interactionsbetween the spliceosomal components.

MATERIALS AND METHODS

Collection and classification of spliceosome proteins

A total of 244 proteins found in the proteomics analyses ofthe major human spliceosome [sourced from one or moreof the following references (2,4,8,37–41)], and 8 proteinsspecific to the U11/U12 di-snRNP subunit of the minorspliceosome (Supplementary Table S1) (42), weredownloaded from the NCBI Protein (nr) database.Proteins were classified as ‘abundant’ and ‘non-abundant’according to (2), and they were assigned into groups basedmainly on (2), followed by references (4,38–40). Proteinsclassified here as ‘miscellaneous’ were classified in primarysources, variably, as ‘miscellaneous proteins’, ‘miscellan-eous splicing factors’, ‘additional proteins’, ‘proteins notreproducibly detected’ and ‘proteins not previouslydetected’. We disclaim any responsibility for the factualaccuracy of the association of proteins with the rele-vant groups beyond the point of following the primarysources.

2 Nucleic Acids Research, 2012

by guest on August 13, 2012

http://nar.oxfordjournals.org/D

ownloaded from

http://nar.oxfordjournals.org/cgi/content/full/gks347/DC1



Sequence searches, alignments and clustering

Searches of protein homologs in the NCBI Protein (nr)database were carried out at the NCBI using BLASTP/PSI-BLAST (43) with default parameter settings. Putativehomology was validated by reciprocal BLASTP searchesagainst the Protein database with ‘human’ (NCBI taxonid: 9606) as a taxon search delimiter. Sequence alignmentswere calculated using the MAFFT server using the Autostrategy (http://mafft.cbrc.jp/alignment/server/) (44).Clustering analysis of helicase sequences was performedwith CLANS (45).

Identification and description of structuralregions of proteins

Identification of intrinsically ordered and disordered regionsof proteins, prediction of protein secondary structure anddomain boundaries, as well as fold-recognition (FR)analyses, were carried out via the GeneSilico MetaServergateway (for references to the original methods, seehttps://genesilico.pl/meta2) (46). In non-trivial cases(usually when putative modeling templates returned by FRscored low and/or various methods disagreed on the besttemplate), FR alignments to the top-scoring templatesfrom the PDB were compared, evaluated and ranked bythe PCONS server (47), and the PCONS result was used toidentify region boundaries. Additional searches were per-formed on the HHPRED server (48).

SCOP database (49) IDs used for the purposed of struc-tural domain identification were either extracted from theProtein Data Bank or from the SCOP parseable files on theSCOP website (http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html) or assigned using the fastSCOP server (http://fastscop.life.nctu.edu.tw/) (50). PFAM domain names wereassigned on the PFAMwebsite (http://pfam.sanger.ac.uk/).SCOP v. 1.75 and PFAM v. 25.0 were used. Structuralsimilarity was compared using the DALI server (51).

Assignment of models to structural regions of proteins

In assigning structural models to regions, we followed afour-step procedure (Figure 1).Whenever a high-resolutionexperimental structural model (either X-ray or NMR

structure) was available, we assigned it to the correspond-ing sequence region. If a structural similarity to a protein ofknown structure was predicted for a given region byfold-recognition algorithms (see below for details), weconstructed a model for this region by a comparative(template-based) modeling technique, using the detectedexperimental structures as templates. In the absence of con-fidently predicted templates, we used de novo foldingmethods for relatively small fragments likely to formglobular domains. For the remaining regions (thosewithout experimentally solved structures and for whichthe current modeling methodology cannot provideconfident predictions of the 3D structure), we generatedpro forma models, in which only the primary and(predicted) secondary structure was represented explicitly,while the tertiary arrangement was arbitrary. Pro formamodels are not supposed to be reliable at the tertiary leveland were constructed for the sake of further analyses(e.g. to initialize protein folding analyses that requiresome kind of a structural representation as an input).For regions withmultiple solved structures in the Protein

Data Bank, the following criteria of preference were used:(i) structures of the region in complex with other proteinsand/or nucleic acids (i.e. in a potentially ‘active’ or ‘func-tionally relevant’ state) were given priority over structuresof the region in isolation, (ii) crystallographic structureswere given priority over NMR structures, (iii) higher-resolution crystallographic structures were given priorityover lower-resolution structures and (iv) more completestructures were given priority over less complete structures.The following experimental artifacts were removed fromexperimental structure files or corrected by standardmodeling procedures: non-native sequences added to aidin the protein expression and structure determinationprocess (e.g. affinity tags), non-standard amino acids (e.g.selenomethionine was replaced by methionine), and gaps insequences (e.g. short disordered loop fragments wereadded). Single chains only were retained if the originalPDB file contained multiple chains of the same protein.Comparative models were constructed by default with

MODELLER (52) based on templates identified in thefold-recognition process. Selected challenging modelswere constructed using the I-TASSER server (53).Selected models were also adjusted with ROSETTA3.0/3.1 using the loop modeling mode (54). De novomodels were produced with the ROSETTA 3.0/3.1AbInitioRelax application and clustered with the Rosetta3.0/3.1 Cluster Application, following the protocols set outin the ROSETTA User Guide for version 3.1. (http://www.rosettacommons.org/manual_guide) (54). De novo foldingwas attempted if the following conditions were fulfilled: theregion was �125 residues in length, predicted to becompletely ordered and predicted to contain secondarystructure elements. These conditions correspond to thecurrent practical limit of utility of this type of methods(55). Artificial pro forma spatial representations ofprotein chains of unknown/uncertain structure or pre-dicted to lack a stable structure were built with UCSFChimera (v.1.4/1.5) using the Tools>StructureEditing>Build Structure command (56). Pro forma con-structs reflect only the known primary and predicted

Figure 1. Rules for selecting and producing structural representationsof protein regions. From left to right, structural representationsdecrease in the average confidence.

Nucleic Acids Research, 2012 3



ownloaded from

http://mafft.cbrc.jp/alignment/server/

http://mafft.cbrc.jp/alignment/server/

https://genesilico.pl/meta2

http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html

http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html

http://fastscop.life.nctu.edu.tw/



http://pfam.sanger.ac.uk/

http://www.rosettacommons.org/manual_guide

http://www.rosettacommons.org/manual_guide


secondary structure of the corresponding regions, whiletheir tertiary structure should be regarded as unassigned(and remains to be modeled in the future). Miscellaneousmanipulations of structures and models of moleculesduring this stage were performed in UCSF Chimera (56)and Swiss-PdbViewer v. 4.0.1 (57).

Protein model quality assessment

Assessment of model quality was performed with MetaMQAPII [https://genesilico.pl/toolkit/unimod?method=MetaMQAPII, an updated version of a method describedin (58)] andQMEAN [http://swissmodel.expasy.org/qmean/(59)].MetaMQAP predicts the deviation of the query model

from the (unknown) native structure and expresses it asthe predicted global root mean square deviation (RMSD)and the predicted global distance test total score(GDT_TS) (60). The lower the predicted RMSDand the higher the predicted GDT_TS score, the betterthe model.QMEAN first calculates an internal score, and then the

QMEAN Z-score indicates by how many standard devi-ations the QMEAN score of the model differs fromexpected values for experimental structures that have asimilar length to the model. High quality models areexpected to have positive QMEAN Z-scores, and goodmodels are expected to have a QMEAN Z-score above�2.0. Indicators of accuracy of individual residues weregenerated by MetaMQAPII and are supplied as B-factorvalues inside the model files available from the SpliProt3Ddatabase website (see below). They can be visualized withthe UCSF Chimera command Render By Attribute >(attributes of residues: average B-factor) or with equiva-lent commands in other molecular visualization programs.Mean values and standard deviations of the QMEANZ-scores for the six QMEAN contributing factors areprovided with this publication (Supplementary Table S4)and the values for all models are provided with the modelfiles. Models of low quality are expected to have a stronglynegative QMEAN Z-score, but also strongly negativeZ-scores for most of the contributing terms.As MetaMQAPII is not capable of evaluating

multimeric models, for models of protein complexes(11 X-ray models and 2 NMR models) only the qualityof the longest chain was evaluated by MetaMQAPII.

Website/database of models

Models and additional data, including alignments ofrepresentative sequences annotated with predictions oforder/disorder, secondary structure, binding disorder,solvent accessibility and coiled coils, as well as and anno-tations of sites of post-translational modification fromUniProt (29), are available via the SpliProt3D webserver at http://iimcb.genesilico.pl/spliprot3D. The entirearchive of files available for download has approximately250 MB.

Visualization of sequence alignments and molecularstructures

Sequence alignments were visualized with Jalview v. 2.6.1(61), while molecular structure graphics were producedwith UCSF Chimera (56).

RESULTS AND DISCUSSION

Identification of structural domains of splicing proteins

Our main priorities in identifying structural domains ofsplicing proteins were to check and correct previouslyreported domain boundaries and to identify and character-ize domains that were not available in UniProt and otherdatabases. We focused on 252 proteins of the humanspliceosome, including 244 proteins found in the results ofproteomics analyses of the major human spliceosome and 8proteins specific to the U11/U12 subunits of the minorspliceosome (see ‘Materials and Methods’ section forreferences to protein sources and Supplementary Table S1for protein GIs). We did not find any references to U4atac/U6atac-specific proteins either in literature or in the GeneOntology (GO) database [http://geneontology.org (62)].A total of 118 proteins were classified as ‘abundant’ asin (2); other proteins were classified as ‘non-abundant’.‘Abundant’ proteins are suggested to be themost importantfor the correct action of the spliceosome (2).

Using a combination of protein fold-recognition andsequence conservation-based domain identificationmethods, we identified 465 ordered structural domains inthe 252 proteins, including 80 domains in the snRNPproteins of the major human spliceosome (Table 1 andSupplementary Table S2). Ordered structural domainscover >80% of the ordered regions of the proteins, and�50% of all residues in the splicing proteins.Correspondingly, close to a half of the human spliceosomal

Table 1. Statistics of structural domains detected in the human spliceosomal proteome

Feature Major spliceosome snRNP All proteins

Number of proteins 45 252Number of residues 20 390 133 040Number of ordered residues 13 427 63 242Number of ordered structural domains 80 465Number of suspected ordered structural domains 7 25Number of domains predicted to be disordered, but found to be ordered inexperimentally determined structures

3 9

Fraction of ordered residues covered by ordered structural domains (%) 89.6 90.3Fraction of total number of residues covered by ordered and disordered structural domains (%) 61.0 43.4




ownloaded from

https://genesilico.pl/toolkit/unimod?method=MetaMQAPII

https://genesilico.pl/toolkit/unimod?method=MetaMQAPII

http://swissmodel.expasy.org/qmean/

http://swissmodel.expasy.org/qmean/



http://iimcb.genesilico.pl/spliprot3D



http://geneontology.org




proteome is predicted to be intrinsically disordered. Theanalysis of various structural and functional types ofintrinsic disorder in the spliceosome brought about aquantity of data whose presentation is beyond the scope ofthis article and that has been consequently made the subjectof an independent article (I.K. and J.M.B., submitted forpublication).

Based on the predicted order/disorder boundaries and thepresence/absence of predicted secondary structure elements,we also detected 25 regions that we termed ‘suspecteddomains’. This category included two groups of regions.The first group were domain-length (>40 residues) regionswithout a recognized fold that were the only ordered regionsof otherwise highly intrinsically disordered proteins (�70%residues predicted to be disordered). The second group werepresent in proteins with low-to-middle intrinsic disordercontent (<70% residues predicted to be disordered) thatcontained other ordered structural domains. The ‘suspecteddomains’ in these proteins were ordered regions that hadclear order/disorder boundaries and contained predictedsecondary structure elements, but lacked a PFAM domainassignment (30) and showed no clear relationship to anyknown folds according to protein fold-recognition analyses.

Ordered domains of splicing proteins classified in theSCOP (49) catalogue belong to classes a–e and g, with

an over-representation of class d, which contains super-family d.58.7 (RNA-binding domain, RRM (RBD), whichusually corresponds to PFAM domain PF00076, RRM_1;Table 2). RRM is present in the 252 proteins in as many as117 copies. This means that roughly each fourth to fifthdomain in the spliceosomal proteome is an RRM.As RRM is a small domain that usually bindssingle-stranded RNA (63,64), this reflects the key charac-ter of protein–RNA interactions in the splicing process.Other common types of ordered protein regions found in

the human spliceosomal proteome include other smallRNA-binding domains, large a- and b-repeat-basedprotein-binding domains, small protein disorder-bindingdomains, ubiquitin-related domains and stablemultidomainRNA helicase architectures (Table 3). Repeat-baseddomains are often found as building blocks of proteincomplexes, while some of the ubiquitin-related domainshave been shown to be part of a putative ubiquitin-basedsystem of controlling spliceosome assembly and dynamics(22,65).In addition to ordered domains, we found nine regions

with an expected independent function that were predictedto be disordered, but that were either found in experimen-tal structures or could be confidently modeled due tostrong sequence matches to known domains. We con-sidered these nine regions to be putative disordereddomains that undergo a transition to order uponentering a complex. We discuss the features of thesedomains in an independent article that focuses specificallyon intrinsic disorder in the spliceosomal proteome (I.K.and J.M.B., submitted for publication). Here, we will onlynote that, in general, the identification of disordered struc-tural domains is currently a non-trivial task in comparisonwith the identification of ordered structural domains, asfewer experimentally validated examples of disorder existin databases and the properties of disorder make auto-mated identification and propagation more difficult.

Table 3. Common types of ordered structural domains in the human spliceosomal proteome

Domain type Example PFAM domains Numberof copies

Examples of proteins

Small RNA-binding domains RRM_1a, PWI, KH_1, S1, KOW, dsrm, G-patch, Surpb,SAP, zf-CCCH, zf-U1c, zf-metc, zf-C2H2_jazc,zf-U11-48K, zf-CCHC, FYVE

�201 U1-A, U1-70K, U1-C

Small protein disorder-bindingdomains

WW, FHA, FF, GYF, SMN, SH3_1 �24 FBP11, U5-52K (CD2BP2)

Repeat-based protein-bindingdomains

Arm, TPR/HAT, HEAT, LRR_4, WD40 repeats �28 U4/U6-60K (hPrp4), U5-102K(hPrp6), SF3b155, U2-A’

Ubiquitin-related domains Ubiquitin, U-box, zf-UBP, UCH, Rtf2, zf-C3HC4, ZZ,DWNN, RWD, JAB+PROCT

�19 SF3a120, U4/U6.U5-65K,RNF113A

Heat shock-related DnaJ, HSP70, HSP20, CS �6 CCAP1Proline isomerase Pro_isomerase 8 U4/U6-20K (PPIH)Stable helicase architectures DEAD+Helicase_C,

DEAD+Helicase_C+HA2+OB_NTP_bind,(DEAD+Helicase_C+Sec63) � 2, Upf1p-like

�19 hPrp43 (DHX15), U5-200K(hBrr2), KIAA0560 (AQR)

Small domains that act as ligands U1snRNP70_N, SF3b1, PRP4, SF3a60_bindingd �6 SF3b155, U4/U6-60K (hPrp4)Sm/Lsm domains LSM 14 Sm, Lsm proteins

aSome RRM domains bind peptide ligands (66).bThe Surp domain is predicted to bind RNA. However, in the only single structure of a Surp domain in complex (PDB ID: 2DT7), the Surp domainbinds a peptide ligand.cSome zf-C2H2 domains mediate protein binding.

Table 2. Statistics of ordered structural domains of the human

spliceosome according to the SCOP classification

SCOP ID Description Number of domains

a All a 79b All b 83c a and b (a/b) 53d a and b (a+b) 159e Multi-domain (a and b) 1g Small 49




ownloaded from


Non-redundant set of experimental and theoreticalstructural models

Following the identification of domains, we constructed anon-redundant set of experimental and theoreticalstructural models of regions in splicing proteins. As theutility and credibility of models, both experimental andtheoretical, depends on their accuracy, we set somesimple heuristic rules of preference to increase thechance that we chose the models with the best quality.We preferred experimental models over theoreticalmodels, X-ray experimental models over NMR experi-mental models and comparative theoretical models overde novo theoretical models (Figure 1). The lowest tier inthe hierarchy was pro forma constructs, in which only theprimary and secondary structure were representedexplicitly, while the tertiary arrangement was arbitrary.As a result, we mapped 104 non-redundant experimentalmodels to the sequences of the spliceosomal proteins, andcreated 255 comparative and 43 de novo models (Table 4and Supplementary Table S3), as well as over 500constructs. The 104 non-redundant experimental modelsinclude 23 models of (nucleo)protein complexes, of which13 complexes have residues from more than one spliceo-some-associated protein. While models of complexes tendto have lower accuracy than models of isolated chains, weconsidered them to be more informative about the proteinfunctional than models of isolated chains. This was theonly instance where we favored the availability of add-itional information over plain accuracy of the structure.Over 90% of ordered regions of splicing proteins can be

associated with experimental structural information orwith comparative and de novo models (Figure 2).

This value is similar for the proteins of the snRNPsubunits of the major spliceosome and other proteinsassociated with the human spliceosome. Between differenttypes of structural representations, experimentallydetermined structural models cover 20.6% of all orderedresidues, the comparative models we generated cover67.4% of all ordered residues, and the de novo modelscover 4.8% of all ordered residues. Hence, our theoreticalmodels cover three times the length of ordered proteinsequence covered by experimental models.

X-ray crystallography is useful for the structure deter-mination of large proteins (>30 kDa) and proteincomplexes, while NMR is well-suited for the structuredetermination of relatively small proteins. Not surpris-ingly, the ratio of the number of ordered residues inproteins from snRNP subunit structures solved by X-raycrystallography versus NMR is �3:1 (15.7%:4.7%), whilethis ratio for all splicing proteins is �1.77:1 (13.4%:7.2%).The main reason for this is that small domains arestatistically more populous in the general set of splicingproteins compared to the snRNP subunits. Contrariwise,most structures of protein–protein complexes available forsplicing proteins include regions from snRNP proteins.Since the resolution (and hence accuracy) of experimen-tally determined structures is typically inversely correlatedwith the molecule or complex size, X-ray models ofsnRNP proteins have on average a slightly worse reso-lution (mean 2.20 A) than X-ray models of allspliceosomal proteins (mean 2.08 A).

For predicted disordered regions, confident structuralcoverage is very low in comparison to ordered regions.Less than 2% of residues predicted to be disordered arecovered by experimental models, and even together withour theoretical models, we could only cover 8.9% of all dis-ordered residues. Moreover, most of the residues coveredbelong to linkers between ordered structural domains orshort regions in protein termini. This low coverage ofintrinsically disordered regions by structural models maybe in the future a considerable challenge in producing acomprehensive structural model of the spliceosome.

Assessment of model quality

For all models except pro forma constructs, we also inde-pendently evaluated their accuracy to determine howcredible they were. To do this, we used two methods:MetaMQAPII (58) and QMEAN (59). Both of themprovide a global score for the entire model (predictedRMSD for MetaMQAPII, QMEAN Z-score forQMEAN) as well as a local score for individual residues(in this analysis, only the MetaMQAPII score was used).Functionally relevant and evolutionarily conservedregions (e.g. binding interfaces) are typically predictedwith a higher than average accuracy, in particular whencomparative modeling is used. Consequently, even amodel with a poor global score can be useful for func-tional considerations, if its functionally important partsare scored well and are likely to be accurate. Somereaders may also be interested in scores that describeonly the model’s quality with respect to a particularfeature (e.g. secondary structure). To help describe

Table 4. Structural representations of regions of proteins of the

human spliceosomal proteome

Feature MajorspliceosomesnRNP

All proteins

Number of proteins 45 252Number of residues 20 390 133 040Number of ordered residues 13 427 63 242Number of non-redundant experimentalmodels

20 104

Number of non-redundant X-ray models 11 43Mean resolution of X-ray models (A) 2.20 2.08Number of non-redundant NMRmodels

9 61

Number of non-redundant theoreticalmodels

49 297

Number of non-redundant comparativemodels

37 255

Number of non-redundant de novomodels

13 43

Total number of non-redundantrepresentations

139 803

Number of experimental models con-taining residues of more than onesplicing protein (X-ray/NMR)

9 (8/1) 13 (11/2)

Total fraction of structural ordercovered (%)

91.2 92.7

Total fraction of combined proteinsequence covered (%)

64.3 48.7




ownloaded from




different features of models, we recorded the mean valuesand standard deviations of QMEAN Z-scores for sixQMEAN contributing factors. These values for allmodels are provided with the manuscript (SupplementaryTable S4).

For comparison with theoretical models, we ‘predicted’the global quality of experimentally determined structures(Supplementary Figure S1). Expectedly, both X-ray andNMR models we selected for our data set are highlyscored by both MetaMQAPII and QMEAN, which is anindicator of the high accuracy of these structures (Table 5;for RMSD, the lower the score, the better the model; forthe QMEAN Z-score good models are scored higher).Mean QMEAN Z-scores for models of both types (0.42for X-ray and 0.08 for NMR) compare favorably to meanQMEAN Z-scores of models across the entire PDB (�0.58and �1.19, respectively) (67). As X-ray models in ourdatabase were scored slightly better than NMR models,we used scores for X-ray models as a benchmark with

which to classify theoretical models into those ‘likely tobe globally accurate’ or ‘unlikely to be globally accurate’.The worst-scored X-ray models in our data set have apredicted RMSD of 4.5 A (PDB ID 2ok3, resolution2.0 A) and a QMEAN Z-score of �1.99 (PDB ID 2qfj,resolution 2.10 A). Consequently, we divided all non-X-ray models into four classes depending on passing oneor both thresholds: predicted RMSD �4.5 A andQMEAN Z-score ��2.0 (Figure 3).Themajority of bothNMRand theoreticalmodels belong

to the most reliable class (i.e. ‘scored not worse than theworst crystal structures in the data set’). These models areexpected to be generally correct, although their localaccuracy may vary. Models scored well only by onemethod should be treated with more caution than modelsscored well by both methods. However, poor scoring byone method may also be due to the model being eithervery short or very long. Models that are scored poorly byMetaMQAPII, but are scored well according to the

Figure 2. Coverage of structural order and disorder with different types of structural models. The values displayed on the graph are the number ofresidues covered by a given type of structural model, followed by percentage value.




ownloaded from







QMEAN Z-score are usually short, while models that arescored high by MetaMQAPII and low by QMEAN areusually long. The mean length of a model scored well byboth methods is 220 residues, but the mean length of amodel scored well only by QMEAN is 70 residues and themean length of amodel scored well only byMetaMQAPII is362 residues. Therefore, we urge the reader to consider thelength of the model before while using models scored poorlyby only one method.Over 40 models are scored poorly by bothMetaMQAPII

and QMEAN. These models may have been built onremotely related templates or did not fold well whenmodeled de novo, and are to be expected to have variouserrors. Based on our previous experience, we believe thatsome of these cases may represent new protein folds orinteresting variations of known folds that present consid-erable challenge for protein modeling methods. Hence,while we regard these models as unreliable, we proposethe corresponding proteins or domains as attractivetargets both for experimental protein structure determin-ation, and for protein modeling with other advancedtechniques.

Database

The entire non-redundant set of representations (includingselected representative models determined by experimentalmethods, and all theoretical models built with computa-tional methods) is available as an online databaseSpliProt3D at http://iimcb.genesilico.pl/SpliProt3D. Theweb server allows for browsing, selecting and download-ing the models. Proteins are also associated with sequencealignments annotated with predictions of intrinsic orderversus disorder, predictions of secondary structure,protein-binding disorder, solvent accessibility andcoiled-coils, as well as the positions of post-translationalmodifications. The database will be curated and newentries will be added and obsolete ones archived followingthe progress in structure determination of newspliceosomal proteins and/or publication of new theoret-ical models with better predicted accuracy. We would liketo encourage structural biologists working on structuredetermination or prediction for spliceosomal proteins tocontact us to have their models included and referenced inour database.

Figure 3. Models of regions of human splicing proteins divided by quality. This bubble graph displays the numbers of models of different types thatbelong to different classes of quality. Mean lengthcomp is the mean length of a comparative model of a given quality class.

Table 5. Predicted quality of models of regions of human spliceosomal proteins

Feature X-ray NMR Comparative De novoMean (SD) Mean (SD) Mean (SD) Mean (SD)

Number of models 43 61 255 43Predicted RMSD (MetaMQAPII) 1.90 (0.84) 3.85 (1.82) 4.53 (1.96) 4.02 (1.50)Predicted GDT_TS (MetaMQAPII) 78.56 (12.78) 55.94 (19.45) 47.28 (21.35) 45.59 (15.85)QMEAN total score 0.805 (0.087) 0.744 (0.110) 0.585 (0.164) 0.562 (0.132)QMEAN Z-score 0.42 (0.87) 0.08 (0.86) �1.30 (1.43) �1.42 (1.33)




ownloaded from



Comparison of predictions with the experimentallydetermined SF3A structure

After submission of this article for review, a crystal struc-ture of the yeast U2 snRNP SF3A sub-complex was pub-lished (68), giving us an opportunity to compare some ofour predictions with the independently determined experi-mental structure.

The structure of the yeast SF3A complex includes, inaddition to several regions composed of individual sec-ondary structure elements, three ordered domains forwhich an experimental structure had not been publishedbefore. One domain in the yeast protein Prp9 is >200residues long (its counterpart in the human proteinSF3a60 is situated roughly between residues 1–77,129–244 and 310–372); it features a novel helical architec-ture. Originally, we made no tertiary structural predictionsfor this domain (i.e. our database contained only con-structs), and it is highly unlikely that the structure ofthis domain could have been predicted accurately by astandard bioinformatics approach. Another domain inthe yeast Prp9 is a zf-C2H2 zinc finger inserted into thelong helical domain, whose counterpart in the humanprotein SF3a60 lacks the Zn-binding residues and isclosely neighbored by another insertion, of a SAPdomain. Despite these differences, in our original modelof this domain (with a predicted RMSD of 8.8 A andQMEAN Z-score of �1.93), we correctly predicted thefold and the position of nearly all residues in this zincfinger. We also correctly predicted the boundaries andthe fold of an all-b domain in the human proteinSF3a66, a counterpart of the yeast protein Prp11. Theoriginal comparative model of this domain had a pre-dicted RMSD of 4.7 A and a QMEAN Z-score of�0.92, with a medium reliability of the fold prediction.In practice, upon comparison, this translated to predictingthe position of approximately a half of the residues in thedomain correctly. This analysis demonstrates the utility ofthe predictions, and that even models with a predicted

relatively low accuracy can, in fact, exhibit correct folds,spatial shapes and locations of some of the functionallyimportant residues.Given the availability of the new template, we generated

newmodels for the human counterparts of the SF3A crystalstructure, using the comparative approach. We alsogenerated a new comparative model for a domain in theC-complex-related protein cactin (NY-REN-24/C19orf29,gi: 126723149) as this protein is predicted to have a domainwith the same all-b fold as the SF3a66 domain. The newmodels have been deposited in the database, while the oldmodels have been moved to the archive of the ‘obsolete’entries and are still available for analysis.

Ubiquitin-related domains are most common in theproteins of the late stages of splicing

Given the known role of ubiquitin in controllingspliceosome assembly and dynamics (21,22), and the factthat ubiquitin-related domains are one of the largestgroups of domains in splicing proteins, we were interestedin learning how these domains were distributed across thedifferent groups of splicing proteins. We found 19 poten-tial or known ubiquitin-related domains in 15splicing-related proteins, including 12 abundant proteinsof the major spliceosome and one protein of the U11/U12di-snRNP subunit of the minor spliceosome (Table 6 andFigure 4). These domains cover most of the main classesof ubiquitin-related domains, including ubiquitin folddomains, RING zinc finger/U-box domains that may actas ubiquitin ligases, a ubiquitin conjugating enzyme-likedomain, a ubiquitin carboxyl-terminal hydrolase domainand the JAB1/MPN domain of protein U5-220K (hPrp8)described in (23). In several cases, such as that of theabundant C-complex-specific protein FLJ35382(C1orf55) and the TREX complex protein THOC5, onlysimilarity of a protein region to a known ubiquitin-relatedfold could be detected.

Table 6. Ubiquitin-related regions in the spliceosomal proteome

Type of domain SCOP ID PFAM ID Protein Protein region Protein group

Ubiquitin d.15.1 Ubiquitin SF3a120a 689,785 U2 snRNPd.15.1 Ubiquitin U11/U12-25K (C16orf33) 41,132 U11/U12 di-snRNPd.15.1 SAP18 SAP18a 18,140 EJCd.15.1 ubiquitin UBL5 1,73 B complexd.15.1 FLJ35382 (C1orf55)a 7,74 C complexd.15.1 XAP5 XAP-5 (FAM50A)a 197,283 C complex

DWNN d.15.2 DWNN RBQ-1 3,77 MiscellaneousRING zinc finger/U-box g.44.1 zf-UBP U4/U6.U5-65K (USP39)a 97,200 U4/U6.U5 trisnRNP

g.44.1 U-box hPRP19a 1,60 hPrp19 / CDC5Lg.44.1 Rtf2 Cyp-60a 36,94 B-act complexg.44.1 Rtf2 Cyp-60a 101,161 B-act complexg.44.1 zf-C3HC4 RNF113Aa 256,319 B-act complexg.44.1 Rtf2 NOSIPa 33,79 C complexg.44.1 Rtf2 NOSIPa 217,286 C complexg.44.1 DUF572 (ZZ) CCDC130 43,117 C complexg.44.1 U-box RBQ-1 258,312 Miscellaneous

UCH d.3.1 UCH U4/U6.U5-65K (USP39)a 220,556 U4/U6.U5 trisnRNPUBC-like (RWD) d.20.1 THOC5 468,640 TREXJAB1/MPN c.97.3 JAB+PROCT U5-220K (Prp8)a 2064,2335 U5 snRNP

aAbundant protein.




ownloaded from


Ubiquitin-related domains are more abundant inproteins active in the late stages of splicing (B, B-act andC complexes). The ubiquitin-fold domain of proteinSF3a120 is the only ubiquitin-related domain found inthe U2 snRNP (its counterpart is found in the U11/U12di-snRNP). On the other hand, as many as three proteinsof the B/B-act complex (UBL5, Cyp-60 and RNF113A)and four proteins of the C complex (FLJ35382/C1orf55,XAP-5/FAM50A, NOSIP and CCDC130) containubiquitin-related domains, in addition to a domain inthe U5 snRNP (the JAB1/MPN of U5-220K) and aprotein in the U4/U6.U5 tri-snRNP (U4/U6.U5-65K).In summary, this distribution suggests that the latestages of splicing are probably under a stricterubiquitin-based control than the early stages. This maybe due to the fact that the earlier stages of splicing, suchas intron/exon definition, are more dependent on weak,disorder-based interactions, while the later catalyticstages require precise subunit rearrangements.

Zinc finger-like domains flanked by conserved intrinsicallydisordered regions in U2 snRNP SF3a120 and othersplicing proteins

Our FR analysis detected that the human SF3Asub-complex contains, in addition to the zinc finger inprotein SF3a60, another degenerate C2H2 (g.37.1)-typezinc finger in the middle conserved region of proteinSF3a120 (conserved region: residues 217–530, PFAMdomain PRP21_like_P; zinc finger: residues 407–435). InSaccharomyces cerevisiae, this zinc finger is absententirely. However, in the majority of non-animal species,especially other fungi, amoeba and Apicomplexa, this zincfinger retains some of the cysteine and histidinezinc-binding residues (Figure 5A). The zinc fingerremnant is surrounded on both sides by intrinsically un-structured regions that are in part predicted to formhelical (potentially coiled-coil) structures. The shortmotifs lying on the distal ends of the disordered linkersare conserved. An additional coiled-coil region connectsthe N-terminal conserved motif with the previously

described (69) second Surp module of SF3a120. Thus,the PRP21_like_P module consists of three motifs, thesecond of which is a zinc-finger remnant, connected byflexible linkers, with an N-terminal coiled coil thatconnects the N-terminal motif to the Surp region(Figure 5B). Structural modules of this type usuallyserve to simultaneously contact a binding partner of theprotein in several locations. In the particular case ofSF3a120, it has been suggested that both the U2 snRNAand a so far, unidentified splicing protein are potentialpartners (69).

Through a systematic search, we found several otherexamples of zinc finger and zinc finger-like domainsembedded in conserved disordered regions in thespliceosomal proteome (Table 7). Alternatively, tandemzinc fingers can be separated, e.g. by predicted coiled-coilregions. The new zinc-finger domains we found belongusually to the zf-C2H2 (g.37.1)-type, which can bindRNA and/or mediate protein–protein interactions. Thepre-mRNA/mRNA-binding protein ARS2 contains aZZ RING zinc finger, while the C complex proteinNOSIP contains two RING zinc finger/U-box-likeregions.

BLUF-like domain (DUF1115) of the U4/U6 di-snRNPprotein 90K (hPrp3)

The C-terminal ordered domain of protein U4/U6-90K(hPrp3), which corresponds to PFAM domain DUF1115(PFAM ID: PF06544; residues 540–683), was predicted inour analysis to have a ferredoxin-like fold. It is predictedto be related to the acylphosphatase/BLUF domain-likesuperfamily (SCOP ID: d.58.10). BLUF family domainshave two additional helices in the C-terminus compared toacylphosphatase family domains. These helices are presentin the DUF1115 domain, and so this domain is predictedto be a BLUF-like domain (Figure 6). This is an unusualassignment, because the BLUF domain is a FAD/FMN-binding blue light photoreceptor domain foundprimarily in bacteria. In Eukaryota, it is found almostexclusively in euglenids and Heterolobosea. On the otherhand, DUF1115 is found exclusively in eukaryotes.However, very high scores of BLUF domain templatesyielded by FR methods for the hPrp3 DUF1115sequence suggest that this protein is definitely homologousto the BLUF family.

Nevertheless, DUF1115 differs from BLUF domains insome key features. The conserved FAD/FMN-bindingresidues are not conserved in DUF1115, and nor is a tryp-tophan residue whose position is altered depending on theexcitement state of the photoreceptor (70) (SupplementaryFigure S2). On the other hand, DUF1115 contains a dis-ordered loop between the second a-helix and the fifthb-strand. The presence of this loop, though not itslength, is conserved in DUF1115 domains. Moreover, aconserved tryptophan residue, W604 in hPrp3, is locatednext to the disordered loop.

Based on biochemical data, theDUF1115 domainmay bea region of interaction of hPrp3 with the U5 snRNP proteinhPrp6 and/or the U4/U6.U5 tri-snRNP protein U4/U6.U5-110K (SART-1) (71). However, it is also possible

Figure 4. Ubiquitin-related structural regions of human splicingproteins. (A) Ubiquitin-fold region of protein FLJ35382 (C1orf55;residues 1–80). Predicted RMSD 3.5 A, QMEAN Z-score �1.33.(B) RWD-like region of protein THOC5 (residues 458–641). PredictedRMSD 3.9 A, QMEAN Z-score �1.85.




ownloaded from





that this interaction proceeds through the disordered PRP3domain of this protein (71). A possible alternative role forDUF1115 is suggested by the fact that, apart from proteinsfrom the hPrp3 family, it is found only in a family of proteinscontaining the RWDdomain. The RWDdomain belongs tothe ubiquitin conjugating enzyme superfamily (72). Hence,the hPrp3 DUF1115 may be a part of the spliceosomalubiquitin-based system.

N-terminal PWI-like domains of the helicases hPrp22(DHX8), hPrp2 (DHX16) and hBrr2 (U5-200K)

hPrp22 (DHX8) and hPrp2 (DHX16) are RNA helicasesthat function in the remodeling of the spliceosome (6).According to our predictions, these two helicases containN-terminal ordered helical bundles with a PWI superfam-ily fold (SCOP superfamily a.188.1) and similarity to thePFAM PWI domain (Figures 7 and 8). PWI is a nucleicacid-binding domain first described in the splicing proteinSRm160 (73,74). PWI is also found in the animal proteinU4/U6-90K (hPrp3). The hPrp22 and hPrp2 PWI-likebundles (hPrp22: residues 1–92 or 1–120; hPrp2: 1–95)are not found in a search with the profile of the PFAMPWI domain, possibly because their eponymous PWI tri-peptide motifs are degenerated. In hPrp22 and itshomologs, only the third position of this motif isconserved: [x][x][IV], while in hPrp2 and its homologs,the second and third positions are usually conserved:[x][WFY][IV]. However, PFAM displays several putativehPrp2/hPrp22 homologs when queried for proteins thatcontain PWI domains. Furthermore, stable binding to

Figure 5. Architecture of the conserved middle region of proteinSF3a120 (residues 217–530). (A) Alignment of the residues of azinc-finger domain in the middle part of SF3a120 (residues 407–435).The ‘g.37.1’ annotation row displays residues predicted to form a partof a g.37.1 (zf-C2H2) zinc finger. The ‘jnetpred SF3a120’ annotationrow displays predicted secondary structure elements of the human ofthe human SF3a120 (ovals represent a-helices, while arrows representb-strands). (B) Architecture of the middle region of SF3a120; dis-ordered linkers denoted as ‘IDR linker’ (intrinsically disorderedregion-linker). (C) Model of the middle region.

Table 7. Zinc-finger domains flanked by or embedded in predicted disordered regions

PFAMdomain Protein

Proteingroup

Region SCOPsuperfamilyID

PFAM domainof template

SCOPdescription

Confidence Region-superfamilysimilarity

PRP21_like_P SF3a120a U2 snRNP SF3A 406,435 g.37.1 zf-U11-48K b–b–a zinc fingers High HighLUC7 LUC7B1 A complex 30,74 g.66.1 zf-CCCH CCCH zinc finger High HighLUC7 LUC7B1 A complex 186,232 g.37.1 zf-C2H2_jaz b–b–a zinc fingers High HighDUF572 CCDC130 C complex 43,117 g.44.1 ZZ RING/U-box High HighRtf2 NOSIPa C complex 33,79 g.44.1 RING RING/U-box High HighRtf2 NOSIPa C complex 217,286 g.44.1 zf-C3HC4 RING/U-box High HighFra10Ac1 Fra10Ac1 C complex 166,220 d.325.1 Ribosomal_L28 L28p-like Lowb LowARS2 ASR2Ba pre-mRNA/mRNA-binding 714,738 g.37.1 zf-C2H2 b–b–a zinc fingers High High

aAbundant protein.bAlternative templates: FYVE, fn1.

Figure 6. BLUF-like region of protein U4/U6-90K (hPrp3) (domainDUF1115, residues 540–683). The position of the conserved residueW604 is displayed. Predicted RMSD 3.7 A, QMEAN Z-score �3.06.




ownloaded from


nucleic acids by PWI requires an adjacent basic-richregion (74). We found potential candidates for such ancil-lary regions both in hPrp22 and in hPrp2 (hPrp22:residues: 93–116; hPrp2: residues 120–132).We also found a PWI-like helical bundle in the

N-terminus of the human protein U5-200K (hBrr2;residues 258–338; Figure 7). This helical bundle is conservedacross themajority of eukaryotes, and is found, for instance,in the S. cerevisiae Brr2. The PWI-like domain of U5-200Kretains a relatively well conserved second and third positionof the tripeptide PWI motif: [x][WFY][ILV]. Notably, ifcorrect, this prediction represents the first case when aPWI-like domain is located in the middle of a protein.Usually, as is the case of SRm160, hPrp3, hPrp22 andhPrp2, a PWI domain is located either in the immediateN-terminus or in the immediate C-terminus of a protein.There are at least three candidate basic-rich regions in thevicinity of the U5-200K PWI-like domain (residues254–259; 343–349; 373–386).Sequences of proteins from the hPrp22 (DHX8) and

hPrp2 (DHX16) families are very similar, to the effectthat we could not easily separate them in a clusteringanalysis (Supplementary Figure S3). The most importantdiscriminant between the two families appears to be thepresence of an S1 RNA-binding domain (PDB ID: 2eqs;DOI:10.2210/pdb2eqs/pdb, manuscript to be published)between the N-terminal PWI-like bundle and theC-terminal helicase domains. This domain is present inhPrp22 and its homologs, but not in hPrp2 and itshomologs. This led us to the hypothesis that Prp2, withthe PWI-like domain, was the ancestral protein, whichthen underwent the insertion of the S1 domain.Nevertheless, the PWI-like domains of hPrp22 andhPrp2 differ in several aspects.The first difference lies in the above-mentioned degree

of degeneration of the tripeptide PWI motif, which islarger in hPrp22 and its homologs than in hPrp2 and itshomologs. In an extreme case, the N-terminus of thePrp22 protein of S. cerevisiae and the related organismEremothecium (Ashbya) gossypii is located inside themotif, which is therefore incomplete. The degenerationof the PWI motif may be offset by the heavy conservationof a [DE][FY] motif in the second helix of the bundle. Themain reason for the conservation of the PWI motif in ca-nonical PWI domains is that it stabilizes the structure ofthe PWI domain (74). It is possible that the conservation

of the [DE][FY] motif is sufficient to guarantee the stabil-ization of the bundle in conjunction with the conservationof the third position of the PWI motif.

Second, there is also a possible difference in either thenumber or the arrangement of helices comprising the PWIdomain. SCOP describes superfamily a.188.1 as a‘four-helix bundle’. However, in the structure of thePWI domain from protein SRm160, the bundle isfollowed by an additional short a-helix orthogonal tothe bundle (PDB ID: 1mp1) (74). The presence of thisa-helix is also predicted for the hPrp3 PWI domain,although it is missing from the available experimentalstructure (PDB ID: 1x4q; DOI:10.2210/pdb1x4q/pdb,manuscript to be published). Similarly, secondary struc-ture predictions for hPrp2 also indicated that this proteinis likely to contain an additional a-helix. However, forhPrp22, predictions of domain boundaries are lessdecisive. The hPrp22 PWI-like domain is either predictedto be a four-helix bundle (in which case it is confined toresidues 1–92), or to contain an additional a-helix, butseparated from the bundle by an intrinsically disorderedregion (in which case the domain spans residues 1–120). Ineither case, the helix arrangement is predicted to be dif-ferent than in hPrp2. To note, the U5-200K PWI-likedomain is predicted to be a five-helix domain.

Third, the pattern of evolutionary conservation of thePWI-like domains is different in hPrp22 and hPrp2. Fewerputative and confirmed hPrp2 homologs from differentspecies have the PWI-like domain than do hPrp22homologs. For instance, the functional analog of hPrp2in S. cerevisiae, Prp2, is considered to be its homolog, butlacks the PWI-like domain. The Prp22 combination ofPWI+S1 appears to be retained, while the Prp2 PWI ismissing, also in putative homologs in organisms, such askinetoplastids (Trypanosoma brucei, Leishmania major),some Apicomplexa (Plasmodium falciparum, Babesiabovis, but not Tetrahymena thermophila, which hasboth), Trichomonas vaginalis and Entamoeba histolytica.Altogether, the PWI-like domain of hPrp22 is more

diverged from the canon, but more often retained, whilethe PWI-like domain of hPrp2 is less diverged from canon,but more often completely lost. This result does notcontradict the hypothesis that the Prp22 protein wasformed in the insertion of the S1 domain into the ancestralPrp2. It rather suggests the possibility that some propertyof the ‘degenerated’ PWI-like domain ensured its retention

Figure 7. PWI-like regions of splicing helicases. (A) hPrp22 (DHX8; residues 1–120 shown, but domain may end at residue 92). Predicted RMSD2.4 A, QMEAN Z-score �2.76. (B) hPrp2 (DHX16; residues 1–95). Predicted RMSD 5.8 A, QMEAN Z-score �2.19. (C) U5-200K (hBrr2; residues259–338). Predicted RMSD 3.8 A, QMEAN Z-score �0.79.




ownloaded from




Figure 8. The PWI domain and PWI-like regions in splicing helicases. In all alignments, the ‘PWI’ annotation row displays the residues of the PWImotif conserved in a given protein. The ‘jnetpred (. . .)’ annotation row displays secondary structure elements predicted in the relevant humanproteins (ovals represent a-helices, while arrows represent b-strands). Vertical lines indicate hidden columns (inserted residues present in only one ortwo sequences in the alignment). (A) Alignment of a ‘canonical’ PWI domain from protein SRm160. The ‘PDB ID: 1mp1’ annotation row displaysthe actual secondary structure elements found in the structure of the PWI domain of the human protein SRm160. (B) PWI-like region from proteinhPrp22 (DHX8). The ‘disorder’ annotation row displays the position of a disordered region in the hPrp22 protein. (C) PWI-like region from proteinhPrp2 (DHX16). (D) PWI-like region from protein U5-200K (hBrr2).




ownloaded from


in evolution. An in-depth structural study of this regionmay elucidate the reason why.As hinted above, the U5-200K PWI-like domain is in

many respects a ‘canonical’ PWI-like domain similar tothat of hPrp2,it retains two out of three of the positionsof the tripeptide PWI motif, and is predicted to be afive-helix domain. However, U5-200K is in generalhighly conserved, and unlike in hPrp2, this conservationalso applies to its PWI-like domain.The N-termini of S. cerevisiae Prp2 and Prp22 are dis-

pensable for splicing (75,76), while the N-terminus ofS. cerevisiae Brr2 was shown not to contact any of theproteins of the U4/U6.U5 tri-snRNP (71). Hence, theN-terminal PWI-like domains of hPrp2, hPrp22 andU5-200K are likely to have only a supporting role insplicing, one that is not revealed in the activity of theyeast proteins. We suggest that they may help in thecorrect positioning of the C-terminal helicase domains onthe relevant snRNAs. Nevertheless, we could not findany data on the activity of the N-termini of hPrp2,hPrp22 and U5-200K. Furthermore, no experimentalmodel of a PWI domain bound to RNA exists, to whichwe could compare the mode of binding of the hPrp2,hPrp22 and U5-200K PWI-like domains. Hence, as faras this publication is concerned, the question of what isbound to the PWI-like domains of the splicing helicasesremains open.

An N-terminal domain of the hPrp8 protein (U5-220K)

We could not confirm a published prediction of abromo-domain encompassing hPrp8 residues 127–242 (apart of the N-terminal PFAM domain PRO8NT), origin-ally made for yeast Prp8 residues 200–315 (77). In ourview, the bromo-domain assignment does not commanda consistent evolutionary conservation pattern. Itencompasses 20 residues universally conserved in Prp8homologs from all known species and nearly 100residues conserved only in some eukaryotic Prp8homologs. On the other hand, we were able to constructa de novo model for the most conserved part (residues 86–150) of the PRO8NT domain (Supplementary Figure S4).Quality evaluation indicates that the model of the putativePrp8 bromo-domain described in (77) has low predictedaccuracy (predicted RMSD 8.7 A, QMEAN Z-score�4.25) compared to our de novo model of residues86–150 (predicted RMSD of 2.4 A, QMEAN Z-score�1.93). Altogether, although we cannot exclude the pos-sibility that PRO8NT encases a bromo-domain, wesuggest that further studies (ideally: experimental struc-ture determination) will be required to provide a confidentstructural model of this region.

Other previously uncharacterized structural regions ofabundant splicing proteins

We found several other new types of structured regions inabundant splicing proteins that we were able to assign toknown folds and/or are similar to existing structures, withvarying degree of confidence (Table 7). For instance, aregion in the C-terminus of the hPrp19/CDC5L-relatedprotein KIAA0560 (IBP160/Aquarius homolog; residues

453–1485) has a helicase architecture similar to thenonsense-mediated decay protein Upf1p (Figure 9).KIAA0560 is a 1485-residue-long protein, whose bindingto pre-mRNA introns is necessary for the successful de-position of the exon junction complex on the pre-mRNA(78) and for successful release of box C/D snoRNAs(small nucleolar RNAs) from introns (14). Upf1pcontains two RNA helicase domains (c.37.1), the first ofwhich is interrupted twice by two insertions: an all-b andan all-a domain insertion (79). In KIAA0560, this firstc.37.1 domain is interrupted three times: both of theoriginal insertions are kept, but a third insertion, largelydisordered, has appeared between them.

Another previously not described region lies in theC-terminus of the B complex protein TFIP11 (homologof the yeast protein Spp382). The results of our FRanalysis suggest that region is a potential double-strandedRNA binding domain (dsRBD) (Figure 9). In othersplicing proteins, such as the non-abundant A complexprotein DHX9, dsRBD domains often occur in tandem,but the TFIP11 region does not have a partner. However,TFIP11 contains also another previously structurallyuncharacterized region with a putative RNA-bindingfunction, a G-patch domain. While the G-patch domaindoes not show sequence similarity to any other knowndomains, a highly scoring de novo model of this domainshows structural similarity to a dsRBD domain (Figure 9).In fact, in the non-abundant splicing-related protein SON,the G-patch domain occurs in tandem with a dsRBDdomain partner. If the G-patch domain has a dsRBD-like fold, the TFIP11 G-patch domain could provide thefunctionality of a second tandem dsRBD-like domain forthe not described suspected domain of TFIP11.

We were also able to construct highly scored de novomodels with a clear structural similarity to known foldsfor ordered helical regions located on the N-termini ofproteins hnRNP R and Q. No known structural domainis assigned to these regions, but our de novo models ofthese regions exhibit fairly high scores (predicted RMSD1.3 A, QMEAN Z-score 0.12) for the region in proteinhnRNP R. Based on structural similarity scores yieldedby the DALI server (51), these may be helix-turn-helixdomains (Figure 9).

Other new putative structural domains are described inTable 8.

Comparison of the human and Giardia lambliaspliceosomal proteome: setting priorities for spliceosomestructure modeling

The human spliceosome, with its 119 abundant proteins,represents a fairly challenging target for both experimentaland theoretical structural analyses. To round-off ouranalysis, we wanted to put forth a candidate minimumset of structural regions in a functional spliceosome that,in our opinion, should be prioritized during the modelingof the structure of the complex.

In general, eukaryotic species with fewer introns havefewer splicing proteins. The yeast Saccharomyces cerevisiaehas homologs of only 61 of the human abundant splicing-related proteins (2). On the other hand, S. cerevisiae has




ownloaded from




Table 8. New types of predicted structural regions in the human spliceosomal proteome that can be classified into known superfamilies

PFAMdomain

Protein Proteingroup

Region SCOPsuperfamilyID

PFAMdomain oftemplate

SCOPdescription

Confidence Region-superfamilysimilarity

KIAA0560 (A) hPrp19/CDC5L-related

1,452 a.118.1 Arm repeats ARM repeat Medium Medium

KIAA0560 (A) hPrp19/CDC5L-related

453,1348 Upf1pa High High

TFIP11 B-complex 771,837 d.50.1 dsrm dsRNA-binding domain-like Mediumb HighG-patch LUCA15 (A) A-complex 741,815 d.50.1 dsrm dsRNA-binding domain-like Mediumc High

hnRNP R hnRNP 28,92 a.4.14 KorB(clan HTH)

KorB DNA-binding domain-like Mediumd High

DUF2414 ELG pre-mRNA/mRNA-binding

124,182 d.58.7 RNA_bind RNA-binding domain, RBD High High

DUF1604 Q9BRR8 C-complex 28,53 b.34.2 SH3_1 SH3-domain High HighCTK3 SR140 U2 snRNP-related 534,680 a.118.9 DUF618 ENTH/VHS domain High HighSlu7 hSlu7 (A) step 2 factors 424,457 BTK motif Lowe HighPRP38 hPrp38 (A) B-complex 26,206 a.96.1 HhH-GPD DNA-glycosylase Lowf Medium

TRAP150 (A) A-complex 861,934 Btz Highg HighBCLAF1 pre-mRNA/

mRNA-binding827,899 Btz Highg High

DZF NFAR A-complex 82,177 d.218.1 NTP_transf_2 Nucleotidyl transferase Highh HighDZF NFAR A-complex 194,325 a.160.1 OAS1_C PAP/OAS1 substrate-

binding domainHighh High

aProtein.bHighly scored alternative template TcpQ (bacterial).cDe novo model, highly scored, structural similarity only (1DI2_B).dDe novo model, highly scored, structural similarity only (1R71_A).eShort; BTK motif always found C-terminal to PH domains, which is not found in Slu7.fAlternative templates: HtH motifs.gPredicted disordered region.hDZF is a member of clan NTP_transf.

Figure 9. Other previously uncharacterized structural regions of the spliceosomal proteome. (A) The C-terminus of protein KIAA0560 (AQR),structurally similar to protein Upf1p (residues 453–1485). RMSD 3.3 A, QMEAN Z-score �4.97. (B) Dsrm-like region of protein TFIP11 (residues701–838). Predicted RMSD 4.5 A, QMEAN Z-score �2.28. (C) The G-patch domain of LUCA15 (residues 741–815). Predicted RMSD 3.0 A,QMEAN Z-score �1.22. (D) HTH-like region of protein hnRNP R (residues 23–92). Predicted RMSD 1.3 A, QMEAN Z-score 0.12.




ownloaded from


also some Saccharomycetes-specific splicing proteins, suchas Prp24 (41), which do not appear in other fungi. In thesearch of a ‘minimum’ set of regions to include in the modelof a functional spliceosome, we turned to the extremelyintron-scarce (80,81) parasitic organism G. lamblia, whichis also known for its genome minimalism (82). Thisorganism apparently underwent a reversed process withrespect to the diversified and specialized human spliceo-somal proteome, namely the loss of many genes encodingspliceosomal proteins.The genome of G. lamblia ATCC50803 encodes

homologs of only 30 human abundant splicing proteins(Table 9). Two more proteins can be found in G. lambliaP15. However, not all of these homologs may be involvedin splicing. For instance, G. lamblia ATCC50803possesses orthologs of U4/U6-15.5K and EIF4A3.In humans, U4/U6-15.5K is a component of the U4/U6di-snRNP, where it binds to U4/U6-61K (hPrp31) (83),while EIF4A3 is a protein of the EJC (33). U4/U6-61Kand all EJC proteins save EIF4A3 are missing inG. lamblia. However, the human U4/U6-15.5K proteinalso participates in box C/D snoRNP formation (83),where it binds a different protein, which does have aG. lamblia homolog, and the human EIF4A3 is anisoform of the eukaryotic translation initiation factor4A. It is therefore possible that their orthologs inG. lamblia perform only these splicing-unrelated functions.There is a pattern to the presence and absence of

abundant splicing-related proteins and/or their domainsand disordered regions in the G. lamblia proteome.Almost all the proteins of the U2 snRNPs are present inG. lamblia, as well as a homolog of U2AF35K, but onlysome core proteins of the U5 snRNP, such as Prp8 andBrr2. Snu114, which, according to the current understand-ing, is in other organisms the third part of the troika of U5proteins essential to splicing (21), is an important absentee.Many proteins of the U1 snRNP and U4/U6 di-snRNPproteome are missing, as well as are all proteins specific tothe human U4/U6.U5 tri-snRNP. The set of Step 2 factorsis reduced to three RNA helicases, and these helicasesare reduced to C-terminal regions of their human counter-parts, with a common architecture. TheG. lamblia helicasesare also impossible to assign unambiguously to their humanor yeast counterparts. Clustering analysis of helicasesequences from different organisms places the G. lambliahelicases away from any major cluster (SupplementaryFigure S3). Finally, G. lamblia has very few homologs ofhuman proteins of the auxiliary complexes, and only twonon-snRNP stage-specific proteins (PRP38 andRNF113A)are present in this organism.The snRNP protein homologs present in the G. lamblia

proteome are shorter than their human counterparts.Three main types of structural features that are commonfor human spliceosomal proteins are largely absent fromthe G. lamblia spliceosomal proteome:

(i) intrinsically disordered proteins or disorderedregions with possibly autonomous function (longprotein disorder that does not form inter-domainlinkers, including compositionally biased disorderand some regions of disorder with preformed

structural elements); consequently, highly disorderedproteins, such as the U4/U6.U5-specific proteinsU4/U6.U5-110K and U4/U6.U5-27K;

(ii) short peptide regions that act as ligand partners forother splicing proteins (PRP4, SF3a60_bindingd,SF3b1 and the ULM-containing region of proteinSF3b155); and their partners (PRP4 partner: U4/U6-20K; SF3a60_bindingd partner: second Surpdomain of protein SF3a120. This protein ismissing entirely (see below); SF3b1 partner: p14;SF3b155 ULM partner: U2AF65K);

(iii) ubiquitin-related domains. This includes: the entireprotein SF3a120 (which contains an ubiquitindomain in addition to the Surp domains); theU4/U6.U5-specific protein U4/U6.U5-65K, whichcontains the ubiquitin hydrolase domains zf-UBPand UCH; the zf-C3HC4 RING zinc finger ofprotein RNF113A. In contrast, the zf-CCCH zincfinger of RNF113A, which is a putative RNA-binding domain, is present.

In our analysis of intrinsic disorder in the humanspliceosomal proteome (I.K and J.M.B., submitted forpublication), we discuss how disordered regions ofsplicing proteins are tied to functions of dynamics,assembly and regulation of the spliceosome. This is alsothe function of known ubiquitin-related regions. Hence, itappears that G. lamblia is missing most proteins and/orprotein regions primarily responsible for splicing regula-tion and dynamics. On the other hand, G. lamblia retainedpre-mRNA and snRNA-binding proteins and/or regions,as well as proteins that directly assist in splicing, such asthe catalytic factor helicases. It also appears that this para-sitic organism’s ubiquitin-based system of splicing controlis reduced, rather than entirely missing. The C-terminalMov34/MPN/JAB1 domain present in Prp8 from humanor yeast (SCOP superfamily c.97.3), which may beimplicated in an ubiquitin-based system (65), is absentfrom the G. lamblia Prp8 (84), but the correspondingregion in the latter protein is predicted by FR analysisto be a domain with a ubiquitin-like fold (SCOPsuperfamily d.15.1).

It is possible, that, like yeast, G. lamblia evolved its ownspecialized splicing proteins, which would not be detectedin sequence similarity searches done with proteins fromother organisms. Since G. lamblia is a parasite, it is alsopossible that it supplements some of its missing proteins(such as Snu114) from the host. Finally, it is also possiblethat some information was missed by our bioinformaticsanalysis but may be uncovered by an in-depth experimen-tal analysis. With the caveat of the possibility of gaps indata (such as, possibly, Snu114), these are not singleproteins that are missing, reduced or degenerated, butentire systems. The cropped set of proteins remaining inour G. lamblia spliceosomal proteome data set, corres-ponds to a system much less dynamical than the humanspliceosome, less precisely regulated and less able to adaptto variable conditions. However, such a spliceosome maystill be functional. Hence, we propose that from a practicalstandpoint, the set of structural regions with homologs inG. lamblia is a good starting point for the higher order




ownloaded from





Table 9. Human spliceosomal proteins with potential G. lamblia homologs, and these potential homologs

Proteingroup

Humanprotein

GI ofG. lambliahomolog

Human proteinarchitecture

Giardia lamblia proteinarchitecture

Sm Sm-B/B0 159117899 LSM+G-rich disorder+poly-P disorder LSMSm Sm-D1 159116502 LSM+G-rich disorder LSMSm Sm-D2 159111944 LSM LSMSm Sm-D3 159107430 LSM+G-rich disorder LSMSm Sm-E 159110758 LSM LSMSm Sm-F 159114826 LSM LSMLsm Lsm2 159109501 LSM LSMLsm Lsm3 159118879 LSM LSMLsm Lsm4 159110729 LSM+G-rich disorder LSMU1 snRNP/U2

snRNPU1-A/U2-B00 253745584 (RRM_1)� 2 RRM_1

U1 snRNP U1-C 308158556 zf-U1+poly-P disorder zf-U1a

U2 snRNP U2-A0 159115402 (LRR_4)� 2 (LRR_4)� 2U2 snRNP SF3a66 159112716 PRP4+zf-met+b.15.1+poly-P disorder zf-met+b.15.1U2 snRNP SF3a60 159115731 SF3a60_bindingd+SAP+g.37.1+g.37.1 zf-met (g.37.1) + g.37.1b

U2 snRNP SF3b155 253747536 ULM+SF3b1+a.118.1 (HEAT) repeats a.118.1 repeatsc

U2 snRNP SF3b145 159118535 SAP+poly-P disorder+RS-like disorder+DUF382+PSP DUF382+PSPU2 snRNP SF3b130 308162520 WD40 repeats+CPSF_A CPSF_Ad

U2 snRNP SF3b49 159117358 (RRM_1)� 2+poly-P disorder (RRM_1)� 2U2 snRNP PHF5A 159114698 PHF5 PHF5U2 snRNP-

relatedU2AF35 159112951 zf-CCCH+RRM_1+zf-CCCH+G-rich disorder zf-CCCH+RRM_1+zf-CCCH

U4/U6di-snRNP

NHP2L1 159112698 Ribosomal_L7Ae Ribosomal_L7Aee

U4/U6di-snRNP

NHP2L1 159111753 Ribosomal_L7Ae Ribosomal_L7Aee

U5 snRNP U5-15K 159116909 DIM1 DIM1U5 snRNP U5-200K 159109491 a.188.1+(DEAD+Helicase_C+Sec63)� 2 DEAD+Helicase_C+Sec63U5 snRNP U5-220K 159109144 PRO8NT+PROCN+RRM_4+U5_2-snRNA_bdg+U6-

snRNA_bdg+PRP8_domainIV+c.97.3 (JAB+PROCT)PRO8NT+PROCN+RRM_4+

U5_2-snRNA_bdg+U6-snRNA_bdg+PRP8_domainIV+d.15.3f

U2snRNP-related

hPrp43(DHX15)

RS-like disorder+DEAD+Helicase_C+HA2+OB_NTP_bind g

B-act complex hPrp2(DHX16)

a.188.1+DEAD+Helicase_C+HA2+OB_NTP_bind g

step 2 factors hPrp22(DHX8)

a.188.1+RS-like disorder+S1+DEAD+Helicase_C+HA2+OB_NTP_bind

g

step 2 factors hPrp16(DHX38)

RS-like disorder+DEAD+Helicase_C+HA2+OB_NTP_bind g

159108899 ATP11+DEAD+Helicase_C+HA2g,h

159113861 DEAD+Helicase_C+HA2+OB_NTP_bindg

159117264 DEAD+Helicase_C+HA2g,h

B complex hPrp38A 159116389 PRP38+RS-like disorder PRP38B-act complex RNF113A 159114937 zf-CCCH+zf-C3HC4 zf-CCCHhPrp19/CDC5L CCAP2 159115167 Cwf_Cwc_15EJC EIF4A3 159117719 DEAD+Helicase_C DEAD+Helicase_Ci

Only abundant human splicing proteins with homologs in G. lamblia are shown. Predicted disordered regions with an independent function areincluded in italics. Ordered structural regions are usually described with their PFAM domains; SCOP IDs are used if the structural region does notcorrespond to a PFAM domain.aOnly in G. lamblia P15.bSAP domain insertion is limited to animals and plants.cSimilarity to human SF3b155 only in C-terminal region (human SF3b155: 998–1304).dOnly in G. lamblia P15; WD40 repeat-like domain may be found via FR.eMay not participate in splicing (other possible human homologs: ribosomal protein L7, 15.5K).fUbiquitin-like fold (d.15) found in protein instead of c.97.3 domain.gThe human splicing helicases hPrp43, hPrp2, hPrp22 and hPrp16 and potential G. lamblia homologs cannot be unequivocally assigned to oneanother.hOB_NTP_bind found via FR.iMay not participate in splicing (other possible human homolog: initiation factor EIF4A).




ownloaded from


structural modeling of the spliceosome, as well asconstitutes an attractive list of targets for experimentalstructural determination.

CONCLUSIONS AND FUTURE PROSPECTS

This work has been intended to review the existing structuralinformation about human spliceosomal proteins and to fillin gaps, providing a framework of reference for futurestructural analyses of the spliceosome. We used proteinstructure prediction methods to identify orderedspliceosomal protein structural elements either notcharacterized at all on the structural level or characterizedinsufficiently, and thus underreported in databases and lit-erature. Examples of such un-/under-characterized elementsinclude the zinc-finger domain in protein SF3a120 of the U2snRNP, PWI-like domains in the essential splicing helicaseshPrp22 (DHX8), hPrp2 (DHX16) and the U5 snRNPprotein hBrr2 (U5-200K), and several ubiquitin-relatedregions in abundant splicing proteins. In the latter case, bycombining database data with our results, we determinedthat ubiquitin processing-related domains are common es-pecially in non-snRNP splicing factors active in the laterstages of the splicing reaction. Having completed the char-acterization of ordered domains of splicing proteins, we con-structed a minimum non-redundant set of experimentalstructural representations of the proteins of the humanspliceosome and modeled most of the (potentially) orderedstructural elements without experimental structural models.Confident high-resolution structural models can be assignedto over 90% of structural order in the spliceosome proteins,which corresponds to about 50% of all amino acid residues.We analyzed the spliceosomal proteome of the

intron-poor organism G. lamblia to determine a candidateminimum set of structural elements present in a functionalspliceosome. We found that the G. lamblia spliceosomedoes not contain the majority of disordered regionsfound in the human splicing proteome, and has retainedonly a vestigial ubiquitin-based system of control. Overall,the G. lamblia spliceosome appears to be much simplerthan the human or the yeast one, in accordance withthis organism’s overall genomic minimalism and itsgenome’s intron-poorness.The results of our analysis of the structural domains in

proteins of the human spliceosome may be used to guideexperimental characterization of these regions. The char-acterization of the reduced G. lamblia spliceosome mayhelp set priorities in selecting the structural regions forexperimental structural determination, and those to beincluded in a first draft of a model of a functionalspliceosome. We suggest that in the event of modelingthe structure of a functional spliceosome, the orderedprotein regions found in G. lamblia proteins should takepriority. Finally, as long as the corresponding structuralinformation is absent, the models we constructed may beused in further structural studies, for instance in modelingthe structure of the entire spliceosome. Models of non-‘core’ proteins can be used to broaden our understandingof alternative splicing. Our models, domain characteriza-tions and suggested priorities thus form a framework of

reference for future structural studies of the spliceosome,and in particular, for the modeling of the structure of thefunctional spliceosome.

Following the (near) completion of the parts list of thespliceosome, we are also advancing our understanding ofthe structure of these parts. This work provides workingstructural models for a majority of the parts that appear tobe ordered regardless of their functional state. Whileexperimental determination of high-resolution structuresfor all of these elements would be desirable, theoreticalmodels can be used to design experiments or performcalculations/simulations that require protein structure asa basis. The next step in the structural analysis thespliceosome would be to use integrative modeling tech-niques to generate three-dimensional pictures of thesplicing machinery, in analogy to the previous work onthe nuclear pore complex (85,86). The even greater chal-lenge ahead will be to model the dynamics of the splicingcycle, for which even greater union of experimental andtheoretical techniques will be required.

SUPPLEMENTARY DATA

SupplementaryData are available at NAR Online: Supple-mentary Tables 1–4 and Supplementary Figures 1–4.

ACKNOWLEDGEMENTS

We thank Lukasz Kozlowski, Albert Bogdanowicz,Marcin Pawlowski, Geoff Barton, Jim Procter and PascalBenkert for help with their software. We also thankReinhard Luhrmann, Elz_bieta Purta, Lukasz Kozlowski,Joanna Kasprzak, and Anna Czerwoniec for criticalreading of the article, useful comments and suggestions.

FUNDING

EU 6th Framework Programme Network of ExcellenceEURASNET [EU FP6 contract no LSHG-CT-2005-518238]. J.M.B. has been additionally supported by the7th Framework Programme of the EuropeanCommission [EC FP7, grant HEALTHPROT, contractnumber 229676], by the European Research Council[ERC, StG grant RNA + P=123D] and by the ‘Ideasfor Poland’ fellowship from the Foundation for PolishScience. Computing power has been provided in part bythe Interdisciplinary Centre for Mathematical andComputational Modeling of the University of Warsaw[grant number G27-4]. The funders had no role instudy design, data collection and analysis, decision topublish or preparation of the article. Funding for openaccess charge: EC FP7 contract number 229676(HEALTHPROT) and by ERC (RNA+P=123D).

Conflict of interest statement. None declared.

REFERENCES

1. Tarn,W.Y. and Steitz,J.A. (1996) A novel spliceosome containingU11, U12, and U5 snRNPs excises a minor class (AT-AC) intronin vitro. Cell, 84, 801–811.




ownloaded from



2. Agafonov,D.E., Deckert,J., Wolf,E., Odenwalder,P., Bessonov,S.,Will,C.L., Urlaub,H. and Luhrmann,R. (2011) Semi-quantitativeproteomic analysis of the human spliceosome via a noveltwo-dimensional gel electrophoresis method. Mol. Cell Biol., 31,2667–2682.

3. Zhou,Z., Licklider,L.J., Gygi,S.P. and Reed,R. (2002)Comprehensive proteomic analysis of the human spliceosome.Nature, 419, 182–185.

4. Jurica,M.S. and Moore,M.J. (2003) Pre-mRNA splicing: awash ina sea of proteins. Mol. Cell, 12, 5–14.

5. Luz Ambrosio,D., Lee,J.H., Panigrahi,A.K., Nguyen,T.N.,Cicarelli,R.M. and Gunzl,A. (2009) Spliceosomal proteomics inTrypanosoma brucei reveal new RNA splicing factors. Eukaryot.Cell, 8, 990–1000.

6. Valadkhan,S. and Jaladat,Y. (2010) The spliceosomal proteome:at the heart of the largest cellular ribonucleoprotein machine.Proteomics, 10, 4128–4141.

7. Ren,L., McLean,J.R., Hazbun,T.R., Fields,S., Vander Kooi,C.,Ohi,M.D. and Gould,K.L. (2011) Systematic two-hybrid andcomparative proteomic analyses reveal novel yeast pre-mRNAsplicing factors connected to Prp19. PLoS One, 6, e16719.

8. Bessonov,S., Anokhina,M., Krasauskas,A., Golas,M.M.,Sander,B., Will,C.L., Urlaub,H., Stark,H. and Luhrmann,R.(2010) Characterization of purified human Bact spliceosomalcomplexes reveals compositional and morphological changesduring spliceosome activation and first step catalysis. Rna, 16,2384–2403.

9. Veretnik,S., Wills,C., Youkharibache,P., Valas,R.E. andBourne,P.E. (2009) Sm/Lsm genes provide a glimpse into theearly evolution of the spliceosome. PLoS Comput. Biol., 5,e1000315.

10. Kornblihtt,A.R., de la Mata,M., Fededa,J.P., Munoz,M.J. andNogues,G. (2004) Multiple links between transcription andsplicing. Rna, 10, 1489–1498.

11. Alexander,R. and Beggs,J.D. (2010) Cross-talk in transcription,splicing and chromatin: who makes the first call? Biochem. Soc.Trans., 38, 1251–1256.

12. Hsu,S.N. and Hertel,K.J. (2009) Spliceosomes walk the line:splicing errors and their impact on cellular function. RNA Biol.,6, 526–530.

13. Dreyfuss,G., Kim,V.N. and Kataoka,N. (2002) Messenger-RNA-binding proteins and the messages they carry. Nat. Rev. Mol. CellBiol., 3, 195–205.

14. Hirose,T., Ideue,T., Nagai,M., Hagiwara,M., Shu,M.D. andSteitz,J.A. (2006) A spliceosomal intron binding protein, IBP160,links position-dependent assembly of intron-encoded box C/DsnoRNP to pre-mRNA splicing. Mol. Cell, 23, 673–684.

15. Hogg,R., McGrail,J.C. and O’Keefe,R.T. (2010) The function ofthe NineTeen Complex (NTC) in regulating spliceosomeconformations and fidelity during pre-mRNA splicing. Biochem.Soc. Trans., 38, 1110–1115.

16. Tange,T.O., Nott,A. and Moore,M.J. (2004) The ever-increasingcomplexities of the exon junction complex. Curr. Opin. Cell Biol.,16, 279–284.

17. Lewis,J.D. and Izaurralde,E. (1997) The role of the cap structurein RNA processing and nuclear export. Eur. J. Biochem., 247,461–469.

18. Dziembowski,A., Ventura,A.P., Rutz,B., Caspary,F., Faux,C.,Halgand,F., Laprevote,O. and Seraphin,B. (2004) Proteomicanalysis identifies a new complex required for nuclear pre-mRNAretention and splicing. EMBO J., 23, 4847–4856.

19. Katahira,J. (2009) Regulation of nuclear export and cytoplasmiclocalization of mRNAs by NXF family proteins. TanpakushitsuKakusan Koso, 54, 2109–2113.

20. Zhang,N., Kaur,R., Lu,X., Shen,X., Li,L. and Legerski,R.J.(2005) The Pso4 mRNA splicing and DNA repair complexinteracts with WRN for processing of DNA interstrandcross-links. J. Biol. Chem., 280, 40559–40567.

21. Wahl,M.C., Will,C.L. and Luhrmann,R. (2009) Thespliceosome: design principles of a dynamic RNP machine.Cell, 136, 701–718.

22. Bellare,P., Small,E.C., Huang,X., Wohlschlegel,J.A., Staley,J.P.and Sontheimer,E.J. (2008) A role for ubiquitin in the

spliceosome assembly pathway. Nat. Struct. Mol. Biol., 15,444–451.

23. Pena,V., Liu,S., Bujnicki,J.M., Luhrmann,R. and Wahl,M.C.(2007) Structure of a multipartite protein-protein interactiondomain in splicing factor prp8 and its link to retinitispigmentosa. Mol. Cell, 25, 615–624.

24. Song,E.J., Werner,S.L., Neubauer,J., Stegmeier,F., Aspden,J.,Rio,D., Harper,J.W., Elledge,S.J., Kirschner,M.W. and Rape,M.(2010) The Prp19 complex and the Usp4Sart3 deubiquitinatingenzyme control reversible ubiquitination at the spliceosome. GenesDev., 24, 1434–1447.

25. Mathew,R., Hartmuth,K., Mohlmann,S., Urlaub,H., Ficner,R.and Luhrmann,R. (2008) Phosphorylation of human PRP28 bySRPK2 is required for integration of the U4/U6-U5 tri-snRNPinto the spliceosome. Nat. Struct. Mol. Biol., 15, 435–443.

26. Laskowski,R.A. and Thornton,J.M. (2008) Understanding themolecular machinery of genetics through 3D structures. Nat. Rev.Genet., 9, 141–151.

27. Stark,H. and Luhrmann,R. (2006) Cryo-electron microscopy ofspliceosomal components. Annu. Rev. Biophys. Biomol. Struct., 35,435–457.

28. Jurica,M.S. (2008) Detailed close-ups and the big picture ofspliceosomes. Curr. Opin. Struct. Biol., 18, 315–320.

29. Magrane,M. and Consortium,U. (2011) UniProt Knowledgebase:a hub of integrated protein data. Database, 2011, bar009.

30. Finn,R.D., Mistry,J., Tate,J., Coggill,P., Heger,A., Pollington,J.E.,Gavin,O.L., Gunasekaran,P., Ceric,G., Forslund,K. et al. (2010)The Pfam protein families database. Nucleic Acids Res., 38,D211–D222.

31. Pomeranz Krummel,D.A., Oubridge,C., Leung,A.K., Li,J. andNagai,K. (2009) Crystal structure of human spliceosomal U1snRNP at 5.5 A resolution. Nature, 458, 475–480.

32. Leung,A.K., Nagai,K. and Li,J. (2011) Structure of thespliceosomal U4 snRNP core domain and its implication forsnRNP biogenesis. Nature, 473, 536–539.

33. Bono,F., Ebert,J., Lorentzen,E. and Conti,E. (2006) The crystalstructure of the exon junction complex reveals how it maintains astable grip on mRNA. Cell, 126, 713–725.

34. Mazza,C., Segref,A., Mattaj,I.W. and Cusack,S. (2002)Large-scale induced fit recognition of an m(7)GpppG capanalogue by the human nuclear cap-binding complex. EMBO J.,21, 5548–5557.

35. Schellenberg,M.J., Edwards,R.A., Ritchie,D.B., Kent,O.A.,Golas,M.M., Stark,H., Luhrmann,R., Glover,J.N. andMacMillan,A.M. (2006) Crystal structure of a corespliceosomal protein interface. Proc. Natl Acad. Sci. USA, 103,1266–1271.

36. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N.,Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The ProteinData Bank. Nucleic Acids Res., 28, 235–242.

37. Makarov,E.M., Makarova,O.V., Urlaub,H., Gentzel,M.,Will,C.L., Wilm,M. and Luhrmann,R. (2002) Small nuclearribonucleoprotein remodeling during catalytic activation of thespliceosome. Science, 298, 2205–2208.

38. Behzadnia,N., Golas,M.M., Hartmuth,K., Sander,B., Kastner,B.,Deckert,J., Dube,P., Will,C.L., Urlaub,H., Stark,H. et al. (2007)Composition and three-dimensional EM structure of doubleaffinity-purified, human prespliceosomal A complexes. EMBO J.,26, 1737–1748.

39. Deckert,J., Hartmuth,K., Boehringer,D., Behzadnia,N., Will,C.L.,Kastner,B., Stark,H., Urlaub,H. and Luhrmann,R. (2006) Proteincomposition and electron microscopy structure of affinity-purifiedhuman spliceosomal B complexes isolated under physiologicalconditions. Mol. Cell Biol., 26, 5528–5543.

40. Bessonov,S., Anokhina,M., Will,C.L., Urlaub,H. andLuhrmann,R. (2008) Isolation of an active step I spliceosome andcomposition of its RNP core. Nature, 452, 846–850.

41. Fabrizio,P., Dannenberg,J., Dube,P., Kastner,B., Stark,H.,Urlaub,H. and Luhrmann,R. (2009) The evolutionarily conservedcore design of the catalytic activation step of the yeastspliceosome. Mol. Cell, 36, 593–608.

42. Will,C.L., Schneider,C., Hossbach,M., Urlaub,H., Rauhut,R.,Elbashir,S., Tuschl,T. and Luhrmann,R. (2004) The human 18S




ownloaded from


U11/U12 snRNP contains a set of novel proteins not found inthe U2-dependent spliceosome. RNA, 10, 929–941.

43. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,Miller,W. and Lipman,D.J. (1997) Gapped BLAST andPSI-BLAST: a new generation of protein database searchprograms. Nucleic Acids Res., 25, 3389–3402.

44. Katoh,K., Kuma,K., Toh,H. and Miyata,T. (2005) MAFFTversion 5: improvement in accuracy of multiple sequencealignment. Nucleic Acids Res., 33, 511–518.

45. Frickey,T. and Lupas,A. (2004) CLANS: a Java application forvisualizing protein families based on pairwise similarity.Bioinformatics, 20, 3702–3704.

46. Kurowski,M.A. and Bujnicki,J.M. (2003) GeneSilicoprotein structure prediction meta-server. Nucleic Acids Res., 31,3305–3307.

47. Lundstrom,J., Rychlewski,L., Bujnicki,J. and Elofsson,A. (2001)Pcons: a neural-network-based consensus predictor that improvesfold recognition. Protein Sci., 10, 2354–2362.

48. Soding,J. (2005) Protein homology detection by HMM-HMMcomparison. Bioinformatics, 21, 951–960.

49. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995)SCOP: a structural classification of proteins database for theinvestigation of sequences and structures. J. Mol. Biol., 247,536–540.

50. Tung,C.H. and Yang,J.M. (2007) fastSCOP: a fast web server forrecognizing protein structural domains and SCOP superfamilies.Nucleic Acids Res., 35, W438–W443.

51. Holm,L. and Rosenstrom,P. (2010) Dali server: conservationmapping in 3D. Nucleic Acids Res., 38, W545–W549.

52. Sali,A., Potterton,L., Yuan,F., van Vlijmen,H. and Karplus,M.(1995) Evaluation of comparative protein modeling byMODELLER. Proteins, 23, 318–326.

53. Roy,A., Kucukural,A. and Zhang,Y. (2010) I-TASSER: a unifiedplatform for automated protein structure and function prediction.Nat. Protoc., 5, 725–738.

54. Das,R. and Baker,D. (2008) Macromolecular modeling withrosetta. Annu. Rev. Biochem., 77, 363–382.

55. Kaufmann,K.W., Lemmon,G.H., Deluca,S.L., Sheehan,J.H. andMeiler,J. (2010) Practically useful: what the Rosetta proteinmodeling suite can do for you. Biochemistry, 49, 2987–2998.

56. Pettersen,E.F., Goddard,T.D., Huang,C.C., Couch,G.S.,Greenblatt,D.M., Meng,E.C. and Ferrin,T.E. (2004) UCSFChimera–a visualization system for exploratory research andanalysis. J. Comput. Chem., 25, 1605–1612.

57. Guex,N. and Peitsch,M.C. (1997) SWISS-MODEL and theSwiss-PdbViewer: an environment for comparative proteinmodeling. Electrophoresis, 18, 2714–2723.

58. Pawlowski,M., Gajda,M.J., Matlak,R. and Bujnicki,J.M. (2008)MetaMQAP: a meta-server for the quality assessment of proteinmodels. BMC Bioinformatics, 9, 403.

59. Benkert,P., Kunzli,M. and Schwede,T. (2009) QMEAN server forprotein model quality estimation. Nucleic Acids Res., 37,W510–W514.

60. Zemla,A., Venclovas, Moult,J. and Fidelis,K. (2001) Processingand evaluation of predictions in CASP4. Proteins, (Suppl 5),13–21.

61. Waterhouse,A.M., Procter,J.B., Martin,D.M., Clamp,M. andBarton,G.J. (2009) Jalview Version 2–a multiple sequencealignment editor and analysis workbench. Bioinformatics, 25,1189–1191.

62. Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H.,Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T.et al. (2000) Gene ontology: tool for the unification of biology.The Gene Ontology Consortium. Nat. Genet., 25, 25–29.

63. Maris,C., Dominguez,C. and Allain,F.H. (2005) TheRNA recognition motif, a plastic RNA-binding platformto regulate post-transcriptional gene expression. FEBS J., 272,2118–2131.

64. Clery,A., Blatter,M. and Allain,F.H. (2008) RNA recognitionmotifs: boring? Not quite. Curr. Opin. Struct. Biol., 18, 290–298.

65. Bellare,P., Kutach,A.K., Rines,A.K., Guthrie,C. andSontheimer,E.J. (2006) Ubiquitin binding by a variant Jab1/MPNdomain in the essential pre-mRNA splicing factor Prp8p. RNA,12, 292–302.

66. Kielkopf,C.L., Lucke,S. and Green,M.R. (2004) U2AF homologymotifs: protein recognition in the RRM world. Genes Dev., 18,1513–1526.

67. Benkert,P., Biasini,M. and Schwede,T. (2011) Toward theestimation of the absolute quality of individual protein structuremodels. Bioinformatics, 27, 343–350.

68. Lin,P.C. and Xu,R.M. (2012) Structure and assembly of theSF3a splicing factor complex of U2 snRNP. EMBO J., 31,1579–1590.

69. Kramer,A., Ferfoglia,F., Huang,C.J., Mulhaupt,F., Nesic,D. andTanackovic,G. (2005) Structure-function analysis of the U2snRNP-associated splicing factor SF3a. Biochem. Soc. Trans.,33, 439–442.

70. Yuan,H., Anderson,S., Masuda,S., Dragnea,V., Moffat,K. andBauer,C. (2006) Crystal structures of the Synechocystisphotoreceptor Slr1694 reveal distinct structural states related tosignaling. Biochemistry, 45, 12687–12694.

71. Liu,S., Rauhut,R., Vornlocher,H.P. and Luhrmann,R. (2006) Thenetwork of protein-protein interactions within the human U4/U6.U5 tri-snRNP. RNA, 12, 1418–1430.

72. Andersen,K.M., Hofmann,K. and Hartmann-Petersen,R. (2005)Ubiquitin-binding proteins: similar, but different. Essays Biochem.,41, 49–67.

73. Blencowe,B.J. and Ouzounis,C.A. (1999) The PWI motif: a newprotein domain in splicing factors. Trends Biochem. Sci., 24,179–180.

74. Szymczyna,B.R., Bowman,J., McCracken,S., Pineda-Lucena,A.,Lu,Y., Cox,B., Lambermon,M., Graveley,B.R.,Arrowsmith,C.H. and Blencowe,B.J. (2003) Structure andfunction of the PWI motif: a novel nucleic acid-bindingdomain that facilitates pre-mRNA processing. Genes Dev., 17,461–475.

75. Edwalds-Gilbert,G., Kim,D.H., Silverman,E. and Lin,R.J. (2004)Definition of a spliceosome interaction domain in yeast Prp2ATPase. RNA, 10, 210–220.

76. Schneider,S. and Schwer,B. (2001) Functional domains ofthe yeast splicing factor Prp22p. J. Biol. Chem., 276,21184–21191.

77. Dlakic,M. and Mushegian,A. (2011) Prp8, the pivotal protein ofthe spliceosomal catalytic center, evolved from a retroelement-encoded reverse transcriptase. RNA, 17, 799–808.

78. Ideue,T., Sasaki,Y.T., Hagiwara,M. and Hirose,T. (2007) Intronsplay an essential role in splicing-dependent formation of the exonjunction complex. Genes Dev., 21, 1993–1998.

79. Chamieh,H., Ballut,L., Bonneau,F. and Le Hir,H. (2008) NMDfactors UPF2 and UPF3 bridge UPF1 to the exon junctioncomplex and stimulate its RNA helicase activity. Nat. Struct.Mol. Biol., 15, 85–93.

80. Roy,S.W. and Gilbert,W. (2006) The evolution of spliceosomalintrons: patterns, puzzles and progress. Nat. Rev. Genet., 7,211–221.

81. Nixon,J.E., Wang,A., Morrison,H.G., McArthur,A.G.,Sogin,M.L., Loftus,B.J. and Samuelson,J. (2002) A spliceosomalintron in Giardia lamblia. Proc. Natl Acad. Sci. USA, 99,3701–3705.

82. Morrison,H.G., McArthur,A.G., Gillin,F.D., Aley,S.B.,Adam,R.D., Olsen,G.J., Best,A.A., Cande,W.Z., Chen,F.,Cipriano,M.J. et al. (2007) Genomic minimalism in the earlydiverging intestinal parasite Giardia lamblia. Science, 317,1921–1926.

83. Liu,S., Li,P., Dybkov,O., Nottrott,S., Hartmuth,K.,Luhrmann,R., Carlomagno,T. and Wahl,M.C. (2007) Binding ofthe human Prp31 Nop domain to a composite RNA-proteinplatform in U4 snRNP. Science, 316, 115–120.

84. Grainger,R.J. and Beggs,J.D. (2005) Prp8 protein: at the heart ofthe spliceosome. RNA, 11, 533–557.

85. Alber,F., Dokudovskaya,S., Veenhoff,L.M., Zhang,W., Kipper,J.,Devos,D., Suprapto,A., Karni-Schmidt,O., Williams,R., Chait,B.T.et al. (2007) Determining the architectures of macromolecularassemblies. Nature, 450, 683–694.

86. Alber,F., Dokudovskaya,S., Veenhoff,L.M., Zhang,W., Kipper,J.,Devos,D., Suprapto,A., Karni-Schmidt,O., Williams,R., Chait,B.T.et al. (2007) The molecular architecture of the nuclear porecomplex. Nature, 450, 695–701.




ownloaded from


Intrinsic Disorder in the Human Spliceosomal ProteomeIga Korneta1, Janusz M. Bujnicki1,2*

1 Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland, 2 Bioinformatics Laboratory,

Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland

Abstract

The spliceosome is a molecular machine that performs the excision of introns from eukaryotic pre-mRNAs. Thismacromolecular complex comprises in human cells five RNAs and over one hundred proteins. In recent years, manyspliceosomal proteins have been found to exhibit intrinsic disorder, that is to lack stable native three-dimensional structurein solution. Building on the previous body of proteomic, structural and functional data, we have carried out a systematicbioinformatics analysis of intrinsic disorder in the proteome of the human spliceosome. We discovered that almost a half ofthe combined sequence of proteins abundant in the spliceosome is predicted to be intrinsically disordered, at least whenthe individual proteins are considered in isolation. The distribution of intrinsic order and disorder throughout thespliceosome is uneven, and is related to the various functions performed by the intrinsic disorder of the spliceosomalproteins in the complex. In particular, proteins involved in the secondary functions of the spliceosome, such as mRNArecognition, intron/exon definition and spliceosomal assembly and dynamics, are more disordered than proteins directlyinvolved in assisting splicing catalysis. Conserved disordered regions in spliceosomal proteins are evolutionarily youngerand less widespread than ordered domains of essential spliceosomal proteins at the core of the spliceosome, suggestingthat disordered regions were added to a preexistent ordered functional core. Finally, the spliceosomal proteome contains amuch higher amount of intrinsic disorder predicted to lack secondary structure than the proteome of the ribosome, anotherlarge RNP machine. This result agrees with the currently recognized different functions of proteins in these two complexes.

Citation: Korneta I, Bujnicki JM (2012) Intrinsic Disorder in the Human Spliceosomal Proteome. PLoS Comput Biol 8(8): e1002641. doi:10.1371/journal.pcbi.1002641

Editor: Lilia M. Iakoucheva, University of California San Diego, United States of America

Received December 29, 2011; Accepted June 16, 2012; Published August 9, 2012

Copyright: � 2012 Korneta, Bujnicki. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work has been supported by the EU 6th Framework Programme Network of Excellence EURASNET (EU FP6 contract no LSHG-CT-2005-518238).J.M.B. has been additionally supported by the 7th Framework Programme of the European Commission (EC FP7, grant HEALTHPROT, contract number 229676), bythe European Research Council (ERC, StG grant RNA+P = 123D) and by the ‘‘Ideas for Poland’’ fellowship from the Foundation for Polish Science (FNP). Computingpower has been provided in part by the Interdisciplinary Centre for Mathematical and Computational Modeling of the University of Warsaw [grant number G27-4].The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

In eukaryotic cells and certain viruses that infect them, the

coding sequences (exons) of most protein-coding genes are

interrupted by noncoding regions (introns). Following the

transcription of an entire gene into a precursor messenger RNA

(pre-mRNA), the introns are excised and the exons are spliced

together to form a functional mRNA. The splicing reaction is

catalyzed by a large macromolecular ribonucleoprotein (RNP)

machine termed the spliceosome. The most common form of the

spliceosome is composed primarily of five small nuclear RNA

(snRNA) molecules: U1, U2, U4, U5 and U6, and 45 proteins,

arranged into snRNP particles. Seven mutually related Sm

proteins are common to all spliceosomal snRNP apart from the

U6, which contains a set of related ‘‘like-Sm’’ (Lsm) proteins [1].

The Sm or Lsm proteins form a ring structure that acts as a

platform to support the snRNA [2]. Apart from Sm and Lsm

heptamers, all other proteins in the human snRNP subunits are

unique (review: [3]).

Apart from the snRNP proteins, approximately 80 proteins are

abundant in the human spliceosome and reported to be essential

to the process of spliceosome-dependent splicing [4], while results

of proteomics analyses [4–7] yield up to over 200 proteins in toto.

Non-snRNP splicing factors are divided into independent protein

splicing factors and proteins that combine into multiprotein

complexes auxiliary to the spliceosome: the hPrp19/CDC5L

(NTC) complex, the exon-junction complex (EJC), the cap-binding

complex (CBP), the retention-and-splicing complex (RES), and the

transcription-export complex (TREX). Spliceosomal proteins are

richly phosphorylated, as well as undergo other types of post-

translational modifications (review: [8]).

A rare class of introns exists (,1% of all introns in human) that

are excised by the so-called minor spliceosome [9]. This low-

abundance spliceosome variant contains a U5 snRNP identical to

the one from the major spliceosome and four snRNPs with

snRNAs U11, U12, U4atac, and U6atac snRNAs that are distinct

from, but structurally and functionally analogous to, U1, U2, U4,

and U6 snRNAs, respectively. Some proteins specific to the minor

spliceosome have been found [10].

The primary activity of the spliceosome, i.e. the excision of

introns and ligation of exons, requires the correct working of

several additional functionalities of the spliceosomal machinery:

recognition of the 59 and 39 splice sites (intron/exon definition),

mutual recognition of spliceosome subunits and correct spliceo-

some assembly, spliceosome remodeling and regulation (review:

[11]). In the course of the splicing reaction, the snRNP subunits

combine and detach from one another and from the pre-mRNA,

forming in turn the so-called E (entry), A, B, B* (B-activated), and

C complexes. For the major spliceosome, the U1 and the U2

snRNPs perform the initial scanning of the pre-mRNA for intron

PLoS Computational Biology | www.ploscompbiol.org 1 August 2012 | Volume 8 | Issue 8 | e1002641

sites, while the actual two-step splicing reaction occurs after the

addition of a U4/U6.U5 tri-snRNP entity and the elimination of the

U1 and U4 snRNPs from the complex, at the assembled interface of

the pre-mRNA substrate and U2, U5, and U6 snRNAs (complex

C). For the minor spliceosome, the U11/U12 di-snRNP performs

the role of the U1 and U2 snRNPs, while the U4atac/U6atac di-

snRNP performs the role of the U4/U6 di-snRNP (review: [12]).

The early recognition and assembly of the splicing reaction (E/A

complex formation) rely on the use of multiple weak binary

interactions to ensure flexibility. On the other hand, later stages of

the splicing reaction (B, B-act, C complexes) involve enzymatic

catalysis [11]. Each of the stages of the splicing reaction has its own

set of associated non-snRNP proteins [4].

Splicing has been associated with intrinsic protein disorder [13].

Intrinsically disordered regions (IDRs) lack stable, well-defined

three-dimensional structure (review: [14]). IDRs frequently contain

low-complexity regions and repeats, although they may also contain

conserved linear motifs embedded in the less conserved regions

(ELMs; [15]). IDRs are not necessarily completely unfolded. In

particular, some IDRs may contain stable preformed secondary

structure elements in isolation [16], while others may switch from

disorder to order (i.e. exhibit ‘‘dual personality’’) depending on the

environment, for instance upon binding to other proteins [17,18].

As they lack tertiary structure under many or all conditions,

IDRs are more flexible and plastic than the rigid structures of

globular domains. Disorder may increase the speed of intermo-

lecular binding and unbinding and make interactions weaker [14].

As a result of these properties, IDRs are found in a variety of

molecular functions, which include forming linkers between

structured domains, being sites of post-translational modifications,

and sites of protein-protein and protein-RNA recognition [19].

The large interaction capacity of IDRs predisposes them to

organizing the assembly of complexes; disorder is a characteristic

feature of ‘‘hub’’ proteins that interact with many partners, and,

notably for spliceosome research, disordered proteins are common

in large complexes [20]. Among RNP complexes, the ribosome in

particular illustrates an RNA-related structural function for

disordered proteins. Many ribosomal proteins contain long

disordered extensions attached to ordered globular bodies [21]

that, upon the formation of the ribosome complex, become

ordered and penetrate into the macromolecule core formed by the

rRNA [22,23]. In other words, the long disordered extensions

become the ‘‘mortar’’ of the macromolecule that fills in gaps in the

rRNA and stabilizes it.

The subject of intrinsic disorder of the spliceosome has not yet

been systematically analyzed for the entirety of the spliceosomal

proteome. As an essential step towards broadening our under-

standing of the functioning of the spliceosome, we have carried out

a bioinformatics analysis of intrinsic disorder within the human

spliceosomal proteome. We discovered that almost half of the

residues within the human spliceosomal proteins are disordered,

and that the distribution of intrinsic disorder is uneven across the

spliceosome. The spliceosome is divided into three layers: a rigid

inner core that performs the precise operations required to effect

splicing catalysis, a middle layer of disorder that acquires structure

in spliceosome-bound proteins, and a fluid outer layer of

disordered regions that do not acquire structure and that are

responsible for the establishment of a matrix of weak interactions

in the initial stages of the splicing process.

Results/Discussion

The human spliceosome is highly disorderedInitially, we predicted the average intrinsic disorder content of

122 core proteins of the major human spliceosome, including all

abundant proteins sensu Agafonov et al. [4] (Table S1). This

prediction was carried out in two stages. The initial fully automated

analysis, carried out via the GeneSilico MetaDisorder server [24],

estimated the intrinsic protein disorder content in the 122 human

spliceosomal proteins at 53.5%, and at 45.2% for 45 proteins of the

snRNP subunits of the major spliceosome (each Sm protein counted

once). Subsequently, we adjusted manually the predictions of order/

disorder boundaries of IDRs based on structural predictions yielded

by the GeneSilico MetaServer [25]. This manual correction shifted

the disorder estimate downwards in some cases by as much as 10%,

to an intrinsic disorder content estimate of 44.0% for all the 122

proteins of the major spliceosome, and 34.1% for the snRNP

proteins. Nevertheless, even after the correction, at least 98 out of

the 122 core spliceosomal proteins (80.3%) were predicted to

contain at least one IDR$30 residues.

An intrinsic disorder content estimate of 44.0% is twice the

average value for all human proteins as calculated on the basis of

genome-based predictions, which is 21.6% [26]. The predicted

fraction of 80.3% of proteins with at least one IDR$30 residues

contrasts against the calculated fraction of 35.2% for the entire

human proteome [26]. Although different methods of prediction

of intrinsic disorder content differ in their estimates, altogether the

human spliceosomal proteome contains a high amount of intrinsic

disorder. This finding will have a significant impact on further

studies involving spliceosomal proteins.

Early human spliceosomal proteins are more disorderedthan late proteins

To determine whether there was any variation of disorder

content throughout the complexes forming the spliceosome at

different stages of the splicing reaction, we analyzed the fraction of

predicted intrinsic disorder for different groups of proteins of the

spliceosome complex. For this analysis, we divided the spliceosome

proteins in our dataset into several groups based on proteomics

Author Summary

In eukaryotic cells, introns are spliced out of protein-coding mRNAs by a highly dynamic and extraordinarilyplastic molecular machine called the spliceosome. Inrecent years, multiple regions of intrinsic structuraldisorder were found in spliceosomal proteins. Intrinsicallydisordered regions lack stable native three-dimensionalstructure in solutions, which makes them structurallyflexible and/or able to switch between different confor-mations. Hence, intrinsically disordered regions are theideal candidate responsible for the spliceosome’s plasticity.Intrinsically disordered regions are also frequently the sitesof post-translational modifications, which were also prov-en to be important in spliceosome dynamics. In this article,we describe the results of a structural bioinformaticsanalysis focused on intrinsic disorder in the spliceosomalproteome. We systematically analyzed all known humanspliceosomal proteins with regards to the presence andtype of intrinsic disorder. Almost a half of the combinedsequence of these spliceosomal proteins is predicted to beintrinsically disordered, and the type of intrinsic disorder ina protein varies with its function and its location in thespliceosome. The parts of the spliceosome that act earlierin the process are more disordered, which corresponds totheir role in establishing a network of interactions, whilethe parts that act later are more ordered.

Disorder in the Spliceosomal Proteome


data as well as included eight proteins of the U11/U12 di-snRNP

of the minor spliceosome (Table S1). As most of the U11/U12

proteins are structurally and functionally related to proteins of the

U1 and U2 snRNPs [10], we expected that they would have a

similar IDR content to the U1 and U2 snRNP subunit proteins.

Different groups of spliceosome proteins differ in their predicted

disorder content (Figure 1). In particular, proteins of the U1 snRNP,

U2 SF3A, U11/U12 di-snRNP, U2-related and U4/U6.U5 tri-

snRNP-specific proteins are predicted to be more disordered than

average spliceosome proteins (.44.0% disorder content). Of these

groups of proteins, all apart from the U4/U6.U5 tri-snRNP-specific

proteins are ‘‘early’’ proteins associated with the early stages of

splicing. On the other hand, U2 SF3B, U4/U6 di-snRNP, U5

snRNP, Sm and Lsm proteins are predicted to be more ordered

than average (,44.0% disorder content). The Sm and Lsm proteins

comprise scaffolds for snRNA, and especially proteins of the U4/U6

di-snRNP and U5 snRNP may be responsible for assisting in

splicing catalysis. Among auxiliary protein complexes, the retention-

and-splicing (RES) complex, whose function is the retention of

unspliced pre-mRNAs in the nucleus [27], is predicted to be

extremely disordered (80.6%), while the cap-binding complex

(CBC) is more ordered than average (28.0%). Two other complexes,

hPrp19/CDC5L and EJC, both of which have multiple functions,

situate in between (40.5% and 53.6% disorder content, respective-

ly). Finally, while all the groups of transiently binding non-snRNP

spliceosomal proteins are predicted to be more disordered than

average for all spliceosomal proteins, the early A-complex proteins

are predicted to be the most disordered in this group, followed by B-

complex proteins, B-act complex proteins, and C-complex proteins.

Early human spliceosomal proteins contain morecompositionally biased disorder than late proteins

As no external standardized annotation scheme was available

for IDRs in the spliceosomal proteins, we developed a classifica-

tion based on their predicted primary and secondary structure

features. We divided the spliceosomal IDRs into three classes:

regions with consistently predicted secondary structure (SS)

elements (henceforth ‘‘disorder with SS’’ or ‘‘IDR with SS’’), long

($25 residues) compositionally biased IDRs without predicted

secondary structure elements (henceforth ‘‘compositionally biased

disorder/IDR’’), and other IDRs, which we omitted from further

analyses (Figure S1). Several types of compositionally biased

regions without predicted SS elements that frequently appear

throughout the spliceosomal proteome had been previously

described in literature. For these compositionally biased IDR

types, we sought to define relevant standard IDR subclasses within

our classification (RS-like, poly-P/Q, G-rich; see Methods for

details).

Having annotated the IDRs, we analyzed the distribution of

different types of disorder across different groups of human

spliceosome proteins. Different groups of spliceosome proteins are

predicted to differ in the type of disorder they contain (Figure 2,

Figure S2). The heptameric complexes of Sm and Lsm proteins

are predicted to contain mainly compositionally biased disorder

without secondary structure elements (69.9% of all disorder).

Correspondingly, crystal structures of the Sm complex lack most of

the predicted disordered regions (example PDB ID: 2Y9A, [28])

and show a stable ungapped platform, which suggests that disorder

in Sm and Lsm proteins is located outside of the ordered torus.

Protein groups that are present earlier in the course of the splicing

process and that are in general highly disordered (U1, U2 SF3A,

U11/U12, U2-related, SR, hnRNP, A-complex proteins) are

predicted to contain more disorder with predicted compositional

bias and less disorder with SS than late proteins. Similarly to

2Y9A, the majority of predicted disorder of the U1 snRNP-specific

proteins included in the crystal structure of the U1 snRNP (PDB

ID: 3CW1; [29]) is missing from the crystal structure. Also

similarly to 2Y9A, almost all compositionally biased disorder is

missing from the structure, while almost all predicted disorder with

SS is present. Notably, also the EJC, whose post-splicing functions

in exon ligation and mRNA transport involve mRNA binding,

also exhibits a high content of compositionally biased disorder

(62.9%). The RES complex also contains long regions of disorder

with very little predicted secondary structure, but we could not

unambiguously divide these regions into subregions with different

compositional bias.

Among different types of compositionally biased disorder, RS-

like IDRs are found in all groups of early proteins, while poly-P/Q

and miscellaneous noncharged IDRs are predicted to be

concentrated mainly in the U1, U2, U11/U12 and U2-related

proteins. Domain-length ($100 residues) hnRNP-type G-rich

regions are found only in hnRNP proteins, but short (,100

residues) hnRNP-like G-rich regions are found, in addition to SR

and Sm proteins, in A-complex and U2-related proteins (Table

S2). Based on the widespread distribution of compositionally

biased IDRs in spliceosomal proteins, we speculate that interac-

tions mediated by these IDRs may be in fact more common and

important than suggested by the particular cases studied before. In

particular, the role of glycine-rich regions in many spliceosomal

proteins is unknown and requires further study. Based on the fact

that RS-like and glycine-rich disordered regions frequently appear

in the same proteins (e.g. SF2/ASF, TRAP150) and in proteins

that interact with each other and/or interact with the same RNA

(SR, hnRNP), we also suggest that these two types of regions may

interact with each other directly. If so, also RS-like and glycine-

rich regions from other proteins may interact with one another.

This interaction may be important for the regulation of splicing

and definition of intron/exon boundaries, and, by extension, for

the regulation of alternative splicing.

In contrast to early proteins, proteins of the later stages of

splicing are often predicted to contain high amounts of disorder

with SS. These proteins include proteins of the U5 snRNP and

U4/U6 di-snRNP, proteins specific to the U4/U6.U5 tri-snRNP

entity, hPrp19/CDC5L, step 2 catalytic factors, as well as B, B-act

and C-complex proteins. Most of these protein groups are also

predicted to be relatively ordered. In particular, for the isolated

proteins of the U5 snRNP, which is predicted to be the least

disordered of all the snRNP subunits, over a half of the disordered

residues are predicted to be in IDRs with SS. We suggest that, in

the case of proteins of larger complexes, disorder with SS may

acquire structure as the individual proteins of the complex come

together. If so, the U5 snRNP may be almost completely ordered

when the proteins come together in the complex. For the highly

disordered U4/U6.U5 tri-snRNP-specific proteins, high disorder

content coupled with a high content of disorder with SS suggests a

high potential for structure variability. We suggest that this

potential is exercised upon the assembly and disassembly of the tri-

snRNP. Among compositionally biased IDRs, only RS-like

domains are commonly found in the late proteins. Between

proteins of the U4/U6.U5 tri-snRNP, step 2 catalytic factors and

the abundant B, B-act and C complex stage-specific proteins, we

identified 12 RS-like IDRs, including a single RS-like IDR in the

central part of the U4/U6 di-snRNP protein U4/U6-90K and the

RS-like IDR on the N terminus of the U5 snRNP protein U5-

100K [30]. The broad distribution of the RS-like IDRs leads us to

propose that RS-like IDRs may be, in fact, a major driving force

behind spliceosome dynamics in addition to fulfilling their role in

the process of pre-mRNA recognition and intron/exon definition.



Figure 1. Intrinsic disorder content of the various groups of core spliceosome proteins. In deeper shades are marked the values for allproteins of the snRNP subunits of the major spliceosome (‘‘snRNP proteins, major spl.’’) and for all the proteins of the major spliceosome (‘‘all proteins,major spl.’’). The orange line indicates means calculated per-protein (disorder fraction was calculated for each protein first, and then a mean was taken outof this) while the green line indicates means calculated per-residue (the number of all disordered residues in a protein group divided by the total length ofproteins in the group). Per-residue means are indicated above the line. Spliceosome protein groups are ordered according to per-residue means.doi:10.1371/journal.pcbi.1002641.g001



Non-abundant proteins contain more compositionallybiased disorder than core spliceosomal proteins

We repeated our IDR analysis for 122 additional proteins

consistently found in the results of proteomics analyses of the

major spliceosome (Table S1). The addition of these proteins

increased the overall predicted disorder content of the major

spliceosome proteome to 52.3%. Hence, the auxiliary spliceosomal

proteins have their overall disorder content higher even than the

core proteins.

For most protein groups, adding non-abundant proteins

changed IDR content values by less than 10% of the respective

lengths of proteins involved (Figure 3). In particular, non-

abundant early (A-complex and B-complex-associated) proteins

are, like abundant early proteins, estimated to be more disordered

than B-act proteins and C-complex proteins (59.5% and 58.4%

disorder content vs.52.5% and 51.2%). Compared to abundant

proteins, non-abundant proteins are predicted to contain a larger

amount of long regions of compositional disorder (Table S2). RS-

like IDRs are again present in multiple proteins, including non-SR

proteins. In the case of the EJC, three non-abundant proteins,

acinus, pinin and RNPS1, supply the RS-like IDRs that are

missing from the EJC as defined only by abundant proteins. We

also found poly-P/Q regions, mainly in early (A-complex, U2

snRNP-related, pre-mRNA/mRNA-binding proteins and ‘‘mis-

cellaneous’’ proteins) and hnRNP proteins. Short hnRNP-like G-

rich regions are found predominantly in SR, A-complex, pre-

mRNA/mRNA-binding proteins and ‘‘miscellaneous’’ proteins, as

well as the EJC protein Aly/Ref. Most of the proteins that contain

hnRNP-like G-rich IDRs have been confirmed to bind RNA. In

short, the distribution of the non-hnRNP G-rich IDRs is similar to

the distribution of other compositionally biased IDRs, and the

distribution of compositionally biased IDRs in non-abundant

proteins is similar to their distribution in abundant proteins.

Some auxiliary proteins, such as the two RS-like IDR-rich

splicing coactivators SRm160/300, are both extremely long and

extremely disordered (SRm300: 2752 residues, predicted 98.1%

disorder content). In this particular case, the SRm160/300

proteins are thought to form a matrix promoting interactions

between splicing factors [31].

Compositionally biased disorder of spliceosome proteins(RS-like and glycine-rich) is associated with post-translational modifications (serine phosphorylation andarginine methylation)

We next considered the association of post-translational

modifications (PTMs) of human spliceosomal proteins with

intrinsic disorder. To do so, we compared our data on IDR

distribution throughout the human spliceosomal proteome with

Figure 2. Types of disorder in core spliceosomal proteins. Compositionally biased disorder (Y-axis) vs. disorder with SS (X-axis). Datapoints arecolored according to predicted total per-residue disorder content. Groups of all proteins of the major spliceosome and all proteins of the snRNPsubunits of the major spliceosome are indicated in bold.doi:10.1371/journal.pcbi.1002641.g002



PTM data from UniProt [32]. Four distinct PTMs are found in

UniProt data in large enough numbers to warrant numerical

analysis: phosphorylations (on various residues), lysine N-acetyla-

tions, other N-terminal acetylations and arginine methylations

(various types). Of these, N-terminal acetylation is a ubiquitous

cellular process not connected to splicing. 80–90% of human

proteins are acetylated on the N terminus [33].

82.6% of all PTMs of spliceosomal proteins found in UniProt

are phosphorylations (Table 1), of which phosphorylation on a

serine is the most common (78.9% of all phosphorylations),

followed by threonine (15.2%) and tyrosine (5.9%) phosphoryla-

tion. 32.2% of all phosphorylations are mapped to RS-like IDRs,

even though such regions comprise only 7.1% of the combined

length of the 252 spliceosome proteins. In the 122 core proteins of

the major spliceosome, which include fewer SR proteins, RS-like

IDRs comprise 3.2% of their combined length, but they

encompass as many as 23.0% of all phosphorylation sites. This

result suggests that the known cases of recorded functional

importance of phosphorylation of RS-like IDRs in non-SR

proteins may not be isolated, and that phosphorylation may be

as important a control mechanism for the function of these sites as

it is for the RS domains of SR proteins. 9.7% of PTMs are lysine

N-acetylations, which map to ordered and disordered regions in

proportions similar to the total amounts of order vs. disorder for

both the core 122 and all 252 proteins (0.6:0.4 order vs.

disorder),and therefore do not appear to be associated with either

order or disorder. Finally, UniProt registers 74 cases of arginine

methylations in the 252 spliceosome proteins (3.4% of all PTMs).

Almost all sites of arginine methylation are located in hnRNP

protein G-rich regions and shorter hnRNP-like G-rich regions in

Sm proteins, SR proteins and A-complex, pre-mRNA-binding and

miscellaneous RNA-binding proteins. Note that UniProt does not

list any arginine methylations for some proteins, such as Sm-D3,

that have been shown to contain methylated arginines [34] and

where we found a G-rich region (Table S2). Hence, arginine

methylations may be more widespread than indicated by database

data. The consideration of arginine methylation has been so far

overshadowed by the consideration of the far more widespread

consideration of phosphorylation (see e.g. [8]). We suggest that the

importance of arginine methylation for spliceosomal proteins

should be considered in greater detail. In particular, the possibility

exists that, if RS-like IDRs (of SR and other proteins) interact with

Figure 3. Disorder in core vs. non-abundant spliceosome proteins. Blue bars indicates values of intrinsic disorder content for core proteins,green bars for both core and additional spliceosome proteins. The blue and green lines indicate means for given protein groups, calculated per-residue. In deeper shade, values for all core (blue) and all (green) proteins associated with the major spliceosome.doi:10.1371/journal.pcbi.1002641.g003



the hnRNP-like G-rich regions (of hnRNP and other proteins),

these interactions may be modulated by phosphorylation and by

methylation. UniProt registers also six cases of lysine methylations

at five unique residues, two of them in disordered regions and

three in ordered regions. Five of the six cases occur in proteins

with methylated arginines.

ULMs are associated with early proteins, while otherdisordered recognition motifs are found throughoutsplicing complexes and candidate hub proteins areassociated with later stages of splicing

To further analyze the possible roles of disorder that may

acquire structure in the human spliceosome, we considered three

sources of information: data from experimentally determined

structures available in the Protein Data Bank (PDB) [35],

predictions of disordered PFAM [36] domains and predictions of

the most disordered proteins of the human spliceosome.

We browsed the experimentally determined structures of

spliceosomal protein complexes to find out which regions predicted

to be disordered in isolation were found to be ordered in a complex.

Short disordered ligand peptides (,30 residues) that acquire

structure upon binding larger partners are called Molecular

Recognition Features (MoRFs) [37], while larger sequence features

of this kind are called domain-length disordered recognition motifs

[16]. In the structures of spliceosomal protein complexes, we found

eight distinct regions that fit either definition (Table 2, Figure S3).

Three of these regions were the previously defined ULMs (UHM

Ligand Motifs), that is ligands for U2AF Homology Motif domains

[38] (ELM database: LIG_ULM_U2AF65_1). Experimental struc-

tures containing ULMs represented U2 snRNP, U2 snRNP-related

and A-complex proteins. Via a pattern recognition search, we found

additional candidate regions for ULMs, mainly in low-abundance

U2 snRNP-related proteins and A-complex proteins (Table S3).

The majority of these tentative ULMs were predicted to be

disordered. Although the presence of an individual ULM in a

sequence may not be significant, we suggest that the concentration

of sequences with ULM patterns at the early stage of the

spliceosome action may be functionally relevant, and that the

additional candidate ULMs may represent actual functional ULMs.

If so, these additional ULMs could represent a non-essential

extension of the essential UHM-ULM interactions, and UHM-

ULM interactions may form an accessory network to the network

created by compositionally biased IDRs (and their partners).

Notably, a list of candidate UHM partners for ULMs also contains

mainly early spliceosomal proteins [39].

Other recognition regions (U1snRNP70_N, SF3a60_bindingd,

SF3b1, PRP4, Btz, all of which we labeled after PFAM regions)

are found in complexes present at various stages of the splicing

reaction. Notably, the U1snRNP70_N region encompasses two

subregions, the C-terminal of which is the only predicted

disordered region shown through an experimental structure to

bind RNA. Via a profile search, we found two additional

candidate regions for the Btz motif and one additional candidate

PRP4 region. The candidate Btz regions are found in TRAP150,

an abundant A-complex protein, and its paralog BCLAF1, a low-

abundance pre-mRNA/mRNA-binding protein that has been

implicated in a wide range of processes [40]. The candidate PRP4

region is found in the U2 snRNP SF3A protein SF3a66. Unlike

the ULMs, which appear to be widespread and function in

multiple contexts at the early stage of splicing, non-ULM motifs

appear to have specific functions and bind specific partners.

To find other potential domain-length recognition motifs in

spliceosomal proteins, we considered the PFAM domains that

mapped to predicted IDRs. We found 51 such PFAM domains

(Table S4), which included both conserved disordered regions in

otherwise ordered proteins and the only conserved regions of

almost completely disordered proteins. We propose these domains

as targets for experimental structural analyses.

Notably, when we compared the list of disordered PFAM

domains with the list of the most disordered proteins in the

spliceosomal proteome, we found that this group includes two out

of three U4/U6.U5 tri-snRNP-specific proteins (U4/U6.U5-27K

and 110K), as well as several conserved proteins associated with

the B, B-act and C complex (e.g. MFAP1, RED, GCIP p29) that

are also abundant in the human spliceosomal proteome [4]

(Table 3; Figure S4). We suggest that the presence of conserved

motifs comprising disordered PFAM domains in these abundant

conserved highly disordered proteins may allow them to act as

‘‘hub’’ proteins. If so, these proteins may be crucial to spliceosome

dynamics. Targeted deletions of the conserved motifs within these

proteins may help elucidate their role.

Conserved disordered regions in spliceosomal proteinsare less widespread and evolutionarily younger thanessential ordered domains in the core of the spliceosome

As spliceosomal proteins found in human are typically

conserved throughout eukaryotes [41], we used the set of proteins

found in the human spliceosomal proteome to determine the

evolutionary path for the accumulation of order and disorder in

the spliceosomal proteome. We investigated whether conserved

Table 1. Post-translational modifications in 252 spliceosome proteins.

ModificationStructuralorder

Disorderwith SS RS-like Poly-P/Q

hnRNP-likeG-rich Noncharged Charged

Otherdisorder Total Percent

Phosphorylation (*) 158 326 572 137 82 43 49 412 1779 82.6%

Lysine N-acetylation 127 30 12 4 6 0 3 27 209 9.7%

Other N-acetylation (**) 14 20 1 0 1 2 2 44 84 3.9%

Arginine methylations (***) 5 2 13 4 42 2 0 6 74 3.4%

Lysine methylations (****) 3 0 2 0 0 0 0 1 6 0.3%

Cysteine methyl ester 0 1 0 0 0 0 0 0 1 0.0%

(*) S,T and Y phosphorylation.(**) N-terminal acetylation of MGASTV.(***) Includes the keywords ‘‘dimethylarginine’’, ‘‘asymmetric dimethylarginine’’, ‘‘omega-N-methylarginine’’.(****) Includes the keywords ‘‘N6-methyllysine’’, ‘‘N6, N6-dimethyllysine’’, ‘‘N6, N6, N6-trimethyllysine’’.doi:10.1371/journal.pcbi.1002641.t001



ordered and disordered PFAM domains present in human

spliceosomal proteins were present in the last eukaryotic common

ancestor species (LECA), according to [42], and whether they are

currently ubiquitous outside of eukaryotes.

The majority of both ordered and disordered PFAM domains

were present in LECA (Table 4). However, while almost none of

the disordered domains are currently widespread in prokaryotes,

at least one-third of the ordered domains are. This suggests that,

unlike disordered domains, these ordered domains may have been

transferred to eukaryotes from prokaryotes, and may be, in fact,

older than LECA. Notably, the contribution of these evolutionarily

old domains is much higher in the ordered regions of the snRNP

proteins than in the general group of abundant proteins. As many

as 19 out of 29 (distinct) domains of the U4/U6.U5 tri-snRNP are

‘‘old’’ domains. Furthermore, the majority of the proteins of the

U4/U6.U5 tri-snRNP, including the Sm/Lsm proteins but not the

U4/U6.U5 tri-snRNP-specific proteins, either possess homologs

among bacterial and non-splicing-related eukaryotic proteins or

are composed of ubiquitous domains [1,43] (Table S5). The U5

snRNP contains ordered domains similar to those present in

maturase proteins of modern bacterial group II introns [44], from

which the spliceosome snRNAs and introns are predicted to have

evolved [45]. In consequence, this group of proteins/domains as

has a strong potential to evolutionarily predate the eukaryotes.

Likewise, the C-terminal region of the splicing helicases hPrp2/

22/16/43 is also found in some bacterial helicases such as the

Escherichia coli HrpA and therefore is likely to be ancient [46]. We

suggest that the spliceosome likely accrued piecewise, and that

these evolutionarily old regions, which are also the most ordered

regions of the spliceosome, were recruited into the system first and

formed the structural and functional core of the spliceosome.

Disordered regions, as well as ordered domains only found in

eukaryotes, would in this scenario appear in the spliceosome later.

The spliceosomal and the ribosomal proteomes have asimilar fraction of disordered residues, but different typesof intrinsic disorder

As the final step of our analysis, we compared the fractions and

distributions of intrinsic disorder in the proteomes of the subunits of

the human major spliceosome and the human and the Escherichia coli

ribosomes. The bacterial ribosome was chosen to supplement

structural information on disorder-to-order transition, as no crystal

structure of the human ribosome is presently available.

Our comparison revealed a number of similarities and

differences between the proteins of the human snRNP subunits

and both ribosomes (Table 5).The percentage fraction of residues

predicted to be disordered is slightly higher in the ribosomal

proteins compared to proteins of the spliceosomal snRNP

subunits. The human ribosome contains more intrinsic disorder

than the E. coli one, in keeping with the overall higher disorder

content in eukaryotic proteins [47]. However, the types of the

predicted disorder in the ribosomes and in the spliceosome are

different. IDRs in ribosomal proteins are much shorter. While the

number of proteins with at least one IDR$30 residues are similar

between the human ribosome and the human spliceosome, the

spliceosome subunits contain twice as many proteins with at least

one IDR$70 residues as the human ribosome (Figure S5).

Furthermore, the majority of intrinsic disorder in ribosomal

proteins is predicted to contain SS elements, while the majority of

intrinsic disorder in spliceosomal snRNP proteins is predicted not

to contain secondary structure. There are 15 distinct non-SS

IDRs$70 residues in the subunits of the human spliceosome, but

only three such regions in the human ribosome and none in the

bacterial ribosome. Disordered regions $70 residues without

Ta

ble

2.

Re

gio

ns

pre

dic

ted

tob

ed

iso

rde

red

,fo

un

dto

be

ord

ere

din

exp

eri

me

nta

llyso

lve

dco

mp

lexe

so

fsp

lice

oso

mal

pro

tein

s.

Re

gio

nT

yp

eP

rote

inR

eg

ion

Pro

tein

gro

up

Pa

rtn

er

(*)

Pre

dic

ted

ord

ere

d/d

iso

rde

red

sta

tus

inis

ola

tio

nS

tru

ctu

reR

efe

ren

ce

N-U

1sn

RN

P7

0_

NM

oR

FU

1-7

0K

8–

22

U1

snR

NP

U1

-C(z

f-U

1)

dis

ord

ere

d,

ne

xtto

ord

ere

dh

elix

3C

W1

[29

]

C-U

1sn

RN

P7

0_

Nsh

ort

,R

NA

-bin

din

gU

1-7

0K

63

–8

9U

1sn

RN

PU

1sn

RN

Ad

iso

rde

red

3C

W1

[29

]

ULM

(**)

Mo

RF

SF3

b1

55

33

3–

34

2U

2,

SF3

BSP

F45

(UH

M)

dis

ord

ere

d2

PEH

[87

]

ULM

Mo

RF

U2

AF6

59

0–

11

2U

2sn

RN

P-r

ela

ted

U2

AF3

5(U

HM

)d

iso

rde

red

1JM

T[3

8]

ULM

Mo

RF

SF1

13

–2

5A

-co

mp

lex

(***

)U

2A

F65

(UH

M)

dis

ord

ere

d1

O0

P[8

8]

SF3

b1

Mo

RF

SF3

b1

55

37

7–

41

5U

2,

SF3

BSF

3b

14

a/p

14

(RR

M)

par

tial

lyo

rde

red

2F9

D[8

9]

SF3

a60

_b

ind

ing

dD

om

ain

-le

ng

thSF

3a6

07

1–

10

6U

2,

SF3

ASF

3a1

20

(Su

rp)

par

tial

lyo

rde

red

2D

T7

[90

]

PR

P4

Do

mai

n-l

en

gth

U4

/U6

-60

K1

07

–1

37

U4

/U6

di-

snR

NP

U4

/U6

-20

Kp

arti

ally

ord

ere

d1

MZ

W[9

1]

PR

P4

(***

*)D

om

ain

-le

ng

thP

rp1

87

7–

11

5st

ep

2fa

cto

rso

rde

red

2D

K4

Btz

Do

mai

n-l

en

gth

MLN

51

16

9–

19

6,

21

5–

23

0EJ

CEI

F4A

3d

iso

rde

red

,n

ext

too

rde

red

he

lix2

J0S

[92

]

(*)

Do

mai

nn

ame

sin

bra

cke

ts.

(**)

ULM

sco

rre

spo

nd

toth

eEL

Mm

oti

fLI

G_

ULM

_U

2A

F65

_1

,d

efi

ne

db

yth

ep

atte

rn[K

R]{

1,4

}[K

R]-

x{0

,1}-

[KR

]W-x

{0,1

}.(*

**)

No

n-a

bu

nd

ant

A-c

om

ple

xp

rote

in.

(***

*)T

he

PR

P4

reg

ion

of

Prp

18

iso

rde

red

and

its

stru

ctu

rein

iso

lati

on

was

solv

ed

.It

isin

clu

de

din

the

tab

lesi

nce

the

PR

P4

reg

ion

of

U4

/U6

-60

Kis

pre

dic

ted

tob

ep

arti

ally

dis

ord

ere

d.

do

i:10

.13

71

/jo

urn

al.p

cbi.1

00

26

41

.t0

02



secondary structure comprise 8.3% of the total mass of the snRNP

subunits of the major human spliceosome, but only 0.4% in the

human ribosome (Figure S6). Hence, intrinsic disorder in the

ribosomes is considerably more ‘‘structured’’ than the disorder in

the spliceosome. Both in the E. coli and in the human ribosomes,

the large subunit is predicted to contain higher percentage of

disorder than the small subunit. However, the differences in the

fraction and type of disorder are less pronounced between the

ribosomal subunits than between the various subunits of the

spliceosome. The ribosome is therefore more homogeneous with

respect to the distribution of the intrinsic disorder of its proteins

than the spliceosome.

The inspection of crystal structures confirms the predicted

differences. 98.9% of predicted disordered residues of 51 E. coli

ribosomal proteins are found ordered in one or more crystal

structures of this ribosome. Only three proteins, L10, L7/L12 and

S1, are missing from all crystal structures of ribosomes deposited in

the PDB. Of these proteins, only L7/L12 contains an interdomain

linker that is confirmed not to acquire structure in a complex [48],

while only S1 contains a C-terminal disordered extension whose

Table 3. ‘‘Most highly disordered’’ proteins in the spliceosomal proteome.

Abundance Protein Disorder fraction PFAM domains Group

Abundant SPF30 80.3% SMN U2 snRNP-related

U4/U6.U5-110K 87.9% SART-1 U4/U6.U5 trisnRNP

U4/U6.U5-27K 76.8% DUF1777 U4/U6.U5 trisnRNP

CCAP2 78.2% Cwf_Cwc_15 hPrp19/CDC5L

TRAP150 100.0% A-complex

MFAP1 79.3% MFAP1_C B-complex

RED 79.5% RED_N, RED_C B-complex

MGC23918 100.0% cwf18 B-act complex

HSPC220 84.8% Hep_59 C-complex

GCIP p29 93.0% SYF2 C-complex

Non-abundant U11/U12-59K 91.1% U11/U12

Npw38BP 93.8% Wbp11 hPrp19/CDC5L

MLN51 100.0% Btz EJC

pinin 92.3% Pinin_SDK_N, Pinin_SDK_memA EJC

MGC13125 93.5% Bud13 RES

C19orf43 88.6% A-complex

FLJ10154 100.0% A-complex

CCDC55 100.0% DUF2040 B-complex

CCDC49 100.0% CWC25 B-complex

PRCC 100.0% PRCC_Cterm B-act complex

DGCR14 86.1% Es2 C-complex

DKFZP586O0120 100.0% DUF1754 C-complex

FLJ22626 100.0% SynMuv_product C-complex

LENG1 100.0% Cir_N C-complex

BCLAF1 100.0% pre-mRNA/mRNA-binding

Entries in this table fulfill simultaneously two conditions: they have a predicted disorder content .75%, and do not contain any PFAM domains that correspond toordered structural domains.doi:10.1371/journal.pcbi.1002641.t003

Table 4. Statistics of conserved ordered and disordered PFAM domains.

ordered domains disordered domains

all proteinsabundantproteins

U4/U6.U5tri-snRNP (*) all proteins

abundantproteins

U4/U6.U5tri-snRNP

all domains 124 86 29 46 24 5

domains found in LECA 121 86 29 36 22 5

domains found inprokaryotes (**)

47 (37.9%) 34 (39.5%) 19 (65.5%) 1 (0.0%) 0 (0.0%) 0 (0.0%)

(*) Including the LSM domain present in Sm and Lsm proteins.(**) In .100 copies.doi:10.1371/journal.pcbi.1002641.t004



fate in a ribosome-bound form is unknown. This contrasts with the

experimentally determined structure of the U1 snRNP, which

reveals order for less than 10% of residues predicted to be

disordered in isolated U1 proteins.

As described in the Introduction, the main function fulfilled by

IDRs in the ribosome is to be the ‘‘mortar’’ that fills in the gaps in

the rRNAs, while the RNA forms the bulk of the macromolecular

structure of the ribosome and defines its shape and catalytic center

[23,49]. Only in few cases is a different function realized. For

instance, the flexible interdomain linker of protein L7/L12

interfaces the ribosome with ribosome-acting GTPases [48]. We

suggest that the prominence of the ‘‘mortar’’ function is the reason

both for the greater homogeneity of disorder types and their

spatial distribution in the ribosomes, and the prevalence of

disorder with SS in the ribosomes.

Although, in percentages, both the ribosomes and the spliceo-

some contain a similar amount of SS disorder, so far, there is very

little structural evidence for the ‘‘mortar’’ function of the proteins of

the spliceosome. We found only one predicted disordered region

confirmed to bind RNA in all experimental structures of the

spliceosome (C-terminal part of the U1snRNP70_N region,

Table 2). Most experimental structures of splicing-related complex-

es feature ordered domains on the protein side. It is possible that

novel structures will reveal binding interfaces wherein protein

disorder supports the RNA in a ‘‘mortar’’-like manner. However,

the ‘‘mortar’’ role of intrinsic disorder may be simply less important

in the spliceosome. The ribosomal RNA is longer in residues than

any given ribosomal protein, occupies more space and has a higher

molecular mass than all ribosomal proteins combined (Figure S6). In

comparison, the snRNAs are much shorter than the rRNAs. Being

shorter, they may be more likely to form a catalytically active form

unaided by proteins and thus be in less need of ‘‘mortar’’.

Summary and conclusionsThe spliceosome has been called a ‘‘molecular machine’’ [11].

While useful, this metaphor may also be misleading, as it brings to

mind the image of a precise, assiduously controlled and operated

mechanism proceeding to perform the splicing reaction according

to discrete and precise steps. This mechanistic point of view of the

spliceosome action leaves very little space to uncertainty, random-

ness, and fuzziness.

In this work, we made multiple predictions regarding individual

regions of human spliceosomal proteins as well as systematically

analyzed the fraction, distribution and types of disorder across the

various spliceosomal components. Summarizing, we found that

the spliceosome, far from being a uniformly ordered machine, can

be divided into three layers:

N An inner layer, which best fits the definition of a ‘‘machine’’. It

includes the ordered cores of U2 snRNP SF3B, U4/U6 di-

snRNP and U5 snRNP, as well as the Sm proteins of U1

snRNP and ordered C termini of the catalytic helicases. This

layer also includes snRNAs. Proteins from this layer mainly

assist the catalysis of the splicing reaction, and publications

regarding this layer stress relatively precise mechanisms, such

as kinetic proofreading [50]. Sm proteins, ordered proteins of

the U4/U6 di-snRNP and U5 snRNP, as well as the C termini

of catalytic helicases, are most likely the evolutionarily oldest

peptide elements of the spliceosome.

N A middle layer, which is associated mostly with ‘‘structured’’

disorder (disorder with SS). It contains an abundance of

domain-length disordered recognition motifs, disorder with

predicted secondary structure that can act as, e.g., preformed

structural elements and/or dual personality disorder, and long,

highly disordered proteins with conserved disordered regions.

Spatiotemporally, this layer is associated with U4/U6.U5 tri-

snRNP-specific proteins, and B, B-act and C-complex non-

snRNP proteins. Functionally, this layer is associated with

spliceosome assembly, catalytic activation and dynamics.

Many of these regions are phosphorylated. In addition to

disorder with SS, this layer is also associated with some RS-like

IDRs that function in splicing dynamics, such as [30]. This

Table 5. Features of intrinsic disorder in E. coli and human ribosomes and human major spliceosome snRNP subunits.

Feature Ribosome, E. coli Ribosome, humanMajor spliceosome, snRNPsubunits, human

Number of proteins 54 80 45

Maximum protein length (aa) 557 (S1) 427 (L4) 2335 (U5-220K/hPrp8)

Mean protein length (aa) 132 170 453

Fraction of predicted disorder (% ofthe combined lengths of proteins)

37.7% 47.0% 34.1%

Number of proteins with at least one IDR $30 residues 28 61 28

Number of proteins with at least one IDR $70 residues 1 19 23

Mean IDR length (aa) 28 39 93

Fraction of predicted disordered residues with secondarystructure (% predicted disorder)

66.6% 64.0% 41.9%

Number of non-PSE IDRs $70 residues 0 3 15

Fraction of predicted disordered residues found in thecrystal structure of the complex (% of predicted disorder)

98.9% — ,10% (U1 snRNP)

Minimal and maximal fractions of predicted disorderedresidues for individual subunits

34.8% (small subunit)- 40.0% (large subunit)

39.1% (small subunit)- 52.2% (large subunit)

20.1% (U5 snRNP)- 65.5% (U1 snRNP)

Maximum RNA length (nt) 2904 (23S) 5070 (28S) 188 (U2 snRNA)(*)

RNA fraction of total weight (% total weight) 65.2% 60.3% 8.2%

(*) Saccharomyces cerevisiae U1 snRNA is 570 nts long, while the U2 snRNA is 1172 nts long. Such exceptional lengths are restricted to the genus Saccharomyces.doi:10.1371/journal.pcbi.1002641.t005



layer is also associated with ubiquitin-dependent systems.

Ubiquitin has been shown to control the dynamics of the

spliceosome in several cases [51]. Proteins of the spliceosome

contain many ubiquitin-related domains, and the majority of

these domains are found in the proteins associated with the

later stages of splicing [52].

N An outer layer, which is associated with mostly ‘‘unstructured’’

disorder. It is enriched in regions of long, compositionally biased

disorder that may function as sensors that the spliceosome

extends to the surrounding environment. These regions contain

interaction sites such as RS-like IDRs, hnRNP-like G-rich

regions, polyproline regions and ULMs. They may interact with

each other, or with small ordered structural domains such as the

Tudor domain (bound by hnRNP-like G-rich regions) and GYF

domain (bound by polyproline regions). On the other hand,

small RNA-binding domains present in this layer, such as RRM

(RNA Recognition Motif) and PWI, may aid in the binding of

the substrate pre-mRNA. The function of this layer is regulated

by phosphorylation (e.g. in RS-like IDRs) and methylation (e.g.

in hnRNP-like G-rich regions). Spatiotemporally, this layer is

associated with early (A-complex, U1, U2 SF3A, U11/U12,

U2-related) proteins, with SR, hnRNP proteins, and SRm160/

300 proteins, and with RES complex proteins. Functionally, this

layer is associated with early recognition, intron/exon defini-

tion, and alternative splicing regulation processes.

Full understanding of spliceosome activity requires information

about each of its elements, at different functional stages [11]. Our

predictions provide a number of testable functional hypotheses:

N We provide the proteins and positions of all types of

compositionally biased disordered regions in spliceosomal

proteins. Based on the colocation of two types of disordered

regions (RS-like and G-rich), we suggest that these regions may

interact with each other. As these two types of disordered

regions are found in multiple proteins throughout the human

spliceosomal proteome, we also suggest the possibility that many

more human spliceosomal proteins interact nonspecifically with

each other and the RNAs than previously suggested. Large-scale

deletions of compositionally biased regions may suggest essential

subsystems of this interaction network;

N We found that arginine methylation in spliceosomal proteins is

associated with intrinsically disordered regions. We also suggest

that arginine methylation and serine phosphorylation act in step

to regulate the interaction network based on compositionally

biased disordered regions. The elucidation of the effect of post-

translational modifications, such as conformational transitions

and molecular interactions that depend on the introduction or

removal of particular modifications, can also lead to an

improved understanding of regulatory mechanisms;

N We provide candidate ULM sequences that can bind known

and predicted UHM domains throughout the early stages of

splicing. These sequences may participate in the regulation of

particular instances of splicing;

N We suggest several abundant conserved proteins found in the

later stages of splicing that may function as ‘‘hub’’ proteins (e.g.

MFAP1, GCIP p29, U4/U6.U5 tri-snRNP proteins). Targeted

deletions of ordered motifs within these proteins may reveal

regions responsible for the formation of particular spliceosomal

complexes, their rearrangements, and interactions with

regulatory factors.

Our prediction that more than one-third of the residues of the

snRNPs are disordered has significant implications for the

structural studies of the spliceosome. While much progress has

been achieved in the determination of global shapes of various

spliceosomal assemblies by cryoEM [53], experimental structural

information is missing for many regions of spliceosomal proteins.

Intrinsic disorder in the spliceosome explains why: the functional

importance of disordered regions notwithstanding, their physico-

chemical properties make them notorious spoilers of crystallization

experiments [54]. Our predictions of disorder may guide the

preparation of protein variants for crystallization that should be

limited to regions that are intrinsically ordered or at least predicted

to become ordered upon complex formation. For long disordered

regions without secondary structure, stable conformations may not

be obtained even in complexes. However, the structural charac-

terization of intrinsically disordered elements of the spliceosome

may require the application of completely different methods, such

as small angle X-ray or neutron scattering (SAXS or SANS)

experiments (review: [55]) and modeling with computational tools

such as the Ensemble Optimization Method [56]. The results of

our analyses will hopefully aid these efforts.

Methods

DataSpliceosome proteins with GI identifiers supplied in Table S1

were downloaded from the NCBI Protein database. Protein names

and identifiers were acquired from [4,6,7,57–61]. Division into

abundant and non-abundant proteins was based on [4]. Assign-

ment into protein groups was based mainly on [4], aided by

information from: [6,58–60]. ‘‘Miscellaneous’’ proteins were

classified in primary sources, variably, as ‘‘miscellaneous proteins’’,

‘‘miscellaneous splicing factors’’, ‘‘additional proteins’’, ‘‘proteins

not reproducibly detected’’, ‘‘proteins not previously detected’’.

Prediction of intrinsic disorder and binding disorderInitial predictions of intrinsic disorder were carried out using the

GeneSilico MetaDisorder server (http://iimcb.genesilico.pl/

metadisorder/; [24]). Subsequently, disorder boundaries yielded

by MetaDisorder were corrected manually based on predictions of

secondary structure and solvent accessibility yielded by the

GeneSilico MetaServer gateway (https://genesilico.pl/meta2/;

[25]). In particular, sequence regions predicted to exhibit stable

secondary structure and high fraction of solvent inaccessible

residues, and confidently aligned to experimentally determined

globular protein structures, were considered ordered regardless of

the primary disorder prediction. Prediction of binding disorder

was carried out using the ANCHOR server [62].

Assignment of disorder with predicted secondary structureIn disorder with SS, the disordered region is predicted to contain

one or both types of canonical a and b SS elements. The predicted

secondary structure may be either pre-formed in the disordered

state or appear only upon the formation of a stable structure, e.g.

upon binding to another molecule. This type of disorder also at

times contains short ordered regions (Table 6, Figure S7).

We defined regions of disorder with SS (predicted intrinsic

disorder with predicted secondary structure elements) as regions for

which simultaneously the majority of intrinsic disorder prediction

methods on the MetaServer gateway yielded predictions of disorder

and the majority of secondary structure prediction methods yielded

predictions of secondary structure elements. Multiple closely spaced

secondary structure elements (connected by loops ,20 residues) in a

predicted disordered region were treated as elements of a single

IDR with SS. If an IDR was predicted to contain a-helical elements

and coiled-coil prediction methods aggregated on the MetaServer



also yielded a prediction, the IDR was classified into the special class

of disorder with coiled coils.

Assignment of disorder with compositional biasIn compositionally biased disorder, the amino acid composition

of the region deviates highly from the usual. We estimated

compositional bias based on the absolute frequencies of occur-

rence of residues, compared to their usual frequency in

vertebrates, as reported on the website http://www.tiem.utk.

edu/,gross/bioed/webmodules/aminoacid.htm (information

from [63,64]). A residue was considered overrepresented if (a)

the region under consideration displayed considerable composi-

tional bias (at least one kind of residue occurred with a frequency

.20% or five times higher than its usual frequency of occurrence

in vertebrates) and (b) this particular residue occurred in the region

with a frequency .20% or three times higher than the usual

frequency of occurrence in vertebrates.

For several types of compositionally biased IDRs with a

previous description in literature, we sought to define relevant

standard IDR subclasses within our classification (Table 6):

N RS-like: IDRs that are rich in arginine and serine residues.

These regions were shown to be intrinsically disordered [65].

They are predicted to have high solvent accessibility (Figure

S7). They may be phosphorylated on the serines [66]. RS-like

regions were found in splicing factors from the SR family (‘‘RS

domains’’) and in other spliceosomal proteins [67]. RS

domains of SR proteins bind other RS-like IDRs as well as

(pre-m)RNA and are crucial for the establishment of a network

of weak contacts at the initial stages of splicing and intron/

exon definition [66]. Phosphorylation of some RS domains

enhances their binding [68,69]. Phosphorylation of the RS-like

IDR of the U5 snRNP protein DDX23 is also required for its

stable association (with the U4/U6.U5 tri-snRNP) [30].

N polyP/Q: IDRs that contain repeats of proline or glutamine

residues. polyP/Q regions are capable of generating type II

poly-P or poly-Q helices [70] and may contain short linear

motifs involved in nonspecific binding of GYF and WW-type

domains [41]. They are predicted to have high solvent

accessibility (Figure S7). Several spliceosomal proteins, such

as the Sm protein SmB/B’, were shown to contain polyP/Q

regions that interact with GYF and WW-type domains.

Collectively, these regions are necessary for the formation of

complex A [71].

N hnRNP-like G-rich: IDRs that contain RGG and related

repeats ([RSY]GG, R[AGT][AGTFIVR]) that can be classi-

fied as short (#100 residues) and long ones. These regions are

predicted to have low solvent accessibility (Figure S7), but do

not contain canonical higher order structures [72]. Repeats

that contain arginines may be methylated on these residues

[73]. Long G-rich IDRs were found in hnRNP proteins [74],

while shorter G-rich IDRs are found in other splicing proteins,

such as SmB/B’, SF2/ASF and U1-70K ([73], [75], [76]). The

G-rich region of hnRNP A1 has been shown to bind in vitro

itself and other hnRNP proteins [77], to be necessary for the

binding of hnRNP A1 to the U2 and U4 snRNPs [78], and to

silence splicing [79]. Arginine-methylated G-rich regions may

interact with the Tudor domain of the SMN protein [80,81].

Arginine methylation of yeast U1-70K homolog decreases

binding of this protein by protein Npl3 [76].

We also developed two additional subclasses of compositionally

biased IDRs to complement these classes of compositionally

disordered IDRs:

N ‘‘noncharged’’ disorder, which is rich in noncharged residues

(PQMGVWA);

N ‘‘charged’’ disorder, which is rich in charged residues (RKDE).

The ‘‘charged’’ compositionally biased disorder is similar to a

type of disorder with SS that has predictions for coiled-coil

secondary structure.

PTM dataSite identifiers of 2153 known or possible post-translational

modifications, including 720 modifications of the 122 core

proteins, were downloaded from UniProt [32]. The following

post-translational modifications were included: serine-, threonine-

and tyrosine phosphorylations, lysine N-acetylations, N-alpha-

terminal N-acetylations of non-lysine residues (MGASTV), various

arginine methylations and various lysine methylations. All site

identifiers available were used in the analysis (i.e. including sites

with a status note ‘‘By similarity’’ and sites identified as ‘‘Potential’’

or ‘‘Probable’’). 132 modification sites had a status note

‘‘Status = By similarity’’ and 8 had a status note ‘‘Status = Poten-

tial’’ or ‘‘Status = Probable’’. Removing sites identified ‘‘By

similarity’’ and sites identified as ‘‘Potential’’ or ‘‘Probable’’ did

not impact overall statistics. In the listing, different modifications

at same residues are considered separately (e.g. different possible

arginine methylations), and the paper follows this model.

Table 6. Features of different IDR classes in the 130 spliceosomal proteins.

IDR class Description Number of regions Mean length Compositional bias

disorder with SS contains secondary structure 95 (predicted tocontain coiled coils),115 (other types)

64 aa (predicted tocontain coiled coils),55 aa (other types)

RKDE with additional MQW(predicted to contain coiledcoils), no rule (other types)

compositionally biased, RS-like biased towards arginine and serine residues 35 65 aa RS

compositionally biased,polyP/Q

noncharged with poly P/Q(P/Q(n), n$3)) repeats

17 138 aa PQMGVWA

compositionally biased,hnRNP G-rich

contains RGG and related repeats([RSY]GG, R[AGT][AGTFIVR]) (*)

4 (hnRNP proteins),10 (other proteins)

145 aa (hnRNP proteins),56 aa (other proteins)

GRY

compositionally biased,noncharged

biased towards noncharged residues 16 45 aa PQMGVWA

compositionally biased, charged biased towards charged residues 9 57 aa RKDE

(*) [72]: XGG, where X aromatic or long aliphatic; arginine methylation data: R[AGT][AGTFIVR].doi:10.1371/journal.pcbi.1002641.t006



Pattern recognition and motif searchAssignment of boundaries for hnRNP-like G-rich regions and

for positions of candidate ULMs was based on pattern analysis.

For hnRNP-like G-rich regions, the following patterns were

used: [RSY]GG-x{1,50}-[RSY]GG-x{1,50}-[RSY]GG; R[AGT]

[AGTFIVR]-x{1,25}-RGG-x{1,25}-R[AGT][AGTFIVR]. For

ULMs, the following pattern was used: [RK]{1,}-[RK]-x{0,1}-

[RK]{1,}-x{0,1}-W-x{0,2}-[DE]{1,}. The ULM consensus pat-

tern was based on the sequences of known ULMs found in

experimentally determined structures of ULM complexes. This

stringent pattern does not retrieve all of the bona fide ULMs in

protein SF3b155 that display a weaker binding affinity to the

U2AF65 partner than the ULM found in the experimentally

determined structure [82]. We decided to use a stringent pattern in

order to reduce the number of possible false positives compared to

the more lenient pattern described in literature [39]. Search for

domain-length disordered recognition motifs was carried out with

HHSEARCH [83].

Assignment of PFAM domains in disordered regions andLECA presence for disordered PFAM domains

PFAM IDs were assigned on the PFAM website [36]. The list of

disordered domains present in LECA was established based on a

list of predicted LECA domains kindly provided by Prof. Adam

Godzik and Dr. Christian M. Zmasek [42].

Analysis of disorder and disorder-to-order transition in E.coli and human ribosome

E. coli and human ribosomal proteins were extracted from the

Ribosomal Protein Gene database (RPG) [84]. The following

crystal structures of E. coli ribosomes and ribosomal proteins were

used to determine disorder-to-order transitions: majority of

proteins: PDB ID: 2QAM (subunit 50S, resolution 3.21 A) and

2QAN (subunit 30S, resolution 3.21 A); protein L31: ribosomal

structure 2AW4; protein L1: ribosomal structure 3FIK. For

protein L7/L12, a dimer structure was used (PDB ID: 1RQU),

while for protein S1 only the one available structure of a single

domain was used (PDB ID: 2KHI).

Although a crystal structure of a eukaryotic ribosome has been

recently determined, many amino acid residues within this

structure are unassigned [85]. Hence, this structure is unsuitable

for the examination of sequences that alter their state between

order and disorder.

VisualizationDisorder and binding disorder plots were generated using the

ANCHOR server (http://anchor.enzim.hu) [62]. Molecular

structure graphics were produced with UCSF Chimera [86].

Supporting Information

Figure S1 The hierarchy of classification of intrinsicdisorder in the spliceosomal proteome. ‘‘Compositionally

biased disorder’’ includes only disorder predicted not to contain

any secondary structure elements.

(TIF)

Figure S2 Types of disorder in core spliceosomalproteins. This figure shows the fractions of all types of disorder

with SS (left) and compositionally biased disorder (right) in various

groups of core spliceosomal proteins. Values are given as fractions of

total disorder. In this figure, disorder with SS is divided based on the

presence or absence of coiled coils and types of secondary structure.

(TIF)

Figure S3 MoRFs in the structures of spliceosomeproteins. A: N-U1snRNP70_N (in yellow) and C-

U1snRNP70_N (in red) (protein U1-70K in the structure of U1

snRNP with removed Sm proteins, PDB ID: 3CW1). B: ULM

(protein SF3b155 in complex with SPF45, PDB ID: 2PEH). C:

ULM (protein U2AF65 in complex with U2AF35, PDB ID:

1JMT). D: SF3b1 (protein SF3b155 in complex with SF3b14a/

p14, PDB ID: 2F9D). E: SF3a60_bindingd (protein SF3a60 in

complex with SF3a120, PDB ID: 2DT7). F: Btz (protein MLN51

in the structure of the exon-junction complex, PDB ID: 2J0S).

(TIF)

Figure S4 Disorder plots for highly disordered splice-osome proteins. Example disorder plots created by the

ANCHOR server, http://anchor.enzim.hu. Red line: disorder

probability; blue line: probability of binding another molecule at

the residue; blue line at the bottom: another representation of the

binding probability (the darker the blue, the higher the

probability). A. MLN51 (EJC protein). The region corresponding

to the Btz MoRF lies between residues 169–230. B. U4/U6.U5-

110K. C. U4/U6.U5-27K.

(TIF)

Figure S5 IDR lengths in E. coli and human ribosomeand human major spliceosome snRNP subunits. This

graph shows the fraction of proteins in the proteomes of the E. coli

(orange) and human ribosome (green) and the snRNP subunits of

the major spliceosome (blue) that contain at least one IDR of a given

length.

(TIF)

Figure S6 Structural regions in E. coli and humanribosome and human major spliceosome snRNP sub-units. This graphs shows the fractions of the total weight of the

three complexes taken up by different types of structural regions.

The Sm proteins were calculated four times each towards the

weight of the spliceosome.

(TIF)

Figure S7 Disorder plots for various types of IDRsfound in spliceosome proteins. Example disorder plots

created by the ANCHOR server, http://anchor.enzim.hu. Red

line: disorder probability; blue line: probability of binding another

molecule at the residue; blue line at the bottom: another

representation of the binding probability (the darker the blue,

the higher the probability). A. IDR with SS: SF3b145, residues

738–818; B. RS-like IDR: protein 9G8, residues 121–215; C.

polyP/Q IDR: SF3a66, residues 216–307; D. hnRNP G-rich

IDR: hnRNPA1, residues 200–285. Interpretation of the plots: A

is predicted to contain short regions of order in regions of disorder,

B and C are predicted to be almost completely unfolded in

isolation and D is largely insoluble. A, B and C contain regions

predicted to be binding. In the case of the RS region, this

encompassed almost its entire length.

(TIF)

Table S1 Proteins of the human spliceosomes dividedinto groups.

(XLSX)

Table S2 Compositionally biased regions of spliceo-some proteins.

(XLSX)

Table S3 Candidate ULMs, Btz and PRP4 regions inspliceosomal proteins.

(XLSX)



Table S4 PFAM domains that map to disorderedregions in human spliceosomal proteins.(XLSX)

Table S5 Conserved ordered regions in the core of thehuman spliceosome.(XLSX)

Acknowledgments

We thank Łukasz Kozłowski for help with his software, Adam Godzik and

Christian Zmasek for the list of LECA domains, Ben Blencowe and

Christos Ouzonis for help with RS domains. IK thanks Peter Tompa for

the kind gift of his book on protein disorder. We thank Reinhard

Luhrmann, Elz_bieta Purta, Anna Czerwoniec, Łukasz Kozłowski, Joanna

Kasprzak, and Marcin Magnus for critical reading of the manuscript,

useful comments and suggestions.

Author Contributions

Conceived and designed the experiments: IK JMB. Performed the

experiments: IK. Analyzed the data: IK JMB. Contributed reagents/

materials/analysis tools: IK JMB. Wrote the paper: IK JMB.

References

1. Veretnik S, Wills C, Youkharibache P, Valas RE, Bourne PE (2009) Sm/Lsmgenes provide a glimpse into the early evolution of the spliceosome. PLoS

Comput Biol 5: e1000315.

2. Kambach C, Walke S, Young R, Avis JM, de la Fortelle E, et al. (1999) Crystal

structures of two Sm protein complexes and their implications for the assembly

of the spliceosomal snRNPs. Cell 96: 375–387.

3. Valadkhan S, Jaladat Y (2010) The spliceosomal proteome: at the heart of the

largest cellular ribonucleoprotein machine. Proteomics 10: 4128–4141.

4. Agafonov DE, Deckert J, Wolf E, Odenwalder P, Bessonov S, et al. (2011) Semi-quantitative proteomic analysis of the human spliceosome via a novel two-

dimensional gel electrophoresis method. Mol Cell Biol 31: 2667–2682.

5. Zhou Z, Licklider LJ, Gygi SP, Reed R (2002) Comprehensive proteomicanalysis of the human spliceosome. Nature 419: 182–185.

6. Jurica MS, Moore MJ (2003) Pre-mRNA splicing: awash in a sea of proteins.

Mol Cell 12: 5–14.

7. Bessonov S, Anokhina M, Krasauskas A, Golas MM, Sander B, et al. (2010)

Characterization of purified human Bact spliceosomal complexes reveals

compositional and morphological changes during spliceosome activation andfirst step catalysis. RNA 16: 2384–2403.

8. McKay SL, Johnson TL (2010) A bird’s-eye view of post-translational

modifications in the spliceosome and their roles in spliceosome dynamics. MolBiosyst 6: 2093–2102.

9. Tarn WY, Steitz JA (1996) A novel spliceosome containing U11, U12, and U5

snRNPs excises a minor class (AT-AC) intron in vitro. Cell 84: 801–811.

10. Will CL, Schneider C, Hossbach M, Urlaub H, Rauhut R, et al. (2004) The

human 18S U11/U12 snRNP contains a set of novel proteins not found in the

U2-dependent spliceosome. RNA 10: 929–941.

11. Wahl MC, Will CL, Luhrmann R (2009) The spliceosome: design principles of a

dynamic RNP machine. Cell 136: 701–718.

12. Will CL, Luhrmann R (2005) Splicing of a rare class of introns by the U12-dependent spliceosome. Biol Chem 386: 713–724.

13. Xie H, Vucetic S, Iakoucheva LM, Oldfield CJ, Dunker AK, et al. (2007)

Functional anthology of intrinsic disorder. 1. Biological processes and functionsof proteins with long disordered regions. J Proteome Res 6: 1882–1898.

14. Tompa P (2009) Structure and Function of Intrinsically Disordered Proteins.

Chapman & Hall.

15. Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M, et

al. (2003) ELM server: A new resource for investigating short functional sites in

modular eukaryotic proteins. Nucleic Acids Res 31: 3625–3630.

16. Tompa P, Fuxreiter M, Oldfield CJ, Simon I, Dunker AK, et al. (2009) Close

encounters of the third kind: disordered domains and the interactions of

proteins. Bioessays 31: 328–335.

17. Zhang Y, Stec B, Godzik A (2007) Between order and disorder in protein

structures: analysis of ‘‘dual personality’’ fragments in proteins. Structure 15:1141–1147.

18. Dunker AK (2007) Another window into disordered protein function. Structure

15: 1026–1028.

19. Radivojac P, Iakoucheva LM, Oldfield CJ, Obradovic Z, Uversky VN, et al.(2007) Intrinsic disorder and functional proteomics. Biophys J 92: 1439–1456.

20. Hegyi H, Schad E, Tompa P (2007) Structural disorder promotes assembly of

protein complexes. BMC Struct Biol 7: 65.

21. Helgstrand M, Rak AV, Allard P, Davydova N, Garber MB, et al. (1999)

Solution structure of the ribosomal protein S19 from Thermus thermophilus.

J Mol Biol 292: 1071–1081.

22. Wimberly BT, Brodersen DE, Clemons WM, Jr., Morgan-Warren RJ, Carter

AP, et al. (2000) Structure of the 30S ribosomal subunit. Nature 407: 327–339.

23. Ban N, Nissen P, Hansen J, Moore PB, Steitz TA (2000) The complete atomicstructure of the large ribosomal subunit at 2.4 A resolution. Science 289: 905–

920.

24. Kozlowski LP, Bujnicki JM (2012) MetaDisorder: a meta-server for theprediction of intrinsic disorder in proteins. BMC Bioinformatics 13: 111.

25. Kurowski MA, Bujnicki JM (2003) GeneSilico protein structure prediction meta-

server. Nucleic Acids Res 31: 3305–3307.

26. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and

functional analysis of native disorder in proteins from the three kingdoms of life.

J Mol Biol 337: 635–645.

27. Dziembowski A, Ventura AP, Rutz B, Caspary F, Faux C, et al. (2004)

Proteomic analysis identifies a new complex required for nuclear pre-mRNA

retention and splicing. Embo J 23: 4847–4856.

28. Leung AK, Nagai K, Li J (2011) Structure of the spliceosomal U4 snRNP core

domain and its implication for snRNP biogenesis. Nature 473: 536–539.

29. Pomeranz Krummel DA, Oubridge C, Leung AK, Li J, Nagai K (2009) Crystal

structure of human spliceosomal U1 snRNP at 5.5 A resolution. Nature 458:

475–480.

30. Mathew R, Hartmuth K, Mohlmann S, Urlaub H, Ficner R, et al. (2008)

Phosphorylation of human PRP28 by SRPK2 is required for integration of the

U4/U6-U5 tri-snRNP into the spliceosome. Nat Struct Mol Biol 15: 435–443.

31. Blencowe BJ, Bauren G, Eldridge AG, Issner R, Nickerson JA, et al. (2000) The

SRm160/300 splicing coactivator subunits. RNA 6: 111–120.

32. Magrane M, Consortium U (2011) UniProt Knowledgebase: a hub of integrated

protein data. Database 2011: bar009.

33. Hwang CS, Shemorry A, Varshavsky A (2010) N-terminal acetylation of cellular

proteins creates specific degradation signals. Science 327: 973–977.

34. Liu Q, Dreyfuss G (1995) In vivo and in vitro arginine methylation of RNA-

binding proteins. Mol Cell Biol 15: 2800–2808.

35. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The

Protein Data Bank. Nucleic Acids Res 28: 235–242.

36. Finn RD, Mistry J, Tate J, Coggill P, Heger A, et al. (2010) The Pfam protein

families database. Nucleic Acids Res 38: D211–222.

37. Vacic V, Oldfield CJ, Mohan A, Radivojac P, Cortese MS, et al. (2007)

Characterization of molecular recognition features, MoRFs, and their binding

partners. J Proteome Res 6: 2351–2366.

38. Kielkopf CL, Rodionova NA, Green MR, Burley SK (2001) A novel peptide

recognition mode revealed by the X-ray structure of a core U2AF35/U2AF65

heterodimer. Cell 106: 595–605.

39. Kielkopf CL, Lucke S, Green MR (2004) U2AF homology motifs: protein

recognition in the RRM world. Genes Dev 18: 1513–1526.

40. Sarras H, Alizadeh Azami S, McPherson JP (2010) In search of a function for

BCLAF1. ScientificWorldJournal 10: 1450–1461.

41. Collins L, Penny D (2005) Complex spliceosomal organization ancestral to

extant eukaryotes. Mol Biol Evol 22: 1053–1066.

42. Zmasek CM, Godzik A (2011) Strong functional patterns in the evolution of

eukaryotic genomes revealed by the reconstruction of ancestral protein domain

repertoires. Genome Biol 12: R4.

43. Staley JP, Woolford JL, Jr. (2009) Assembly of ribosomes and spliceosomes:

complex ribonucleoprotein machines. Curr Opin Cell Biol 21: 109–118.

44. Dlakic M, Mushegian A (2011) Prp8, the pivotal protein of the spliceosomal

catalytic center, evolved from a retroelement-encoded reverse transcriptase.

RNA 17: 799–808.

45. Michel F, Costa M, Westhof E (2009) The ribozyme core of group II introns: a

structure in want of partners. Trends Biochem Sci 34: 189–199.

46. Moriya H, Kasai H, Isono K (1995) Cloning and characterization of the hrpA

gene in the terC region of Escherichia coli that is highly similar to the DEAH

family RNA helicase genes of Saccharomyces cerevisiae. Nucleic Acids Res 23:

595–598.

47. Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ (2000) Intrinsic

protein disorder in complete genomes. Genome Inform Ser Workshop Genome

Inform 11: 161–171.

48. Mulder FA, Bouakaz L, Lundell A, Venkataramana M, Liljas A, et al. (2004)

Conformation and dynamics of ribosomal stalk protein L12 in solution and on

the ribosome. Biochemistry 43: 5930–5936.

49. Brodersen DE, Nissen P (2005) The social life of ribosomal proteins. FEBS J 272:

2098–2108.

50. Valadkhan S (2007) The spliceosome: caught in a web of shifting interactions.

Curr Opin Struct Biol 17: 310–315.

51. Bellare P, Small EC, Huang X, Wohlschlegel JA, Staley JP, et al. (2008) A role

for ubiquitin in the spliceosome assembly pathway. Nat Struct Mol Biol 15: 444–

451.

52. Korneta I, Magnus M, Bujnicki JM (2012) Structural bioinformatics of the

human spliceosomal proteome. Nucleic Acids Res. E-pub ahead of print. doi:

10.1093/nar/gks347



53. Stark H, Luhrmann R (2006) Cryo-electron microscopy of spliceosomal

components. Annu Rev Biophys Biomol Struct 35: 435–457.54. Quevillon-Cheruel S, Leulliot N, Gentils L, van Tilbeurgh H, Poupon A (2007)

Production and crystallization of protein domains: how useful are disorder

predictions ? Curr Protein Pept Sci 8: 151–160.55. Bernado P, Svergun DI (2012) Structural analysis of intrinsically disordered

proteins by small-angle X-ray scattering. Mol Biosyst 8: 151–167.56. Bernado P, Mylonas E, Petoukhov MV, Blackledge M, Svergun DI (2007)

Structural characterization of flexible proteins using small-angle X-ray

scattering. J Am Chem Soc 129: 5656–5664.57. Makarov EM, Makarova OV, Urlaub H, Gentzel M, Will CL, et al. (2002)

Small nuclear ribonucleoprotein remodeling during catalytic activation of thespliceosome. Science 298: 2205–2208.

58. Behzadnia N, Golas MM, Hartmuth K, Sander B, Kastner B, et al. (2007)Composition and three-dimensional EM structure of double affinity-purified,

human prespliceosomal A complexes. EMBO J 26: 1737–1748.

59. Deckert J, Hartmuth K, Boehringer D, Behzadnia N, Will CL, et al. (2006)Protein composition and electron microscopy structure of affinity-purified

human spliceosomal B complexes isolated under physiological conditions. MolCell Biol 26: 5528–5543.

60. Bessonov S, Anokhina M, Will CL, Urlaub H, Luhrmann R (2008) Isolation of

an active step I spliceosome and composition of its RNP core. Nature 452: 846–850.

61. Fabrizio P, Dannenberg J, Dube P, Kastner B, Stark H, et al. (2009) Theevolutionarily conserved core design of the catalytic activation step of the yeast

spliceosome. Mol Cell 36: 593–608.62. Dosztanyi Z, Meszaros B, Simon I (2009) ANCHOR: web server for predicting

protein binding regions in disordered proteins. Bioinformatics 25: 2745–2746.

63. King JL, Jukes TH (1969) Non-Darwinian evolution. Science 164: 788–798.64. Dyer KF (1971) The quiet revolution: A new synthesis of biological knowledge.

J Biol Edu 5: 15–24.65. Haynes C, Iakoucheva LM (2006) Serine/arginine-rich splicing factors belong to

a class of intrinsically disordered proteins. Nucleic Acids Res 34: 305–312.

66. Long JC, Caceres JF (2009) The SR protein family of splicing factors: masterregulators of gene expression. Biochem J 417: 15–27.

67. Calarco JA, Superina S, O’Hanlon D, Gabut M, Raj B, et al. (2009) Regulationof vertebrate nervous system alternative splicing and development by an SR-

related protein. Cell 138: 898–910.68. Roscigno RF, Garcia-Blanco MA (1995) SR proteins escort the U4/U6.U5 tri-

snRNP to the spliceosome. RNA 1: 692–706.

69. Xiao SH, Manley JL (1997) Phosphorylation of the ASF/SF2 RS domain affectsboth protein-protein and protein-RNA interactions and is necessary for splicing.

Genes Dev 11: 334–344.70. Cubellis MV, Caillez F, Blundell TL, Lovell SC (2005) Properties of polyproline

II, a secondary structure element implicated in protein-protein interactions.

Proteins 58: 880–892.71. Kofler M, Schuemann M, Merz C, Kosslick D, Schlundt A, et al. (2009) Proline-

rich sequence recognition: I. Marking GYF and WW domain assembly sites inearly spliceosomal complexes. Mol Cell Proteomics 8: 2461–2473.

72. Steinert PM, Mack JW, Korge BP, Gan SQ, Haynes SR, et al. (1991) Glycineloops in proteins: their occurrence in certain intermediate filament chains,

loricrins and single-stranded RNA binding proteins. Int J Biol Macromol 13:

130–139.73. Bedford MT, Richard S (2005) Arginine methylation an emerging regulator of

protein function. Mol Cell 18: 263–272.

74. Han SP, Tang YH, Smith R (2010) Functional diversity of the hnRNPs: past,

present and perspectives. Biochem J 430: 379–392.75. Sinha R, Allemand E, Zhang Z, Karni R, Myers MP, et al. (2010) Arginine

methylation controls the subcellular localization and functions of the

oncoprotein splicing factor SF2/ASF. Mol Cell Biol 30: 2762–2774.76. Chen YC, Milliman EJ, Goulet I, Cote J, Jackson CA, et al. (2010) Protein

arginine methylation facilitates cotranscriptional recruitment of pre-mRNAsplicing factors. Mol Cell Biol 30: 5245–5256.

77. Cartegni L, Maconi M, Morandi E, Cobianchi F, Riva S, et al. (1996) hnRNP

A1 selectively interacts through its Gly-rich domain with different RNA-bindingproteins. J Mol Biol 259: 337–348.

78. Buvoli M, Cobianchi F, Riva S (1992) Interaction of hnRNP A1 with snRNPsand pre-mRNAs: evidence for a possible role of A1 RNA annealing activity in

the first steps of spliceosome assembly. Nucleic Acids Res 20: 5017–5025.79. Del Gatto-Konczak F, Olive M, Gesnel MC, Breathnach R (1999) hnRNP A1

recruited to an exon in vivo can function as an exon splicing silencer. Mol Cell

Biol 19: 251–260.80. Brahms H, Meheus L, de Brabandere V, Fischer U, Luhrmann R (2001)

Symmetrical dimethylation of arginine residues in spliceosomal Sm protein B/B’and the Sm-like protein LSm4, and their interaction with the SMN protein.

RNA 7: 1531–1542.

81. Friesen WJ, Massenet S, Paushkin S, Wyce A, Dreyfuss G (2001) SMN, theproduct of the spinal muscular atrophy gene, binds preferentially to

dimethylarginine-containing protein targets. Mol Cell 7: 1111–1117.82. Thickman KR, Swenson MC, Kabogo JM, Gryczynski Z, Kielkopf CL (2006)

Multiple U2AF65 binding sites within SF3b155: thermodynamic and spectro-scopic characterization of protein-protein interactions among pre-mRNA

splicing factors. J Mol Biol 356: 664–683.

83. Soding J (2005) Protein homology detection by HMM-HMM comparison.Bioinformatics 21: 951–960.

84. Nakao A, Yoshihama M, Kenmochi N (2004) RPG: the Ribosomal ProteinGene database. Nucleic Acids Res 32: D168–170.

85. Ben-Shem A, Jenner L, Yusupova G, Yusupov M (2010) Crystal structure of the

eukaryotic ribosome. Science 330: 1203–1209.86. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, et al.

(2004) UCSF Chimera–a visualization system for exploratory research andanalysis. J Comput Chem 25: 1605–1612.

87. Corsini L, Bonnal S, Basquin J, Hothorn M, Scheffzek K, et al. (2007) U2AF-homology motif interactions are required for alternative splicing regulation by

SPF45. Nat Struct Mol Biol 14: 620–629.

88. Selenko P, Gregorovic G, Sprangers R, Stier G, Rhani Z, et al. (2003) Structuralbasis for the molecular recognition between human splicing factors U2AF65 and

SF1/mBBP. Mol Cell 11: 965–976.89. Schellenberg MJ, Edwards RA, Ritchie DB, Kent OA, Golas MM, et al. (2006)

Crystal structure of a core spliceosomal protein interface. Proc Natl Acad

Sci U S A 103: 1266–1271.90. Kuwasako K, He F, Inoue M, Tanaka A, Sugano S, et al. (2006) Solution

structures of the SURP domains and the subunit-assembly mechanism withinthe splicing factor SF3a complex in 17S U2 snRNP. Structure 14: 1677–1689.

91. Reidt U, Wahl MC, Fasshauer D, Horowitz DS, Luhrmann R, et al. (2003)Crystal structure of a complex between human spliceosomal cyclophilin H and a

U4/U6 snRNP-60K peptide. J Mol Biol 331: 45–56.

92. Bono F, Ebert J, Lorentzen E, Conti E (2006) The crystal structure of the exonjunction complex reveals how it maintains a stable grip on mRNA. Cell 126:

713–725.



Oświadczenia

Summary in English

S u m m a r y i n E n g l i s h | 68

68

Introduction: The spliceosome

The spliceosome is a large molecular machine that in eukaryotic cells carries out the process of splicing –

the removal of introns (noncoding sequences) and the joining of exons (coding sequences) of the precursor

mRNA (pre-mRNA). The human spliceosome comprises 45 different proteins bound in protein-RNA

complexes called the “subunits” of the spliceosome, approximately 70-80 additional proteins found in the

spliceosome in large quantities (abundantly), and over 100 additional non-abundant proteins. Non-subunit

spliceosomal proteins may be essential to its operation, may participate in its function only in specific

instances, or may mediate between the process of splicing and other mRNA processing pathways. Non-

subunit proteins may be components of stable protein complexes or function as independent splicing

factors. Among the protein complexes functionally associated with the spliceosome are the

hPrp19/CDC5L complex as well as the EJC, CBP, TREX and RES complex.

Introduction: Research project

My research project focused on the structural analysis and modeling of 252 human spliceosomal proteins,

including all proteins of the spliceosome subunits and all abundant non-subunit proteins. The project was

initiated by Professor Janusz M. Bujnicki, who is also the supervisor of this dissertation work. The work

completed in the project was supported by the EU 6th Framework Programme Network of Excellence

EURASNET (grant number LSHG-CT-2005-518238). Computing power has been provided in part by the

Interdisciplinary Centre for Mathematical and Computational Modeling of the University of Warsaw

(grant number G27-4).

Although the creation of an exhaustive structural representation of the protein part of the human

spliceosome has a value of its own, the research project was largely motivated by the vision of creating a

structural model of the entire spliceosome. At the moment when the project started, no high-resolution

experimental structures were available for larger regions of the spliceosome. Through combining

structural models of individual fragments of the complex with the results of experimental analyses such as

mass spectrometry and electron cryomicroscopy, it would be possible to obtain a structural model of the

spliceosome (that later could, in turn, aid in experimental work). Research on an attempt to create a

structural model of the entire spliceosome is continued by other members of Professor Bujnicki’s research

team.

First stage of the project

The first task I performed within the confines of the project was to systematically analyze the structured

regions of the proteins of the human spliceosome, as well as to review the existing high-resolution

experimental structures of the human splicing proteins and to construct high-resolution models for regions

of proteins without experimental representation.

In the 252 proteins, I detected 465 autonomous ordered structural domains that can be assigned to known

classes of structural domains. Furthermore, I discovered 25 ordered regions that could not be attributed to

known classes, but for which their properties (such as coherence, length, prediction of structural order and

secondary structure elements) indicate that they might constitute potential autonomous domains. Among

the domains classified into a known type, several domains, including those of proteins of the subunits of

the spliceosome and other abundant proteins, had not been characterized prior to this work (e.g. PWI-type

domains found in proteins hBrr2, hPrp22 and hPrp2). Through a systematic characterization of ubiquitin-

related domains found in the human splicing factors, I concluded that these domains are common among

the human spliceosomal proteins, and are specific mainly to proteins found at the later stages of the

splicing process.


69

On the basis of the available experimental structures, I created a standardized library of 104 unique non-

overlapping structures. Taken together, these structures cover 20.6% of the total protein sequence

predicted to be ordered (14.3% of the total protein sequence). In addition, I constructed 255 comparative

models and 43 de novo models, which altogether cover a three times greater length of the total protein

sequence. Overall, the available experimental structures and the comparative and de novo models I

constructed cover more than 90% of the total length of the predicted ordered protein sequence (48.7% of

the total protein sequence). For the majority of disordered regions and the remaining ordered regions, I

constructed pro forma structures that will enable subsequent structural analysis of these fragments.

Domain detection and model construction was for me the hardest part of the project, for several reasons.

First, it was time-consuming and took up the lion’s part of the time of the project. Second, it required

more tenacity than genuine intellectual creativity. Finally, the ultimate determinant of the value of the

models – the creation of the model of the entire spliceosome (or, at the minimum, of sufficiently large

parts of it) – lay outside of the scope of my part of the project. Taken together, these three circumstances

vastly decreased my motivation at this stage of the project, and made it very hard for me to finish it.

Nevertheless, there were some exciting moments – the most satisfying of which was, of course, the

discovery of new structural domains in some of the most important proteins of the spliceosome, some of

which had been analyzed multiple times before (e.g. hBrr2). It is extremely satisfying, to find something

that others before you have missed.

Second stage of the project

The second stage of the project consisted of an analysis of intrinsic structural disorder in the spliceosomal

proteins. Intrinsic protein disorder is defined as the lack of a stable tertiary structure of a given region of

protein while in solution in isolation, although it is possible that secondary structure elements are formed

and/or that the region acquires structure in certain conditions (for example, when the protein is bound in a

complex). The analysis of intrinsic disorder was not a part of the original project. It was only found to be

necessary after an initial analysis, when I realized that over a third of the total length of the proteins of the

subunits of the human spliceosome, and more than half the total length of all human splicing proteins, was

predicted to be intrinsically disordered.

The issue of structural disorder in the proteins of the spliceosome had not been systematically examined

prior to my analysis. Hence, the first step of the analysis was to collect as much previously published

information on the various putative functional forms of structural disorder in splicing proteins as possible.

Only after that could a systematic analysis of the human splicing proteins themselves follow. As a result

of this analysis, I discovered that the proteins of the spliceosome subunits, as well as abundant and non-

abundant proteins specific to different stages of the splicing process, differ in the content and type of

structural disorder. Proteins specific to the initial stages of the reaction, whose role is to create a network

of weak contacts between the subunits of the spliceosome and pre-mRNA, as well as between essential

and instance-specific splicing factors, contain a significant amount of structural disorder with no predicted

secondary structure elements, but exhibiting one of several characteristic types of amino acid

compositional bias. In contrast, proteins responsible for the dynamics of the process of splicing and the

interaction of the spliceosome subunits with one another contain a significant amount of structural

disorder with predicted elements of secondary structure.

During this stage of the project, I performed also several additional analyses, focusing on elements such

as: the correlation of the sites of post-translational modifications in human spliceosomal proteins with

regions of structural order or disorder; proteins with extremely high disorder content (>75%); comparative

evolutionary history of conserved ordered and disordered regions. I also compared intrinsic disorder in the

proteins of the human spliceosome with intrinsic disorder in the proteins of the human and bacterial

(Escherichia coli) ribosomes.


70

This stage of the project was for me much more interesting than the first stage, because, upon combining

the results of various analyses, I was at its end able to formulate a single coherent model (conceptual – not

structural) for a phenomenon that had not been described at all prior to that point: the presence and

function of intrinsic structural disorder in the human spliceosome. My model assumes the existence of a

hard, ordered, potentially evolutionarily ancient “core” comprising the ordered domains of proteins that

directly assist the process of splicing performed by the RNA of the spliceosome; a plastic “mantle”

containing a large amount of intrinsic disorder with secondary structure elements that can acquire or lose

structure depending on circumstances, and being responsible for the control of spliceosome dynamics

(putatively similar in this respect to the ubiquitin-related domains that I mentioned earlier); and a

relatively loose external “atmosphere” composed of intrinsic disorder without predicted secondary

structure and small ordered domains that bind RNA and protein intrinsic disorder, and active mainly in the

beginning stages of the splicing process, that is partner recognition and definition. These three “layers”

reflect the heterogeneity and complexity of the splicing reaction in human.

My model can be used as a framework for further research of the phenomenon of intrinsic disorder in the

spliceosome. At the same time, specific results of my analyses regarding particular protein regions that I

presented together with the general model, can be verified experimentally (and, if necessary, be used to

correct the model).

Third stage of the project

In the third stage of the project, I compared the complement of proteins and protein domains found in the

human spliceosomal proteome with the known complement of proteins and protein domains found in the

spliceosomal proteome of the diplomonad Giardia lamblia. This species is characterized by genomic

minimalism, including in this also a minimal number of introns in the genome.

By comparing the G. lamblia spliceosomal proteome with the human one, I was able to determine that the

G. lamblia spliceosomal proteome is missing most of the ubiquitin-related proteins and/or domains and

the majority of structural disorder predicted to possess an independent function found in the spliceosomal

proteome of human. On the other hand, the G. lamblia proteome contains the majority of conserved

domains from the “hard core” that directly assist the RNA during splicing catalysis and that had been

probably adapted into the spliceosome from pre-existent systems.

I find this analysis to be the most interesting part of my project – although very short, it brought about

results confirming the existence of a coherent “functionality” of the spliceosomal machine based on

ubiquitin-related domains, and another “functionality” based on intrinsically disordered regions of

proteins. The result of this analysis may help determine precedence in the modeling of the much more

complicated human spliceosome, because regions common to both human and G. lamblia proteomes

should be probably prioritized in the modeling process.

Publication of data

The structural models I have created (except for the pro forma structures) possess parameters adequate for

use in further research of the spliceosome, including the possibility of combining them with the results of

electron cryomicroscopy analyses of the spliceosome subunits in order to understand the structure of the

complex. The catalogue of structures and models is available at http://iimcb.genesilico.pl/SpliProt3D. The

website was developed by Marcin Magnus, M.Sc..

Although I did not participate in the programming of the website, I was one of its designers. From my

point of view, an interesting challenge at this stage was the necessity to create a clear visualization of the



71

combination of the sequence alignment of homologs of human proteins with a description of some of the

properties of the human protein, such as the predicted intrinsic disorder and secondary structure or the

position of known or predicted sites of posttranslational modifications. Pre-existent tools that aggregate

various types of data regarding protein properties (such as e.g. the GeneSilico metaserver), are powerful,

but usually targeted towards taking in as much data as possible at the expense of the esthetics of the

message – and so, are not suited well for data visualization. However, for the purpose of the creation of

the website as well as publication of the results, it was necessary to integrate data from alignments and

predictions in a compact form. It is my belief that the final effect of my work, which I obtained using the

Jalview program, responds well to the challenge set.

Publication of results

The results of the project were published in two articles, “Structural Bioinformatics of the Human

Spliceosomal Proteome” (Korneta I., Magnus M., Bujnicki JM., 2012, doi: 10.1093/nar/gks347, PMID:

22573172) and “Intrinsic Disorder in the Human Spliceosomal Proteome” (Korneta I., Bujnicki JM., 2012

doi: 10.1371/journal.pcbi.1002641, PMID: 22912569), which comprise the dissertation.

Analiza strukturalna i modelowanie białek spliceosomu ludzkiego, doktorat

Documents