V genes in primates from whole genome shotgun data · V genes in primates from whole genome shotgun data ... 2 and Francisco Gambon-Deza ... clades of the TRA locus and 25 clades

V genes in primates from whole genome shotgun data

David N. Olivieri1,2 and Francisco Gambon-Deza3

1 School of Computer Science, University of Vigo, Ourense 32004, Spain.2 Broad Institute of MIT and Harvard, Cambridge MA, 02142, USA.

3Servicio Gallego de Salud (SERGAS), Inmunologıa, Hospital do Meixoeiro, 36210 Vigo, [email protected] ([email protected]), [email protected]

Abstract

The adaptive immune system uses V genes for antigen recognition. The evolutionary diversifica-tion and selection processes within and across species and orders are poorly understood. Here, westudied the amino acid (AA) sequences obtained of translated in-frame V exons of immunoglobu-lins (IG) and T cell receptors (TR) from 16 primate species whose genomes have been sequenced.Multi-species comparative analysis supports the hypothesis that V genes in the IG loci undergobirth/death processes, thereby permitting rapid adaptability over evolutionary time. We also showthat multiple cladistic groupings exist in the TRA (35 clades) and TRB (25 clades) V gene lociand that each primate species typically contributes at least one V gene to each of these clade. Theresults demonstrate that IG V genes and TR V genes have quite different evolutionary pathways;multiple duplications can explain the IG loci results, while co-evolutionary pressures can explainthe phylogenetic results, as seen in genes of the TR loci. We describe how each of the 35 V genesclades of the TRA locus and 25 clades of the TRB locus must have specific and necessary rolesfor the viability of the species.

Keywords: Immunologic Repertoire, Primate Evolution

1. Introduction

The adaptive immune system contains natural molecular recognition machinery that is able todistinguish self from non-self and defend the body against infections (Janeway Jr, 1992). Thismolecular recognition system consists of two molecular structures, immunoglobulins (IG) and Tlymphocyte receptors (TR). Immunoglobulins recognize antigen in soluble form and are com-posed of two types of molecular units, a heavy chain (IGH) and a light chain (either IGK or IGL).The recognition site is composed of the variable (V) domains present in the NH2-terminus of bothchains. The antigen binding site is composed of two V domains, one from each chain. Within theseprotein domains, zones have been described that interact with antigen (called the complementaritydetermining regions, CDR) and framework regions (FR). For IG, there are three CDR and threeFR regions within each V domain. The interaction site with antigen consists of six CDR supportedby six FR regions. The TR recognize antigen that are presented by the molecules of the major his-tocompatibility complex (MHC), as antigen-MHC molecular complexes. Despite this substantialdifference between TR and IG with respect to the mechanism of antigen recognition, both possess

Preprint submitted to Elsevier July 7, 2014

peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/006924doi: bioRxiv preprint first posted online Jul. 8, 2014;

http://dx.doi.org/10.1101/006924

similar structures. In particular, each have two chains possessing V domains at the site within themolecule that is responsible for antigen-MHC recognition. These domains are similar to those inIG, containing three FR regions and three CDR per chain (Janeway et al., 2001).

Because the amino acid (AA) sequences of the IG and TR V domains are so similar, it has beenhypothesized that all such sequences present today were derived from an ancestral gene (Hughes,1994). This ancestral gene was assigned to immune recognition in the epoch coinciding with theorigin of vertebrates. Evidence comes from the present-day IG and TR sequences in fish, whosestructures have been maintained in all extant vertebrates (Ghaffari & Lobb, 1991).

Genes of the V domains are distributed across seven unique loci. The genes from three ofthese loci are used to construct IG chains, while the other four loci contain genes that encode TRchains (Janeway et al., 2005; Brack et al., 1978; Tonegawa, 1983; Davis & Bjorkman, 1988). Theantigen recognition repertoire that an organism possesses is dictated by the total available set ofthese genes. These genes have two exons, one for the peptide leader and the other that encodesmost of the V domain (V in-frame exon) (The Immunoglobulin FactsBook (Lefranc & Lefranc,2001a); The T cell receptor FactsBook (Lefranc & Lefranc, 2001b); (Lefranc, 2014). Within theV exon, there are coding sequences for the first two complementarity determining region (CDR1and CDR2) for antigen recognition and the three framework regions (FR1, FR2, and FR3) (theinternational ImMunoGeneTics information system, http://www.imgt.org (Lefranc et al., 2009), IMGT/GENE-DB (Giudicelli & Lefranc, 2004). A third complementarity determining region(CDR3) is generated through a gene rearrangement process, called VDJ recombination, wherebyV exons are moved from their location in order to join with other gene segments, called D and J.This process is somatic and only occurs within lymphoid cells (Tonegawa, 1983).

The number of V exons for IG and TR is highly variable across different species, especiallywith respect to the IG loci. For example, there are approximately 600 IGHV genes for the microbat(Mioitis lucifugus), while for other mammals, such as those living in aquatic environments (e.g.seals, dolphins and walruses) have much fewer IGHV genes (Olivieri et al., 2013). From the Vexon sequence data available at http://vgenerepertoire.org, the number of genes in the TRBV locusbetween species is approximately constant, while there is a large variance in the number of TRAVgenes, particularly pronounced in the Bovine species of the Laurasiatheria. The causes for thisvariability amongst species is presently unknown.

In this paper, we describe organizational and phylogenetic relationships of the amino acid (AA)sequences derived from the V exons of the order Primates uncovered from whole genome shot-gun (WGS) datasets. In particular, we studied 16 representative Primate species whose genomeshave been sequenced in order to identify evolutionary patterns that could explain the present-daygenomic repertoire of V genes. Our results show that in the IG loci, duplications and losses of Vexons are common, while in the TRAV and TRBV loci, complex selection mechanisms may beresponsible in order to conserve V exons between species.

2. Material y methods

For these studies, we used genome data from whole genome shotgun (WGS) assemblies ofspecies that are publicly available at the NCBI. For the majority of these species, the V geneshave not been annotated or only partial annotations have been performed in specific loci, (IMGT

2


http://dx.doi.org/10.1101/006924

Repertoire, http://www.imgt.org). All curated genes were entered in IMGT/GENE-DB (Giudicelliet al., 2005) and IMGT gene nomenclature has been provided to Gene at NCBI (Lefranc, 2014).We used our VgenExtractor bioinformatics tool [http://vgenerepertoire.org/downloads/] (Olivieriet al., 2013) to identify the in-frame V exon sequences from an analysis of these genome files bysearching for well established signatures and motifs. Our software algorithm searches and extractsin-frame V exon sequences based upon known motifs. In particular, the algorithm scans largegenome files and extracts candidate V exon sequences. These V exon sequences are delimited atthe nucleotide level by an acceptor splice a the 5′ end and at the 3′ end by the V recombinationsignal (V-RS). These V exon (and its translation) includes the second part of the signal peptide(L-PART2) and the V-REGION (IMGT-ONTOLOGY 2012, for a detailed description (Giudicelli& Lefranc, 2012)). The IMGT unique numbering starts at the beginning of the V-REGION. Theexons fulfill specific criteria: they are flanked in 3′ by the V recombination signal (V-RS), theyhave a reading frame without stop codon, the length is at least 280 bp long, and they containtwo canonical cysteines and a tryptophan at specific positions (1st-CYS 23 and 2nd-CYS 104,CONSERVED-TRP 41 according to the IMGT unique numbering ((Lefranc et al., 2003), (Lefranc,2011)).

Since the VgenExtractor algorithm scans entire genomes by matching specific motif patternsalong the exon sequence, a fraction of the extracted sequences can fulfill the conditions of ouralgorithm for being functional V-genes, yet are structurally very different. Such sequences areeasily discarded with a Blastp comparison against a V-gene consensus sequence. We found thatan ample threshold (evalue=1e-15) is sufficient for eliminating all sequences that are not V genes.

The VgenExtractor algorithm can be modified to identify pseudogenes by relaxing the motiffilters or by relaxing the condition of stop-codon translation. Nonetheless, this would only uncovera fraction of the complete set of pseudogenes that could otherwise fulfill different criteria. Acomplete set of pseudogenes would remain elusive due to random alterations of sequences overevolutionary history. Thus, we limited all further study to specifically targeted functional genes, orthose exons that possess the requirements seen in all V genes sequences annotated to date ( IMGTRepertoire, http://www.imgt.org).

Once we identified functional V exons, we analyzed the set of translated amino acid (AA)sequences with a pipeline that we implemented within the Galaxy toolset (https://usegalaxy.org/).The steps of the workflow are as follows. First, we performed multiple BLAST alignment ofthe AA sequences against V exon consensus sequences obtained from previously annotated Vgenes (IMGT Repertoire, http://www.imgt.org). Those sequences with a BLAST similarity score> 0.001 were retained, while other sequences were discarded. From the resulting AA translatedV exon sequences, we performed multiple alignment with ClustalO(Sievers & Higgins, 2014)and phylogenetic comparison studies using SEAVIEW (Gouy et al., 2010). For the tree construc-tion, we used a maximum likelihood algorithm and the LG matrix. Finally, we used the MEGA5(Tamura et al., 2011) and FigTree (http://tree.bio.ed.ac.uk/software/figtree/) to produce tree graph-ics.

We classified the V exon sequences into one of the 7 loci (IGHV, IGKV, IGLV, TRAV, TRBV,TRDV and TRGV) by obtaining a heuristic score based upon a BLASTP against the NCBI NRprotein database. The score is computed by mining the text description from protein hits that havea similarity score above a predetermined threshold in order to obtain a relevant word frequency

3


http://dx.doi.org/10.1101/006924

indicative of exon type. For each protein description, the word frequency is weighted by theBLAST similarity score, so that the most similar protein descriptions contribute more to the finalloci classification.

We developed a python analysis script (called Trozos, which can be freely downloaded athttp://vgenextractor.org/downloads/), which we used for extracting the CDR and FR sequencesfrom the AA translated V exon. First, we studied the V exon amino acid sequences and couldidentify the presence of the two canonical cysteines and the presence of tryptophan, W41 (IMGTunique numbering (Lefranc et al., 2003; Lefranc, 2011)). Between each V exon, the number ofamino acids in the CDRs varies, however we used the standardized IMGT naming/nomenclatureto define the regions. In particular, CDR1 contains six amino acids and begins at the position ofthe first cysteine (ie., + 3 to cysteine + 10). The CDR2 is defined as the sequence located betweenW41 + 15 and W41 + 22. The framework sequence regions are located between the CDRs.Stretches of sequences, which we indicate by (i), refer to sequences we obtained computationally.For a particular set of sequences, some parameter adjustment in the Trozos algorithm is necessaryfor consistency, validating the final result against a visualization of the sequence alignment. For adetailed study of the alignments and a study of the conservation sites, we used the software Jalview(Waterhouse et al., 2009).

We obtained the V exon sequences of 16 primates whose WGS sequences are available at theNCBI. The primates included in our study and their corresponding abbreviated accession numbersare the following: the Lemuriformes: Daubentonia madagascariensis (AGTM01), Otolemur gar-nettii (AAQR03), Microcebus murinus (AAHY01), the Tarsiformes: Tarsius syrichta (ABRT01),the NewWorldMonkeys: Callithrix jacchus (ACFV01), Saimiri boliviensis (AGCE01), the OldWorld Monkeys: Macaca mulatta (AANU01), Macaca fascicularis (CAEC01), Chlorocebussabaeus (AQIB01), Papio anubis (AHZZ01), the Hominids: Nomascus leucogenys (ADFV01),Pongo abelii (ABGA01), Gorilla gorilla (CABD02), Pan paniscus (AJFE01), Pan troglodytes(AACZ03), and Homo sapiens (AADD01). Details of these WGS data sets are provided in Sup-plementary Table 1.

3. Results

Previous work have described variations in the number of V genes between species (Guo et al.,2011; Niku et al., 2012). Likewise, we recently demonstrated the presence of distinct evolutionaryprocesses between the IG and TR V genes (Olivieri et al., 2013). Nonetheless, the origin of thisvariation is still unknown. In order to understand the reason for this V gene number variation,we compared these genes amongst species of specific mammalian orders and families. Here wedescribe such a comparative study in primates. In particular, we studied the V exon sequencesfrom the 16 primate species represented in Figure 1. We extracted the V exon sequences fromWGS public datasets, listed in Table 1. We carried out studies in the five major simian branches.Nonetheless, there is a greater representation (six species) from the hominid group simply basedupon the maturity of available WGS data.

3.1. Immunoglobulin GenesThe IGHV locus has been described in many vertebrate species (Berman et al., 1988; Miller

et al., 1998; Deza et al., 2009; Gambon-Deza et al., 2010). The joint study of all species with all4


http://dx.doi.org/10.1101/006924

Papio anubis

Macaca fascicularis

Chlorocebus sabaeus

Macaca mulatta

Pan paniscus

Pan troglodytes

Homo sapiens

Gorilla gorilla

Pongo abelii

Nomascus leucogenys

Saimiri boliviensis

Callithrix jacchus

Tarsius syrichta

Daubentonia madagascariensis

Otolemur garnettii

Microcebus murinus

0204060 Mya.

Eocene Oligocene Miocene toHolocene

New World monkeys

Old World monkeys

Lemuriformes - Prosimians

Tarsiiformes

Hominides

Figure 1: Phylogenetic tree of the primates included in the study. The tree was constructed from recent molecularphylogenetic data provided in (Perelman et al., 2011).

5


http://dx.doi.org/10.1101/006924

Table 1: Distribution of V-genes amongst the IG and TR loci.

Specie ighv igkv iglv trav trbv trgv trdv all

LemuriformesD. madagascariensis 4 10 10 40 17 2 5 88O. garnettii 9 49 42 69 31 5 13 218M. murinus 40 26 52 38 30 3 3 191TarsiiformesT. syrichta 38 50 26 40 12 3 6 175NewWorld monkeysC. jacchus 75 24 27 30 43 3 6 208S. boliviensis 27 29 14 37 36 4 2 159OldWorld monkeysM. mulatta 63 65 51 45 57 5 6 292M. fascicularis 67 66 48 44 56 4 7 292C. sabaeus 64 63 44 36 53 3 7 270P. anubis 68 78 47 38 58 5 6 300HominidsN. leucogenys 28 33 20 42 37 4 5 169P. abelii 72 42 31 42 62 4 9 262G. gorilla 44 30 22 41 49 3 5 194P. paniscus 30 26 28 36 51 6 6 183P. troglodytes 53 41 27 52 55 5 4 237H. sapiens 44 46 33 42 49 3 6 193

V exon sequences has identified three evolutionary clans (IMGT (Lefranc, 2001)). The clans aredefined by specific sequences in the Framework 1 (IMGT-FR1) regions and Framework 3 (IMGT-FR3) regions, which are influential in antigen recognition functionality (Kirkham et al., 1992). TheIGKV and IGKL loci are less well studied and there are no published work that clearly indicatethe existence of clans in these loci, as is the case in the IGHV locus.

From the 16 species of primates studied (Table 1), we obtained a total of 701 IGHV exonssequences. Two species, (D. madagascariensis and O. garnetti), have markedly fewer IGHVgenes than the average, while the rest of the primates have an average of between 30 to 60 IGHVgenes in this locus.

To compare the V exon sequences in extant primate species, we carried out multi-species phy-logenetic analysis by first aligning the AA translated sequences with clustalO (Sievers & Higgins,2014) and then performing tree construction with FastTree (Price et al., 2010) (using maximumlikelihood and WAG matrices). Subsequently, we used Figtree or MEGA5 (Tamura et al., 2011)for visualization. The resulting phylogenetic trees show the presence of three major clades (ClanI, II and III) which have already been described in vertebrates (Kirkham et al., 1992; Lefranc &Lefranc, 2001a; Giudicelli & Lefranc, 1999) (see the IMGT clans http://www.imgt.org) (Figure 2).Additionally, due to the large number of sequences in this study, we can discern the presence of

6


http://dx.doi.org/10.1101/006924

subclades within each of the defined IGHV locus clans. Thus, from the primate V exon sequencesfound within Clan I, the 182 sequences form three subclades (A-29 Seq, B-22 Seq, and C-131Seq), the 139 sequences in Clan II form two subclades (A-41 Seq and B-98 Seq), and the 380sequence in Clan III 380 can be grouped into three subclades (A-104 Seq, B-94 Seq, and C-182Seq).

Clues about the origins of V gene sequences can be gained by observing the distribution ofthe primates amongst the clades. While Old World monkeys and Hominids have sequences in allclades, New World Monkeys have no sequences within clade III-B and Tassiforms and Lemureshave no sequences within the IGHV clade II (Table 2). We also observe that species typically haveseveral V genes per clade, as seen in Figure 2), which may be due to recent duplication events.

Table 2: Distribution of IGHV exons across the clans.Clan I Clan II Clan III

Specie a b c a b a b cLemuriformesD. madagasca. 0 0 2 0 0 0 0 0O. garnettii 2 0 0 0 0 2 4 1M. murinus 0 0 9 0 0 5 10 13TarsiiformesT. syrichta 2 0 16 0 0 5 8 8NewWorld monkeysC. jacchus 7 2 18 1 2 3 0 40S. boliviensis 2 2 8 2 1 2 0 10OldWorld monkeysM. mulatta 2 1 4 4 14 14 9 15M. fascicularis 2 2 6 3 15 13 9 17C. sabaeus 2 0 10 1 7 11 8 8P. anubis 2 3 9 4 18 12 7 13HumanoidsN. leucogenys 1 2 6 2 4 4 3 6P. abelii 3 3 9 6 18 13 10 10G. gorilla 1 2 9 3 6 6 6 11P. paniscus 1 1 7 3 2 3 5 8P. troglodytes 1 2 8 7 8 6 8 12H. sapiens 1 2 10 5 3 5 7 11

From each of the clades, we determined the 90% consensus sequences. These sequences arerepresented in the lower part of Figure 2, showing sequence conservation in the framework regionsand the existence of motifs that can be used to define separate clades in these regions. As expected,the least conserved regions correspond to those of the CDRs.

We found similar results in the IG light chain V genes. In particular, we found that across

7


http://dx.doi.org/10.1101/006924

0.2

Clan I

Clan II

Clan III

A

B

A

B

C

A

B

C

FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT CDR3-IMGT

A B BC C C' C'C" C" D E F FG

1 10 15 16 23 26 27 38 3941 46 47 55 56 65 66 74 75 84 85 89 96 97 104 105 |........|....| |......|..| |..........| |.|....| |.......| |........| |.......| |........| |...|......| |......| |...

——————————————> ——————————> ———————> ————————> ————————> —————————> ———————————> ———————>

Clan IIA V*S QVTLKESGP*LVKP T*TLTLTCT*S GFSLS **G** **WIRQPP *KALEWLA* I*** D* K*YS*SLK* RL*I*KDTSK *QVVLTMTNMDP VDTATYYC A**Clan IIB VLVLS QV*L***GP*LVKP **TL*LTC*** G*S*S ***** **WIRQ*P GK*LEW*** I*** S*** **Y**SLK* R*T*S*DTSK *Q**L******* *DTA*YYC A**

Clan IA VCA **QLVQS*AEVK*P GESL*ISC**S GYSF ***W I*WVRQ*P GKGLE**G* I*** DS* T*Y*PSFQG **TISAD*S* *T**LQW*SLKA SD*A*YYC A*Clan IB ***QV QLVQSG*E*K*PG* SVKVSCKASGY *FT *Y* *NW**QA* GQ*LEWMGW *NT* *G* P*YAQGF** *F*FS*DTS* ST*YLQISSLK* ED*A*YYC *RClan IC **S*V QLVQSG*EV**PG* SVK*SCK*SGY TF* *** **WV*Q** **GLEW*G* ***MP **G* **Y*QKFQ* RVT*T*D*S* *T*YMEL*SLR* ED*A*YYC **

Clan IIIA VQCEV QL*E*GGGLVQPGG SLRLSC**SGF TF* *** M*WV*QAP GKG*EWVG* *R*KA *G* **YA*SVKG RFTISRDDSK S***LQM**L*T EDTAVYYC *RClan IIIB VQCEV QLVESGGGLVQPGG SL*LSCAASGF TFS *** M*WVRQA* GKGLEWV** ***K** *** **YA**VKG RFTISRDDSK N**YLQM*SLKT EDTAVYYC **Clan IIIC RVQC*V QLVESGGGL**PGG SLRLSC*ASSG FTF* *** M*W*RQAP GKGLEWV** I**** *** **Y*DSVKG RFTISR*N*K N*L*LQMNSL** EDTA*YYC **

Figure 2: The phylogenetic trees of the AA translated sequences from IGHV exons from 16 primates species. V exonsequences are obtained from whole genome shotgun (WGS) datasets using the Vgenextractor algorithm. Alignmentof the amino acid sequences was performed with clustalO, tree construction with FastTree using the WAG matrix, andvisualization with Figtree. (Left:) the tree of all IGHV exon sequences; (Right:) a detailed view of the subclade, ClanI-C, showing the names of constituent species. The leaves of the taxon, P. albelii, are highlighted in order to easilyillustrate the distribution of a particular species within the subclade. In the bottom part of the figure, the consensussequences of each clade are given. The amino acids that are found in more than 90 % of the sequences are marked bytheir letter, while the variable regions are represented by an asterisk (”*”) .

8


http://dx.doi.org/10.1101/006924

the primate species, there was wide variability in number IGKV genes (coding for the κ, or IGK,chains), ranging between 10 and 66 V genes. With respect to the number of IGLV (coding for theλ, or IGL, chain), we found a similar variability, between 10 and 51 V genes. Also, some variationexists between the number ratios IGKV/IGLV within each of the primate species studied. Theseratios are of interest, because it is well established that in humans and mice, there is more IGKthan IGL both in serum as well as with respect to the number of genes (ie., in humans, there are46 IGKV genes and 33 IGLV in humans, while in serum there is approximately 70% IGK chainantibodies as compared to 30% IGL antibody chains). In all primate species, we found a largernumber of IGKV compared to IGLV genes, with only two exception, in prosimians, in which theratio is unity, and in the case of the M. murinus for which the ratio is inverted with respect to otherprimates.

From the 16 primate species in our study, we obtained a total of 629 IGKV exon sequences.The resulting phylogenetic tree from the corresponding AA translated sequences indicates theexistence of two large (or principal) clades and two smaller clades having a lower number ofsequences (Clade I -210 sequences- and II-419 sequences- seen in Figure 3). The two principalclades have representative sequences from all the primates studied, except for prosimians, whichdo not have sequences in clade IIA.

We obtained a total of 522 IGLV exon sequences from the 16 primate species studied. Thesesequences group into five principal clades, each having representatives from all species. This cladestructure may be significant, since it corresponds to the five clades in IGLV locus we described ina previous publication (Olivieri et al., 2014). Indeed, this structural conservation seen in the IGLVclades may have functional significance, because it is also maintained in distant reptiles species.

From sequence alignments within each clade, we deduced motifs from the 90% consensus se-quences (those sequences whose AA positions possess 90% similarity) given in Figure 3 (bottom).The AA in the sequences are present in nearly all sequences, while the ”*” represents sites of vari-ability. Similar to what we showed for the IGHV exons, the clades are defined by sequence motifspresent in the frameworks FR1 and FR3. The sequences in the FR2 region from different cladesare similar, while variability can be detected in regions that contain the CDRs.

3.2. The TR V genesWe used our gene calling algorithm, Vgenextractor, to obtain the TRV exon sequences and

study the AA sequences of the V exons from the TRA and TRB loci in 16 different primate species.In particular, we obtained 670 TRAV exon sequences and found that the number of TRAV exonsranges between 30 (in C. jacchus) and 69 (in O. garnettii). From a phylogenetic study of theAA sequences of the TRAV exons, six major clades can be identified. Also, each of these cladeshave several subclades. From a detailed examination of these subclades, at least one sequencefrom each species is found to be common, indicating that these are clades of orthologous genes.Figure 3 shows phylogenetic tree of the AA translated sequences from the TRAV exon sequences,showing a natural grouping into 35 subclades. Most of these subclades have sequences for at leasttwelve species. In the figure, we zoomed in on specific subclades to expose the taxa to which theconstituent sequences originate.

Table 5 lists the V exon sequence distribution for each species per clade. In general, eachspecies is represented within each of the 35 clades with one or two genes. There are clades where

9


http://dx.doi.org/10.1101/006924

Table 3: Distribution of V exons from IGKV and IGLV across clades and species.IGKV IGLV

Specie I IIA IIB IIC mI mII I II III IV VLemuriformesD. madagasca. 2 0 3 3 1 1 3 4 2 1 0O. garnettii 17 0 2 25 2 1 5 13 4 16 4M. murinus 15 0 2 8 0 1 13 24 3 5 7TarsiiformesT. syrichta 15 3 1 27 0 3 6 8 5 4 3NewWorld monkeysC. jacchus 8 2 4 8 1 0 13 10 1 1 2S. boliviensis 11 2 4 10 2 0 4 3 3 3 1OldWorld monkeysM. mulatta 19 9 6 29 1 1 14 22 6 5 4M. fascicularis 20 9 6 28 1 1 10 22 6 5 5C. sabaeus 21 5 3 15 1 0 12 16 6 3 7P. anubis 24 14 6 32 1 1 12 20 6 4 5HumanoidsN. leucogenys 15 2 1 14 1 0 6 11 0 1 2P. abelii 9 8 4 20 1 0 10 14 2 1 4G. gorilla 5 3 3 17 1 0 9 9 2 1 1P. paniscus 9 3 2 12 0 0 8 11 3 3 3P. troglodytes 6 4 3 27 1 0 10 11 1 3 2H. sapiens 1 2 4 7 1 0 10 14 3 2 4

10


http://dx.doi.org/10.1101/006924

0.2

IGLVIGKV

I

II

I

II

III

IV

V

0.2

FR1-IMGT CDR1-IMGT FR2-IMGT CDR2-IMGT FR3-IMGT CDR3-IMGT (1-26) (27-38) (39-55) (56-65) (66-104) (105-117)

A B BC C C' C'C" C" D E F FG (1-15) (16-26) (27-38) (39-46) (47-55) (56-65) (66-74) (75-84) (85-96) (97-104)

——————————————> ——————————> ———————> ————————> ————————> —————————> ———————————> ———————> 1 10 15 16 23 26 27 38 3941 46 47 55 56 65 66 74 75 84 85 89 96 97 104 105 11112 |........|....| |......|..| |..........| |.|....| |.......| |........| |.......| |........| |...|......| |......| |.....||

Clade I **G D*VMTQ*PL*L**T* G***SISCR*S QSL**S****TY L*W**QKP GQ*P**LIY **.......S *R*SGVP.D RFSGSG*..G TDFTLKIS*V*A ED*GVYYC *Q****PClade IIA SDT*G **V*TQSPATLS*SP GE**T*SCRAS QSV*.....S** LAWYQQKP GQAP*LLI* *A.......S *RATGIP.* RFSGSGS..G T*FTLTISSLEP ED**VY*C *******Clade IIB *** ****TQSP******* *****I*C*A* ***SI**G**** **WYQQ*P ***P***** **.......* ****G**.* RF*G***..G T*F**TI***** *D*A*Y*C *Q****PClade IIC **C *IQMTQSPS*LSAS* GD*VTI*C*AS Q*I......*** L*WYQQKP G**P**LIY *A.......S *L**G*P.S RFSGSGS..G T***LTI**LQ* ED*A*YYC ******PClade minor I *** ****TQSP**LA*** G*R*T**CK** *S*L**S**K** **W*QQ*P GQ*PK**** **.......S *R*SGVP.* R*SG**S..G TDFTLTIS**** *DV**YYC ****S*PClade minor II V*G D****Q*PASL*A** GE**SISC*AS A*V......HGE *SW*RIKL GQ*LEPLIS HV.......T TLAPGVP.* *YS***S..G *SY*FSIS*L*P GDSG*YYC *HD*GW*

Clade I S** ***LTQ***.VSV** GQ***ITC*G* ***......*** **W*QQK* *Q*PVL*IY **.......* *RPSGIP.* RFS*S*S..G ****LTI***** *DEADYYC ***D*****Clade II S*A ****TQ**S.*S*** ****T*SC*** S****...**** V*WYQQ** G**P****Y **.......* *RPSG**.* RFSGS**SS* **ASL*I*GL** EDEADYYC *********Clade III **S Q*VV*QE*S.****P G*TVTLTC**S *G*V*...**** **W*QQ** *Q*P**LI* *T.......* ******P.* *FSGS**..G *KAALT**GAQ* *DE**YYC *L*****IAClade IV *** ***LTQ**S.ASAS* G*S**LTCTL* S**S.....*Y* **W*QQ** ***P***M* ****G..*V* **G*GIP.D *F*GS*S..G **RYLTI*N*Q* *DEA*Y*C ********Clade V SLS Q***TQP*S.*SA** GAS**L*CT** *****...**** **W*QQKP G*PPRYLL* ***D...S*K **GSGVP.* R*SGS*D*** N*G*L**S*LQ* EDEADYYC *****S*

A

B

C

minor Clade I

minor Clade II

KAPPA

LAMBDA

Figure 3: The phylogenetic trees of the AA translated sequences from (IGKV -left- and IGLV -right-) exons from 16primates species. V exon sequences are obtained from whole genome shotgun (WGS) datasets using the Vgenextractoralgorithm. Alignment of the amino acid sequences was performed with clustalO, tree construction with FastTreeusing the WAG matrix, and visualization with Figtree. The root of the major clades are marked with Roman numerals.Significant subclades in the tree are identified to the right. In the bottom part of the figure, the consensus sequences ofeach clade are given. The amino acids that are found in more than 90 % of the sequences are marked by their letter,while the variable regions are represented by an asterisk (”*”).

11


http://dx.doi.org/10.1101/006924

the V exon sequences of some species are absent, however this could be because we did not detectthe V exon with our gene calling algorithm. In previous publications, we show that our algorithmdetects approximately 95% of V exon sequences (Olivieri et al., 2013).

Similar to the method we used in the IG loci, we determined the 90% consensus sequencesfor each of the 35 clades for V exons in the TRA locus. Clear differences can be seen betweensequences from different clades and conservation exist within the same clade. Table 4) shows thevariability that exists within each of the V exon sequence regions (FR1, CDR1, FR2, CDR2 andFR3). Within each of these five regions, we determined the fraction of the number of locationshaving variability (with positions shown as ’*’) to the total number of positions including theamino acid conserved sites. As expected, the CDR regions are those that exhibit the most AA sitevariability, however CDR2 is less variable than the CDR1, suggesting an underlying conservationprocesses. When compared against the framework regions, the CDR2 region is slightly morevariable.

Table 4: Percentage of AA sites along the translated V exon primate sequences, derived from the alignments providedin Figures 4 and 6.

Locus FR1 CDR1 FR2 CDR2 FR3TRAV 32% 54% 30% 38% 30%TRBV 34% 43% 27% 59% 31%IGHV 24% 54% 31% 68% 29%IGKV 44% 56% 35% 50% 32%IGLV 50% 85% 51% 83% 39%

For TRBV exon sequences, we obtained similar results with respect to clade grouping as wefound for the TRA locus. In particular, we obtained 696 TRBV exon sequences and we can deduce25 clades from a tree analysis. As in the previous cases discussed, we found that all clades containV sequences from the majority of primate species. Table 6 shows the distribution of AA translatedTRBV exon sequences across the different subclades for species. As can be seen, each speciecontributes one or more sequences per clade. Figure 6 shows the phylogenetic tree of all the AAtranslated TRBV exon sequences, together with the alignment within each clade to obtain the 90 %consensus sequences. As before, we studied the variability between the canonical IMGT definedregions. The results are shown in Table 4, where it can be seen that the greatest variability isdetected within the CDRs. Also, in amongst the TRAV genes, the sequences of CDR1 have lowervariability than those of CDR2.

The phylogenetic results from the TRBV and TRAV loci show that V exons sequences existthat are maintained throughout evolution across primate species, since each species contributesone such gene to the subclades of the tree. To confirm these results and to study which parts of thesequences are involved in the process of selection, we studied independently the framework andCDR sequences separately from each V exon sequence. To separate the four separate the FR1,FR2, FR3, and the CDRs from the AA translated V exon sequences, we developed a python utilityprogram, called Trozos.py. Three sequences correspond to framework regions FR1, FR2 and FR3,while the fourth sequences is constructed by combining the two sequences from CDR1 and CDR2.

12


http://dx.doi.org/10.1101/006924

29

0.4

101112

1314

15

16

1718

1920

21

2223

2425

2627

2829

303132

3334

35

21

29

0.4

I

II

III

IV

V

VI

TRAV

12

34

56

7

8

9

Vs349|Homo_sap

Vs295|Pan_panis

Vs579|Pong

Vs123|Gorilla_gorilla|trav

Vs970|Otolemur_garnett

Vs531|Macaca_fascicularis|t

Vs418|Papio

Vs345|Homo_sapiens|trav

Vs748|Daubentonia_mad

Vs229|Daubentonia_madaga

Vs652|Callithrix_jacchus

Vs503|Nomascus_leucogeny

Vs301|Pan_

Vs724|Pongo_ab

Vs904|Chlorocebus_sabaeus|tra

Vs730|Pongo_ab

Vs126|Gorill

Vs322|Macaca_mulatta|trav

Vs395|Pan_troglodytes|trav

Vs1065|Otolemur_garnettii|tr

Vs65|Daubentonia_madagascariensis|trav

Vs289|Pan_paniscus|trav

Vs214|Tarsius_syrichta|t


Vs125|Gorilla_go

Vs116|Gorilla_go

Vs727|Microcebus_murinus|

Vs300|Pan_panis

Vs573|Saim

Vs915|Chloroceb

Vs973|Otolemur_garnett



Vs339|Homo_sapiens|tra


Vs586|Pongo_abelii|trav


Vs660|Daubenton


Vs527|Macaca_fa

Vs638|Callith

Vs506|Noma

Vs353|Homo_sap

Vs496|Nomascus

Vs969|Otolemur_

Vs517|Macaca_fascicula

Vs380|Pan_troglodytes|t

Vs324|Maca

Vs316|Macaca_m

Vs309|Macaca_mulatta|t

Vs521|Macaca_fascicularis|trav

Vs572|Saimiri_bo

Vs492|Nomascus_leucogenys|trav

Vs534|Maca

Vs392|Pan_troglo


Vs922|Chloroceb

Vs397|Pan_troglo

Vs907|Chlorocebus_sab

Vs558|Papio_anu


Vs524|Macaca_fa

Vs244|Tarsius_syrichta|trav


Vs500|Nomascus


Vs505|Nomascus

Vs353|Microcebus_murinus|trav

Vs533|Macaca_fa

Vs580|Saimiri_bo

Vs520|Daubenton

Vs763|Papio_anubis|trav


Vs360|Homo

Vs923|Chlor



Vs359|Homo_sap

Vs665|Tarsius_sy

Vs911|Chlorocebus_sabaeus|trav

Vs586|Saimiri_boliviensis|trav

Vs964|Otolemur_garnettii|trav

Vs323|Macaca_m



Vs582|Saimiri_boliviensi

Vs283|Pan_paniscus|tra

Vs417|Papio_anu

Vs919|Chlorocebus_sabaeu

Vs398|Pan_

Vs679|Tarsius_sy

Vs575|Saimiri_bo

Vs486|Nomascus_leuco

Vs465|Papio_anu


Vs293|Microcebu

Vs608|Pan_


Vs650|Callithrix_jacchus|trav























































































































Vs121|Gorilla_gorilla|trav0.3





Vs736|Daubentonia_madagascarien









































Vs461|Daubentonia_madagascariens
























































(1-15) (16-26) (27-38) (39-46) (47-55) (56-65) (66-74) (75-84) (85-96) (97-104) ——————————————> ——————————> ———————> ————————> ————————> —————————> ———————————> ———————>

(1-26) (27-38) (39-55) (56-65) (66-104) (105-117)

1 10 15 16 23 26 27 38 3941 46 47 55 56 65 66 74 75 84 85 89 96 97 104 105 |........|....| |......|..| |..........| |.|....| |.......| |........| |.......| |........| |...|......| |......| |.....

Clade 1 GT* SNSVKQ.T*Q***SE GASVTMNCT** **G......YPT *FWYV*YP *KPLQ*LQ* E........T MEN.....S KNFG**NIKD KNSP**K*SV*V SDSA*YYC LL*DTVL*Clade 2 *** GD*VTQTEG*VTL*E *****LNCTYQ **Y*.....**F *FWYVQ** *K*P*LLLK SSSE...*Q* ***.....* GF*A***KSD SSFHL*K*S*Q* SDSAVYYC ***Clade 3 G** GDSV*QTEG**LLSE **SL*VNCSYE ***......YP* L*WYVQYP G*G**LLLK A*K*N..D** *SN.....K *FEA*Y**ET TSFHL*K*SV*E SDSAVY*C ALSClade 4 RT* G*SV*Q*EG**TLSE ***L*INCTYT ***......YP* LFWY*QYP GEG*QLLL* A***...*** G*N.....K GFEA*Y**ET TSFHL*K*SV** SDSAVY*C AL*Clade 5 GLR AQ*V*QP***V*V*E G*PLT*KCTYS *SG......*PY LFWYVQ*P **GLQFLLK Y*TGD..NLV KG*.....Y GFEAEFNKSQ TSFHLKK*SAL* SDSA*YFC AV*Clade 6 GTR AQ*VTQPEK*LSVF* GAPV*LKC*YS YSG......SP* LFWYV*YP *QRLQLLLR HI......SR ES*.....K GFTADLNK** TSFHL*K*FAQE EDSA*YYC ALSClade 7 GT* AQSVTQ*D****V*E *****LRCNYS SSS*.....*** LFWYVQYP NQGL*LLLK Y**G..**LV *GI.....* *F*AEF*KSE TSFHL*K*S*H* SD*A*YFC AV*Clade 8 **R AQ*V*Q*******SE ****EL*CNYS Y***.....*** LFWYVQ*P *Q*LQLLL* ****..***V *GI.....K GFEAE***** *SF*L*K***** SD*A*YFC A**GAClade 9 **S LAKTTQ.PI*M*SYE GQEVNI*C*H* *IAT.....*** I*WY*QFP *QGPR*IIQ GYK.....*N **N.....E VASLF***DR KSSTL*LPR**L SD*AVYYC ***Clade 10 T*I DAKTTQ.P*SMDC*E G*A*NLPCNHS TI**.....*EY **W*RQ** S**PQY**H GL*.....N* *TN.....* MASL*I**DR KSSTL*LPH*TL RD***YYC IVRVClade 11 I*G DAKTTQ.PNS*E**E EEPV*LPCNHS TISG.....**Y **WYRQ** *Q*PEYVIH GL*.....*N V**.....* MA*L*I**DR KSSTLIL***TL *D*AVY*C I*RClade 12 SS* SQ*VIQ*QPAIS*QE GET**LDC*** T***.....YY* **WYK**P ****I*LI* Q*T**..*T* ***.....* *YSV****A* *TI*LIIS*SQP EDSATYFC *L*EClade 13 *GI AQK*TQ******VQE KE*VTL*CTYD T***.....*Y* LFWYKQPS SGEMI**I* Q*SY..***N *TE.....G RYSLNFQKA* K***LVISASQ* *DSA*YFC A***Clade 14 *SM **KVTQ****IS**E K**VTLDC*Y* ****.....*YY L*WYKQ** **E***L** **S*..*EQ* ***.....G RY**NFQK*T SS***TI*A*Q* *DSA*YFC AL**Clade 15 *** AQTVTQ*Q*EMSV*E *E*VTL*CTY* *S**.....*Y* L*WYKQ*P S*QM***I* Q**Y..*QQN A**.....N RFSVNFQKAA KS*SL*IS*SQL *D*A*YFC A***Clade 16 *T* GQ***Q.P*E*TA*E G**VQ*NCTYQ TS*......F*G L*WYQQ** G*AP**LSY **L....DGL ***.....G *FSSFL**S* *Y*YLLL**LQ* KDSASY*C AVRClade 17 V** *****Q*P**L***E G****LNC*** ***......*** **WFRQDP GKG**SL** IQS*...Q*E Q**.....* *****L*K** **S******S*P *DSATY*C A****LPClade 18 M*R G***EQSP*FL*V*E GD**VINCTYT DS*......STY *YWYKQ*P G**LQLL** I*SN...*D* KQD.....* RL*V*LNK** KHLSL*I*D*Q* *DSA*YFC A*SClade 19 V** GENVEQ*PSTL*VQ* GD**VI*CTYS DSA......S*Y FPWYKQEL GKGP***ID IRSN..**** NK*.....* R**V*LNK*A KH*SLHI***QP *DSA*YFC AA*Clade 20 *** *E**GLH*PT**VQE GD*S*INC*YS *SA......S*Y **WYKQE* GKGPQ*I*D IRSN...**K ***.....* R*TV*LNKT* KHLSL*I**T** GDSAVYFC AE*Clade 21 *SQ *K*VEQ******V*E ******NCTYS ***......*** F*WYRQ** *K*P*L*** *YS*...*** N**.....G RFT******S *Y*SL*IRDS** SDSATYLC A**Clade 22 *SG KNQVEQSPQSL*ILE G*NCTLQCNYT V*P......F*N LRWYKQD* G*GP**L** MT*S...*N* *S*.....G RYTATLDA** K*SSLH*TA*QL SDSASYIC VV*Clade 23 VNS QQGEE**.Q*LSIQE GENA**NCSYK *SI.......** LQWYRQ*S *R****L*L IRSN...*RE ***.....G RLR*TL*TS* KSSSL*I*A**A ADTA*YFC A**Clade 24 V*S QKIEQN*.**L*I*E G**A**TCNYT *YS......P** *QWYRQDP G*G*VFLLL IREN...E*E K**.....* RL*VTFDTTL KQ**F*I*ASQP ADSATYLC A*DClade 25 VRS Q***Q*P.***I**E GE****NC*SS **L.......YS V*WYRQK* *E***FLM* LLKG...GEQ K*H.....D KI*A**NEKK QQSSL****SQ* *YS*TYFC **EClade 26 **S QE*EQSP.*SL**QE G**LTI*C*SS KTL.......Y* **WY*QKY GEGGLIFLM *L**..*GEE KSH.....* KITAKLDEKK QQS*LHITA**P SH*GIYLC G**Clade 27 VSG QQLNQSP.QS***QE *EDVSMNCTSS S*F.......N* *LWYKQD* GEGPVLL** L*K*...GEL T*N.....G RL*AQFGITR KDSFLNISAS*P *DVG*YFC AGClade 28 V*T Q*LEQSP.*FLSIQE GE**T*YCNSS S*F.......** L*WYR**P GEGPVLL** *V**...GE* KK*.....K RLTFQFGDAR KDSSLHIT**QP GDTGLYLC AGClade 29 **G QQ**QIP.Q**H*QE GEDF*TYCN** **L.......** *QWYKQRP GG*PV*L** L*K*...GEV KK*.....K RLT**FGE** K*SSLHITA*QT *DVGTYFC A*AQCS*Clade 30 V*S *LNVEQSPQSL*VQE GDSTNFTCSFP SS*......FYA LHWYRWET AK*P**LFV M*LN...GDE KK*.....G R***TLNTKE GYSYLYIKGSQ* EDSATYLC A*Clade 31 VSS EDKV*Q*P**L*VHE *D**T**C*YE ***......F*S L*WYKQEK *AP*FLF*L *SSG....IE KKS.....G *LSSILD*** **S*LNITAT*T *DSA*YLCA*EAQCSLClade 32 **G E*QV***P**L**Q* G***S**C*Y* VS*......*** L*WYRQ** G*GP**L** *YS.....AG *EK.....* K*RL*A*L*K **S*L*IT***P EDSA*Y*C AV*Clade 33 L*G E*KV*Q*PL*LSTQE G***TIYCNYS ***......S*R L*WYRQDP GKSLE*LFV LLSN...GAV KQE.....G *L*ASLDTKA RLS*LHI*A*** *LSATYFC AVClade 34 V**A*NEV*QSPQ*LT*QE GE*ITINCSYS *G*.......** L*WLQQ*P GGGIVSLF* LSS.....** KK*.....G RL*ATIN*QE *HSSLHITAS*P RDSA*YIC AVClade 35 *VR G**V*QSP**L***E G****L*CNFS ***.......** *QWF*QNP *G*LI*LF* ***.....GT KQ*.....G RL**T****E **S*L*I***Q* *DS**Y*C A***QCSP

Figure 4: The phylogenetic trees of the AA translated sequences from TRAV exons from 16 primates species. V exonsequences are obtained from whole genome shotgun (WGS) datasets using the Vgenextractor algorithm. Alignmentof the amino acid sequences was performed with clustalO, tree construction with FastTree using the WAG matrix, andvisualization with Figtree. The original tree with all V exon sequences are shown to the left. Clades are collapsed inthe center tree to better illustrate the sequence similarities. Representative subclades, 8 and 23, are shown to the rightto illustrate the distribution of constituent primate taxa. At the bottom part of the figure, the consensus sequences ofeach clade are shown, where the amino acids that are found in more than 90% of the sequences are marked by theirletter, while the variable regions are represented by an asterisk (”*”)

13


http://dx.doi.org/10.1101/006924

0.08

TDSSSTY---IF

SNMDM

Vs285|Panpaniscu

s

SDSASNY---IR

SNEHE

Vs463|P

apioanubis

TDSSSTY---IL

SNTDL

Vs519|Macacafascicu

laris

SDSASDY---IR

SNMAK

Vs320|Macacamulatta

TDSSSTY---IL

SNMDL

Vs311|Macacamulatta

SNSASDY---IR

SNMDK

Vs121|Gorilla

gorilla

SDSASSY---IR

SNMGK

Vs329|Micro

cebusmurinus

SDSASNY---IR

SNAHE

Vs525|M

acacafascicularis

TDSASTY---IF

SNMDK

Vs237|Tarsiu

ssyrich

ta

TDSSSTY---IL

SNMDL

Vs554|Papioanubis

TDSSSTY---IF

SNMDM

Vs382|Pantroglodytes

SNSASDY---IR

SNMDK

Vs355|Homosapiens

TDSSSSY---IL

SNTDM

Vs584|Saimirib

olivie

nsis

SDSASNY---IR

SNVDK

Vs644|Callith

rixjacch

us

SDSASTY---IR

SNMDK

Vs205|Tarsiu

ssyrich

ta

SDSASNY---IR

SNVGE

Vs118|Gorilla

gorilla

SDSASNY---IR

SNVGE

Vs498|N

omascus

leucogenys

SDSASDY---IR

SNMAK

Vs467|Papioanubis

TDSTSTY---L

SNAEKK

Vs966|Otolemurgarnettii

SDSASNY---IR

SNMGE

Vs725|P

ongoabelii

TDSTFTY---IL

SNVDK

Vs521|Daubentoniamadagasca

riensis

TDSSSTY---IF

SNMDL

Vs488|Nomascu

sleucogenys

SDSASNY---IR

SNVGE

Vs351|H

omosapiens

SDSASDY---IR

SNMAK

Vs917|Chlorocebussabaeus

SNSASIY---IR

SNRDK

Vs346|Micro

cebusmurinus

SDSASDY---IR

SNMDK

Vs723|Pongoabelii

SDSASNY---IH

SNMGE


SDSASNY---IR

SNVDK

Vs577|Saimirib

olivie

nsis

SDTASSY---IR

SNEGK

Vs176|Tarsius

syrichta

TDSSSTY---IF

SNMDM

Vs109|Gorilla

gorilla

SDSASSY---IR

SNVNR

Vs1043|Otolemurgarnettii

TDSSSTY---IF

SNMDM

Vs341|Homosapiens

TETTSTY---IF

SYEDK

Vs283|Micro

cebusmurinus

SDSASDY---IR

SNMAK

Vs529|Macacafascicu

laris

SDSASSY---IR

SNMDK

Vs177|Tarsiu

ssyrich

ta

SDSASNY---IR

SNMGE

Vs293|P

anpaniscus

SDSASTY---IR

SNMGK

Vs555|Daubentoniamadagasca

riensis

SDSASNY---IR

SNAHE

Vs317|Macaca

mulatta

SNSASDY---IR

SNMDK


SDSASDY---IR

SNMDK

Vs501|Nomascu

sleucogenys

TDSSSTY---IL

SNTDL

Vs909|Chlorocebussabaeus

TDSSSTY---IF

SNMDL

Vs764|Pongoabelii

SNSASDY---IR

SNMDK

Vs296|P

anpaniscus

Clade18

Clade19

Clade20

CDR1(i)

CDR2(i)

Framework

3Framework

2Framework

1

CDR1(i)

CDR2(i)

Figure5:

Detailed

representationof

theclades

18,19and

20form

edfrom

thealignm

entofthe

AA

sequencesof

TR

AVexon

sequencesof

primates.

The

phylogenetictree

(left)showthe

CD

R1(i)and

CD

R2

(i)sequencesin

ordertodem

onstratethe

similarity

ofthesesequences

between

mem

bersofthe

same

clade.Sequencealignm

ent(right)isshow

nforclade-18

andclade-20.T

heregions

thatarem

arkedhave

beendefined

byouranalysis

software,Trozos,(see

materialand

methods).

14


http://dx.doi.org/10.1101/006924

0.3

I

II

III

IV

V

VI

VII

VIII

IX

1

345

67

8

9

10

12

13

1415

16

1718

1920

21

2223

24

25

2

11

0.3



(1-15) (16-26) (27-38) (39-46) (47-55) (56-65) (66-74) (75-84) (85-96) (97-104) ——————————————> ——————————> ———————> ————————> ————————> —————————> ———————————> ———————>

(1-26) (27-38) (39-55) (56-65) (66-104) (105-117)

1 10 15 16 23 26 27 38 3941 46 47 55 56 65 66 74 75 84 85 89 96 97 104 105 |........|....| |......|..| |..........| |.|....| |.......| |........| |.......| |........| |...|......| |......| |.....

Clade 1 *L ***VSQ*PSR*ICK* G*SV*IEC*** DFQ......ATT MFWYRQ** *Q*L*L*AT SN*G..S**T YEQG***.* *F*I*H*.*L T*S*LTV**AHP EDSSFY*C S**Clade 2 V* ****SQKPSR**CQ* GTS**IQC*** SQ*.......** MFWY*Q*P G****L*AT ANQG..S*AT YE**F**.D KFPIS*P.NL *FST**VSN**P EDS**Y*C S**Clade 3 ** DG*ITQSPKYLFRKE GQ*VTL*CEQN LNH.......DA MYWYRQDP GQGLRLIYY S**....V*D *QKGDI*.E GYSVSRE.*K *SFPLTVTSAQ* N*TAFYLC ASSIClade 4 P* EAQVTQNPRYLITVT GKKLTVTCSQN MNH.......*Y MSWYRQDP GLGLRQIYY S*N....V** *DKGD*P.E GY*VSRK.EK RNFPLI*ESP*P *QTSLY*C ASSLClade 5 LV ***VTQ**R*L*KR* GE*V*LEC*QD MDH.......** MFWYRQDP GLGLRLIYF S*D....*** *E*GD*P.* GY*VSR*.KK **FSL*L*SA*T *QTS*YLC ASS*SAClade 6 ** *A***Q*PR***I*T GK***L*CSQ* M*H.......** MYW***** G*******Y S**....**S TE*GD*S.* ***VSR*.** **FPLTLESA** **TS*YLC ASS*Clade 7 *M DA*VTQTPRN*I*KT G**I*LECSQT **H.......** MYWYRQDP GLGL*LIYY S**....V*D **KGE*S.* GY*VSR*.*Q *KFSLSLE*A** NQTALYFC A*S*Clade 8 H* DA*ITQ*PR*K*TET G**VTL*CHQT **H.......** M*WYRQD* G*GLRLI*Y S**....*** **K*EV*.D GY*VSRS.** E*F*LTLESA** SQTSVYFC A*S*Clade 9 ** *A*VTQ*P****L** GQ**T**C*QD M*H.......** M*WYRQD* G*GLRLI*Y S**....*G* T**GEV*.* GY*VSR*.** **F*L*L*SA** SQTS*YFC ASS*ATVClade 10 *R *QT*HQWPA**VQP* GSPLSLECTV* GTS......NP* LYWYRQ** ***LQLLFY S**.....** Q**SE**.Q NLSASR*.Q* **F*LSSKKLLL SDSGFYLC AWSClade 11 H* **MVIQNPRYQ*T** *KPVTLSCSQN *NH.......** MYWYQQK* SQAPKLL** YYD....*** N*E*DT*.D NFQ**RP.NT SFC**DI*S*GL *D*A*YLC A*S*Clade 12 *L DTAV*QTPKYL*TQ* G****LKCEQ* LGH.......** MYWYKQDS *K*LK*MF* Y*N*....** **NET**.* RFSP*S*.DK A*L*LHI*S*E* GDSAVY*C ASS*Clade 13 P* ****TQTP*HLV*** **KK*L*CEQ* *GH.......** MYWY*Q** *K**E*MF* Y**....*** **N**VP.S RF*PE**.*S S*L*LH***LQP EDSA*YLC ASSQClade 14 *V *AGV*Q*PR*LIK*K *E*A*L*CYP* **H.......*T VYWYQQ*P *Q**QFLIS *Y*....KMQ **KG*IP.* RF*AQQF.*D YHSE*N*SSLEL GDSA*Y*C ASS*Clade 15 *V D*GVTQTPKHL*TA* GQ*VTLRCSPR SGD.......*S VYWY*QSL *Q*LQFLIQ YYN....G*E **KGNI*.E RFS*QQF.** **SELNLSSLEL GDSALYFC ASS*Clade 16 ** **GVTQ*P**LIK*R GQQVTL*CSP* SGH.......** V*WYQQ** GQG*Q**** Y**....*** ***GNFP.* RFS**QF.** **SE*NV***** *DSALYLC ASSLClade 17 L* NAG**QNPRHLVRR* GQEA*L*CSP* KGH.......*H VYWY*QL* *EGLKFM*Y LQKE...**I DESGMP*.* *FSAEFP.KE GPS*L*IQQA** *DSA*YFC ASS*Clade 18 SP GEEV*QTP*HLV*G* GQKA*LYC*PI *G*.......*Y *FWYQ*VL *KEFKFLIS FQN*...N*F D*TGMPK.* RFSAKC*.*N S*CSLEIQAT** *DSA*Y*C ASSQClade 19 S* DT*VTQ*PR**V*** *QK*K*DCVP* K*H.......SY VYWY*K*L *EELKFL*Y *QN*...**I *K*E*IN.* RF*AQC*.*N S*C*LEIQSTE* GD***YFC A*S*Clade 20 *F *A*VTQTPG*L*K*K G*K**M*C*P* *GH.......** **WYQQ*Q NKE***L** FQ**...*** **TE**K.* RFS**CP.** *PC*L*I*S**P GD*ALY*C ASS*Clade 21 ** DAGV*Q*P*H*VTEM G**VT*RC*PI *GH.......** **WYRQT* **GLE*L*Y F***...*** DDS*MPK.D RFSA*MP.** ***TLKIQP*EP *DSA*Y*C AS*LClade 22 H* *AGVTQFPSH*VIEK GQ*VTLRCDPI SGH.......** L*WYRR*M GKE*KFL** F***...**Q DESGMP*.* RF*A**T.GG T*STL*VQ*AEL EDSG*YFC ASS*Clade 23 *T EP*V*QTPSHQVT*M GQ*VIL*C*PI **H.......** FYWYRQI* GQK*EFL** F***...*I* **SEIF*.D *FS**R*.*G ***TLKI*STKL EDSA*YFC ASS*Clade 24 ** EA*V*QSPRYKI*EK *Q*V*FWC*P* SGH.......*T LYWY*Q*L GQGP*LL** ****...**V DDSQLPK.D RFSAER*.KG V*STL*IQPA*L *DSA*YLC ASSLClade 25 H* *AGVSQ*P**K**** G**V***CDPI S*H.......** LYWY*Q** GQG*E*L*Y F***...*** **SGL**.* RF*A*R*.*G S*STL*IQ*T*Q *DSA*YLC ASS*ARA

Figure 6: The phylogenetic trees of the AA translated sequences from TRAV exons from 16 primates species. V exonsequences are obtained from whole genome shotgun (WGS) datasets using the Vgenextractor algorithm. Alignmentof the amino acid sequences was performed with clustalO, tree construction with FastTree using the WAG matrix, andvisualization with Figtree. The original tree with all V exon sequences are shown to the left. Clades are collapsed inthe center tree to better illustrate the sequence similarities. Representative subclades, 3, 16, and 33, are shown to theright to illustrate the distribution of constituent primate taxa. At the bottom part of the figure, the consensus sequencesof each clade are shown, where the amino acids that are found in more than 90% of the sequences are marked by theirletter, while the variable regions are represented by an asterisk (”*”).

15


http://dx.doi.org/10.1101/006924

Table 5: Number of TRAV exons present in each clade by specie in the phylogenetic tree defined in Figure 4.

Clade D. m

adag

asca

r.

O. g

arne

ttii

M.m

urin

us

T .sy

rich

ta

C. j

acch

us

S.bo

livie

nsis

M.m

ulat

ta

M.f

asci

cula

ris

C. s

abae

us

P.an

ubis

N. l

euco

geny

s

P .ab

elii

G. g

orill

a

P .pa

nisc

us

P .tr

oglo

dyte

s

H. s

apie

ns

1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 12 1 2 1 1 1 1 1 1 1 1 1 1 0 0 0 03 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 14 3 4 1 0 0 0 1 1 1 1 1 1 1 1 1 15 1 2 0 1 1 1 1 1 1 1 1 1 1 1 1 16 0 0 0 0 1 2 1 1 1 1 1 3 1 1 2 27 2 1 1 2 0 3 2 3 2 3 3 2 2 2 2 38 4 8 5 4 2 3 2 2 1 2 3 3 4 3 4 39 1 2 5 1 1 1 1 1 1 1 1 1 1 1 1 1

10 1 0 2 1 1 1 1 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 2 2 1 1 1 2 1 0 2 112 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 013 2 2 0 2 1 1 1 1 1 1 1 1 1 1 1 114 2 5 1 2 0 0 1 1 1 1 1 1 1 1 2 115 1 15 0 1 1 0 2 2 2 2 2 2 2 2 2 216 0 2 0 1 1 2 2 2 1 2 1 3 2 1 2 217 1 1 0 2 1 1 1 1 3 0 2 2 1 1 1 118 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 119 0 1 0 0 1 1 1 1 0 1 1 1 1 1 1 120 1 0 2 2 0 0 1 1 1 1 1 1 1 1 1 121 3 5 3 2 3 3 2 3 2 3 3 2 3 3 3 322 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 123 1 1 1 1 0 0 1 1 1 1 1 1 1 1 2 124 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 125 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 026 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 127 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 128 0 0 0 1 1 1 1 1 1 1 0 0 1 0 2 129 1 1 3 1 0 0 2 2 1 1 1 1 1 0 3 130 1 0 0 0 1 1 2 2 1 1 1 0 1 1 2 131 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 132 0 0 0 2 1 1 1 1 1 1 2 1 2 2 2 233 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 134 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 135 4 7 4 2 1 1 2 1 1 1 1 1 1 1 1 1

16


http://dx.doi.org/10.1101/006924

Table 6: Number of TRBV genes present in each primate species in the phylogenetic clades defined in Figure 6.

Clade D. m

adag

asca

r.

O. g

arne

ttii

M. m

urin

us

T .sy

rich

ta

C.j

acch

us

S.bo

livie

nsis

M. m

ulat

ta

M.f

asci

cula

ris

C.s

abae

us

P.an

ubis

N.l

euco

geny

s

P .ab

elii

G. g

orill

a

P .pa

nisc

us

P .tr

oglo

dyte

s

H.s

apie

ns

1 1 1 0 0 1 1 1 1 1 1 1 1 1 2 2 22 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 13 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 14 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 15 1 1 1 1 1 1 3 1 1 1 1 2 1 1 1 16 1 0 0 0 1 2 1 1 1 1 1 3 2 2 2 27 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 18 1 2 1 0 2 3 2 2 3 3 2 3 2 2 3 29 0 4 2 0 3 3 3 5 6 7 4 8 6 8 8 6

10 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 111 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 112 0 2 4 1 2 1 2 4 3 4 3 3 2 2 2 113 1 3 4 0 4 1 4 3 3 3 4 4 2 3 3 314 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 115 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 116 1 2 5 0 5 4 5 7 8 8 4 7 6 7 7 617 1 1 0 0 1 1 1 1 0 0 0 1 1 1 1 118 1 0 0 0 1 1 1 1 1 1 1 1 0 1 1 119 1 1 1 0 1 1 1 1 1 1 0 3 2 1 1 120 1 1 1 0 1 1 1 1 1 1 1 1 2 2 2 221 1 2 1 1 4 3 4 3 3 3 1 5 3 2 2 322 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 123 1 1 1 1 1 1 1 3 3 3 3 4 1 1 1 124 0 1 1 2 2 0 2 3 3 3 1 2 2 2 3 325 1 3 3 1 4 4 4 9 6 9 2 5 7 6 6 5

17


http://dx.doi.org/10.1101/006924

Once the sequence fragments FR1, FR2, FR3, and the CDRs were separated, we studiedwhether each V exon of the TRV is unique to each species and whether it has an ortholog inother species (as suggested by the results of the phylogenetic trees). Figure 7 shows the par-ticular case of a randomly selected V exon sequence for illustration (Vs367 of H. sapiens forthe TRAV and Vs168 of M. mulatta from the TRBV locus; sequences can be obtained fromhttp://vgenerepertoire.org). For example, in the case of the TRAV sequence shown (ie., Vs367Homo sapiens), each of the 4 segments (CDRS, FR1, FR2 and FR3) differ significantly from otherTRAV sequences within the same species, appearing as an outlier in the boxplot of Figure 7. In12 primate species, we found one or two sequences which are similar, indicating that that they areorthologs. This phenomenon occurs in each of the segments, indicating the uniqueness of eachV gene. We repeated the same experiment for the TRBV genes (ie., the Vs168 exon sequenceof Macaca mulatta, shown in Figure 7 (bottom)). The results are similar to those of the TRAV,however unique V exon sequences were not found in the FR2 regions.

The data generated from the phylogenetic tree suggests frequent changes in the gene loci of IGas well as a reduced permissiveness in the genes of the TR chains. The theory of birth and death ofgenes has been postulated as a mechanism that directs the evolutionary processes of these genes.In Figure 8 we studied this hypothesis by quantifying sequence similarities higher than 90 % insegments over 3000 bases in the IGHV, TRAV and TRBV loci between the orangutan, human andmacaque species. The results show that In the IGHV locus, there are more tracks and cross linkingas compared to the TR loci. Also, comparing species uncovers relationships in the IGHV locusover evolutionary time. For example, the number of IGHV tracks and crossovers is higher betweenhuman and macaque (more distantly related species) than between human and orangutan (moreevolutionarily close species). In the TRAV and TRBV loci, the tracks are approximately parallel,indicating that in these loci, less duplication/deletion processes took place between speciationevents, contrasting what can be observed in the IGHV locus.

4. Discussion

In previous studies (Olivieri et al., 2013, 2014), we showed data indicating a different evolu-tionary process between the V genes of IG and TR. In IG, the processes of birth and death are quiteevident. We also highlighted the grouping of sequences into established IMGT clans and proposednew clades. For the species studied in this work, we describe the clustering of the IG light chainsequences into major clades. While these groupings are not as obvious as the three clans of theIGH chains, we can establish these light chain clades with certainty due to the large number ofsequences, supporting their existence. The grouping of the IGL chains into five clades is of par-ticular interest, since these clades originated prior to the diversification of mammals and reptilesand interestingly both have remained in evolutionary lines for over 300 million years, suggestinga functional significance of each clade which is still unknown.

All loci containing V exons are very similar. Despite this wide similarity, there are starkevolutionary differences amongst the IG and TR loci. The IG loci exhibit a more pronounced rateof change as than the TR loci. This is seen by observing sequences between species of primateswhere there is greater sequence conservation in the TCR loci. Besides the evidence left as relicsin genomic sequences, frequent duplications of IG genes generate recent clades with multiple

18


http://dx.doi.org/10.1101/006924

05

10

15

05

10

15

1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

16

01

02

03

04

0

1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

16

Vs2

08

05

10

15

20

25

30

1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

16

05

10

15

05

10

15

20

25

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

05

10

15

1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

16

01

02

03

04

0

1 2 3 4 5 6 7 8 9

10

11

12

13

14

15

16

CDRs(i)Vs367 Homo sapiens

PSSNFYA---MTLNGDE

Frame 2(i)Vs367 H

om

o sa

pie

ns

LHWYRWETAKSPEALFV

Frame 3(i)Vs367 H

om

o sa

pie

ns

LNGDEKKKGRISATLNTKEGYSYLYIKGSQPEDSATYLCA

Frame 1(i)Vs367 H

om

o sa

pie

ns

VSSILNVEQSPQSLHVQEGDSTNFTCSF

CDRs(i)Vs168 M

aca

ca_m

ula

tta

NLNHDAM---SQIVNDI

Frame 1(i)Vs168 M

aca

ca_m

ula

tta

TMDGRITQSPKYLFRKEGQNVTLSCEQ

Frame 2(i)Vs168 M

aca

ca_m

ula

tta

MYWYRQDPGQGLRLIYY

Frame 3(i)Vs168 M

aca

ca_m

ula

tta

NDIQKGDIAEGYSVSRERKESFPLTVTSAQRNPTAFYLCASS

TRAVTRBV

Num

ber o

f identica

l sequence

sN

um

ber o

f identica

l sequence

sN

um

ber o

f identica

l sequence

sN

um

ber o

f identica

l sequence

s

Hom

o sa

pie

ns

Pan

trog

lod

yte

s

Pan

pan

iscus

Gorilla

gorilla

Pon

go a

belii

Nom

ascu

s leu

cog

en

ys

Pap

io a

nu

bis

Ch

loro

ceb

us sa

baeu

s

Maca

ca fa

scicula

ris

Maca

ca m

ula

tta

Saim

iri boliv

iensis

Callith

rix ja

cchu

s

Tarsiu

s syrich

ta

Micro

ceb

us m

urin

us

Oto

lem

ur g

arn

ettii

Dau

ben

ton

ia m

ad

ag

asca

rien

sisVs506

Vs633

Vs565

Vs837

Vs544

Vs898

Vs1041

Vs514

Vs133

Vs531

Vs1097

Vs367

Hom

o sa

pie

ns

Pan

trog

lod

yte

s

Pan

pan

iscus

Gorilla

gorilla

Pon

go a

belii

Nom

ascu

s Leu

cog

en

ys

Pap

io a

nu

bis

Ch

loro

ceb

us sa

baeu

s

Maca

ca fa

scicula

ris

Maca

ca m

ula

tta

Saim

iri boliv

iensis

Callith

rix ja

cchu

s

Tarisiu

s syrich

ta

Micro

ceb

us m

urin

us

Oto

lem

ur g

arn

ettii

Dau

ben

ton

ia m

ad

ag

asca

rien

sis

Vs506 V

s633

Vs565

Vs837

Vs544 V

s898

Vs1041

Vs514

Vs133

Vs531

Vs405

Vs367

Vs506

Vs239

Vs633

Vs565

Vs36

Vs544

Vs898

Vs1041

Vs514

Vs133

Vs531

Vs405

Vs367

Vs367

Vs405

Vs531

Vs133

Vs514

Vs1041

Vs898

Vs544

Vs36

Vs565

Vs633V

s506

Vs223

Vs202

Vs827

Vs488

Vs144

Vs876

Vs652

Vs418

Vs168

Vs179

Vs158

Vs563

Vs223

Vs202

Vs827

Vs488

Vs144

Vs876

Vs652

Vs418

Vs168

Vs179V

s158

Vs563

Vs223

Vs202

Vs827

Vs488

Vs144

Vs192

Vs876

Vs652

Vs418

Vs168

Vs179

Vs158

Vs674

Vs383

Vs707

Vs463

Vs223

Vs202

Vs827

Vs488

Vs144

Vs192

Vs876

Vs652

Vs413

Vs163

Vs179

Vs158

Vs563

Vs687

Figure7:

Participationof

eachof

thecanonical

segments,

FR1,

FR2,

FR3,

CD

R1,

andC

DR

2,from

theam

inoacid

translatedV

exonsequences

ofthe

uniquelyidenticalgenes.From

thesesequences,w

eobtained

thefram

ework

regions(1,2

and3)and

othersequencescreated

artificiallyw

iththe

two

CD

Rs

(1and

2).A

concretecase

ofsequences,originating

froma

TR

AVsequence

(top)w

ascom

pared(identity

number)

with

therestof

thefragm

entsobtained

fromallT

RAV

exonsfrom

allspecies.The

same

experimentw

asperform

edw

ithspecific

sequencesofa

TR

BV

(bottom).

19


http://dx.doi.org/10.1101/006924

1V 2V3V 4V 5V 6V 7V8V 9V 10V

11V12V

13V

14V15V16V

17V

18V19V

20V21V22V

23V

24V

25V26V27V

28V29V

30V

31V32V

33V34V35V

36V37V38V

39V40V41V42V

43V44V45V

46V47V

48V49V

50V51V

52V

1VR

1V 2V3V 4V 5V6V 7V 8V9V 10V11V12V13V14V15V

16V17V

18V19V

20V

21V

22V23V24V25V26V

27V28V29V

30V31V32V33V

34V35V36V

37V38V

39V

40V

41V42V

43V

1VR

1V 2V 3V 4V 5V 6V 7V8V 9V 10V11V

12V13V

14V

15V16V17V18V

19V

20V

21V

22V23V24V

25V

26V

27V

28V29V30V31V32V

33V34V35V36V37V

38V39V40V

41V42V

43V

44V

1VR

macaco record macaco record macaco record macaco record macaco record

Human record Human record Human record Human record Human record

orangutan record orangutan record orangutan record orangutan record

IGHV

TRAV

TRBV

P. abelii

H. sapiens

M. mulatta

P. abelii

H. sapiens

M. mulatta

P. abelii

H. sapiens

M. mulatta

1000000

1VR2VR

3VR4VR

5VR6VR

7VR8VR

9VR10VR

11VR12VR

13VR14VR

15VR16VR

17VR18VR

19VR20VR

1VR2VR

3VR4VR

5VR6VR

7VR8VR

9VR10VR

11VR12VR

13VR14VR

15VR16VR

17VR18VR

19VR20VR

21VR22VR

23VR24VR

25VR26VR

27VR28VR

29VR30VR

31VR32VR

33VR34VR

35VR36VR

37VR38VR

39VR40VR

41VR42VR

43VR44VR

45VR

1VR2VR

3VR4VR

5VR6VR

7VR8VR

9VR10VR

11VR12VR

13VR14VR

15VR16VR

17VR18VR

19VR20VR

21VR22VR

23VR24VR

25VR26VR

27VR28VR

29VR30VR

31VR32VR

33VR34VR

35VR36VR

37VR38VR

39VR40VR

41VR42VR

43VR44VR

45VR46VR

47VR

macaco record macaco record macaco record

Human record Human record Human record Human record

orangutan record orangutan record orangutan record orangutan record orangutan record

1000000

1V 2V 3V 4V 5V 6V 7V8V 9V 10V

11V

12V13V

14V

15V

16V

17V18V

19V20V

21V

22V23V24V

25V26V27V28V29V30V

31V

32V

33V34V

35V36V37V

38V39V

40V

41V

42V

1VR

1V 2V 3V 4V 5V 6V 7V 8V9V10V

11V

12V13V14V

15V16V

17V

18V

19V

20V21V

22V

23V

24V

25V26V27V28V

29V

30V

31V32V

33V34V

35V36V

37V38V39V

1V 2V 3V 4V 5V 6V 7V 8V9V 10V

11V

12V

13V14V

15V

16V

17V18V19V

20V

21V

22V

23V

24V

25V

26V

27V28V

29V30V

31V32V33V

34V

1VR

macaco record macaco record macaco record macaco record macaco record

Human record Human record Human record Human record

orangutan record orangutan record orangutan record orangutan record

Figure 8: Identical Sequence within the IGHV, TRAV and TRBV loci from the three species of primates, H. sapiens, P.abelii and M. mulatta. For each species, the V exon sequences were extracted from genomic segments available at theEnsemble repository www.ensembl.org. For our analysis pipeline, we used Galaxy (http://galaxy.wur.nl). Sequencesfor the tracks were obtained in the following way: we performed a BLASTN against the orangutan and macaquesequences (ie., the db), with the query consisting of human sequence. We selected sequence identities > 60% and analignment length > 3000 bases and the figure was made with a custom python script. The locations of the V exons,marked in red, were obtained with Vgenextractor (http://vgenerepertoire.org/).

20


http://dx.doi.org/10.1101/006924

members. Evolution provides a defense mechanism of an organism for rapid adaptation of IGchains to a rapidly changing external infectious environment.

In the TRA and TRB loci, there is a conservation of V exons and a low duplication permis-siveness. In particular, we found a conservation of 35 TRAV exon sequences and 25 TRBV exonsequences. Nonetheless, in some species, we did not detect any conserved V exons. This may be amethodological error (Vgenextractor only detects 95 % of the V exons from WGS data sets) or thesystem may be slightly redundant, permitting some V exon loss without compromising the sur-vival of the individual. Similarly, in the TR loci we detected duplication events but never observedmultiple duplications, such as those in the IG loci.

The uniqueness of each gene in the TR loci is of particular interest. The number of V genesfrom these loci is not arbitrary. The fact that a large repertoire variation can be generated by theprocess of VDJ recombination and somatic mutation has given rise to the assumption that a few Vgenes should be sufficient for somatic diversification. Previous publications (Suarez et al., 2006)suggest that few V regions can generate nearly complete repertoires. The results expressed in thiswork indicate that the genomic diversity of the V genes in the TR loci should have a functionalbasis maintained throughout evolution.

Our results also provide new insights into the evolution of CDR and framework regions. Froman evolutionary point of view, the CDRs are sequence segments that should be permissive tomutations, while changes in the framework regions should be less permissive since they provide awell defined structure. In general, when the AA sequences deduced of the V exons are aligned, theCDR regions are grouped in regions called hypervariable regions. However, when each clade isstudied independently, the framework regions have a variability similar to that found in the CDRregions (Figure 5) especially in the CDR2 of the TRAV locus and the CDR1 of the TRBV locus.These results show that sequences of this CDRs are maintained in evolution and that there is not agreater permissiveness to mutations than in framework regions. The hypervariability found in thealignment of sequences of one specie is due to the presence of different CDRs within each V exon,but there exists an evolutionary ortholog maintained in other primate species. In Table 5, a columnwith the consensus sequences of the CDRs (i) are shown. This data indicates that the sequences ofeach TRAV exon may be positively selected with a specific, non-redundant function.

Why have these genes been maintained in the TRV loci?. A probable explanation is that thismaintenance is due to a co-evolution with interacting molecules, such as MHC, that provide a nat-ural evolutionary pressure. This same evolutionary pressure may also condition the pairing of theTRA and TRB regions. Therefore, the evolution of each V region must be constrained by modi-fications that equally occur in the MHC molecules as well as other changes in V region pairing.These same mechanisms do not occur in the IG loci, since antigen recognition by antibodies isnot restricted by MHC molecules, making it likely the greater permissibility towards evolutionarymodifications.

Why are there 25 TRBV and 35 TRAV?. A possible explanation could be that a minimumnumber of genes are required to form TRA/TRB pairs needed for T lymphocytes to recognizeantigen presented by the large structural variations of MHC class I and class II molecules. Indeed,it is known that MHC can have multiple forms, particularly class II molecules. If this hypothesiswere true, we would expect to find specific pairings of TRA/TRB for putative MHC molecules.Also, we would expect to find evidence of the association between V exons and the presence or

21


http://dx.doi.org/10.1101/006924

absence of MHC genes in evolutive studies.A plausible explanation for the result we presented are that the MHC genes that coexist with

the TRV genes must act as evolutionary guides. In this scenario, the capacity for the TR to rec-ognize the MHC should be coded directly within the germline, while the antigen recognition ofthe TR-MHC complex is a consequence of random somatic variations in the individual (VDJ re-arrangements and somatic mutation). Studies suggest that recognition of MHC is mediated bythe CDR1 and CDR2 which are within the V exon, while the antigenic component is recognizedby the CDR3 (encoded by D and J exons) (Marrack et al., 2008; Deng et al., 2012). Our data isconsistent with this description and that MHC recognition system must be encoded in the genome.This would explain the coevolution of both molecules. It is logical that the processes of somaticvariability are directed towards antigen recognition and have a stochastic quality. If this were thecase, the MHC recognition structures would be limited in order to accompany evolutionary al-lowed changes. Our work points to the fact that these constraining structures are the FR and CDRamino acid sequences generated from the V exons.

5. References

Berman, J. E., Mellis, S., Pollock, R., Smith, C., Suh, H., Heinke, B., Kowal, C., Surti, U., Chess, L., Cantor, C. et al.(1988). Content and organization of the human ig vh locus: definition of three new vh families and linkage to theig ch locus. The EMBO journal, 7, 727.

Brack, C., Hirama, M., Lenhard-Schuller, R., & Tonegawa, S. (1978). A complete immunoglobulin gene is createdby somatic recombination. Cell, 15, 1–14.

Davis, M. M., & Bjorkman, P. J. (1988). T-cell antigen receptor genes and t-cell recognition. Nature, 334, 395–402.Deng, L., Langley, R. J., Wang, Q., Topalian, S. L., & Mariuzza, R. A. (2012). Structural insights into the editing of

germ-line–encoded interactions between t-cell receptor and mhc class ii by vα cdr3. Proceedings of the NationalAcademy of Sciences, 109, 14960–14965.

Deza, F. G., Espinel, C. S., & Mompo, S. M. (2009). The immunoglobulin heavy chain locus in the reptile¡ i¿ anoliscarolinensis¡/i¿. Molecular immunology, 46, 1679–1687.

Gambon-Deza, F., Sanchez-Espinel, C., & Magadan-Mompo, S. (2010). Presence of an unique igt on the igh locusin three-spined stickleback fish (¡ i¿ gasterosteus aculeatus¡/i¿) and the very recent generation of a repertoire of vhgenes. Developmental & Comparative Immunology, 34, 114–122.

Ghaffari, S. H., & Lobb, C. J. (1991). Heavy chain variable region gene families evolved early in phylogeny. igcomplexity in fish. The Journal of immunology, 146, 1037–1046.

Giudicelli, V., Chaume, D., & Lefranc, M.-P. (2005). Imgt/gene-db: a comprehensive database for human and mouseimmunoglobulin and t cell receptor genes. Nucleic acids research, 33, D256–D261.

Giudicelli, V., & Lefranc, M. (2004). Imgt/gene-db. The molecular biology database collection. Nucl Acids Res, 32.Giudicelli, V., & Lefranc, M.-P. (1999). Ontology for immunogenetics: the imgt-ontology. Bioinformatics, 15, 1047–

1054.Giudicelli, V., & Lefranc, M.-P. (2012). Imgt-ontology 2012. Frontiers in genetics, 3.Gouy, M., Guindon, S., & Gascuel, O. (2010). Seaview version 4: a multiplatform graphical user interface for

sequence alignment and phylogenetic tree building. Molecular biology and evolution, 27, 221–224.Guo, Y., Bao, Y., Wang, H., Hu, X., Zhao, Z., Li, N., & Zhao, Y. (2011). A preliminary analysis of the immunoglobulin

genes in the african elephant (loxodonta africana). PloS one, 6, e16889.Hughes, A. L. (1994). The evolution of functionally novel proteins after gene duplication. Proceedings of the Royal

Society of London. Series B: Biological Sciences, 256, 119–124.Janeway, C. A., Travers, P., Walport, M., & Shlomchik, M. J. (2001). Immunobiology. Garland Science.Janeway, C. A., Travers, P., Walport, M., & Shlomchik, M. J. (2005). Immunobiology: the immune system in health

and disease. Garland Science New York.

22


http://dx.doi.org/10.1101/006924

Janeway Jr, C. A. (1992). The immune system evolved to discriminate infectious nonself from noninfectious self.Immunology today, 13, 11–16.

Kirkham, P., Mortari, F., Newton, J., & Schroeder Jr, H. (1992). Immunoglobulin vh clan and family identity predictsvariable domain structure and may influence antigen binding. The EMBO journal, 11, 603.

Lefranc, M.-P. (2001). Nomenclature of the human immunoglobulin heavy (igh) genes. Experimental and clinicalimmunogenetics, 18, 100–116.

Lefranc, M.-P. (2011). From imgt-ontology description axiom to imgt standardized labels: for immunoglobulin (ig)and t cell receptor (tr) sequences and structures. Cold Spring Harbor Protocols, 2011, pdb–ip83.

Lefranc, M.-P. (2014). Immunoglobulin and t cell receptor genes: Imgt R© and the birth and rise of immunoinformatics.Frontiers in immunology, 5.

Lefranc, M.-P., Giudicelli, V., Ginestoux, C., Jabado-Michaloud, J., Folch, G., Bellahcene, F., Wu, Y., Gemrot, E.,Brochet, X., Lane, J. et al. (2009). Imgt R©, the international immunogenetics information system R©. Nucleic acidsresearch, 37, D1006–D1012.

Lefranc, M.-P., & Lefranc, G. (2001a). The immunoglobulin factsbook. Gulf Professional Publishing.Lefranc, M.-P., & Lefranc, G. (2001b). The T cell receptor FactsBook. Gulf Professional Publishing.Lefranc, M.-P., Pommie, C., Ruiz, M., Giudicelli, V., Foulquier, E., Truong, L., Thouvenin-Contet, V., & Lefranc, G.

(2003). Imgt unique numbering for immunoglobulin and t cell receptor variable domains and ig superfamily v-likedomains. Developmental & Comparative Immunology, 27, 55–77.

Marrack, P., Scott-Browne, J. P., Dai, S., Gapin, L., & Kappler, J. W. (2008). Evolutionarily conserved amino acidsin tcr v regions and mhc control their interaction. Annual review of immunology, 26, 171.

Miller, R. D., Grabe, H., & Rosenberg, G. H. (1998). Vh repertoire of a marsupial (monodelphis domestica). TheJournal of Immunology, 160, 259–265.

Niku, M., Liljavirta, J., Durkin, K., Schroderus, E., & Iivanainen, A. (2012). The bovine genomic dna sequence datareveal three¡ i¿ ighv¡/i¿ subgroups, only one of which is functionally expressed. Developmental & ComparativeImmunology, 37, 457–461.

Olivieri, D., Faro, J., von Haeften, B., Sanchez-Espinel, C., & Gambon-Deza, F. (2013). An automated algorithm forextracting functional immunologic v-genes from genomes in jawed vertebrates. Immunogenetics, 65, 691–702.

Olivieri, D., von Haeften, B., Sanchez-Espinel, C., Faro, J., & Gambon-Deza, F. (2014). Genomic v exons from wholegenome shotgun data in reptiles. Immunogenetics, (pp. 1–14).

Perelman, P., Johnson, W. E., Roos, C., Seuanez, H. N., Horvath, J. E., Moreira, M. A., Kessing, B., Pontius, J.,Roelke, M., Rumpler, Y. et al. (2011). A molecular phylogeny of living primates. PLoS genetics, 7, e1001342.

Price, M. N., Dehal, P. S., & Arkin, A. P. (2010). Fasttree 2–approximately maximum-likelihood trees for largealignments. PloS one, 5, e9490.

Sievers, F., & Higgins, D. G. (2014). Clustal omega, accurate alignment of very large numbers of sequences. InMultiple Sequence Alignment Methods (pp. 105–116). Springer.

Suarez, E., Magadan, S., Sanjuan, I., Valladares, M., Molina, A., Gambon, F., Dıaz-Espada, F., & Gonzalez-Fernandez, A. (2006). Rearrangement of only one human ighv gene is sufficient to generate a wide repertoireof antigen specific antibody responses in transgenic mice. Molecular immunology, 43, 1827–1835.

Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M., & Kumar, S. (2011). Mega5: molecular evolutionarygenetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecularbiology and evolution, 28, 2731–2739.

Tonegawa, S. (1983). Somatic generation of antibody diversity. Nature, 302, 575–581.Waterhouse, A. M., Procter, J. B., Martin, D. M., Clamp, M., & Barton, G. J. (2009). Jalview version 2—a multiple

sequence alignment editor and analysis workbench. Bioinformatics, 25, 1189–1191.

23


http://dx.doi.org/10.1101/006924

6. Supplementary Online Material

Table 1: WGS Data for the 16 primates included in this study.NP/DS indicates no publication, direct submission.

Specie WGS&Bio Pubmed Sequencing contigProject No. PMID coverage N50

Lemuriformes:D. madagascariensis AGTM01/PRJNA74997 22155688 Illumina(38×) 3,653O. garnettii AAQR03/PRJNA16955 NP/DS Illumina(137×) 27,100M. murinus AAHY01/PRJNA11785 12040188 Celera 21,690Tarsiformes:T. syrichta ABRT01/PRJNA20339 NP/DS Sanger(2.07×) 38,165NewWorldMonkeys:C. jacchus ACFV01/PRJNA20401 NP/DS ABI 3730(6.6× 29,273S. boliviensis AGCE01/PRJNA67945 NP/DS Illumina HiSeq(80×) 38,823the OldWorldMonkeys:M. mulatta AANU01/PRJNA12537 17431167 Sanger 25,707M. fascicularis CAEC01/PRJEA48347 21862625 454-FLXr, SOLiD 8,925C. sabaeus AQIB01/PRJNA168621 NP/DS 454 Titanium; Illumina 90,449P. anubis AHZZ01/PRJNA54005 NP/DS Sanger: 2.5× 40,262

454: 4.5× Illumina: 85×Hominids:N. leucogenys ADFV01/PRJNA13975 NP/DS Sanger(5.6×) 35,148P. abelii ABGA01/PRJNA20869 21270892 Sanger(6×) 15,648G. gorilla CABD02/PRJNA169344 NP/DS SangerP. paniscus AJFE01/PRJNA49285 22722832 454 (26×) 66,775P. troglodytes AACZ03/PRJNA13184 16136131 Sanger (6×) 50,656H. sapiens ABBA01/PRJNA19621 17803354 Sanger 108,431

24


http://dx.doi.org/10.1101/006924

V genes in primates from whole genome shotgun data · V genes in primates from whole genome shotgun data ... 2 and Francisco Gambon-Deza ... clades of the TRA locus and 25 clades

Documents