Top Banner
Vol.:(0123456789) 1 3 3 Biotech (2021) 11:35 https://doi.org/10.1007/s13205-020-02583-w ORIGINAL ARTICLE Implications of genome simple sequence repeats signature in 98 Polyomaviridae species Rezwanuzzaman Laskar 1  · Md Gulam Jilani 1  · Safdar Ali 1 Received: 25 September 2020 / Accepted: 2 November 2020 / Published online: 6 January 2021 © King Abdulaziz City for Science and Technology 2021 Abstract The analysis of simple sequence repeats (SSRs) in 98 genomes across four genera of the family Polyomaviridae was performed. The genome size ranged from 3962 (BM87) to 7369 bp (BM85) but maximum genomes were in the range of 5–5.5 kb. The GC% had an average of 42% and ranged between 34.69 (BM95) and 52.35 (BM81). A total of 3036 SSRs and 223 cSSRs were extracted using IMEx with incident frequency from 18 to 56 and 0 to 7, respectively. The most prevalent mono-nucleotide repeat motif was “T” (48.95%) followed by “A” (33.48%). “AT/TA” was the most prevalent dinucleotide motif closely followed by “CT/TC”. The distribution was expectedly more in the coding region with 77.6% SSRs of which nearly half were in Large T Antigen (LTA) gene. Notably, most viruses with humans, apes and related species as host exhibited exclusivity of mono-nucleotide repeats in AT region, a proposed predictive marker for determination of humans as host in the virus in course of its evolution. Each genome has a unique SSR signature which is pivotal for viral evolution particularly in terms of host divergence. Keywords Simple sequence repeats · Polyomaviridae · Prevalence · Distribution · Virus host · Evolution Introduction The genome of any organism is the key to understanding its functionality and evolutionary significance. Besides the sequence per se, each genome has some features which pro- vide for very crucial information. For instance, the repeat sequences or satellite sequences which are classified on the basis of the length of the repeat motif. Simple sequence repeats (SSRs) are the smallest of satellite sequences also known as microsatellites. SSRs are ubiquitously present across the genomes of all organisms, albeit with different incidence, complexity and iterations. Ever since the identi- fication of these repeats in multiple species, across coding and non-coding regions, their functional relevance has been explored at different levels (Gur-Arie et al. 2000; Kofler et al. 2008; Chen et al. 2012). Clinical relevance of SSRs in humans has also been reported. For instance, the expansion of these repeats through copy number alterations has been associated with enhancer amplification near oncogenes in cancer as well as in neuronal degradation in multiple neu- ropathies (Burguete et al. 2015; Hung et al. 2019). Based on iterations and intervening sequences, tandemly repeated SSRs may be classified into interrupted, pure, compound, interrupted compound, complex or interrupted complex (Chambers and MacAvoy 2000). Amongst various organisms, viruses are a unique plat- form to study SSRs owing to their small but rapidly evolving genomes. Further, the dependence of viruses on the host cell for survival makes it an easy aspect to study in terms of genome features and evolution. SSRs have been reported to play a role in genome evolution (Bennetzen 2000) and host range in viruses (Alam et al. 2019). Present study focuses on extraction and analysis of micro- satellites from genomes of 98 species of Polyomaviridae, which is a family of small, non-enveloped viruses that Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s1320 5-020-02583-w. * Safdar Ali [email protected]; [email protected] Rezwanuzzaman Laskar [email protected] Md Gulam Jilani [email protected] 1 Clinical and Applied Genomics (CAG) Laboratory, Department of Biological Sciences, Aliah University, IIA/27, Newtown, Kolkata 700160, India
12

Implications of genome simple sequence repeats signature ...35 3 Biotech (2021) 11:35 1 3 Page 2 of 12 derivesitsname“Polyoma”fromitsabilitytoinducemul-...

Jan 27, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Vol.:(0123456789)1 3

    3 Biotech (2021) 11:35 https://doi.org/10.1007/s13205-020-02583-w

    ORIGINAL ARTICLE

    Implications of genome simple sequence repeats signature in 98 Polyomaviridae species

    Rezwanuzzaman Laskar1 · Md Gulam Jilani1 · Safdar Ali1

    Received: 25 September 2020 / Accepted: 2 November 2020 / Published online: 6 January 2021 © King Abdulaziz City for Science and Technology 2021

    AbstractThe analysis of simple sequence repeats (SSRs) in 98 genomes across four genera of the family Polyomaviridae was performed. The genome size ranged from 3962 (BM87) to 7369 bp (BM85) but maximum genomes were in the range of 5–5.5 kb. The GC% had an average of 42% and ranged between 34.69 (BM95) and 52.35 (BM81). A total of 3036 SSRs and 223 cSSRs were extracted using IMEx with incident frequency from 18 to 56 and 0 to 7, respectively. The most prevalent mono-nucleotide repeat motif was “T” (48.95%) followed by “A” (33.48%). “AT/TA” was the most prevalent dinucleotide motif closely followed by “CT/TC”. The distribution was expectedly more in the coding region with 77.6% SSRs of which nearly half were in Large T Antigen (LTA) gene. Notably, most viruses with humans, apes and related species as host exhibited exclusivity of mono-nucleotide repeats in AT region, a proposed predictive marker for determination of humans as host in the virus in course of its evolution. Each genome has a unique SSR signature which is pivotal for viral evolution particularly in terms of host divergence.

    Keywords Simple sequence repeats · Polyomaviridae · Prevalence · Distribution · Virus host · Evolution

    Introduction

    The genome of any organism is the key to understanding its functionality and evolutionary significance. Besides the sequence per se, each genome has some features which pro-vide for very crucial information. For instance, the repeat sequences or satellite sequences which are classified on the basis of the length of the repeat motif. Simple sequence repeats (SSRs) are the smallest of satellite sequences also known as microsatellites. SSRs are ubiquitously present across the genomes of all organisms, albeit with different

    incidence, complexity and iterations. Ever since the identi-fication of these repeats in multiple species, across coding and non-coding regions, their functional relevance has been explored at different levels (Gur-Arie et al. 2000; Kofler et al. 2008; Chen et al. 2012). Clinical relevance of SSRs in humans has also been reported. For instance, the expansion of these repeats through copy number alterations has been associated with enhancer amplification near oncogenes in cancer as well as in neuronal degradation in multiple neu-ropathies (Burguete et al. 2015; Hung et al. 2019). Based on iterations and intervening sequences, tandemly repeated SSRs may be classified into interrupted, pure, compound, interrupted compound, complex or interrupted complex (Chambers and MacAvoy 2000).

    Amongst various organisms, viruses are a unique plat-form to study SSRs owing to their small but rapidly evolving genomes. Further, the dependence of viruses on the host cell for survival makes it an easy aspect to study in terms of genome features and evolution. SSRs have been reported to play a role in genome evolution (Bennetzen 2000) and host range in viruses (Alam et al. 2019).

    Present study focuses on extraction and analysis of micro-satellites from genomes of 98 species of Polyomaviridae, which is a family of small, non-enveloped viruses that

    Supplementary Information The online version contains supplementary material available at https ://doi.org/10.1007/s1320 5-020-02583 -w.

    * Safdar Ali [email protected]; [email protected]

    Rezwanuzzaman Laskar [email protected]

    Md Gulam Jilani [email protected]

    1 Clinical and Applied Genomics (CAG) Laboratory, Department of Biological Sciences, Aliah University, IIA/27, Newtown, Kolkata 700160, India

    http://orcid.org/0000-0003-3298-9282http://crossmark.crossref.org/dialog/?doi=10.1007/s13205-020-02583-w&domain=pdfhttps://doi.org/10.1007/s13205-020-02583-whttps://doi.org/10.1007/s13205-020-02583-w

  • 3 Biotech (2021) 11:35

    1 3

    35 Page 2 of 12

    derives its name “Polyoma” from its ability to induce mul-tiple tumors in its host. These viruses normally have mam-mals, avians and fish as their hosts (Ahsan and Shah 2006). The circular/linear genome generally encodes for two types of proteins. First, the early regulatory proteins which include large tumour antigen (LTAg), small tumour antigen (STAg), middle tumour antigen (MTAg), alternative tumour antigen (ATAg) and putative alternative large tumour antigen (PAL-TAg). These are pivotal for replication, transcription and maturation of the virus during infection. Second category of genes include those encoding for late structural proteins, which include the major capsid protein, viral protein 1 (VP1) and minor capsid proteins, VP2 and VP3. As the name sug-gests these are important for capsid formation (Moens et al. 2011; Meijden et al. 2015).

    In this analysis, we extracted SSRs from genomes of Pol-yomavirus and studied its incidence, distribution and com-plexity to understand the genome SSR signature. Further, the role of SSRs in viral evolution and contributing genome regions therein has been studied. This understanding of the viral genomics holds the key to combat viral pathogenesis and host divergence.

    Materials and methods

    Genome sequences

    Whole-genome sequence of 98 species of Alphapolyoma-virus of family Polyomaviridae across 4 different genera which is listed in ICTV (https ://talk.ictvo nline .org/ictv-repor ts/ictv_onlin e_repor t/dsdna -virus es/w/polyo mavir idae) was extracted from NCBI (http://www.ncbi.nlm.nih.gov/). These include Alphapolyomavirus (43 species), Betapolyomavirus (33 species), Gammapolyomavirus (9 species) and Deltapol-yomavirus (4 species). The study also included 9 species yet to be assigned Genera. The details of all the species included in the study (Genome type, Genera, Genome size, GC%, Host, Accession number) have been summarized in Sup-plementary file 1. All the genomes were double-stranded DNA, mostly circular except for 10 linear genomes. The information for all the known hosts for these viruses was assessed from Virus-Host Database (https ://www.genom e.jp/virus hostd b/note.html).

    Microsatellite extraction

    We have used Imperfect Microsatellite Extractor (IMEx) for extracting SSRs, wherein mono- to hexa-nucleotide repeat motifs are uncovered, imperfect microsatellites are allowed and compound microsatellites (cSSR: multiple SSRs sep-arated by a distance of less than equal to dMAX) have a

    dMAX range of 10–50. So, the results need to be assessed within these parameters.

    Microsatellite extraction was carried out using the ‘Advance-Mode’ of IMEx with the parameters reported for HIV (Mudunuri and Nagarajaram 2007; Chen et al. 2012) and as used for Mycobacteriophages (Alam et al. 2019). Briefly, the parameters included minimum repeat size which was set as follows: 6 (mono-), 3 (di-), 3 (tri-), 3 (tetra-), 3 (penta-), 3 (hexa-). Two SSRs separated by a distance of less than or equal to dMAX are treated as a single cSSR. In other words, maximum distance allowed between any two SSRs is called dMAX which was set at 10 initially and subsequently varied to 20, 30, 40, 50. All corresponding changes in cSSR incidence were recorded. It should be noted here that the maximum permissible dMAX value in IMEx is 50, because beyond that the fate of microsatellites is individualistic and hence clubbing it as cSSR becomes irrelevant. Other param-eters were set to the defaults.

    Statistical analysis

    All statistical analyses performed on the spreadsheet using data Analysis ToolPak of MS Office Suite v2016. Linear regression was used to reveal the correlation between the relative abundance, relative density of microsatellites with genome size and GC%.

    Dot plot analysis for host specificity

    Dot plot analysis of two nucleic acid/protein sequences using Genome Pair Rapid Dotter (GEPARD) highlights the pres-ence of SSRs within the genomes (Krumsiek et al. 2007; Alam et al. 2019) to ascertain their evolutionary relation-ships in context of repeats, reverse matches, and conserved domains. We used GEPARD v1.40 (Krumsiek et al. 2007) to perform dot plot analysis between genomes on the basis of hosts.

    Evolutionary relationship

    The phylogenetic tree construction was carried out by aligning the nucleotide sequence with the default speci-fications of MAFFT v6.861b (Katoh and Standley 2013) and the alignment was pruned by the trimAl v1.4.rev6 gap-pyout algorithmic method (Capella-Gutierrez et al. 2009) using the ETE3 v3.1.1 “build” function as implemented on GenomeNet (https ://www.genom e.jp/tools /ete/). To evalu-ate the evolutionary perspective that matches the alignment perfectly, we used pmodeltest v1.4 among JC, K80, TrNef, TPM1, TPM2, TPM3, TIM1ef, TIM2ef, TIM3ef, TVMef, SYM, F81, HKY, TrN, TPM1uf, TPM2uf, TPM3uf, TIM1, TIM2, TIM3, TVM and GTR models to infer ML tree. Using RAxML v8.1.20 of the GTRGAMMAI model with default

    https://talk.ictvonline.org/ictv-reports/ictv_online_report/dsdna-viruses/w/polyomaviridaehttps://talk.ictvonline.org/ictv-reports/ictv_online_report/dsdna-viruses/w/polyomaviridaehttp://www.ncbi.nlm.nih.gov/https://www.genome.jp/virushostdb/note.htmlhttps://www.genome.jp/virushostdb/note.htmlhttps://www.genome.jp/tools/ete/

  • 3 Biotech (2021) 11:35

    1 3

    Page 3 of 12 35

    parameters (Stamatakis 2014), the Maximum-Likelihood (ML) tree was asserted with the 100 bootstrap replicates. The final tree for visualization was constructed utilizing the webtool interactive Tree Of Life (Letunic and Bork 2019).

    Results

    Genome features

    The genome size ranged from 3962 (BM87) to 7369 bp (BM85) but maximum genomes were in the range of 5–5.5 kb. However, the GC% with an average of 42% ranged

    between 34.69 (BM95) and 52.35 (BM81) but exhibits much more diversity as compared to genome size (Fig. 1a, Supple-mentary file 1). In essence, the Polyomaviridae genomes are mostly of similar sizes, but its composition in terms of GC% is much more variable. If we hypothesize that SSR incidence has an equal chance across the whole genome, irrespective of the composition. Then the same should be reflected in the motifs of SSRs present. However, as discussed later, this is not the case. There are several species which have mono-nucleotide motifs exclusively in the AT region.

    The correlation between genome size and GC content was ascertained with various SSR features. SSR incidence was found to be significantly correlated (r = 0.19, P < 0.05) with

    20

    30

    40

    50

    SSR

    Inci

    denc

    e

    4K

    5K

    6K

    7K

    Gen

    ome

    Size

    0

    2

    4

    6

    8

    cSSR

    Inci

    denc

    e

    35

    40

    45

    50

    GC

    %

    BM

    3B

    M5

    BM

    7B

    M9

    BM

    11B

    M13

    BM

    15B

    M17

    BM

    19B

    M21

    BM

    23B

    M25

    BM

    27B

    M29

    BM

    31B

    M33

    BM

    35B

    M37

    BM

    39B

    M41

    BM

    43B

    M45

    BM

    47B

    M49

    BM

    51B

    M53

    BM

    55B

    M57

    BM

    59B

    M61

    BM

    64B

    M66

    BM

    68B

    M70

    BM

    72B

    M74

    BM

    76B

    M78

    BM

    80B

    M82

    BM

    84B

    M86

    BM

    88B

    M90

    BM

    92B

    M94

    BM

    96B

    M98

    0

    2

    4

    6

    8

    RA

    (SSR

    )

    0.0

    0.5

    1.0

    1.5

    RA

    (cSS

    R)

    0

    20

    40

    60

    80

    RD

    (SSR

    )0

    10

    20

    30

    RD

    (cSS

    R)

    A

    B

    Fig. 1 a Genome features and SSR/cSSR incidence of Polyomaviri-dae genomes. Though genome size is predominantly around 5–5.5 kb as evident by a fairly constant level of red bars whereas the corre-sponding GC variations (superimposed black bars) have a much broader range. In addition, note the diversity in SSRs incidence in genomes of similar length. Furthermore, higher SSR incidence does

    not necessarily translate to more cSSRs. b Relative abundance (RA) and relative density (RD) of SSRs and cSSRs RA is the number of microsatellites present per kb of the genome whereas RD is the sequence space composed of SSRs of microsatellites per kb of the genome. The varying peaks signify the presence of a unique SSR sig-nature for each genome

  • 3 Biotech (2021) 11:35

    1 3

    35 Page 4 of 12

    genome size and GC content (r = 0.08, P < 0.05). Though relative density and relative abundance were not significantly correlated with genome size (r = 0.01, P > 0.05; r = 0.005, P > 0.05), significant correlation was observed with GC con-tent (r = 0.20, P < 0.05; and r = 0.23, P < 0.05), respectively.

    Further, cSSR incidence is significantly correlated with genome size (r = 0.06, P < 0.05) but its corresponding rela-tive density (r = 0.0038, P > 0.05) and relative abundance (r = 0.004, P > 0.05) shows no significant correlation therein. GC content is also significantly correlated for cSSR inci-dence (r = 0.06, P < 0.05), relative density (r = 0.11, P < 0.05), and relative abundance (r = 0.08, P < 0.05).

    Incidence of SSRs and cSSRs

    A total of 3036 SSRs and 223 cSSRs were extracted from the 98 species of Polyomaviridae (Supplementary files 2–4). The average distribution of SSRs and cSSRs per genome varied from 23 and 1.3 (Gammapolyomavirus) to 33 and 2.9 (Betapolyomavirus), respectively. Their distribution across genera has been summarized in Table 1.

    Maximum of 56 SSRs were present in BM85 whereas minimum of 18 were present in BM80 and BM21. cSSR incidence ranged from 0 in seven species (BM99, BM82, BM76, BM59, BM24, BM21, BM14) to 7 in two species (BM85 and BM84) (Fig. 1a). Two interesting but contrast-ing observations can be made from this data. First, BM85 and BM84 with 7 cSSRs have 56 and 31 SSRs in a genome size of 7369 and 4697 bp, respectively (Supplementary file 2). What it essentially means is that though a longer genome should ideally account for more SSRs but the eventual clus-tering of SSRs reflected as cSSR incidence remains the same. Thus, the SSR rich regions of the genome are inde-pendent of genome size. The second aspect is that the above observation is not the norm as is evident from the cSSR range of zero to seven. Multiple genomes of Polyomaviridae with varying number of SSRs have same number of cSSRs. This is highlighted by 29 species having 2 cSSRs (Fig. 1a, Supplementary files 2–4) suggesting of a unique genome SSR signature.

    To further highlight the regularity of this anomaly, we looked into cSSR%, which is percentage of SSRs present as cSSRs in a particular genome. Note, the variations in cSSR% are not only across different genera but even within, thereby negating the clustering of SSRs in a genera specific manner (Fig. 2a). These are reflective of specific yet variable localizations and clustering of SSRs in a particular genome.

    Relative abundance (RA) and relative density (RD) of SSRs and cSSRs

    RA is the number of microsatellites present per kb of the genome whereas RD is the sequence space composed of SSRs of microsatellites per kb of the genome. So, these val-ues are reflective of number of iterations of SSRs present. If the SSRs have a conserved tendency to be iterated, then higher incidence should correspond to elevated RD values. Moreover, a higher RA value should correspond to high RD value. As observed, BM65 has the highest RA and RD val-ues of 9.32 and 80.4, respectively, for SSRs which means, since more SSRs are present per kb of the genome, more genome is comprised of SSRs. The corresponding lowest values for RA and RD was 3.39 (BM21) and 26.5 (BM80), respectively (Fig. 1b, Supplementary files 2–4).

    Similarly, the cSSR relative abundance (cRA) and rela-tive density (cRD) was also studied. Since there were 7 spe-cies with no cSSR (Fig. 1a), hence the minimum cRA and cRD values were zero for these species. The highest values for cRA and cRD were 1.490 (BM84) and 33.93 (BM95), respectively (Fig. 1b, Supplementary files 2–4). This dif-ference may be due to the differential composition of the cSSRs.

    dMAX and cSSR

    cSSR incidence is dependent on the allowed distance (dMAX) between two SSRs for it to be treated as one cSSR. Since cSSR is reflective of clustering of SSRs, and IMEx allows for dMAX values till 50, we analyzed cSSR incidence of Polyomaviridae genomes by varying the dMAX values

    Table 1 SSR and cSSR incidence across the different genera of Polyomaviridae

    S. No. Genera No. of Species SSR incidence Average SSR per Species

    cSSR incidence Average cSSR per Species

    1 Alphapolyomavirus 43 1315 30.58 80 1.862 Betapolyomavirus 33 1090 33.03 96 2.93 Deltapolyomavirus 04 108 27 6 1.54 Gammapolyomavirus 09 208 23.11 12 1.335 Unassigned Species 09 315 35 29 3.22

    Total 98 3036 223

  • 3 Biotech (2021) 11:35

    1 3

    Page 5 of 12 35

    BM

    4

    BM

    6

    BM

    8

    BM

    10

    BM

    12

    BM

    14

    BM

    16

    BM

    18

    BM

    20

    BM

    22

    BM

    24

    BM

    26

    BM

    28

    BM

    30

    BM

    32

    BM

    34

    BM

    36

    BM

    38

    BM

    90

    BM

    92

    BM

    94

    BM

    40

    BM

    42

    BM

    44

    BM

    46

    BM

    48

    BM

    50

    BM

    52

    BM

    54

    BM

    56

    BM

    58

    BM

    60

    BM

    62

    BM

    65

    BM

    67

    BM

    69

    BM

    95

    BM

    71

    BM

    73

    BM

    75

    BM

    77

    BM

    79

    BM

    81

    BM

    83

    BM

    85

    BM

    87

    BM

    97

    BM

    99

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50cS

    SR%

    BM

    2B

    M3

    BM

    4B

    M5

    BM

    6B

    M7

    BM

    8B

    M9

    BM

    10B

    M11

    BM

    12B

    M13

    BM

    14B

    M15

    BM

    16B

    M17

    BM

    18B

    M19

    BM

    20B

    M21

    BM

    22B

    M23

    BM

    24B

    M25

    BM

    26B

    M27

    BM

    28B

    M29

    BM

    30B

    M31

    BM

    32B

    M33

    BM

    34B

    M35

    BM

    36B

    M37

    BM

    38B

    M39

    BM

    40B

    M41

    BM

    42B

    M43

    BM

    44B

    M45

    BM

    46B

    M47

    BM

    48B

    M49

    BM

    50B

    M51

    BM

    52B

    M53

    BM

    54B

    M55

    BM

    56B

    M57

    BM

    58B

    M59

    BM

    60B

    M61

    BM

    62B

    M64

    BM

    65B

    M66

    BM

    67B

    M68

    BM

    69B

    M70

    BM

    71B

    M72

    BM

    73B

    M74

    BM

    75B

    M76

    BM

    77B

    M78

    BM

    79B

    M80

    BM

    81B

    M82

    BM

    83B

    M84

    BM

    85B

    M86

    BM

    87B

    M88

    BM

    89B

    M90

    BM

    91B

    M92

    BM

    93B

    M94

    BM

    95B

    M96

    BM

    97B

    M98

    BM

    99B

    M10

    0

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    cSSR

    Inci

    denc

    e In

    crea

    sing

    Per

    cent

    age

    (%) w

    ith v

    aryi

    ng d

    MA

    XA

    B

    Alpha PV Beta PV Delta PV Gamma PV Unassigned Species

    dMAX30dMAX20 dMAX40 dMax50

    Fig. 2 a cSSR% in the studied Polyomaviridae genomes. Percentage of individual SSRs as part of cSSRs is cSSR%. The data for all the genera are differentially coloured. Not only there is diversity across the genera but also within the genomes of the same genera as well. Interestingly, BM84 which has the highest cSSR% is yet to be clas-

    sified into any genera. b Percentage increase in cSSR incidence with increasing dMAX (10–50). Note the non-linearity in increase. Nega-tive bars represent a decrease in cSSR incidence when two cSSRs merge into one with increasing dMAX

  • 3 Biotech (2021) 11:35

    1 3

    35 Page 6 of 12

    from initial value of 10 to 20, 30, 40 and 50. Subsequently, % increase was calculated using the given formula.

    This % increase was thereon plotted. Though maximum increase is observed for most species when dMAX increased from 10 to 20 as evident from the predominant black bar, it does not conform to a pattern per se (Fig. 2b). This means that even in species of the same family, SSRs chart their own path in terms of localizations in each genome.

    SSR motif types and their prevalence

    First, the contribution of different repeat motif (mono- to hexa) to the overall SSRs incidence was ascertained. The data were analysed separately for each of the genera. Moreo-ver, the analysis was done in percentage and not absolute numbers to account for variable number of species across genera. Note that the data from species with unassigned genera was not included herein. The contribution of mono-nucleotide repeats motifs ranged from 36 (Gammapolyoma-virus) to 47% (Betapolyomavirus). Deltapolyomavirus had no incidence of penta- and hexa-nucleotide repeats whereas Gammapolyomavirus lacked hexanucleotide repeats. This can be attributed to fewer species in these genera. Gam-mapolyomavirus had the highest contribution from di-nucle-otide repeats (39.42%) and the only genus to have more di-nucleotide repeats than mono-nucleotide repeats (Fig. 3a, Supplementary files 2–3).

    We thereon looked into the motif composition of mono- and di-nucleotide repeats for their prevalence across the different genera of Polyomaviridae. For the mono-nucle-otides, if we look at the overall data, the most prevalent repeat motif is “T” (48.95%) followed by “A” (33.48%). “T” also remains the most prevalent mono-nucleotide motif for Alpha-, Beta- and Delta-polyomavirus (47, 52 and 71 percent, respectively). However, Gammapolyomavirus has a highest contribution from “C” (34.67%) followed by “T” (33.33%) (Fig. 3b, Supplementary files 2–3). Interestingly, the same Gammapolyomavirus has the highest di-nucleotide repeat motif contribution from “AT/TA” (29.27%) motif while Alphapolyomavirus has its largest contribution from “CT/TC” (29.37). Overall, “AT/TA” was the most preva-lent dinucleotide repeat motif closely followed by “CT/TC” (Fig. 3c) PV: polyomavirus.

    SSRs in coding regions

    The assessment of SSRs distribution across genome revealed that non-coding region accounted for 679 SSRs (22.4%)

    %increase =

    [

    {cSSR incidence at dMAXn − cSSR incidence at dMAX(n − 10)}

    ÷cSSR incidence at dMAX(n − 10)

    ]

    × 100

    whereas coding region comprised of 32 proteins/putative genes/ORFs housed 2357 (77.6%) of SSRs (Supplementary

    file 2).Subsequently, we analyzed the SSR prevalence across dif-

    ferent genes of the studied genomes. Six genes accounted for over 92% of SSRs. Overall, the LTAg gene alone accounted for over 47% of total SSRs with VP1 gene a distant second at around 16% (Fig. 3d). Thereafter, we dissected the data across different genera. Interestingly, though LTAg gene takes the pole position in the housing of SSRs across genera, its contribution varied. In Betapolyomavirus, it was account-ing for one in every two SSR (49.54%) while in Gammapoly-omavirus, approximately one in every three SSR was housed in LTAg gene (35%). This difference permeates to all the genes, albeit to a lesser extent (Fig. 3e, Supplementary files 2–3).

    SSRs (mono‑nucleotide) specificity and host range exclusivity

    The compilation of different SSRs contribution to overall incidence revealed an interesting observation. Eighteen spe-cies had one hundred percent mono-nucleotide SSRs com-prising of A/T. Further, the majority of these viruses had humans or members of the ape family as their hosts. To elucidate a possible pattern and significance of the same, we arranged all the studied species in decreasing order of their mono-nucleotide SSR contribution by A/T (Fig. 4, Supplementary files 1–2). Notably, viruses with humans, apes, and related species as hosts have a much higher A/T mono-nucleotide SSRs composition as compared to birds and fishes as hosts (Fig. 4).

    Using representative species (9 each) we thereon inves-tigated whether the SSRs composition by A/T and the hosts reflect a pattern. Dot plot analysis was performed for nine species each with humans, apes and related species as hosts (Fig. 5a) and nine species with birds, fishes and other species as hosts (Fig. 5b). Interestingly, even though three species in Fig. 4 have 100% mono-nucleotide SSR contribution by A/T (same as Fig. 5a), the overall number of dots (reflective of repeat sequences) is higher for all the genomes of Fig. 5a, representing humans and related species as hosts.

    Phylogenetic tree of Polyomaviridae

    Subsequently, we constructed the phylogenetic tree of the 98 Polyomaviridae genomes and observed that all the viruses are not evolved together as per their hosts. However, hosts do

  • 3 Biotech (2021) 11:35

    1 3

    Page 7 of 12 35

    Fig.

    3 a

    SSR

    inci

    denc

    e an

    d m

    otif

    leng

    th. A

    n in

    crea

    se in

    repe

    at m

    otif

    resu

    lted

    in le

    sser

    inci

    denc

    e, in

    vers

    e pr

    opor

    tiona

    lity,

    whi

    ch is

    exp

    ecte

    d. H

    owev

    er, t

    wo

    obse

    rvat

    ions

    sho

    uld

    be n

    oted

    . Firs

    t, G

    amm

    apol

    yom

    avir

    us is

    the

    only

    gen

    era

    whe

    re th

    e hi

    ghes

    t inc

    iden

    ce is

    of d

    i-nuc

    leot

    ide

    repe

    at m

    otifs

    . All

    othe

    rs h

    ave

    mon

    o-nu

    cleo

    tide

    mot

    if as

    mos

    t rep

    rese

    nted

    alo

    ng e

    xpec

    ted

    lines

    . Sec

    ond,

    th

    e fa

    ll in

    inci

    denc

    e fro

    m m

    ono-

    to d

    i-nuc

    leot

    ide

    mot

    if SS

    Rs

    is th

    e le

    ast i

    n D

    elta

    poly

    omav

    irus

    . b M

    ono-

    nucl

    eotid

    e m

    otif

    com

    posi

    tion.

    In-s

    pite

    of v

    aryi

    ng G

    C p

    erce

    ntag

    e (F

    ig. 1

    ), th

    e m

    ono-

    nucl

    eotid

    e m

    otif

    com

    posi

    tion

    is v

    ery

    muc

    h bi

    ased

    tow

    ards

    A/T

    acr

    oss

    all g

    ener

    a. T

    otal

    repr

    esen

    ts o

    vera

    ll da

    ta. c

    Di-n

    ucle

    otid

    e m

    otif

    com

    posi

    tion.

    Tho

    ugh

    AT/

    TA is

    the

    mos

    t rep

    rese

    nted

    di-

    nucl

    eotid

    e re

    peat

    mot

    if ov

    eral

    l, it

    does

    not

    enj

    oy th

    e sa

    me

    stat

    us a

    cros

    s all

    gene

    ra, w

    ith A

    lpha

    poly

    omav

    irus

    bei

    ng th

    e ex

    cept

    ion.

    Her

    e, C

    T/TC

    has

    the

    high

    est i

    ncid

    ence

    clo

    sely

    follo

    wed

    by

    AT/

    TA. d

    Dist

    ribut

    ion

    of S

    SRs

    (%) a

    cros

    s di

    ffere

    nt p

    rote

    ins.

    Ove

    rall,

    LTA

    g ac

    coun

    ted

    for o

    ver 4

    7% o

    f all

    SSR

    s in

    the

    codi

    ng re

    gion

    with

    VP1

    com

    ing

    a di

    stan

    t sec

    ond

    at a

    roun

    d 16

    %. O

    nly

    the

    6 pr

    otei

    ns w

    hich

    acc

    ount

    ed fo

    r the

    hig

    hest

    SSR

    s wer

    e in

    clud

    ed, t

    he re

    st ha

    ve b

    een

    colle

    ctiv

    ely

    take

    n as

    “O

    ther

    s”. e

    SSR

    s con

    tribu

    tion

    (%) b

    y pr

    otei

    ns a

    cros

    s diff

    eren

    t gen

    era.

    Her

    ein,

    subt

    le v

    aria

    -tio

    ns a

    re v

    isib

    le. T

    houg

    h LT

    Ag

    gene

    acc

    ount

    s for

    max

    imum

    SSR

    s in

    the

    codi

    ng g

    enom

    e ac

    ross

    all

    the

    gene

    ra b

    ut th

    e co

    ntrib

    utin

    g pe

    rcen

    tage

    var

    ies f

    rom

    35%

    in G

    amm

    apol

    yom

    avir

    us to

    alm

    ost

    50%

    in B

    etap

    olyo

    mav

    irus

  • 3 Biotech (2021) 11:35

    1 3

    35 Page 8 of 12

    reflect in the tree. Multiple places of clustering of the virus with the same or related hosts can be observed (Fig. 6). The fact that all viruses with human or same hosts do not follow the pattern is only indicative of other players in genome evolution besides hosts.

    We thereon superimposed the data for percentage mono-nucleotide SSR contribution by AT region, the phylogenetic analysis and the known hosts. For the sake of clarity, hosts of only those species with > 90% mono-nucleotide SSR con-tribution from AT region are shown as illustrations here, though the complete information is provided in Fig. 4. We hypothesize that the presence of mono-repeats in the AT region is somehow providing for viral host flexibility and interchangeability.

    Discussion

    Owing to the variable nature of the A/T and G/C regions of the DNA, often these sequences exhibit specific attrib-utes. The significance of AT repeats in strand slippage and copy number polymorphism is well documented (Katti et  al. 2001). Though this implies GC content to be an important aspect for SSR studies but it is not necessarily the case primarily because of two reasons. First, the uneven

    distribution of SSRs across any genome as observed herein and reported for other genomes is not determined by the GC content (Chen et al. 2012; Alam et al. 2013, 2019). For instance, there are 18 species herein where the complete mono-nucleotide SSRs are localized to the A/T region. The fact that these genomes have a maximum GC content of 52%, proves the argument with 48% of the genome housing hundred percent of the mono-nucleotide repeats. We believe that this unevenness in distribution is not random but with a purpose; most probably host range, as discussed later. Sec-ond, the prevalence of repeats is dependent on size of repeat motifs, as in what is applicable to mono-nucleotides, is not true for di-nucleotides and it also varies from one genus to another. However, two exceptions both in Gammapolyoma-virus deserve mention. First, it is the only genera to have maximum mono-nucleotide SSRs as “C”. It is a deviation from AT region being hub for shorter repeat motifs. Con-trastingly, it returns to expected lines with “AT/TA” being the most represented di-nucleotide repeat motif. Second, we should bear in mind that this genus has lesser number of species (nine) but that may be looked with multiple perspec-tives. Either we consider the fewer species as the reason for the aberrant observation or we can assume this uniqueness is the reason for fewer species in Gammapolyomavirus. We believe in the latter.

    BM81

    BM83Serinus

    canaria

    BM21Mus musculus

    BM76Gallus gallus,Melopsittacusundulatus

    BM77Eurasian jackdaw

    BM37Pteropus

    vampyrus

    BM82Pyrrhula

    pyrrhula

    griseiventris

    BM75Anser sp.

    BM2Acerodon celebensis

    BM99Sparus aurata

    BM66Rattus norvegicus

    BM84Bos taurus

    BM38Rattus norvegicus

    BM3Artibeus planirostris

    BM17Homo sapiens

    BM44Desmodus rotundus

    BM13Homo sapiens

    BM78Cracticus torquatus

    BM85Centropristis striata

    BM35Pongo pygmaeus

    BM19Mesocricetus auratus

    BM10Dobsonia moluccensis

    BM89Sorex araneus

    BM47Equus caballus

    BM92Sturnira lilium

    BM90Sorexcoronatus

    BM58Miniopterus africanus

    BM91Sorex minutus

    BM98Procyon lotor

    BM25Pan troglodytes verus

    BM12Gorilla gorilla gorilla

    BM27Pan troglodytes verus

    BM93Miniopterusschreibersii

    BM15Homo sapiens

    BM29Pan troglodytesverus

    BM97Ailuropodamelanoleuca

    BM56Meles meles

    BM96Rousettusaegyptiacus

    BM45Dobsoniamoluccensis

    BM26Pan troglodytes verus

    BM80Lonchura maja

    BM11Eidolon helvum

    BM5Ateles paniscus

    BM23Otomops martiensseni

    BM94Miniopterus schreibersii

    BM16Homo sapiens

    BM32Piliocolobus badius

    BM95Canis familiaris

    BM18Macacafascicularis

    BM7Carolliaperspicillata

    BM8Chlorocebuspygerythrus

    BM22Otomops martiensseni

    BM6Cardioderma cor

    BM53Loxodonta africana

    BM71Homo sapiens

    BM20Molossus molossus

    BM52Leptonychotes weddellii

    BM28Pan troglodytes verus

    BM57Microtus arvalis

    BM9Chlorocebus pygerythrus

    BM39Acerodon celebensis

    BM30Pan troglodytes schweinfurthii

    BM86Delphinus delphis

    BM61Myotis lucifugus

    BM49Homo sapiens

    BM70Zalophus californianus

    BM64Pteronotus davyi

    BM14Homo sapiens

    BM51Homo sapiens

    BM4Artibeus planirostris

    BM36Procyon lotor

    BM50Homo sapiens

    BM34Pongo abelii

    BM46Dobsonia moluccensis

    BM31Papio cynocephalus

    BM40Artibeus planirostris polyomavirus 1

    BM69Vicugna pacos

    BM41Cebus albifrons

    BM65Pteronotus parnellii

    BM54Macaca mulatta

    BM88Trematomus pennellii

    BM87Rhynchobatus djiddensis

    BM79Erythrura gouldiae

    BM74Homo sapiens

    BM73Homo sapiens

    BM72Homo sapiens

    BM68Saimiri sciureus

    BM67Saimiri boliviensis

    BM62Pan troglodytes verus

    BM60Myodes glareolus

    BM59Mus musculus

    BM55Mastomys natalensis

    BM48Homo sapiens

    BM43Chlorocebus pygerythrus

    BM42Cercopithecus erythrotis

    BM33Piliocolobus rufomitratus

    BM24Pan troglodytes verus

    BM100Trematomus bernacchii

    Pygoscelis adeliae

    25.49 100.00

    Fig. 4 Genomes with decreasing % of A/T mono-nucleotides repeat motif. Though, not perfect, the similar values for humans and related species suggests host range dependency on SSR distribution across AT genome regions. Higher the contribution of mono-nucleotide

    repeat motifs from AT region, greater are the chances that it will have humans or related species as its host. The color gradient represents the percentage of A/T mono-nucleotide repeat motif

  • 3 Biotech (2021) 11:35

    1 3

    Page 9 of 12 35

    Fig.

    5

    Dot

    plot

    ana

    lysi

    s of

    Pol

    yom

    avir

    idae

    gen

    omes

    with

    a h

    uman

    , ape

    s or

    rela

    ted

    spec

    ies

    as h

    osts

    with

    mon

    o-nu

    cleo

    tide

    repe

    at m

    otif

    cont

    ribut

    ion

    of 1

    00%

    from

    the

    AT

    regi

    on a

    nd b

    div

    erge

    nt

    hosts

    with

    var

    ying

    mon

    o-nu

    cleo

    tide

    repe

    ats i

    n th

    e A

    T re

    gion

  • 3 Biotech (2021) 11:35

    1 3

    35 Page 10 of 12

    The study of cSSRs has always been relevant with SSRs owing to their involvement in functional aspects such as reg-ulation of gene expression (Kashi and King 2006; Chen et al. 2011). Essentially, cSSR is a reflection of accumulation of SSRs in the genome. Higher cSSR incidence refers to SSRs

    present in close proximity to each other and with these being sources of variations and genome evolution (Kim et al. 2008; Madsen et al. 2008), we further looked at cSSRs in terms of cSSR% and by varying dMAX. An increase in cSSR incidence with increasing dMAX is expected and observed

    BM95

    BM91

    BM30

    BM16

    BM34

    BM31

    BM20

    BM25

    BM18

    BM15

    BM96

    BM60

    BM99

    BM2

    BM86

    BM93

    BM54

    BM97

    BM35

    BM78

    BM49

    BM83

    BM76

    BM47

    BM88

    BM87

    BM21

    BM26

    BM29

    BM94BM

    65

    BM38

    BM79

    BM72

    BM9

    BM92

    BM77

    BM52

    BM50

    BM41

    BM57

    BM17

    BM98

    BM56BM

    69

    BM40

    BM12

    BM70

    BM23

    BM33

    BM100

    BM24

    BM80

    BM19

    BM75

    BM84

    BM22

    BM37

    BM55

    BM14

    BM67

    BM32

    BM45

    BM89

    BM8

    BM46

    BM59

    BM71

    BM42

    BM48

    BM7

    09MB

    BM74

    BM4

    BM51

    BM11

    BM3

    BM53

    BM6

    BM44

    BM82

    BM10

    BM81

    BM66

    BM58B

    M64

    BM13

    BM68

    BM73

    BM43

    BM85

    BM39

    BM28

    BM27

    BM62

    BM36

    BM5

    5555

    100100

    3333

    8787

    100

    100

    8383

    3636

    100100

    6262

    7474

    5555

    9494

    7979

    8989

    100 100

    7171

    6565

    100100

    100100

    3333

    4747100100

    4242

    100100

    8686

    3030

    9696

    100100

    5858

    100 100

    9696

    7878

    100100

    100100

    100 100

    100100

    5151

    9999

    4040

    100100

    100100

    8484

    100 100

    5252

    100

    100

    6363

    4848

    100

    100

    8181

    8484

    100 100

    5555

    100 100

    100

    100

    100 100

    9292

    5050

    2626

    100 100

    100 100

    9898

    5353

    100100

    9999

    100

    100

    9393

    000011

    1919

    5757

    100

    100

    100100

    4949

    9999

    5858

    100 100

    100 100

    7676

    9393

    000011

    9292

    5454

    4646

    100 100

    100100

    100 100

    9191

    99

    100100

    100 100

    100 100

    100100

    100 100

    100 100100 100

    5050

    100100

    100100

    0.3650.365

    0034

    0.034

    0131

    0.1310063

    0.063

    0.253

    0.253

    0.031 0.031

    0186 0.186

    0.041

    0.041

    0.0380.038

    00.111818

    0.0850.085

    0.0390.039

    013

    0.13

    0.054 0.054

    0.065

    0.065

    0071 0.071

    00.111717

    0.081

    0.081

    0.061

    0.061

    0.184 0.184

    0.213 0.213

    0.186

    0.186

    0.094 0.094

    0.05 0.05

    0.094

    0.094

    0.0430.043

    0.1480.148

    0.3020.302

    0.1850.185

    0.0290.029

    0.051

    0.051

    0082

    0.082

    0175 0.175

    0.1480.148

    1621 1.621

    0.080.08

    0.02

    30.

    023

    0.5740.574

    0.1710.171

    0.1490.149

    0.152

    0.152

    0.049 0.049

    0.174

    0.174

    0052

    0.052

    0.060.06

    0.1350.135

    0.1050.105

    0.265

    0.265

    0.2

    0.211

    11

    0.501

    0.501

    0.074

    0.074

    0.1540.154

    0.070.07

    0081

    0.081

    0.0920.092

    0.0690.069

    0.438 0.438

    0.381

    0.381

    0.23 0.23

    1366 1.366

    0.1470.147

    2021 2.021

    06 0.61111

    0.063 0.063

    0.105

    0.105

    0.1280.128

    0.1530.153

    0498

    0.498

    0.194

    0.194

    0.0420.042

    0.074 0.074

    0.0350.035

    0213

    0.213

    0.09

    70.

    097

    0083 0.083

    0.234

    0.234

    0.0760.076

    0288

    0.288

    0.3570.357

    0.2150.215

    0.090.09

    00

    00.111212

    0.0360.036

    0.153

    0.153

    2.437 2.437

    0252 0.252

    0.20.21111

    0.239

    0.239

    0379

    0.379

    0.1060.106

    0.155 0.155

    0.093

    0.093

    0.0960.096

    0.2310.231

    0.038

    0.038

    0.0620.062

    00.111616

    164 1.64

    0.0390.039

    00.111717

    0.055 0.055

    0.0640.064

    0.277 0.277

    0.04

    0.04

    0.1760.176

    0.449 0.449

    0.14

    70.

    147

    0.312

    0.312

    0169 0.169

    0.610.61

    0.2490.249

    0079

    0.079

    0.0750.075

    0.046

    0.046

    0.0350.035

    0.1920.192

    00.111515

    0.0550.055

    0.1080.108

    0.233 0.233

    0.09

    10.

    091

    0.0260.026

    0.0510.051

    0.7910.791

    0164 0.164

    0.0010.001

    0.059

    0.059

    0.234

    0.234

    0.0730.073

    0.32 0.32

    0.340.34

    0096

    0.096

    334422.

    00

    0.083

    0.083

    990044.

    00

    0051

    0.051

    0.10

    10.

    101

    0.1640.164

    7700.00

    0.415

    0.415

    0.1520.152

    0155

    0.155

    0.3130.313

    0.085 0.085

    015 0.15

    0.05

    20.

    052

    0.1610.161

    0.581

    0.581

    0.03

    0.03

    0.3060.306

    0.2530.253

    0.2020.2020.064

    0.064

    0.055 0.055

    0.2180.218

    0.197 0.197

    0039

    0.039

    0.15

    50.

    155

    0.568

    0.568

    0438

    0.438

    0.29 0.29

    0.0630.063

    0.3180.318

    557755.

    00

    0.040.04

    0.1050.105

    0.0370.037

    00

    0.035

    0.035

    0.38 0.38

    0.0720.072

    0.262

    0.262

    0.2340.234

    0.08

    70.

    087

    0.25

    0.25

    1.301 1.301

    0.0620.062

    0.0180.018

    0174

    0.174

    0.1430.143

    0152 0.152

    0.0910.091

    0.219 0.219

    0.352 0.352

    0.151

    0.151

    0.0930.093

    0.0680.068

    0.0710.071

    0343

    0.343

    0.1490.149

    0.1050.105

    0.0810.081

    0.524 0.524

    0.2380.238

    Mono Nucleotide Repeat (SSR) AT%

    Mono Nucleotide Repeat (SSR) GC%

    Tree scale: 0.1

    Human :

    Ape :Monkey :

    Bat :Alpaca :

    Racoon :Rodent :

    Fish :Dolphin :

    Seal :Bird :

    Host Symbol

    =100%

  • 3 Biotech (2021) 11:35

    1 3

    Page 11 of 12 35

    as well (Fig. 2b). However, the increase not conforming to any pattern as visible by the different lengths of differently coloured lines is indicative of each genomes’ uniqueness. The few instances wherein negative percentage is observed is owing to merging of two independent cSSRs into one with increasing dMAX, thus leading to a decrease in cSSR incidence. Moreover, the cSSR% varies not only across the genera of Polyomaviridae but also within the species of same genera (Fig. 2a). In spite of these variations, of all the reported cSSRs, only 17 are composed of three SSRs and 3 of four SSRs. Rest all are of two SSRs only. There is only one species BM97 which has two cSSRs of more than 3 SSRs each. Other genomes have a single representation only. All the above figures are for dMAX of 10 (Supplementary file 4).

    The prevalence of SSRs in coding region of viral genomes conforms to earlier reports (Alam et al. 2014, 2019). The distribution of around 78% SSRs across coding regions is in accordance with other viral genomes through the gene specific data (Fig. 3d–e) exhibits uniqueness to Polyomaviridae genomes. The overlap of genes is reflected by LTAg/STAg or VP2/VP3 representation. Presence of SSRs in these overlapping regions can be influential in the scenario that an alteration there would have an impact on two genes simultaneously. The cSSRs constitution ranged from two to four SSRs, albeit with divergent motifs as men-tioned above. The distribution of SSRs failed to conform to a pattern. Thus, we can affirm that the genome-specific clustering of SSRs is not only unique but regulated as well. This may be an attempt of the genome to shield itself from changes as clustering of SSRs will lead to developing hot-spots for mutations.

    Though the overall evolution of viruses is guided by mul-tiple factors such as host range and genome features, the number and composition of mono-nucleotide SSRs showed a correlation with the hosts and we believe the data has the foundation of predicting the future hosts for any viral spe-cies. Our hypothesis stems from the fact that there were eighteen genomes which exhibited mono-nucleotide repeats being exclusively restricted to the AT region. A closer analysis (Fig. 4) revealed a pattern suggesting humans or related hosts in those genomes. On widening our analysis, we can say with confidence that the contribution of mono-nucleotide SSRs from AT region is pivotal for host range determination. Viruses are constantly expanding their hosts as is supported by HIV which had origins in monkey and Coronavirus which had originally bats as host (19). Both the species, monkey and bats, are hosts for Polyomavirus genomes having the exclusive or near-exclusive contribution of mono-SSRs from AT region.

    Earlier studies on the evolution of Polyomavirus have suggested gene duplications and inversions as sources for variations in genome size and also predicted their prior

    existence in invertebrate hosts indicating an evolving virus family in terms of host (Buck et al. 2016). This becomes all the more relevant when we look at the suggested organisms on the basis of this study to share a common/interchangeable host range for viruses. This includes monkeys (HIV) and Bats (Coronavirus) (Parrish et al. 2008). We accept that the correlation between mono-repeat from AT region and host is not universal suggesting other influencing factors but its presence in species across genera demands further authen-tication of the idea.

    To conclude, the incidence and distribution of SSRs in the Polyomaviridae genomes suggests a unique genome SSR signature which is defined by multiple factors. These include GC content, evolutionary relation and coding/non-coding regions. We also propose the mono-nucleotide distribution in A/T region of the genome as a key parameter to host divergence to humans and related species. This needs to be ascertained in all the known human infecting viruses.

    Author contributions RL performed all the analysis of extracted SSRs and prepared all the figures and tables. MGJ carried out the extrac-tion of microsatellites from IMEx. SA supervised the whole study and prepared the manuscript.

    Funding Not applicable.

    Compliance with ethical standards

    Conflict of interest The authors declare that they have no conflict of interest.

    Availability of data and material All data have been provided as sup-plementary material.

    References

    Ahsan N, Shah KV (2006) Polyomaviruses and human diseases. Adv Exp Med Biol 577:1–18. https ://doi.org/10.1007/0-387-32957 -9_1

    Alam CM, Singh AK, Sharfuddin C, Ali S (2013) In-silico analy-sis of simple and imperfect microsatellites in diverse tobamo-virus genomes. Gene 530:193–200. https ://doi.org/10.1016/j.gene.2013.08.046

    Alam CM, Singh AK, Sharfuddin C, Ali S (2014) Incidence, com-plexity and diversity of simple sequence repeats across potex-virus genomes. Gene 537:189–196. https ://doi.org/10.1016/j.gene.2014.01.007

    Alam CM, Iqbal A, Sharma A et al (2019) Microsatellite diversity, complexity, and host range of mycobacteriophage genomes of the Siphoviridae family. Front Genetics. https ://doi.org/10.3389/fgene .2019.00207

    Bennetzen JL (2000) Transposable element contributions to plant gene and genome evolution. Plant Mol Biol 42:251–269

    Buck CB, Doorslaer KV, Peretti A et al (2016) The ancient evolution-ary history of polyomaviruses. PLoS Pathog 12:e1005574. https ://doi.org/10.1371/journ al.ppat.10055 74

    Burguete AS, Almeida S, Gao F-B et al (2015) GGG GCC microsatel-lite RNA is neuritically localized, induces branching defects, and

    https://doi.org/10.1007/0-387-32957-9_1https://doi.org/10.1007/0-387-32957-9_1https://doi.org/10.1016/j.gene.2013.08.046https://doi.org/10.1016/j.gene.2013.08.046https://doi.org/10.1016/j.gene.2014.01.007https://doi.org/10.1016/j.gene.2014.01.007https://doi.org/10.3389/fgene.2019.00207https://doi.org/10.3389/fgene.2019.00207https://doi.org/10.1371/journal.ppat.1005574https://doi.org/10.1371/journal.ppat.1005574

  • 3 Biotech (2021) 11:35

    1 3

    35 Page 12 of 12

    perturbs transport granule function. eLife 4:e08881. https ://doi.org/10.7554/eLife .08881

    Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972–1973. https ://doi.org/10.1093/bioin forma tics/btp34 8

    Chambers GK, MacAvoy ES (2000) Microsatellites: consensus and controversy. Comp Biochem Physiol B Biochem Mol Biol 126:455–476

    Chen M, Zeng G, Tan Z et al (2011) Compound microsatellites in complete Escherichia coli genomes. FEBS Lett 585:1072–1076. https ://doi.org/10.1016/j.febsl et.2011.03.005

    Chen M, Tan Z, Zeng G, Zeng Z (2012) Differential distribution of compound microsatellites in various Human Immunodeficiency Virus Type 1 complete genomes. Infect Genet Evol 12:1452–1457. https ://doi.org/10.1016/j.meegi d.2012.05.006

    Gur-Arie R, Cohen CJ, Eitan Y et al (2000) Simple sequence repeats in Escherichia coli: abundance, distribution, composition, and polymorphism. Genome Res 10:62–71

    Hung S, Saiakhova A, Faber ZJ et al (2019) Mismatch repair-signature mutations activate gene enhancers across human colorectal cancer epigenomes. eLife 8:e40760. https ://doi.org/10.7554/eLife .40760

    Kashi Y, King DG (2006) Simple sequence repeats as advantageous mutators in evolution. Trends Genet 22:253–259. https ://doi.org/10.1016/j.tig.2006.03.005

    Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780. https ://doi.org/10.1093/molbe v/mst01 0

    Katti MV, Ranjekar PK, Gupta VS (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol Biol Evol 18:1161–1167. https ://doi.org/10.1093/oxfor djour nals.molbe v.a0039 03

    Kim T-S, Booth JG, Gauch HG et  al (2008) Simple sequence repeats in Neurospora crassa: distribution, polymorphism

    and evolutionary inference. BMC Genomics 9:31. https ://doi.org/10.1186/1471-2164-9-31

    Kofler R, Schlötterer C, Luschützky E, Lelley T (2008) Survey of microsatellite clustering in eight fully sequenced species sheds light on the origin of compound microsatellites. BMC Genomics 9:612. https ://doi.org/10.1186/1471-2164-9-612

    Krumsiek J, Arnold R, Rattei T (2007) Gepard: a rapid and sensi-tive tool for creating dotplots on genome scale. Bioinformatics 23:1026–1028. https ://doi.org/10.1093/bioin forma tics/btm03 9

    Letunic I, Bork P (2019) Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res 47:W256–W259. https ://doi.org/10.1093/nar/gkz23 9

    Madsen BE, Villesen P, Wiuf C (2008) Short tandem repeats in human exons: a target for disease mutations. BMC Genomics 9:410. https ://doi.org/10.1186/1471-2164-9-410

    Moens U, Ludvigsen M, Van Ghelue M (2011) Human polyomavi-ruses in skin diseases. In: Pathology research international. https ://www.hinda wi.com/journ als/pri/2011/12349 1/. Accessed 3 May 2020

    Mudunuri SB, Nagarajaram HA (2007) IMEx: imperfect microsatellite extractor. Bioinformatics 23:1181–1187. https ://doi.org/10.1093/bioin forma tics/btm09 7

    Parrish CR, Holmes EC, Morens DM et  al (2008) Cross-species virus transmission and the emergence of new epidemic diseases. Microbiol Mol Biol Rev 72:457–470. https ://doi.org/10.1128/MMBR.00004 -08

    Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313. https ://doi.org/10.1093/bioin forma tics/btu03 3

    van der Meijden E, Kazem S, Dargel CA et al (2015) Characterization of T antigens, including middle T and alternative T, expressed by the human polyomavirus associated with trichodysplasia spinu-losa. J Virol 89:9427–9439. https ://doi.org/10.1128/JVI.00911 -15

    https://doi.org/10.7554/eLife.08881https://doi.org/10.7554/eLife.08881https://doi.org/10.1093/bioinformatics/btp348https://doi.org/10.1093/bioinformatics/btp348https://doi.org/10.1016/j.febslet.2011.03.005https://doi.org/10.1016/j.meegid.2012.05.006https://doi.org/10.7554/eLife.40760https://doi.org/10.1016/j.tig.2006.03.005https://doi.org/10.1016/j.tig.2006.03.005https://doi.org/10.1093/molbev/mst010https://doi.org/10.1093/molbev/mst010https://doi.org/10.1093/oxfordjournals.molbev.a003903https://doi.org/10.1093/oxfordjournals.molbev.a003903https://doi.org/10.1186/1471-2164-9-31https://doi.org/10.1186/1471-2164-9-31https://doi.org/10.1186/1471-2164-9-612https://doi.org/10.1093/bioinformatics/btm039https://doi.org/10.1093/nar/gkz239https://doi.org/10.1186/1471-2164-9-410https://doi.org/10.1186/1471-2164-9-410https://www.hindawi.com/journals/pri/2011/123491/https://www.hindawi.com/journals/pri/2011/123491/https://doi.org/10.1093/bioinformatics/btm097https://doi.org/10.1093/bioinformatics/btm097https://doi.org/10.1128/MMBR.00004-08https://doi.org/10.1128/MMBR.00004-08https://doi.org/10.1093/bioinformatics/btu033https://doi.org/10.1128/JVI.00911-15

    Implications of genome simple sequence repeats signature in 98 Polyomaviridae speciesAbstractIntroductionMaterials and methodsGenome sequencesMicrosatellite extractionStatistical analysisDot plot analysis for host specificityEvolutionary relationship

    ResultsGenome featuresIncidence of SSRs and cSSRsRelative abundance (RA) and relative density (RD) of SSRs and cSSRsdMAX and cSSRSSR motif types and their prevalenceSSRs in coding regionsSSRs (mono-nucleotide) specificity and host range exclusivityPhylogenetic tree of Polyomaviridae

    DiscussionReferences