The Worm Genome and Transcriptome Assembly Genome and Transcriptome Annotation Transcripts Involved in Regeneration in M. lignano Additional Findings RNA seq 0 3 6 12 24 48 72 hours RNA extraction DEseq amputation 0h 3h 6h 12h 24h 48h 72h 5 0 5 Value Color Key MAPK signalling pathway LIF LIFR JAK STAT3 Grb2 Mek PI3K Akt Erk1/2 Klf4 c-Myc Tbx3 Sox2 Nanog Oct4 Jak-Stat signalling pathway MAPK signalling pathway Activin Nodal ACVRI/II SMAD2/3 Proliferation DNA DNA Bmp4 BMPRI/II SMADs SMAD4 Erk1/2 p38 ID DUSP9 TGFβR TGFβ TGFβ signalling pathway Wnt Frizzled Dvl GSK-3β β-catenin Axin APC Wnt signalling pathway FGF2 FGFR Ras Raf Mek Erk1/2 IGF IGF-IR PI3K Akt DNA DNA DNA DNA DNA DNA Core transcriptional network c-Myc MAPK signalling pathway PI3K-Akt signalling pathway Pluripotency pathways from human/murine stem cells: Not found in M.lignano transcriptome Found in M.lignano transcriptome Factors with TGFβ -like domain found M.lignano transcriptome STAT3 not found, other STATs identified in M.lignano transcriptome KLF4 not found, other KLFs identified in M.lignano transcriptome BMP4 not found, other BMPs identified in M.lignano transcriptome DNA DNA DNA DNA DNA Genes from human/murine stem cells: 0.3 max_hvu diminutive_dme 5392611_gvu 4367212__psi 2235912_cle 8844611_ece 8320811_csp 1677112_pca max_aqu 329761_mili 8264415_csp 6889311_pst 358631_mosp 232631_meli 1291201_ltr 3759811_pca 331811_mosp 3433911_mli 4337911_mfu 4338711_sst max_hsa 1643511_psi 7137911_rsp 1674911_mfu 3098031_mfu USF2_hsa 7573711_mli 4658511_mili 170531_meli 4991111_ltr 9649121_mli 5204311_gvu 4338712_sst 401711_ece 1384111_pca 8002813_rsp 1535711_msc 1865211_psi 1677111_pca myc2_hvu S000209_sma 4247911_psi 133235_lgi 2472711_psi 9705812_mli 3981211_mfu 96815g22_mli 8721511_ltr 5733111_gvu 4658514_mili S35429_sma 368781_mosp mycl_hvu 6927111_pst 6637312_rsp 2694411_nco 2289111_nco 7019111_pst 3029911_gvu 193181i_pca Max_dme 500662_mili myc_aqu MXL1_cel lmyc_hsa 1961211_msc 4186911_sst 1086232_mli 437061_meli 442451_meli 1131511_msc 2256811_cle 5046551_psi 173401_mosp 6012811_rsp 8769711_nco 785741_scma - 785751_scma 4186914_sst 370851_mili 323911_mosp 5856311_mfu 1028284_mli nmyc_hsa 699701_csp 1030111_mli 166474_cte USF1_hsa MXL3_cel 88480_lgi 740831_csp myc1_hvu 4161311_mfu 301971_mosp 118760_cte cmyc_hsa 8624611_mli myc_pdu Max_pdu mycAl_hvu 0.837 0.891 0.976 0.784 0.998 0.892 0.728 0.938 0.918 0.769 0.862 0.834 0.963 0.743 0.695 0.923 0.861 0.81 0.9 0.974 0.727 0.789 0.98 0.918 0.995 0.93 0.767 0.836 0.953 0.9 0.815 0.771 0.996 1 1 0.974 0.822 0.931 0.77 0.907 0.783 0.903 0.998 0.787 0.919 0.996 0.962 0.991 0.828 0.969 1 0.926 0.951 0.928 1122 820 500 1439 402 480 355 317 1747 535 262 281 218 581 1368 M. lignano S. mediterranea C. elegans D. melanogaster Stephanostomum sp. -ACC----TATACGGTT---CTCT-GCCGTGTA------TATTAGT-C-ATGGT-AAGAA Haematolechus sp. -ACC----TATACGGTT---CTCT-GCCGTGTA------TCAGTG--C-ATGGT-AAGAA Fasciola sp. AACC----TTAACGGTT---CTCTTGCCCTGTA------TATTAGTGC-ATGGTAAAGAA S.mansoni AACC----GTCACGGTT---TTAC--TCTTGTG------ATTTGTTGC-ATGGT-AAGAA Echinococcus sp. CACCG --TTAATCGGTC---CTTA--CCTTGCA------ATTTTGT---ATGGT-GAGTA M.lignano -GCCG--TAAAGACGGT---CTCTTACTGCGAAGACTCAATTTATTGC-ATGCT-CAGTA S.med SL1 -GCCG--TTAGACGGTC---TTATCGAAATCTATAT---AAATCTTAT-ATGGT-ACGGA S.med SL2 -GCCG--TTAGACGGTC---TTATCGAAATCTATAT---AAAAATTAT-ATGGT-GAGG A Stylochus sp. TGCCGTATTTGACGGTCTCAAAAATTTCGTGTTTATTGCAATAATTGCAATGGT-AAGCA Notoplana sp. TGCCGTATTTGACGGTCTCAAAAATTTCGTGTTTATTGCAATAATTGCAATGGT-AAGCA .** : . * : : : *** * .* * Stephanostomum sp. TCGAA-----TTCGAC------CTATGGTCGAATAA-ATTCTTTGGCTAG-CCTCT---- Haematolechus sp. TCGAG-----TTCGACTCACATCGTTGGTCGAATAAGATTATTTGGCTAG-CCTCCACTC Fasciola sp. TCG-------TTGGAC------CATCGGTCCAAACCCATTATTTGGCTAG-CCTCCATTC S.mansoni CCG--------TCGAC------CAAGAATCGAAGTT--TTCTTTGGCAGC-CCTAACACA Echinococcus sp. TCGATGCAGCTCAGGCTG-TGCCTACGGAGCTGACCCAGTATTTGGCTGGTCCTT----- M.lignano TCGACCCAGCTTCATCAAAT-AAAAGAATGCGAATCGAATATACAGCCGAGCCCGACAAC S.med SL1 CCG--------TTATC------CAACATTAGTTGGTTAATTTTTGACAGTCACTTGAATC S.med SL2 CCG--------TTTGC------CAGCATTAGTTGGCTAATTTTTGACAGTAGCTTGCAT - Stylochus sp. TCAAAT-------GAT------CCAGTGTGATCGTCGAGTCTTTG--ACAGGCCG----- Notoplana sp. TCAAA--------GAT------CCA-TGTGATCGTCGAGTCTTTGACACAGGCCG----- *. . : * *: . * Stephanostomum sp. ---TCGGGGGCTAA------ 96 Haematolechus sp. TGGTCGGGGGCTA------- 108 Fasciola sp. TG--CAGAGGCTAAGAATCC 110 S.mansoni ----CGGGG----------- 91 Echinococcus sp. ----CGAGGGCC-------- 105 M.lignano TCGGCACTGTCTGCTCCGC- 130 S.med SL1 --ACAAGTGACTAT------ 107 S.med SL2 --GCAAGTGACTAT------ 106 Stylochus sp. ----CGAGGCCTATAT---- 111 Notoplana sp. ----CAAGGCCTATTT---- 111 .. * C.elegans D.melanogaster H.sapiens S.mansoni M.lignano 100 worms M.lignano sorted stem/germ cells M.lignano RNA Sequence complexity Normalized frequency X 10 4 0 20 100 40 60 80 0 2 4 6 8 10 12 Genome % of bases masked by TRF M.lignano 24.8 C.elegans 6.8 D.melanogaster 4.7 H.sapiens 2.2 S.mansoni 0.3 A ** eyes pharynx testes ovaries egg seminal vesicle stylet brain vesicula granulorum female antrum gut B E Nematostella vectensis Aurelia aurita Hydra magnipapillata Isodiametra pulchra Macrostomum lignano Schistosoma japonicum Schistosoma mansoni Dugesia japonica Schmidtea mediterranea Agropecten irradians Lumbricus rubellus Platynereis dumerilii Daphnia pulex Anopheles gambiae Drosophila melanogaster Caenorhabditis elegans Caenorhabditis briggsae Strongylocentrotus purpuratus Ciona intestinalis Danio rerio Xenopus laevis Gallus gallus Homo sapiens Cnidaria Acoela Ecdysozoa Lophotrochozoa Deuterostomia 0.1 Stenostomum sthenum Catenula lemnae Microstomum lineare Macrostomum lignano Prorhynchus stagnalis Geocentrophora sphyrocephala Prosthiostomum siphunculus Maritigrella crozieri Leptoplana tremerralis Echinoplana celerrima Microdalyellia schmidtii Microdalyellia fusca Mesostoma lingua Nematoplana coelogynoporoides Monocelis sp. Itaspiella helgolandica Schmidtea mediterranea Dugesia japonica Procotyla fluviatilis Dendrocoelum lacteum Bothrioplana semperi Neobenedenia melleni Gyrodactylus salaris Taenia solium Schistosoma japonicum Schistosoma mansoni Clonorchis siniensis C D Catenulida Macrostomorpha Lecithoepitheliata Polycladida Rhabdocoela Proseriata Tricladida Bothrioplanida Monogenea Cestoda Trematoda Platyhelminthes male gonopore adhesive organs ** head tail DAPI EdU T O DE 50μm 0 100 200 300 400 500 600 0 5 10 15 20 M. lignano K-mer model 23-mer Coverage 23-mer frequency X10 5 observed errors p1= 110 p2= 220 p3= 330 p4= 440 composite F 0 20 40 60 80 100 Contig size NG(x) % ML2 Assembly ML1 Assembly 100 1000 1e+06 1e+05 1e+04 G A B A B C D A B The free-living flatworm, Macrostomum lignano, much like its better known planarian relative, Schmidtea mediterranea, has a nearly unlimited regenerative capacity. Following injury, this species has the ability to regenerate almost an entirely new organism. This is attributable to the presence of an abundant somatic stem cell population, the neo- blasts. These cells are also essential for the ongoing maintenance of most tissues, as their loss leads to the rapid and irreversible degeneration of the animal. This set of unique properties makes flatworms an attractive species for studying the evolution of pathways involved in self-renewal, fate specification, and regeneration. The use of Macrosto- mum lignano, or other flatworms, as models, however, is hampered by the lack of a well-assembled and annotated genome sequence, fundamental to modern genetic and mo- lecular studies. Here we report the genomic sequence of Macrostomum lignano and an accompanying characterization of its transcriptome. The genome structure of Macrostomum lignano is remarkably complex, with ~75% of its sequence being comprised of simple repeats and transposon sequences. This has made high quality assembly from Illumina reads alone impossible (N50=414bp). We therefore obtained 130X coverage by long sequencing reads from the PacBio platform and combined this with more than 250X Illumina coverage to create a mixed assembly with a significantly improved N50 of 64 kb. We complemented the reference genome with an assembled and annotated transcriptome, and used both of these datasets in combination to probe gene expression patterns during regeneration, examining pathways important to stem cell function. Aditionally we found evidence of low levels of CpG methylation in Macrostomum lignano’s genome and evidence of trans-splicing in the worm’s transcriptome. Interestingly we found that flatworms lack Myc - a very conserved pluripotency factor in Bilaterians and beyond (cnidarians, poriferans). As a whole, our data will provide a crucial resource for the community for the study not only of invertebrate evolution but also of regeneration and so- matic pluripotency. Genome and transcriptome of the regeneration-competent flatworm, Macrostomum lignano. Wasik K.A.* 1 , Gurtowski J.* 1 , Zhou X. 1,2 , Ramos O.M 1 , Delas M.J. 1,3 , Battistoni G. 1,3 , El Demerdash O. 1 , Falciatori I. 1,3 , Vizoso D.B. 4 , Smith A.D. 5 , , Ladurner P. 6 , Scharer L. 4 , McCombie W.R. 1 , Hannon G.J. 1,3 and Schatz M. 1 1 Watson School of Biological Sciences, Howard Hughes Medical Institute, Cold Spring Harbor Laboratory, New York 11724, USA; 2 Molecular and Cellular Biology Graduate Program, Stony Brook University, NY 11794; 3 Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom; 4 Department of Evolutionary Biology, Zoological Institute, University of Basel, 4051 Basel, Switzerland; 5 Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA; 6 Department of Evolutionary Biology, Institute of Zoology and Center for Molecular Biosciences Innsbruck, University of Innsbruck, A-6020 Innsbruck, Austria A. Phylogenetic analysis of 23 animal species using partial sequences of 43 genes. Fig. modified from Egger et al. (2015). B. Interference contrast image and a diagrammatic representation of an adult Macrostomum lignano. C. Phylogenomic analysis of 27 flatworm species (21 free-living and 6 neodermatan) using >100,000 aligned amino acids. Fig. modified from Egger et al. (2015). D. Electron micrograph of a M. lignano neoblast. Note the small rim of cytoplasm (yellow) and the lack of cytoplasmic differentiation. er - endoplasmic reticulum; mi - mitochondria; mu - muscle; ncl - nucleolus; nu - nucleus (red). E. Immunofluorescence labeling of dividing neoblasts with EdU (red) in an adult worm. All cell nuclei are stained with DAPI (blue). T- testes, O - ovaries, DE – developing eggs, asterisks denote eyes. F. Representation of 23-mer frequency and cover- age in the Illumina sequencing data generated from DNA extracted from a population of adult worms. M. lignano shows unusual 4-modal 23-mer distribution. G. Comparison of Illumina only (ML1) and Pacbio (ML2) assemblies. Contig length distribution (Log2 scale) over the M. lignano genome in the ML1 (green) and ML2 (red) assemblies. Note that the ML1 assembly covers only about 55% of the genome. 10 20 Contig size X 10 Kbp C1 Contig ID miRNA count Large repetitive elements Trancsript abundance Tandem repeat unit size in bp Tandem repeat count GC content Illumina coverage 0 10 20 30 40 50 60 0 10 20 30 40 50 0 10 20 30 40 0 10 20 30 40 0 10 20 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 0 10 20 C1 C50 10 A. Schematic representation of the experimental design: 200 worms (per replicate) underwent amputation at a level between the brain and the gonads. The heads were allowed to regenerate, and regenerating animals were collected at different timepoints post amputation (0, 3, 6, 12, 24, 48, 72 hours). RNA-Seq libraries from each timepoint were analyzed for differentially expressed genes. Below - a heat map of differentially expressed genes at different regeneration timepoints. Each replicate is plotted separately. Downregulated and upregulated transcripts are labeled in green and red, respectively. Scale covers Log2 values. The samples are grouped with complete-linkage clustering using Euclidean distance. B.Known pluripotency pathways from H. sapiens and M. musculus were adapted from the Kyoto Encyclopedia of Genes and Genomes. Factors that had potential homologues in M. lignano are labeled. A.Overview of the 50 largest contigs in the M. lignano genome, making up about 2.6 % of the total assembly. Different tracks denote (moving inwards): contig size X 10 Kbp; miRNA count (1-54 mapped miRNAs); large repetitive elements (RepeatScout) (1-4476 identified repeats); transcript count (1-43 mapped transcripts); Tandem repeat unit size in base pairs (1-500); Tandem repeat count (1-28); GC content (0-1); and Illumina coverage (4-160X). The color gradients correspond to the range of values for each track (lower values are lighter, higher values are darker). B. Sequence complexity comparison across five organisms. Drosophila melanogaster has an abundance of very low complexity sequence, not found in the other species. Macrostomum lignano has a sizable amount of moderately complex sequences that are not found in other species and that do not appear to be expressed. C. Tandem Repeat Finder was run on five species to assess their tandem and low complexity sequence composition. Macrostomum lignano had far more bases masked by Tandem Repeat Finder than the other organisms in the test set. D.The number of reciprocal blast hits against the Homo sapiens tran- scriptome for four different species: Macrostomum lignano, Schmidtea mediterranea, Caenorhabditis elegans, and Drosophila melanogaster. Only the number of hits passing the E-value cutoff of ≤1e-10 is shown. A. M. lignano is lacking the Myc gene. Evolution of the of Myc and Max gene families across different repre- sentatives of the animal phyla. Mycs and Maxs gene candidates are retrieved based on reciprocal best BLASTp from the available transcriptomes. The distance tree was inferred using neighbor-joining based on JTT sequence evolution model (1000 bootstrap replicates). Human USF proteins are used as an outgroup. The Myc branch is labeled in green, the Max branch is labeled in blue. dme – D. melanogaster, hsa – H. sapiens, lgi – L. gigantea, cte – C. teleta, hvu – H. vulgaris, aqu – A. queenslandica, cel – C. elegans, mli – M. lignano, mfu - M. fusca, mosp – Monocelis sp, psi – P. siphunculus, ltr – L. tremelaris, ece – E. celerrima, , meli – M. lingua , msc – M. schmidtii, mili – M. lineare, nco – N. coelogynoporoides, rsp – Rhabdopleura sp., gvu – G. vulgaris, csp – Cerebratulus sp., pca – P. caudatus, sst – S. sthenum, cle – C. lemnae, pdu – P. dumerilli, sma – S. mediterranea, scma – S. mansoni. Transcript ID is next to each phylum name. B. Trans splicing in M. lignano. Align- ment between first 130nt of M. lignano’s putative SL RNA and SL RNAs from other flatworms. The conserved splice junction is indicated by an arrowhead. Spliced leader sequences are labeled in blue. The potential initiator AUG (last three nucleotides of the spliced leader) is labeled in green. Conclusions: • We have assembled and annotated a highly repetitive genome using a mix of Pacbio and Illumina sequencing • We have found that: - M. lignano’s genome shows evidence of CpG methylation - It has retained a large number of homeoboxes as compared to other flatworms - The transcriptome shows evidence of trans-splicing - Flatworms lost the very conserved Myc gene • We have characterized the gene expression patterns during regeneration in M. lignano • Wasik et al. PNAS (2015); doi: 10.1073/pnas.1516718112