Top Banner
Comprehensive analysis of epigenetic signatures of human transcrip- tion control Guillaume Devailly *a and Anagha Joshi b Advances in sequencing technologies have enabled exploration of epigenetic and transcription profiles at a genome-wide level. The epigenetic and transcriptional landscape is now available in hundreds of mammalian cell and tissue contexts. Many studies have performed multi-omics analyses using these datasets to enhance our understanding of relationships between epigenetic modifications and transcription regulation. Nevertheless, most studies so far have focused on the promoters/enhancers and transcription start sites, and other features of transcription control including exons, introns and transcription termination remain under explored. We investigated interplay between epigenetic mod- ifications and diverse transcription features using the data generated by the Roadmap Epigenomics project. A comprehensive analysis of histone modifications, DNA methylation, and RNA-seq data of about thirty human cell lines and tissue types, allowed us to confirm the generality of previously described relations, as well as to generate new hypotheses about the interplay between epigenetic modifications and transcript features. Importantly, our analysis included previously under-explored features of transcription control namely, transcription termination sites, exon-intron boundaries, mid- dle exons and exon inclusion ratio. We have made the analyses freely available to the scientific com- munity at joshiapps.cbu.uib.no/perepigenomics_app/ for easy exploration, validation and hypotheses generation. Background Epigenetic modifications of the DNA sequence and DNA- associated proteins along with transcriptional machinery is thought to be the main driver shaping mammalian genome dur- ing development and disease 1 . Epigenetic modifications include DNA methylation, histone variants and histone post-translational modifications (such as acetylations and methylations), and fa- cilitate tissue specific expression 2 . The advent and maturation of sequencing technologies have facilitated large scale genera- tion of epigenomic data across diverse organisms in multiple cell and tissue types. Accordingly, consortia were established to gen- erate large epigenomic datasets, including ENCODE 3 , Roadmap Epigenomics 2 , Blueprint epigenome 4 for human, modENCODE 5 for model organisms, and FAANG 6,7 for farm animal species. The International Human Epigenome Consortium (IHEC) was set up to gather reference maps of human epigenomes 8 . The data from these efforts have generated new findings through integrated analyses. Such analyses are facilitated by consortia data portals 8,9 as well as portals gathering data from multiple sources 10–14 , which allow easy browsing as well as download- ing both sequences and processed data. In addition, many data portals include (or link to) genome browsers to allow online so- lutions for the data exploration 15 . Several online tools have been developed to explore publicly a GenPhySE, Université de Toulouse, INRAE, ENVT, 31326, Castanet Tolosan, France. E-mail: [email protected] b Computational Biology Unit, Department of Clinical Science, University of Bergen, 5021, Bergen, Norway. E-mail:[email protected] Electronic Supplementary Information (ESI) available at the end of this document. See DOI: 00.0000/00000000. available epigenomic data 16–20 to gain insights into mammalian epigenetic control. These tools used diverse computational frame- works ranging from data integration and visualisation (e.g. ChIP- Atlas allows visualisation multiple histone modifications and tran- scription factor binding sites at given genomic locus by using public ChIP-seq and DNase-seq data 21 ) to semi-automated an- notation of genome (e.g. Segway performed genomic segmen- tation of human chromatin by integrating histone modifications, transcription-factor binding and open chromatin 22 ). Though identification of functional elements from epigenetic data 2,22 has been highly effective to annotate the enhancer and promoter re- gions of a genome, they failed to capture other transcription reg- ulation features such as exon-intron boundaries and transcrip- tion termination features. For example, Roadmap Epigenomics project mapped about 30 epigenetic modifications across human cell lines and tissues to gather a representative set of "complete" epigenomes 2 . Using this data, Kundaje et al. built a hidden Markov model based classifier to define 15 distinct chromatin states, including active or inactive promoters, active or inactive enhancers, condense and quiescent states. Notably, this unsu- pervised approach did not lead to the definition of "exon" states, and even less to "exon-included" and "exon-excluded" states. This might be because epigenetic modifications enriched at enhancers and promoters have a strong signal (or peaks), while the ones abundant at gene bodies (DNA methylation, H3K36me3) are wide and diffuse. The promoter and enhancers features therefore dominate in epigenetic data analyses, hindering recovery of as- sociations between epigenetic modifications and other transcrip- tion control events such as splicing (constitutive or alternative). Moreover, some transcription features might not have a strong 1–19 |1 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625 doi: bioRxiv preprint
19

Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Oct 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Comprehensive analysis of epigenetic signatures of human transcrip-tion control†

Guillaume Devailly∗a and Anagha Joshib

Advances in sequencing technologies have enabled exploration of epigenetic and transcription profilesat a genome-wide level. The epigenetic and transcriptional landscape is now available in hundredsof mammalian cell and tissue contexts. Many studies have performed multi-omics analyses usingthese datasets to enhance our understanding of relationships between epigenetic modifications andtranscription regulation. Nevertheless, most studies so far have focused on the promoters/enhancersand transcription start sites, and other features of transcription control including exons, introns andtranscription termination remain under explored. We investigated interplay between epigenetic mod-ifications and diverse transcription features using the data generated by the Roadmap Epigenomicsproject. A comprehensive analysis of histone modifications, DNA methylation, and RNA-seq dataof about thirty human cell lines and tissue types, allowed us to confirm the generality of previouslydescribed relations, as well as to generate new hypotheses about the interplay between epigeneticmodifications and transcript features. Importantly, our analysis included previously under-exploredfeatures of transcription control namely, transcription termination sites, exon-intron boundaries, mid-dle exons and exon inclusion ratio. We have made the analyses freely available to the scientific com-munity at joshiapps.cbu.uib.no/perepigenomics_app/ for easy exploration, validation and hypothesesgeneration.

BackgroundEpigenetic modifications of the DNA sequence and DNA-associated proteins along with transcriptional machinery isthought to be the main driver shaping mammalian genome dur-ing development and disease1. Epigenetic modifications includeDNA methylation, histone variants and histone post-translationalmodifications (such as acetylations and methylations), and fa-cilitate tissue specific expression2. The advent and maturationof sequencing technologies have facilitated large scale genera-tion of epigenomic data across diverse organisms in multiple celland tissue types. Accordingly, consortia were established to gen-erate large epigenomic datasets, including ENCODE3, RoadmapEpigenomics2, Blueprint epigenome4 for human, modENCODE5

for model organisms, and FAANG6,7 for farm animal species.The International Human Epigenome Consortium (IHEC) wasset up to gather reference maps of human epigenomes8. Thedata from these efforts have generated new findings throughintegrated analyses. Such analyses are facilitated by consortiadata portals8,9 as well as portals gathering data from multiplesources10–14, which allow easy browsing as well as download-ing both sequences and processed data. In addition, many dataportals include (or link to) genome browsers to allow online so-lutions for the data exploration15.

Several online tools have been developed to explore publicly

a GenPhySE, Université de Toulouse, INRAE, ENVT, 31326, Castanet Tolosan, France.E-mail: [email protected] Computational Biology Unit, Department of Clinical Science, University of Bergen,5021, Bergen, Norway. E-mail:[email protected]† Electronic Supplementary Information (ESI) available at the end of this document.See DOI: 00.0000/00000000.

available epigenomic data16–20 to gain insights into mammalianepigenetic control. These tools used diverse computational frame-works ranging from data integration and visualisation (e.g. ChIP-Atlas allows visualisation multiple histone modifications and tran-scription factor binding sites at given genomic locus by usingpublic ChIP-seq and DNase-seq data21) to semi-automated an-notation of genome (e.g. Segway performed genomic segmen-tation of human chromatin by integrating histone modifications,transcription-factor binding and open chromatin22). Thoughidentification of functional elements from epigenetic data2,22 hasbeen highly effective to annotate the enhancer and promoter re-gions of a genome, they failed to capture other transcription reg-ulation features such as exon-intron boundaries and transcrip-tion termination features. For example, Roadmap Epigenomicsproject mapped about 30 epigenetic modifications across humancell lines and tissues to gather a representative set of "complete"epigenomes2. Using this data, Kundaje et al. built a hiddenMarkov model based classifier to define 15 distinct chromatinstates, including active or inactive promoters, active or inactiveenhancers, condense and quiescent states. Notably, this unsu-pervised approach did not lead to the definition of "exon" states,and even less to "exon-included" and "exon-excluded" states. Thismight be because epigenetic modifications enriched at enhancersand promoters have a strong signal (or peaks), while the onesabundant at gene bodies (DNA methylation, H3K36me3) arewide and diffuse. The promoter and enhancers features thereforedominate in epigenetic data analyses, hindering recovery of as-sociations between epigenetic modifications and other transcrip-tion control events such as splicing (constitutive or alternative).Moreover, some transcription features might not have a strong

Journal Name, [year], [vol.],1–19 | 1

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 2: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

correlation with any chromatin modification studied. For exam-ple, Curado et al.23 have estimated that only 4% of differential in-cluded exons were associated with changes in H3K9ac, H3K27ac,and/or H3K4me3 across 5 different cell lines. A gap thereforeremains in genome-wide computational analyses towards gettinga comprehensive overview of the associations between epigeneticmodifications and transcription control features.

On the other hand, individual targeted studies have providedevidence for interplay between epigenetics and other transcrip-tional features. DNA methylation at gene bodies has been pos-itively correlated with gene expression level24,25. Maunakea etal.26 observed that DNA methylation was positively correlatedwith splicing at alternatively spliced exons, and proposed a mech-anism involving DNA methylation reader MECP2. Lev Maoretal.27 observed that DNA methylation at exons can be either posi-tively or negatively correlated with splicing depending on the ex-ons, through a mechanism involving CTCF and MECP2. A causalrole of DNA methylation in alternative splicing was establishedby drug-induced de-menthylation28, as well as by targeted DNAmethylations and de-methylations by Shayevitch et al.29. Xu etal.30,31 identified H3K36me3 epigenetic modification associatedwith alternative splicing. There is some evidence for epigeneticcontrol at transcription termination as well: the loss of genebody DNA methylation was found to favour the usage of a proxi-mal alternative poly-adenylation site by unmasking CTCF bindingsites32.

In summary, many large data integration approaches only al-low extraction of epigenetic signatures for dominant featuresof transcription control (e.g. enhancer, promoter, TSS), miss-ing many other transcription features (e.g. exon, intron, TSS).We therefore performed a systematic analysis of associationsbetween epigenetic modifications and diverse transcription fea-tures. Using the Roadmap Epigenomics project data for 30epigenetic modifications in about 30 cell and tissue contexts,we explored links between epigenetic modifications and tran-scription control. We confirmed previously known associa-tions as well as generated novel observations. We have pro-vided our analyses freely through a web application available athttps://joshiapps.cbu.uib.no/perepigenomics_app/, allowing re-searchers to browse the results and generate working hypotheses.

Results

Exploration of epigenetic signatures of transcription controlusing the Roadmap Epigenomics project data

To explore the epigenetic signatures at transcription control sites,we first extracted three gene features: Transcription start site(TSS), transcription termination site (TTS) and middle exonsfrom GENCODE annotation version 2933 (Table 1). We classi-fied genes in two different ways. Firstly, we partitioned all genesbased on the gene length "long" (>3kb), "short" (≤1kb), and "in-termediate" length genes. We also classified genes based on asimplified GENCODE gene types, namely: protein coding, RNAgenes, pseudogenes and other genes (see method section).

We obtained RNA sequencing data for about 30 cell or tis-sue types from the Roadmap data portal2. Gene and exon nor-

malised expression (Transcript per millions, TPM) were calcu-lated by pseudo-mapping of the reads to the human transcrip-tome using Salmon34 in each sample. Moreover, for every middleexon in each cell type, exon inclusion ratio (or ψ) was calculated(see methods), ranging from 0 to 1 (0 for exons not included, and1 for exons included in all the transcripts).

The genome-wide histone modification and DNAseI profilesfor the cell types corresponding to the transcriptome data wereobtained from the Roadmap Epigenomics consortium. For theWhole Genome Bisulfite Sequencing (WGBS) data, we computedthree new tracks (figure S1†): CpG nucleotide density (consistentacross cell and tissue types), CpG methylation ratio (average ra-tio of methylation at CpG sites in the window), CpG methylationdensity (number of methylated CpG sites in each window), foreach sample, using the WGBS CpG coverage track as a control.

For each pair of transcription feature and epigenetic modifica-tion, the associations were explored at two levels. (1) Cell or tis-sue level: for all genes (or exons) in each cell or tissue type and(2) Gene level: for each gene (or exon) across all cell or tissuetypes. Cell or tissue level analysis allows within-assay comparisonof highly and lowly expressed genes (or exons), but is sequenceand genomic context (unique for each gene or exon) dependent.Whereas Gene level analysis fixes the sequence and genomic con-text, but is more sensitive to technical variability across experi-ments.

The main observations from these analyses are summarisedin Table 1 and elaborated in sections below. We have de-veloped a web application to allow exploration of analyses athttps://joshiapps.cbu.uib.no/perepigenomics_app/.

Transcription activity and epigenetic modifications near Tran-scription Start Sites

Many epigenetic modifications are enriched around the TSS of ex-pressed genes, a region containing gene promoters. To investigatethe link between transcription level and epigenetic modificationsat the TSS, we generated stack profiles of each epigenetic modi-fication around TSS. When epigenetic modifications were sortedaccording to gene expression (figure 1A), most histone modifica-tions studied were more abundant at highly expressed genes thanat lowly expressed ones. Specifically, only one histone acetylation(H2BK20ac), three histone methylation (H3K9me3, H3K27me3,H3K36me3), and one histone variant H2A.Z did not show a posi-tive correlation with gene expression level. We further noted thatonly H3K27me3 was more abundant at the TSSs of lowly or notexpressed protein coding genes than at the TSSs of highly ex-pressed protein coding genes. H3K27me3 was not present at theTSSs of not expressed, non-protein coding genes in any of the cellor tissue type, highlighting the fact that the associations betweenan epigenetic modification and transcriptional level may be gene-type specific.

We explored the trends in peak shapes and noted that a ’doublehill’ (or ’M’ shape) with a gap at the exact location of the TSS wasthe most common shape. The ’gap’ of ChIP-seq signal betweenthe ’hills’ was located exactly at a sharp peak of DNAseI signalaround TSS (figure 1A), denoting a very high DNA accessibility

2 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 3: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Table 1 Summary of the associations between epigenetic modifications and transcription features in the Roadmap epigenomics project data. Tick (orcross): whether (or not) an epigenetic assay shows an increase or decrease in signal at the feature. Pluses: positive associations. Minuses: negativeassociations. Zeroes: no correlation. NA: not available (data not available in enough cell types).

Journal Name, [year], [vol.],1–19 | 3

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 4: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Fig. 1 Relationships between epigenetic modifications near the Transcription Start Sites (TSS) and gene expression level. A. Association betweenepigenetic modifications near the TSS and gene expression level in the small intestine. Upper part, from left to right: Gene expression level in all47,812 autosomal genes annotated by Gencode. First side bar indicates the gene type (green: protein coding genes, blue: pseudogenes, purple: RNAgenes, red: other types of genes), the second side bar indicates the genes sorted according to expression level (figure 1A bottom, 5 bins in total,purple: highly expressed genes, green: lowly expressed genes). Stacked profiles of (i) DNAse-seq and respective control, of (ii) H3K4me3 ChIP-seqand respective input control, of (iii) CpG density, mCpG ratio (mCpG/CpG), mCpG density, and the WGBS coverage near TSS, sorted according tothe corresponding gene expression level. Bottom part, from left to right: Boxplot of gene expression levels in each of the 5 expression bins defined inthe upper part. Average profiles of DNAse-seq and respective control, of H3K4me3 ChIP-seq and respective input control, CpG density, mCpG ratio,mCpG density and WGBS coverage, ± SEM (Standard Error of the Mean) for each bin of promoters. B-E. Association between epigenetic marksnear the TSS and gene expression level across cell types. B. Regression of expression level (in log10(TPM+1)) of the MKRN3 gene and the meanDNA methylation ratio at CpG sites 500bp around the TSS of the MKRN3 gene. Each dot corresponds to a cell type. The slope is negative and thecorrelation coefficient (R2) is greater than 0.75 for MKRN3. Similar regressions were generated for each gene and epigenetic modification pair. C-E.Distribution of the slopes from the gene regressions (as in B.) according to the r-squared correlation coefficients for DNAse-seq signal (C.), H3K4me3ChIP-seq signal (D.) and mCpG ratio (E.) near the TSS of the corresponding genes. Numbers of genes and percentage of genes in each category aredisplayed below each box.

4 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 5: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

at the TSS. This suggests presence of a nucleosome-free regionat promoter terminating at the TSS, with a first nucleosome po-sitioned near the +1 of transcription. This double hill patternwas either symmetric or asymmetric depending on the mark andcell or tissue type (e.g. Stronger peak at the downstream hillthan the upstream hill in H3K4me3 in the small intestine, fig-ure 1A). H3K79me2 was particularly strongly asymmetric (figureS2†), with a strong peak at around ± 500bp downstream of theTSS across 2 of the 3 cell lines for which the mark was studied inRoadmap Epigenomics.

CpG density around the TSS was positively correlated withgene expression in all cell and tissues in the dataset. CpG methy-lation ratio and CpG methylation density near gene TSS were neg-atively correlated with gene expression levels. While CpG methy-lation ratio showed a flat profile near the TSS of non-expressedgenes, CpG methylation density transitioned from a gap at highlyexpressed genes to a peak at non expressed genes at the TSS (fig-ure 1A).

We further explored these trends for individual gene types. Asprotein-coding genes formed the majority of all genes, the aboveobservations for all gene types were preserved when the analyseswere restricted to protein coding genes only. Separating genes ac-cording to gene type highlighted differences in epigenetic profilesat lincRNAs and unprocessed pseudogenes compared to proteincoding genes. Processed pseudogenes often had a different rela-tionship between their expression level and the epigenetic statusof their promoter. For example, expressed processed pseudogenesshowed neither DNAse1 accessibility peak at the TSS, nor an en-richment of active epigenetic modifications at the TSS. While CpGdensity near the TSS was correlated with processed pseudogeneexpression level, their promoters did not show any decrease inDNA methylation ratio, resulting in a positive relation betweenDNA methylation density and processed pseudogene expression.

Expressed genes of the ’antisense’ gene type often mirrored theepigenetic signature of the ’protein coding’ gene type. For exam-ple, H3K79me2 peak was pronounced 500 bp before the TSS of’antisense’ genes, whereas it peaked 500bp after the TSS of pro-tein coding genes. However, it is likely that the observed epige-netic signal at antisense genes might be due to the corresponding’sense’ gene.

Altogether, gene expression level was positively or negativelycorrelated with many epigenetic modifications at gene promot-ers when comparing expressed and not expressed genes within acell type. The associations between epigenetic modifications andgene expression in a cell or tissue type are promoter sequenceand gene context dependent. For example, the CpG density at theTSS is a good predictor of both the gene expression level and theCpG methylation ratio and density (figure 1A). The investigationof associations across cell types is therefore important. To studyassociation between epigenetic modifications and transcriptionacross different cell and tissue types, we calculated linear regres-sion between each epigenetic modification and gene expressionlevel across cell or tissue types. Specifically, the epigenetic sig-nals in ±500bp window around the TSS and the gene expressionlevel (log10(T PM+1)) for each gene (figure 1B) were linearly re-gressed to obtain a slope and a linear correlation coefficient (R2).

For example, the linear regression between the MKRN3 gene ex-pression level and the average CpG methylation ratio near theTSS of the MKRN3 gene resulted in a slope of -1.41, an R2 of 0.82,and a p-value of 3.4 ·10−13. For each epigenetic modification, thedistribution of slopes across all genes was plotted against their R2

(figure 1C, 1D and 1E). The epigenetic modifications with posi-tive slopes across cell or tissue types were: H3K4me3 (figure 1D),H3K36me3, H2BK5ac, H3K4ac, H3K9ac, H3K18ac, H3K27ac andchromatin accessibility measured by DNAse1 digestion assay (fig-ure 1C). DNA methylation showed a negative slope for the R2

greater (figure 1E).

Transcription activity and epigenetic modifications near Tran-scription Termination Sites

We repeated the analyses described above at transcription termi-nation sites (TTS). We noted that highly expressed genes tend tobe longer than not expressed genes. In short genes, is is difficultto distinguish the effect of epigenetic modifications at TSS fromthat at the TTS. We therefore defined three classes of genes: short(≤ 1kb), long (> 3kb), and intermediate length genes, to mitigatethis gene-length effect.

While many epigenetic modifications showed a peak (or a gap)centred at the gene TSS, only two modifications were enriched atthe gene TTS. First, H3K36me3 displayed a broad hill-shape pro-file, with a peak at TTS (figure 2A). Levels of H3K36me3 at TTSwere positively correlated with gene expression level for differentgenes within a cell type (figure 2A), and also for gene expressionlevel of the same gene across cell type (figure 2C). In some sam-ples, the H3K36me3 profile was slightly asymmetric at TTS, withmore signal in the gene body than after the TTS.

DNA methylation density increased at the TTS (figure 2A). AsDNA methylation ratio was nearly constant, the increase of DNAmethylation density was mostly due to an increase of CpG densityat the TTS. DNA methylation density at the TTS was positivelycorrelated with gene expression level when comparing differentgenes within a cell type. A weak negative correlation betweenDNA methylation and gene expression level at TTS was observedfor a subset of genes when comparing the same gene across celland tissue types (figure 2D). This is in agreement with recent find-ings32. The negative correlation between DNA methylation andgene expression was evident only in ’long’ genes, and in proteincoding genes.

Though most epigenetic modifications did not show enrich-ment at TTS, four epigenetic modifications (DNAseI accessibil-ity, H3K4me1, H3K79me1, H3K27ac), were positively correlatedwith expression level of a gene across cell and tissue types. Theseepigenetic modifications showed no enrichment at TTS, yet thechange in expression level was associated with the change in epi-genetic modification strength. These modifications could be re-flecting broader chromatin organisational features such as topo-logically associating domain and A/B chromatin domains35.

Journal Name, [year], [vol.],1–19 | 5

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 6: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Fig. 2 Relationships between epigenentic modifications near the Transcription Termination Sites (TTS) and expression level. A. Association betweenepigenetic marks near the TTS and gene expression level in the spleen. Upper part, from left to right: Gene expression level in all 47,812 genesannotated by Gencode. First side bar indicates the gene type (green: protein coding genes, blue: pseudogenes, purple: RNA genes, red: other typesof genes), the second side bar indicates the 5 bins used in the figure 2A bottom panels (purple: highly expressed genes, green: lowly expressed genes).Stacked profiles of (i) H3K36me3 ChIP-seq and respective input control, of (ii) CpG density, mCpG ratio (mCpG/CpG), mCpG density, and the WGBScoverage near TTS, sorted according to the corresponding gene expression level. Bottom part, from left to right: Boxplot of gene expression levelin each of the 5 bins defined in the upper part. Then, average profiles of DNAse-seq and respective control, of H3K4me3 ChIP-seq and respectiveinput control, CpG density, mCpG ratio, mCpG density and WGBS coverage ± SEM (Standard Error of the Mean) for each bin of promoters. B-D.Association between epigenetic marks near the TTS and gene expression level across cell types. B. Regression of expression level (in log10(TPM+1))of the GATA3 gene with the mean H3L36me3 ChIP-seq signal 500bp around the TTS of the GATA3 gene, where each dot corresponds to a cell ortissue type. The slope was positive and the correlation coefficient (R2) was greater than 0.75 in this case. Similar regressions were conducted for eachgene and epigenetic modification pair. C. and D. Distribution of the slopes from the gene regressions (as in B.) according to the r-squared correlationcoefficients for H3K36me3 modification (C.), and mCpG ratio (D.) near the TTS of the corresponding genes. Number of genes and percentage ofgenes in each category are displayed below each box.

6 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 7: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Fig. 3 Relationships between epigenentic modifications near middle exons starts and exon expression level or exon inclusion ratio. A. Associationbetween epigenetic marks near middle exons start sites and exon expression levels in the HUES64 cell line. Upper part, from left to right: middle exonexpression levels in 16,811 middle exons annotated by Gencode. The side bar indicates 5 bins used in the figure 3A bottom panels (purple: highlyexpressed exons, green: lowly expressed exons). Then: Stacked profiles of H3K4me3 ChIP-seq, H3K36me3, and input control, sorted according to theexon expression levels. Bottom part, from left to right: Boxplot of exon expression levels in each of the 5 bins defined in the upper part. Then, averageprofiles of H3K4me3 ChIP-seq, H3K36me3, and respective input control ± SEM (Standard Error of the Mean) for each bin of exons. B. Associationbetween epigenetic marks near middle exons start sites and exon inclusion ratios in the HUES64 cell line. Upper part, from left to right: middle exoninclusion ratios in 17,517 middle exons annotated by Gencode. The side bar indicates 5 bins used in the figure 3B bottom panels (purple: includedexons, green: excluded expressed exons). Then: Stacked profiles of H3K4me3 ChIP-seq, H3K36me3, and respective input control, sorted accordingto the corresponding exon inclusion ratio. Bottom part, from left to right: Boxplot of exon inclusion ratios in each of the 5 bins defined in the upperpart. Then, average profiles of H3K4me3 ChIP-seq, H3K36me3, and respective input control ± SEM (Standard Error of the Mean) for each bin ofexons. C. and D. Distribution of the slopes from exon expression level regressions and epigenetic mark presence at exon start ( ± 100 bp) accordingto the R2 correlation coefficients for H3K4me3 signal (C.), and H3K36me3 (D.) near the start of the corresponding middle exons. E. Distribution ofthe slopes from exon inclusion ratio regressions and epigenetic modification at exon start ( ± 100 bp) grouped by the R2 correlation coefficients forH3K36me3 modification. Number of exons and percentage of exons in each category are displayed below each box.

Journal Name, [year], [vol.],1–19 | 7

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 8: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Exon transcription and epigenetic modifications at middle ex-ons

Several studies have found a correlation between DNA methyla-tion26,27,36 or histone modifications23,30,31 and exon and splic-ing events, in either single or a few cell and tissue contexts. Weexplored whether these observations hold true in the RoadmapEpigenomics dataset. We focused on middle exons and excludedfirst and last exons of protein coding genes. A total of 16,811middle exons were expressed in at least one cell or tissue typein the Roadmap dataset. The expression level of an exon wasdefined as the sum of the TPM of the transcripts including thatexon. Similar to TSS and TTS analysis, we performed epigeneticmodifications enrichment analysis at middle exons. H3K36me3showed enrichment at middle exons, correlated with the exon ex-pression level within a cell or tissue type (figure 3A). Changesin H3K36me3 at exons was also strongly associated with exonsexpression level across cell and tissue types (figure 3D). Thoughsome other epigenetic modifications showed a weak enrichmentat exons, similar to input samples used as negative controls,therefore likely reflecting a technical artefact. Epigenetic mod-ifications including H3K4me1, H3K4me3 (figure 3C), H3K27ac,H3K79me1, H3K79me2, H3K9ac, and H3K8ac, were correlatedwith exon expression level across cell and tissue types, but didnot show any enrichment at the middle exons. The H3K4me3 sig-nal, peaked at the gene TSS, and terminated around the start ofthe first internal exon (figure 3A), resulting in a transition fromH3K4me3 marked chromatin to H3K36me3 marked chromatinnear the beginning of the second exon of genes.

There was no difference between DNA methylation ratio at ex-ons and introns, but exons had overall more CpG sites than in-trons, resulting in an higher DNA methylation density at middleexons than at surrounding introns. DNAse1 accessibility furtherdecreased at exon start (Figure S3†), from the already low sur-rounding accessibility, suggesting that the splicing acceptor sitehas even lower accessibility than surrounding regions.

Exon inclusion ratio and epigenetic modifications at middleexons

Exon expression level in a cell type consists of both constitutiveand alternative splicing events. To study association betweenepigenetic modifications and alternate splicing events, we calcu-lated exon inclusion ratio for each exon. The exon inclusion ra-tion was calculated for all transcripts of a given gene (see meth-ods), in each cell or tissue type and sorted using the inclusionratio (also known as Psi). Accordingly, we obtained Psi valuesfor 17,517 exons, including 706 exons that were never includedin the Roadmap datasets, but were part of genes that were ex-pressed in this dataset. We checked whether epigenetic mod-ifications were correlated with exon inclusion ratio and notedthat only a few modifications; namely H3K27ac, H3K4me3, andH3K36me3 were associated with exon inclusion ratio within acell type (figure 3B). DNA methylation was also associated withexon inclusion ratio. No epigenetic modification showed a signifi-cant association with the changes in inclusion ratio at the alterna-tively included exons (figure 3E, Table 1). There were neverthe-

less very weak associations for mCpG ratio, DNAse1, H3K4me1,H3K27me3, H3k9ac and H4K8ac.

A linear model for gene expression

So far, we analysed the associations between epigenetic modifi-cations and transcription control in a pair-wise manner. In orderto better model the combinatorial effect of modifications, we se-lected 6 epigenetic modifications (DNA methylation, H3K4me1,H3K4me3, H3K9me3, H3K27me3 and H3K36me3) for which epi-genetic and transcriptome data was available for 27 cell types inthe Roadmap dataset, and regressed a linear model at four tran-scriptional features: (i) epigenetic modifications around TSS andgene expression level (figure 4A), (ii) epigenetic modificationsaround TTS and gene expression level (figure 4B), (iii) epigeneticmodifications around the start of middle exons and exon expres-sion level (figure 4C), and (iv) epigenetic modifications aroundthe start of middle exons and exon inclusion ratio (figure 4D).

At TSS, amongst the 6 marks studied, DNA methylation wasassociated with gene repression, while all studied histone methy-lations were associated with activation of expression level (figure4A). At TTS, we noted similar pattern as TSS, with a differencethat H3K36me3 at TTS was the modification most strongly associ-ated with an increase in expression level, followed by H3K4me1as the second most positively associated (figure 4B). We notedthat restricting the analysis to long genes only, TTS associationswith epigenetic modifications were much weaker (figure S4†),highlighting that a important portion of the TTS associations weredue to short genes for which the promoter marks might confoundthe TTS signal. H3K36me3 at middle exons was also stronglyassociated with exon expression levels (figure 4C). Finally, no sig-nificant associations were detected for epigenetic modifications(of 6 selected ones) at exon inclusion ratio (figure 4D).

PEREpigenomics, a web resource to explore associations be-tween epigenetic modifications and transcription features

Using the Roadmap dataset, we unravelled a range of associa-tions between epigenetic modifications and transcription control.We have included a total of 9,024 stacked profiles of epigeneticmodifications near TSS, TTS and middle exons, sorted accordingto the gene expression level, exon expression level or exon inclu-sion ratio, in about 30 different cell and tissue types. We havealso provided stacked profiles for TSS and TTS for each gene type(protein-coding, RNA, pseudogene, other), as well as for short(≤ 1kb), long (> 3kb) and intermediate length genes. Users fur-thermore can generate regressions between an epigenetic modi-fication at TSS or TTS and gene expression level across cell andtissue types for a gene of interest, and the same feature is avail-able for middle exons and exon expression level or exon inclusionratio as well.

We have made the analyses available to users through a web-application at joshiapps.cbu.uib.no/perepigenomics_app/.

DiscussionIn summary, this multi-faceted analysis of the Roadmap Epige-nomics data, freely available through a web application, has en-

8 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 9: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Fig. 4 Linear regression models including six epigenetic modifications characterised in 27 cell types. Each bar represents the number of genes or exonswith a statistically significant slope (p <= 0.01), either positive (golden) or negative (deep blue). A. Linear regression model of gene expression leveland the levels of 6 epigenetic modifications near their respective TSS (±500 bp). B. Linear regression model of gene expression level and the levelsof 6 epigenetic modifications near their respective TTS (±500 bp). C. Linear regression model of middle exon expression level and the levels of 6epigenetic modifications near their respective start (±100 bp). D. Linear regression model of middle exon inclusion ratio and the levels of 6 epigeneticmodifications near their respective start (±100 bp).

Journal Name, [year], [vol.],1–19 | 9

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 10: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

abled us confirm known (or previously observed in only one offew cell types) associations between epigenetic modifications andtranscription on a large data set and further formulate new hy-potheses. We specifically discuss epigenetic associations of pre-viously under-explored transcription features below grouped ac-cording to the epigenetic modification.

Histone modifications

For 28 studied histone modifications near the TSS, 25 were pos-itively correlated with gene expression across genes within a celltype, and a subset of them (16) were also correlated with geneexpression when comparing the same gene across cell and tis-sue types (table 1). H3K27me3 was the only mark that wasmore abundant at TSS of lowly or not expressed genes than athighly expressed protein coding genes. It is important to notethat H3K27me3 modification was not present at non-protein cod-ing genes, highlighting that the negative correlation betweenH3K27me3 and gene expression is a gene type specific associa-tion. H3K9me3 has been associated with closed chromatin37,38,and was not enriched at the TSS of non expressed genes in theRoadmap Epigenomics data set. While many active histone marksshowed a TSS asymmetry, with higher signal downstream of theTSS than at the upstream of the TSS. H3K79me1 and H3K79me2were enriched downstream of the TSS of expressed genes (figureS2†), suggesting an association with transcription direction. In-deed, H3K79 methylations are catalysed by the DOT1L enzymeduring transcription elongation39. It should be noted that theirprofile was only available in 5 and 3 cell lines respectively, allderived from the embryonic stem cell line H1. H3K79me1 andH3K79me2 asymmetries were less marked in the H1-derived tro-phoblast, which could reflect either a relevant biological differ-ence or be caused by experimental issues.

Among histone modifications, H3K36me3 displayed a uniqueprofile across all studied transcription features. The H3K36me3modification is positively correlated with gene expression at theTTS across genes within a cell type, and also when comparingthe same gene across cell and tissue types. While largely absentfrom the TSS region, it is enriched at all middle exons and onthe last exon. We noted that gene body H3K36me3 begins atthe start of the second exon, where H3K4me3 peak diminishes.We further explored a potential link between exonic H3K36me3and alternative splicing. Our approach using exon inclusion ratiorevealed that any such association was either weak or restricted toonly a few exons. Similar observations have been made by others:Xu et al31 noted that from about 3000 alternative splicing events,800 were positively correlated with changes in H3K36me3 and700 where negatively correlated with changes in H3K36me3. Itshould be noted that H3K36me3 modified genomic regions (widepeaks) tend to be an order of magnitude larger than an averageexon size (1000 pb vs 100 pb). Altogether, though 10 histonemodifications showed some enrichment at exons, the associationsbetween epigenetic modifications and changes in exon inclusionratio were weak and/or limited to a subset of genes.

DNA methylation

CpG density is highly variable in the human genome, with CpGdepleted or CpG poor regions spanning most of the genome. Wenoted that CpG density at TSS was strongly associated with geneexpression across genes within a cell type, where most expressedgenes had a CGI centred at their TSS. Accordingly, at TSS, themCpG ratio, or the average methylation at CpG sites, was nega-tively associated with gene expression across genes within a celltype. This trend overlaps with the CpG density where most ofCpG deserts are heavily methylated, and most CGI are unmethy-lated40 (figure S1†). Increase in mCpG ratio at TSSs was alsoassociated with the down-regulation of gene expression in thegene level analysis. Non-promoter regions were methylated withmCpG ratios around 85%, this ratio dropped to around 30% at theTSSs of the expressed protein-coding genes. We noted that thisnot the case for expressed pseudogenes, whose TSSs had higherCpG density than surrounding regions, but remained methylated.

mCpG density is the number of methylated sites at a given ge-nomic region, calculated as a product of the CpG density and theaverage mCpG ratio at a give genomic region. It has been shownthat mCpG density, but not mCpG ratio, was the main driver of thebinding of DNA methylation readers of the MBD family41. Whilemany publications focus solely on the mCpG ratio as a metric toevaluate DNA methylation, we argue that multiple metrics pro-vide complementary information. For example, while mCpG ratiostays constant across the promoters of repressed genes, mCpGdensity peaks near the TSS of repressed genes, suggesting thatthese regions might preferentially recruit repressive DNA bindingproteins (e.g. MBP). DNA methylation density also peaks at theTSSs of expressed processed pseudogenes, as their CGIs remainmethylated. At exons, mCpG ratio is as high as at introns, butthe CpG density is higher at exons than at introns, resulting in anhigher mCpG density.

It has been observed that GC rich region might be more difficultto sequence using some Illumina sequencing protocols, resultingin lower coverage at CGI42. We noted this bias in about half of theWGBS samples, where WGBS coverage at TSS was anti-correlatedwith gene expression across genes within a cell type. Some WGBSsamples were less affected by this bias, while a few showed evenan inverse trend, with higher coverage at GC rich regions. Luckilythese biases leave mCpG ratio and mCpG density profiles largelyunaffected, therefore preserving the validity of the analysis.

DNA accessibility and nucleosome positioning

DNAse1 assay showed a narrow (<100 bp) peak before and atthe TSS of expressed genes. This narrow peak of DNA accessibil-ity matched the location of a dip in the bi-modal signal presentin many histone modifications positively correlated with expres-sion level (e.g. H3K4me3 and many histone acetylations). Theseobservations suggest that there is a short nucleosome-free regionbefore the TSS of expressed genes, with a nucleosome positionedjust after the +1 of transcription. Such nucleosome positioningeffect is well described in yeast43 and in mammals44. Intrigu-ingly, middle exon starts and TTS positions appear to be depletedof DNAse1 signal, even more so than the surrounding regions

10 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 11: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

(figure S3†). This suggests that middle and last exon starts areparticularly inaccessible regions, maybe due to nucleosome posi-tioning45 or the presence of the splicing machinery at acceptorsites.

Analysis across cell and tissue types

For each gene and middle exon, we correlated the level of epi-genetic modification with expression level or exon inclusion ra-tio across the different cell and tissue types in the Roadmapdataset. When regressing promoter DNA methylation with geneexpression levels, 2,506 genes had a linear regression coefficientR2 ≥ 0.75 (5.2% of all genes), including 1,173 protein codinggenes (6.3% of protein coding genes). The average slope was neg-ative, indicating that for an individual gene, higher level of pro-moter DNA methylation is associated with lower gene expressionlevel. Conversely, 2,485 genes had a linear regression coefficientR2 ≥ 0.75 (5.2% of all genes) when regressing H3K4me3 levels atthe promoter and gene expression level, including 1,367 proteincoding genes (7.3% of protein coding genes). The average slopein this case was positive, i.e. higher level of promoter H3K4me3was associated with higher gene expression level. Nonetheless,most genes or exons did not have a regression coefficient ≥ 0.25.This might be because many genes and exons might not havelarge enough epigenetic or expression variability in this dataset.Furthermore, ChIP-seq and DNAse-seq peak height can also be bi-ased by the epigenetic modifications (or accessibility) in the dom-inant fraction of alleles and cells, as well as biased by changes inChIP efficiency due to hard-to-control experimental variations.

Conclusions

In summary, we performed a comprehensive analysis to studylinks between epigenetic modifications and transcription controlusing the Roadmap Epigenomic data. The Roadmap Epigenomichistone modifications, whole genome bisulfite sequencing, andRNA-seq data, across diverse human cell and tissue types, al-lowed us to confirm the generality of previously described re-lations between epigenetic modifications and transcription con-trol in one or few cell types, as well as to generate new hy-potheses about the interplay between epigenetic modificationsand transcript diversity. Importantly, our analysis focused on pre-viously under-explored features of transcription control includ-ing, transcription termination sites, exon-intron boundaries, mid-dle exons and exon inclusion ratio. We have produced thou-sands of stack profile plots of epigenetic marks around gene fea-tures sorted according to gene expression level, exon expressionlevel or exon inclusion ratio, and filterable by gene type. Theseplots are made freely available through a web application, joshi-apps.cbu.uib.no/perepigenomics_app/. We hope this web appli-cation will serve the community as a resource i) to validate knownor previously described epigenetic modifications associated withtranscription features ii) as well as an interactive tool to allowexploration of the data as a novel hypotheses generator of epige-netic and transcriptional control.

Methods

Data retrieving

Gencode human annotation version 29 (main annotation file),were downloaded from theGencode as gff3 files. Reads fromRNA-seq data were retrieved from the European NucleotideArchive using the the Roadmap sample table as a reference.Roadmap Epigenomics whole genome bisulphite sequencing(WGBS) data sets were downloaded bigwig files of fractionalmethylation and read coverage from the Roadmap Epigenomicsthe Roadmap data portal.

Histone modifications and DNAse1 data were downloaded asconsolidated, not subsampled, tagAlign files from the theRoadmap data portal.

Altogether, we retrieved 27 RNA-seq and corresponding WGBSdatasets, 13 DNAse1 profiles (with matching controls), and 242histone ChIP-seq datasets in 27 human cell lines or tissues (with27 matching controls).

RNA sequencing analysis

RNA-sequencing reads were quantified using Salmon34 v12.0 bypseudoalignement to the human reference genome hg38 and an-notations v29 provided by Gencode33. We selected the param-eters validateMappings, seqBias, gcBias where on, withbiasSpeedSamp equal to 5, and libType equals to A. For sam-ples with biological replicates, the median expression value inTPM samples was used for genes and transcripts. For each exon,exon expression level was calculated as the sum of the TPM ofthe transcripts including this exon. Exon inclusion ratios as com-puted as the sum of TPM of transcripts including the exon dividedby the TPM value of the gene.

We considered each gene uniquely, by selecting a representa-tive TSS (ory TTS) per gene, from all annotated TSSs (or TTSs).We sorted all transcripts of a gene according to their TSS (or TTS)genomic coordinates and selecting the TSS (or TTS) of the tran-script at the middle of the list, i.e. the median TSS (or TTS). Thelist of middle exons was obtained by taking the shortest transcriptof each protein coding gene, then selecting genes with 3 or moreexons, and excluding the first and last exons of each transcript.The shortest known transcript isoform was used to ensure thatthe list of middle exons could not contain first or last exons ofother isoforms. Most annotated non-protein coding genes weremonoexonic, and were excluded from the exonic analyses.

Epigenetic modifications data processing

For WGBS, three different tracks were generated from theFractionalMethylation.bigwig files using bedtools46 andrtracklayer47: number of CpG sites per window, mean DNAmethylation ratio per window, density of mCpG sites per win-dow, using windows of 250 base pair width, sliding by 100 basepairs. The WGBS coverage file was processed similarly to producea fourth track serving as a control. No post-processing was donefor DNAse1 and Histone tagAlign files.

Journal Name, [year], [vol.],1–19 | 11

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 12: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Heatmap generation

Gene types were derived from Gencode33, where the pseudogenecategory contained all genes with a word "pseudogene", and the"other" type of gene was defined as neither protein coding, norpseudogene, nor RNA gene. Genes were binned in 5 groups ofequal sizes according to their expression values, and exons in 5groups according to their expression values or exon inclusion ra-tio.

For genes with multiple Gencode annotations for TSSs (orTTSs), we sorted all TSSs (or TTSs) according to their genomiccoordinate (5’ to 3’, taking into account their orientation) andtook the TSS (or TTS) of the transcript in the middle of the list.The list of middle exons was obtained using protein coding genes.For genes with several annotated transcripts, the transcript withthe smallest length was selected, first and last exons were filtered,and the remaining exons were kept. Stacked profiles of trackscentred at TSS or TTS were generated using a region of ± 2.5kbaround the TSS or TTS, using windows of 100 bases every 100bases. Stacked profiles of tracks centred at middle exons weredrawn using a region of ± 1kb around middle exon starts, withwindows of 50 bases every 50 bases. For histone modificationsand DNAse1 data, and corresponding input controls, coveragewas expressed as FPKM values. For CpG density and mCpG den-sity was defined as the number of (methylated) sites per window(250 bp for TSS or TTS, 100 bp for exons). The mCpG ratio rngedfrom 0 (all the CpG sites in the window fully unmethylated) to 1(all the CpG sites in window fully methylated), and the coveragewas expressed as number of reads covering a region. For eachof the five gene (or exon) bins, we displayed the value distribu-tion as boxplots, and the average profiles ± standard error of themean (SEM) for each bin. Heatmaps were drawn by a customscript using the following packages: seqplots48, Repitools49, Ge-nomicRanges50, rtracklayer47, and plotrix51.

Regression analysis

At each gene TSS (or TTS), the epigenetic modification level in asample was averaged in the ± 500 bp region around the TSS (orTTS) and a linear regression was calculated for each gene usingexpression values in log10(T PM + 1). Epigenetic marks at the ±100 bp region around middle exon starts were regressed witheither log10(T PM + 1) of the exon or the exon inclusion ratio.For each regression the slope and regression coefficient (R2) wereobtained, using dplyr52, purrr53, and broom54.

Web application and script availability

PEREpigenomics is developed in Shiny55. Sourcecode and data of the application are available atforgemia.inra.fr/guillaume.devailly/perepigenomics_app.Scripts used to process the data and generate the plots can befound at: github.com/gdevailly/perepigenomicsAnalysis.

Conflicts of interest

The authors declare that they have no competing interests.

Author’s contributionsGD and AJ designed the analyses and the web applications, andwrote this manuscript. GD performed the analysis and developedthe web application.

AcknowledgementsGD was funded for this work by the People Program (MarieCurie Actions FP7/2007-2013) under REA grant agreement NoPCOFUND-GA-2012-600181. AJ is supported by the Bergen Re-search Foundation Grant no. BFS2017TMT01. The authors wouldlike to thanks Olaf Sarnow, Kjell Petersen and Stanislav Oltu fortheir help with the web server configuration, and Anna Mantsokiand Barry Horne for her help at the beginning of the project.

Abreviations• TSS: Transcription Start Sites

• TTS: Transcription Termination Sites

• TPM: Transcript Per Million of reads

• CpG: Cytosine - Guanine dinucleotide

• CGI: CpG island

• Psi: Exon inclusion ratio

• WGBS: Whole Genome Bisulfite Sequencing

• H3K9me3: tri-methylation of lysine 9 of histone 3

• H4K5ac: acetylation of lysine 5 of histone 4

Notes and references1 J. Romanowska and A. Joshi, Genes, 2019, 10, 76.2 A. Kundaje, , W. Meuleman, J. Ernst, M. Bilenky, A. Yen,

A. Heravi-Moussavi, P. Kheradpour, Z. Zhang, J. Wang, M. J.Ziller, V. Amin, J. W. Whitaker, M. D. Schultz, L. D. Ward,A. Sarkar, G. Quon, R. S. Sandstrom, M. L. Eaton, Y.-C. Wu,A. R. Pfenning, X. Wang, M. Claussnitzer, Y. Liu, C. Coarfa,R. A. Harris, N. Shoresh, C. B. Epstein, E. Gjoneska, D. Le-ung, W. Xie, R. D. Hawkins, R. Lister, C. Hong, P. Gascard,A. J. Mungall, R. Moore, E. Chuah, A. Tam, T. K. Canfield,R. S. Hansen, R. Kaul, P. J. Sabo, M. S. Bansal, A. Carles,J. R. Dixon, K.-H. Farh, S. Feizi, R. Karlic, A.-R. Kim, A. Kulka-rni, D. Li, R. Lowdon, G. Elliott, T. R. Mercer, S. J. Neph,V. Onuchic, P. Polak, N. Rajagopal, P. Ray, R. C. Sallari, K. T.Siebenthall, N. A. Sinnott-Armstrong, M. Stevens, R. E. Thur-man, J. Wu, B. Zhang, X. Zhou, A. E. Beaudet, L. A. Boyer,P. L. D. Jager, P. J. Farnham, S. J. Fisher, D. Haussler, S. J. M.Jones, W. Li, M. A. Marra, M. T. McManus, S. Sunyaev, J. A.Thomson, T. D. Tlsty, L.-H. Tsai, W. Wang, R. A. Waterland,M. Q. Zhang, L. H. Chadwick, B. E. Bernstein, J. F. Costello,J. R. Ecker, M. Hirst, A. Meissner, A. Milosavljevic, B. Ren, J. A.Stamatoyannopoulos, T. Wang and M. Kellis, Nature, 2015,518, 317–330.

3 Nature, 2012, 489, 57–74.

12 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 13: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

4 D. Adams, L. Altucci, S. E. Antonarakis, J. Ballesteros, S. Beck,A. Bird, C. Bock, B. Boehm, E. Campo, A. Caricasole, F. Dahl,E. T. Dermitzakis, T. Enver, M. Esteller, X. Estivill, A. Ferguson-Smith, J. Fitzgibbon, P. Flicek, C. Giehl, T. Graf, F. Grosveld,R. Guigo, I. Gut, K. Helin, J. Jarvius, R. Küppers, H. Lehrach,T. Lengauer, Å. Lernmark, D. Leslie, M. Loeffler, E. Macin-tyre, A. Mai, J. H. Martens, S. Minucci, W. H. Ouwehand,P. G. Pelicci, H. Pendeville, B. Porse, V. Rakyan, W. Reik,M. Schrappe, D. Schübeler, M. Seifert, R. Siebert, D. Sim-mons, N. Soranzo, S. Spicuglia, M. Stratton, H. G. Stunnen-berg, A. Tanay, D. Torrents, A. Valencia, E. Vellenga, M. Vin-gron, J. Walter and S. Willcocks, Nature Biotechnology, 2012,30, 224–226.

5 S. E. Celniker, , L. A. L. Dillon, M. B. Gerstein, K. C. Gunsalus,S. Henikoff, G. H. Karpen, M. Kellis, E. C. Lai, J. D. Lieb, D. M.MacAlpine, G. Micklem, F. Piano, M. Snyder, L. Stein, K. P.White and R. H. Waterston, Nature, 2009, 459, 927–930.

6 L. Andersson, , A. L. Archibald, C. D. Bottema, R. Brauning,S. C. Burgess, D. W. Burt, E. Casas, H. H. Cheng, L. Clarke,C. Couldrey, B. P. Dalrymple, C. G. Elsik, S. Foissac, E. Giuf-fra, M. A. Groenen, B. J. Hayes, L. S. Huang, H. Khatib, J. W.Kijas, H. Kim, J. K. Lunney, F. M. McCarthy, J. C. McEwan,S. Moore, B. Nanduri, C. Notredame, Y. Palti, G. S. Plastow,J. M. Reecy, G. A. Rohrer, E. Sarropoulou, C. J. Schmidt, J. Sil-verstein, R. L. Tellam, M. Tixier-Boichard, G. Tosser-Klopp,C. K. Tuggle, J. Vilkki, S. N. White, S. Zhao and H. Zhou,Genome Biology, 2015, 16, year.

7 S. Foissac, S. Djebali, K. Munyard, N. Vialaneix, A. Rau,K. Muret, D. Esquerré, M. Zytnicki, T. Derrien, P. Bardou,F. Blanc, C. Cabau, E. Crisci, S. Dhorne-Pollet, F. Drouet, T. Fa-raut, I. Gonzalez, A. Goubil, S. Lacroix-Lamandé, F. Laurent,S. Marthey, M. Marti-Marimon, R. Momal-Leisenring, F. Mom-part, P. Quéré, D. Robelin, M. S. Cristobal, G. Tosser-Klopp,S. Vincent-Naulleau, S. Fabre, M.-H. P.-V. der Laan, C. Klopp,M. Tixier-Boichard, H. Acloque, S. Lagarrigue and E. Giuffra,BMC Biology, 2019, 17, year.

8 D. Bujold, D. A. de Lima Morais, C. Gauthier, C. Côté,M. Caron, T. Kwan, K. C. Chen, J. Laperle, A. N. Markovits,T. Pastinen, B. Caron, A. Veilleux, P.-É. Jacques andG. Bourque, Cell Systems, 2016, 3, 496–499.e2.

9 C. A. Davis, B. C. Hitz, C. A. Sloan, E. T. Chan, J. M. David-son, I. Gabdank, J. A. Hilton, K. Jain, U. K. Baymuradov, A. K.Narayanan, K. C. Onate, K. Graham, S. R. Miyasato, T. R.Dreszer, J. S. Strattan, O. Jolanki, F. Y. Tanaka and J. M.Cherry, Nucleic Acids Research, 2017, 46, D794–D801.

10 M. Sánchez-Castillo, D. Ruau, A. C. Wilkinson, F. S. Ng,R. Hannah, E. Diamanti, P. Lombard, N. K. Wilson andB. Gottgens, Nucleic Acids Research, 2014, 43, D1117–D1123.

11 H. G. Stunnenberg, M. Hirst, S. Abrignani, D. Adams,M. de Almeida, L. Altucci, V. Amin, I. Amit, S. E. Antonarakis,S. Aparicio, T. Arima, L. Arrigoni, R. Arts, V. Asnafi, M. Es-teller, J.-B. Bae, K. Bassler, S. Beck, B. Berkman, B. E. Bern-stein, M. Bilenky, A. Bird, C. Bock, B. Boehm, G. Bourque,C. E. Breeze, B. Brors, D. Bujold, O. Burren, M. J. Bussemak-ers, A. Butterworth, E. Campo, E. C. de Santa-Pau, L. Chad-

wick, K. M. Chan, W. Chen, T. H. Cheung, L. Chiapperino,N. H. Choi, H.-R. Chung, L. Clarke, J. M. Connors, P. Cronet,J. Danesh, M. Dermitzakis, G. Drewes, P. Durek, S. Dyke,T. Dylag, C. J. Eaves, P. Ebert, R. Eils, J. Eils, C. A. Ennis, T. En-ver, E. A. Feingold, B. Felder, A. Ferguson-Smith, J. Fitzgib-bon, P. Flicek, R. S.-Y. Foo, P. Fraser, M. Frontini, E. Furlong,S. Gakkhar, N. Gasparoni, G. Gasparoni, D. H. Geschwind,P. Glažar, T. Graf, F. Grosveld, X.-Y. Guan, R. Guigo, I. G.Gut, A. Hamann, B.-G. Han, R. A. Harris, S. Heath, K. Helin,J. G. Hengstler, A. Heravi-Moussavi, K. Herrup, S. Hill, J. A.Hilton, B. C. Hitz, B. Horsthemke, M. Hu, J.-Y. Hwang, N. Y.Ip, T. Ito, B.-M. Javierre, S. Jenko, T. Jenuwein, Y. Joly, S. J.Jones, Y. Kanai, H. G. Kang, A. Karsan, A. K. Kiemer, S. C. Kim,B.-J. Kim, H.-H. Kim, H. Kimura, S. Kinkley, F. Klironomos,I.-U. Koh, M. Kostadima, C. Kressler, R. Kreuzhuber, A. Kun-daje, R. Küppers, C. Larabell, P. Lasko, M. Lathrop, D. H.Lee, S. Lee, H. Lehrach, E. Leitão, T. Lengauer, Å. Lern-mark, R. D. Leslie, G. K. Leung, D. Leung, M. Loeffler, Y. Ma,A. Mai, T. Manke, E. R. Marcotte, M. A. Marra, J. H. Martens,J. I. Martin-Subero, K. Maschke, C. Merten, A. Milosavlje-vic, S. Minucci, T. Mitsuyama, R. A. Moore, F. Müller, A. J.Mungall, M. G. Netea, K. Nordström, I. Norstedt, H. Okae,V. Onuchic, F. Ouellette, W. Ouwehand, M. Pagani, V. Pan-caldi, T. Pap, T. Pastinen, R. Patel, D. S. Paul, M. J. Pazin,P. G. Pelicci, A. G. Phillips, J. Polansky, B. Porse, J. A. Pospisi-lik, S. Prabhakar, D. C. Procaccini, A. Radbruch, N. Rajew-sky, V. Rakyan, W. Reik, B. Ren, D. Richardson, A. Richter,D. Rico, D. J. Roberts, P. Rosenstiel, M. Rothstein, A. Salhab,H. Sasaki, J. S. Satterlee, S. Sauer, C. Schacht, F. Schmidt,G. Schmitz, S. Schreiber, C. Schröder, D. Schübeler, J. L.Schultze, R. P. Schulyer, M. Schulz, M. Seifert, K. Shirahige,R. Siebert, T. Sierocinski, L. Siminoff, A. Sinha, N. Soranzo,S. Spicuglia, M. Spivakov, C. Steidl, J. S. Strattan, M. Stratton,P. Südbeck, H. Sun, N. Suzuki, Y. Suzuki, A. Tanay, D. Tor-rents, F. L. Tyson, T. Ulas, S. Ullrich, T. Ushijima, A. Valen-cia, E. Vellenga, M. Vingron, C. Wallace, S. Wallner, J. Walter,H. Wang, S. Weber, N. Weiler, A. Weller, A. Weng, S. Wilder,S. M. Wiseman, A. R. Wu, Z. Wu, J. Xiong, Y. Yamashita,X. Yang, D. Y. Yap, K. Y. Yip, S. Yip, J.-I. Yoo, D. Zerbino andG. Zipprich, Cell, 2016, 167, 1145–1149.

12 S. J. Marygold, , M. A. Crosby and J. L. Goodman, Methods inMolecular Biology, Springer New York, 2016, pp. 1–31.

13 J. Chèneby, M. Gheorghe, M. Artufel, A. Mathelier andB. Ballester, Nucleic Acids Research, 2017, 46, D267–D275.

14 J. Chèneby, Z. Ménétrier, M. Mestdagh, T. Rosnet, A. Douida,W. Rhalloussi, A. Bergon, F. Lopez and B. Ballester, NucleicAcids Research, 2019.

15 D. Li, S. Hsu, D. Purushotham, R. L. Sears and T. Wang, Nu-cleic Acids Research, 2019, 47, W158–W165.

16 C. Coarfa, C. S. Pichot, A. Jackson, A. Tandon, V. Amin,S. Raghuraman, S. Paithankar, A. V. Lee, S. E. McGuire andA. Milosavljevic, BMC Bioinformatics, 2014, 15, year.

17 F. Albrecht, M. List, C. Bock and T. Lengauer, Nucleic AcidsResearch, 2016, 44, W581–W586.

18 G. Devailly, A. Mantsoki and A. Joshi, Bioinformatics, 2016,

Journal Name, [year], [vol.],1–19 | 13

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 14: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

32, 3354–3356.19 Y. He and T. Wang, Bioinformatics, 2017, 33, 3268–3275.20 M. G. Dozmorov, Bioinformatics, 2017, 33, 3323–3330.21 S. Oki, T. Ohta, G. Shioi, H. Hatanaka, O. Ogasawara,

Y. Okuda, H. Kawaji, R. Nakaki, J. Sese and C. Meno, EMBOreports, 2018, 19, year.

22 M. M. Hoffman, O. J. Buske, J. Wang, Z. Weng, J. A. Bilmesand W. S. Noble, Nature Methods, 2012, 9, 473–476.

23 J. Curado, C. Iannone, H. Tilgner, J. Valcárcel and R. Guigó,Genome Biology, 2015, 16, year.

24 Y. Li, J. Zhu, G. Tian, N. Li, Q. Li, M. Ye, H. Zheng, J. Yu,H. Wu, J. Sun, H. Zhang, Q. Chen, R. Luo, M. Chen, Y. He,X. Jin, Q. Zhang, C. Yu, G. Zhou, J. Sun, Y. Huang, H. Zheng,H. Cao, X. Zhou, S. Guo, X. Hu, X. Li, K. Kristiansen, L. Bol-und, J. Xu, W. Wang, H. Yang, J. Wang, R. Li, S. Beck, J. Wangand X. Zhang, PLoS Biology, 2010, 8, e1000533.

25 M. B. Stadler, R. Murr, L. Burger, R. Ivanek, F. Lienert,A. Schöler, C. Wirbelauer, E. J. Oakeley, D. Gaidatzis, V. K.Tiwari and D. Schübeler, Nature, 2011.

26 A. K. Maunakea, I. Chepelev, K. Cui and K. Zhao, Cell Research,2013, 23, 1256–1269.

27 G. L. Maor, A. Yearim and G. Ast, Trends in Genetics, 2015, 31,274–280.

28 X.-L. Ding, X. Yang, G. Liang and K. Wang, Scientific Reports,2016, 6, year.

29 R. Shayevitch, D. Askayo, I. Keydar and G. Ast, RNA, 2018,24, 1351–1362.

30 Y. Xu, Y. Wang, J. Luo, W. Zhao and X. Zhou, Nucleic AcidsResearch, 2017, 45, 12100–12112.

31 Y. Xu, W. Zhao, S. D. Olson, K. S. Prabhakara and X. Zhou,Genome Biology, 2018, 19, year.

32 V. Nanavaty, E. W. Abrash, C. Hong, S. Park, E. E. Fink, Z. Li,T. J. Sweet, J. M. Bhasin, S. Singuri, B. H. Lee, T. H. Hwangand A. H. Ting, Molecular Cell, 2020, 78, 752–764.e6.

33 A. Frankish, M. Diekhans, A.-M. Ferreira, R. Johnson, I. Jun-greis, J. Loveland, J. M. Mudge, C. Sisu, J. Wright, J. Arm-strong, I. Barnes, A. Berry, A. Bignell, S. C. Sala, J. Chrast,F. Cunningham, T. D. Domenico, S. Donaldson, I. T. Fiddes,C. G. Girón, J. M. Gonzalez, T. Grego, M. Hardy, T. Hourlier,T. Hunt, O. G. Izuogu, J. Lagarde, F. J. Martin, L. Martínez,S. Mohanan, P. Muir, F. C. P. Navarro, A. Parker, B. Pei, F. Pozo,M. Ruffier, B. M. Schmitt, E. Stapleton, M.-M. Suner, I. Sy-cheva, B. Uszczynska-Ratajczak, J. Xu, A. Yates, D. Zerbino,Y. Zhang, B. Aken, J. S. Choudhary, M. Gerstein, R. Guigó,T. J. P. Hubbard, M. Kellis, B. Paten, A. Reymond, M. L. Tressand P. Flicek, Nucleic Acids Research, 2018, 47, D766–D773.

34 R. Patro, G. Duggal, M. I. Love, R. A. Irizarry and C. Kingsford,Nature Methods, 2017, 14, 417–419.

35 E. Lieberman-Aiden, N. L. van Berkum, L. Williams,

M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J.Sabo, M. O. Dorschner, R. Sandstrom, B. Bernstein, M. A. Ben-der, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L. A.Mirny, E. S. Lander and J. Dekker, Science, 2009, 326, 289–293.

36 S. Shukla, E. Kavak, M. Gregory, M. Imashimizu, B. Shuti-noski, M. Kashlev, P. Oberdoerffer, R. Sandberg and S. Ober-doerffer, Nature, 2011, 479, 74–79.

37 T. Kouzarides, Cell, 2007, 128, 693–705.38 I. A. Tchasovnikarova, R. T. Timms, N. J. Matheson, K. Wals,

R. Antrobus, B. Gottgens, G. Dougan, M. A. Dawson and P. J.Lehner, Science, 2015, 348, 1481–1485.

39 K. Wood, M. Tellier and S. Murphy, Biomolecules, 2018, 8, 11.40 A. M. Deaton and A. Bird, Genes & Development, 2011, 25,

1010–1022.41 T. Baubec, R. Ivánek, F. Lienert and D. Schübeler, Cell, 2013,

153, 480–492.42 J. C. Dohm, C. Lottaz, T. Borodina and H. Himmelbauer, Nu-

cleic Acids Research, 2008, 36, e105–e105.43 K. Struhl and E. Segal, Nature Structural & Molecular Biology,

2013, 20, 267–273.44 Y. Li, C. Li, S. Li, Q. Peng, N. A. An, A. He and C.-Y. Li, Pro-

ceedings of the National Academy of Sciences, 2018, 115, 8817–8822.

45 H. Guo, B. Hu, L. Yan, J. Yong, Y. Wu, Y. Gao, F. Guo, Y. Hou,X. Fan, J. Dong, X. Wang, X. Zhu, J. Yan, Y. Wei, H. Jin,W. Zhang, L. Wen, F. Tang and J. Qiao, Cell Research, 2016,27, 165–183.

46 A. R. Quinlan and I. M. Hall, Bioinformatics, 2010, 26, 841–842.

47 M. Lawrence, R. Gentleman and V. Carey, Bioinformatics,2009, 25, 1841–1842.

48 P. Stempor and J. Ahringer, Wellcome Open Research, 2016, 1,14.

49 A. L. Statham, D. Strbenac, M. W. Coolen, C. Stirzaker, S. J.Clark and M. D. Robinson, Bioinformatics, 2010, 26, 1662–1663.

50 M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, M. Carlson,R. Gentleman, M. T. Morgan and V. J. Carey, PLoS Computa-tional Biology, 2013, 9, e1003118.

51 L. J, R-News, 2006, 6, 8–12.52 H. Wickham, R. François, L. Henry and K. Müller, dplyr: A

Grammar of Data Manipulation, 2020.53 L. Henry and H. Wickham, purrr: Functional Programming

Tools, 2019.54 D. Robinson and A. Hayes, broom: Convert Statistical Analysis

Objects into Tidy Tibbles, 2019.55 W. Chang, J. Cheng, J. Allaire, Y. Xie and J. McPherson, shiny:

Web Application Framework for R, 2018.

14 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 15: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Supplementary figures

Journal Name, [year], [vol.],1–19 | 15

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 16: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Supplementary Figure S1: WGBS is analysed through CpG density, mCpG ratio and mCpG density metrics. The mCpG ratio an mCpG density metricscan behave differently if comparing regions with different CpG density. A. On these hypothetical 1 kb genomic windows, the CpG rich region (left)has an lower mCpG ratio than the CpG poor region (right), but an higher mCpG density. B. Four tracks have been obtained from WGBS data (herefrom the H1-hESC cell line). The 36kb region including the gene KLHL36 is shown. From top to bottom: The CpG density tracks, higher at the threeCpG islands in the region. The mCpG ratio, showing high values (> 80%) everywhere but at the first CpG island. The mCpG density, showing lowvalues everywhere but at the second and thris CpG island. Finaly, the WGBS coverage is used as a control track. Here it shows a lower coverage atthe three CpG islands.

16 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 17: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Supplementary Figure S2: H3K79me2 near the transcription start sites in H1-BMP4 derived mesoderm (A) or H1-BMP4 derived throphoblast (B).Upper parts, from left to right: Gene expression level in all 47,812 autosomal genes annotated by Gencode. First side bar indicates the gene type (green:protein coding genes, blue: pseudogenes, purple: RNA genes, red: other types of genes), the second side bar indicates the genes sorted according toexpression level (5 bins in total, purple: highly expressed genes, green: lowly expressed genes). Stacked profiles of H3K79me2 ChIP-seq and respectiveinput control, sorted according to the corresponding gene expression level. Bottom parts, from left to right: Boxplot of gene expression levels in eachof the 5 expression bins defined in the upper part. Average profiles of H3K79me2 ChIP-seq and respective input control, ± SEM (Standard Error ofthe Mean) for each bin of promoters.

Journal Name, [year], [vol.],1–19 | 17

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 18: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Supplementary Figure S3: DNAse1 profile around middle exon start sites in pancreas show a narrow decrease in DNAse accessibility near splicingacceptor sites. Upper part, from left to right: middle exon expression levels in 16,811 middle exons. The side bar indicates 5 bins used in the figure S3bottom panels (purple: highly expressed exons, green: lowly expressed exons). Then: Stacked profiles of DNAse1 profile and control, sorted accordingto the exon expression levels. Bottom part, from left to right: Boxplot of exon expression levels in each of the 5 bins defined in the upper part. Then,average DNAse profiles and control ± SEM (Standard Error of the Mean) for each bin of exons.

18 | 1–19Journal Name, [year], [vol.],

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint

Page 19: Comprehensive analysis of epigenetic signatures of human transcript… · 2020/9/23  · Comprehensive analysis of epigenetic signatures of human transcrip-tion control† Guillaume

Supplementary Figure S4: Linear regression models including six epigenetic modifications characterised in 27 cell types, focusing on the 25,068 longgenes (> 3kb, amongst 47,812 analysed genes). Each bar represents the number of genes with a statistically significant slope (p <= 0.01), eitherpositive (golden) or negative (deep blue). A. Linear regression model of long gene expression level and the levels of 6 epigenetic modifications near theirrespective TSS (±500 bp). B. Linear regression model of long gene expression level and the levels of 6 epigenetic modifications near their respectiveTTS (±500 bp).

Journal Name, [year], [vol.],1–19 | 19

.CC-BY 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted September 23, 2020. ; https://doi.org/10.1101/2020.09.23.309625doi: bioRxiv preprint