Top Banner
1 SUPPLEMENTAL DATA Supplemental Text Growth Conditions for Human Embryonic Stem Cells Quality Control for Human Embryonic Stem Cells Antibodies Chromatin Immunoprecipitation Array Design Data Normalization and Analysis Identification of Bound Regions Controlling for the Effect of Murine Embryonic Fibroblast Feeder Cells Comparing Bound Regions to Known and Predicted Genes Comparing Binding and Human Expression Data Estimating Error Rates Binding of Suz12, Eed and H3K27me3 GO Classification for RNA Polymerase II and Suz12-Bound Genes Comparing Suz12-Bound Regions to Domains of Conservation Generating Suz12-deficient Mouse Cells and Analysis of their Expression Pattern Sample Preparation and Analysis of Differentiated Muscle Comparing Suz12 Binding with Oct4, Sox2 and Nanog Binding Index of Supplemental Tables Table S1. Regions bound by RNA polymerase II and their relationship to known and predicted genes. Table S2. HUGO/EntrezGene identifiers for RNA Pol II bound, annotated genes. Table S3. RNA polymerase II-bound regions that predict novel gene candidates. Table S4. Gene models bound by RNA polymerase II. Table S5. MicroRNA genes bound by RNA polymerase II and Suz12 in ES cells. Table S6. Expression of genes bound by RNA polymerase II in ES cells. Table S7. Regions bound by Suz12 and their relationship to known and predicted genes. Table S8. HUGO/EntrezGene identifiers for Suz12-bound, annotated genes. Table S9. Detection of Suz12, Eed and H3K27me3 occupancy using promoter arrays. Table S10. Enriched gene ontologies among RNA Pol II-bound and Suz12-bound genes. Table S11. Developmental transcription factors bound by Suz12. Table S12. Developmental signaling proteins bound by Suz12. Table S13. Expression of Suz12-bound genes during ES cell differentiation. Table S14. Genes bound by Suz12 in ES cells and upregulated in Suz12 -/- mouse cells. Table S15. Developmental regulators associated with PRC2 in ES cells and muscle. Index of Supplemental Figures Figure S1. Human H9 ES cells cultured on a low density of irradiated MEFs. Figure S2. Analysis of human ES cells for markers of pluripotency. Figure S3. Analysis of human ES cells for differentiation potential. Figure S4. The fraction of annotated promoters bound by RNA polymerase II or Suz12. Figure S5. Estimating error rates. Figure S6. Co-occupation of gene promoters by Suz12, Eed and H3K27me3. Figure S7. Protein domain classification of Suz12- and Pol II-bound transcription factors. Figure S8. Suz12 occupies large regions of DNA.
34

SUPPLEMENTAL DATA - younglab.wi.mit.eduyounglab.wi.mit.edu › hES_PRC › Supplemental_Text.pdf~1 X 106 cells/well in suspension on 6-well non-adherent, low cluster dishes for 30

Feb 14, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1

    SUPPLEMENTAL DATA

    Supplemental TextGrowth Conditions for Human Embryonic Stem CellsQuality Control for Human Embryonic Stem CellsAntibodiesChromatin ImmunoprecipitationArray DesignData Normalization and AnalysisIdentification of Bound RegionsControlling for the Effect of Murine Embryonic Fibroblast Feeder CellsComparing Bound Regions to Known and Predicted GenesComparing Binding and Human Expression DataEstimating Error RatesBinding of Suz12, Eed and H3K27me3GO Classification for RNA Polymerase II and Suz12-Bound GenesComparing Suz12-Bound Regions to Domains of ConservationGenerating Suz12-deficient Mouse Cells and Analysis of their Expression PatternSample Preparation and Analysis of Differentiated MuscleComparing Suz12 Binding with Oct4, Sox2 and Nanog Binding

    Index of Supplemental TablesTable S1. Regions bound by RNA polymerase II and their relationship to known and

    predicted genes.Table S2. HUGO/EntrezGene identifiers for RNA Pol II bound, annotated genes.Table S3. RNA polymerase II-bound regions that predict novel gene candidates.Table S4. Gene models bound by RNA polymerase II.Table S5. MicroRNA genes bound by RNA polymerase II and Suz12 in ES cells.Table S6. Expression of genes bound by RNA polymerase II in ES cells.Table S7. Regions bound by Suz12 and their relationship to known and predicted genes.Table S8. HUGO/EntrezGene identifiers for Suz12-bound, annotated genes.Table S9. Detection of Suz12, Eed and H3K27me3 occupancy using promoter arrays.Table S10. Enriched gene ontologies among RNA Pol II-bound and Suz12-bound genes.Table S11. Developmental transcription factors bound by Suz12.Table S12. Developmental signaling proteins bound by Suz12.Table S13. Expression of Suz12-bound genes during ES cell differentiation.Table S14. Genes bound by Suz12 in ES cells and upregulated in Suz12 -/- mouse cells.Table S15. Developmental regulators associated with PRC2 in ES cells and muscle.

    Index of Supplemental FiguresFigure S1. Human H9 ES cells cultured on a low density of irradiated MEFs.Figure S2. Analysis of human ES cells for markers of pluripotency.Figure S3. Analysis of human ES cells for differentiation potential.Figure S4. The fraction of annotated promoters bound by RNA polymerase II or Suz12.Figure S5. Estimating error rates.Figure S6. Co-occupation of gene promoters by Suz12, Eed and H3K27me3.Figure S7. Protein domain classification of Suz12- and Pol II-bound transcription

    factors.Figure S8. Suz12 occupies large regions of DNA.

  • 2

    Figure S9. H3K27me3 co-occupies large domains with Suz12.Figure S10. Generation of Suz12 -/- cells.Figure S11. Binding of Suz12 in differentiated muscle.

    Figure S12. Detection of genes bound by RNA polymerase II and Suz12 in human EScell expression datasets.

    Figure S13. Relationship between size of Suz12 binding domain and RNA polymerase IIco-occupancy and gene expression.

    Figure S14. Association of Oct4, Sox2 or Nanog with Suz12-bound promoters.Figure S15. Motifs associated with DNA regions that are bound by Oct4, Sox2, Nanog

    and Suz12 or bound by Oct4, Sox2 and Nanog.

    Supplemental References

  • 3

    Supplementary Text

    Growth Conditions for Human Embryonic Stem Cells

    Human embryonic stem (ES) cells were obtained from WiCell (Madison, WI; NIH CodeWA09) and grown as described (Cowan et al., 2004). Briefly, passage 34 cells weregrown in KO-DMEM medium supplemented with serum replacement, basic fibroblastgrowth factor (FGF), recombinant human leukemia inhibitory factor (LIF) and a humanplasma protein fraction. Detailed protocol information on human ES cell growth conditionsand culture reagents are available at http://www.mcb.harvard.edu/melton/hues.

    In order to minimize any MEF contribution to our analysis, H9 cells were cultured on alow density of irradiated murine embryonic fibroblasts (ICR MEFs) resulting in a ratio ofapproximately >8:1 H9 cell to MEF (Figure S1). The culture of H9 on low-density MEFshad no adverse affects on cell morphology, growth rate, or undifferentiated status asdetermined by immunohistochemistry for pluripotency markers (e.g. Oct4, SSEA-3, Tra-1-60; see below). In addition, H9 cells grown on a minimal feeder layer maintained theability to generate derivates of ectoderm, mesoderm, and endoderm upon differentiation(see below).

    Quality Control for Human Embryonic Stem Cells

    Immunohistochemical analysis of pluripotency markersFor analysis of pluripotency markers, cells were fixed in 4% paraformaldehyde for 30minutes at room temperature and incubated overnight at 4°C in blocking solution (5 mlNormal Donkey Solution:195 ml PBS + 0.1% Triton-X) (Figure S2). After a brief wash inPBS, cells were incubated with primary antibodies to Oct-3/4 (Santa Cruz sc-9081),SSEA-3 (MC-631)(Solter and Knowles, 1979), SSEA-4 (MC-813-70)(Solter and Knowles,1979), Tra-1-60 (MAB4360; Chemicon International), and Tra-1-81 (MAB4381;Chemicon International) in blocking solution overnight at 4°C. Following incubation withprimary antibody, cells were incubated with either rhodamine red or FITC-conjugatedsecondary antibody (Jackson Labs) for 2-5hrs at 4°C. Nuclei were stained with 4’,6-diamidine-2-phenylidole dihydrochloride (DAPI). Epifluorescent images were obtainedusing a fluorescent microscope (Nikon TE300). Data is shown for Oct4 and SSEA-3. Ouranalysis indicated that >90% of the H9 cells were strongly positive for all pluripotencymarkers.

    Alkaline phosphatase activity of human ES cells was analyzed using the Vector RedAlkaline Phosphatase Substrate Kit (Cat. No. SK-5100; Vector Laboratories) according tomanufacturer’s specifications and the reaction product was visualized using fluorescentmicroscopy.

    Teratoma formationTeratomas were induced by injecting 2-5 x 106 cells into the subcutaneous tissue above therear haunch of 6 week old Nude Swiss (athymic, immunocompromised) mice. Eight totwelve weeks post-injection, teratomas were harvested and fixed overnight in 4%

  • 4

    paraformaldehyde at 4°C. Samples were then immersed in 30% sucrose overnight beforeembedding the tissue in O.C.T freezing compound (Tissue-Tek). Cryosections wereobtained and 10 µm sections were incubated with the appropriate antibodies as above andanalyzed for the presence of the following differentiation markers by confocal microscopy(LSM 210): neuronal class II β-tubulin, Tuj1 (ectoderm; MMS-435P Covance); striatedmuscle-specific myosin, MF20 (mesoderm; kind gift from D. Fischman), andalphafetoprotein (endoderm; DAKO) (Figure S3). Nuclei were stained blue with 4’,6-diamidine-2-phenylidole dihydrochloride (DAPI). Antibody reactivity was detected formarkers of all three germ layers confirming that the human embryonic cells used in ouranalysis had maintained differentiation potential.

    Embryoid bodies (EB)ES cells were harvested by enzymatic digestion and EBs were allowed to form by plating~1 X 106 cells/well in suspension on 6-well non-adherent, low cluster dishes for 30 days.EBs were grown in the absence of leukemia inhibitory factor (LIF) and basic fibroblastgrowth factor (FGF) in culture medium containing 2x serum replacement. EBs were thenharvested, fixed for 30 minutes in 4% paraformaldehyde at room temperature, and placedin 30% sucrose overnight prior to embedding the tissue in O.C.T. freezing compound(Tissue-Tek). Cryosections were obtained as described for teratoma formation. Confocalimages were obtained for all three germ layer markers again confirming that the H9 cellsused in our analysis have maintained differentiation potential (data not shown; resultssimilar to those shown in Figure S3).

    Antibodies

    RNA polymerase II-bound genomic DNA was isolated from whole cell lysate using8WG16, a mouse monoclonal antibody (Thompson et al., 1989). This antibodypreferentially binds a form of RNA polymerase II that lacks phosphorylation at the C-terminal domain of the largest subunit of polymerase (Patturajan et al., 1999; Cho et al.,2001; Jones et al., 2004) although this preference can be subject to experimentalconditions.

    Suz12-bound genomic DNA was isolated from whole cell lysate with a Suz12 rabbitpolyclonal antibody purchased from Upstate (07-379).

    Eed-bound genomic DNA was isolated from whole cell lysate using an Eed mousemonoclonal antibody previously described (Hamer et al., 2002).

    H3-K27me3-bound genomic DNA was isolated from whole cell lysate using rabbitpolyclonal antibody purchased from Abcam (AB6002). Chromatin immunoprecipitationsagainst H3K27me3 were compared to reference DNA obtained by chromatinimmunoprecipitation of total histone H3 (Abcam AB1791; epitope derived from C-terminal 100 amino acids of histone H3) to normalize for nucleosome density.

  • 5

    Chromatin Immunoprecipitation

    Protocols describing all materials and methods can be downloaded fromhttp://web.wi.mit.edu/young/hES_PRC.

    We performed independent immunoprecipitations for each whole-genome analysis. HumanWA09 embryonic stem cells were grown to a final count of 5x107 – 1x108 cells for eachlocation analysis reaction. Cells were chemically crosslinked by the addition of one-tenthvolume of fresh 11% formaldehyde solution for 15 minutes at room temperature. Cellswere rinsed twice with 1xPBS and harvested using a silicon scraper and flash frozen inliquid nitrogen. Cells were stored at –80oC prior to use.

    Cells were resuspended, lysed in lysis buffers and sonicated to solubilize and shearcrosslinked DNA. Sonication conditions vary depending on cells, culture conditions,crosslinking and equipment. We used a Misonix Sonicator 3000 and sonicated at power 7for 10 x 30 second pulses (90 second pause between pulses). Samples were kept on ice atall times.

    The resulting whole cell extract was incubated overnight at 4°C with 100 µl of DynalProtein G magnetic beads that had been preincubated with approximately 10 µg of theappropriate antibody. For cases where suppliers did not provide information regardingantibody concentration, 20 µl of the supplied solution was used per reaction. Theimmunoprecipitation was allowed to proceed overnight.

    Beads were washed 5 times with RIPA buffer and 1 time with TE containing 50 mM NaCl.Bound complexes were eluted from the beads by heating at 65°C with occasional vortexingand crosslinking was reversed by overnight incubation at 65°C. Whole cell extract DNA(reserved from the sonication step) was also treated for crosslink reversal.

    Immunoprecipitated DNA and whole cell extract DNA were then purified by treatmentwith RNAse A, proteinase K and multiple phenol:chloroform:isoamyl alcohol extractions.Purified DNA was blunted and ligated to linker and amplified using a two-stage PCRprotocol. Amplified DNA was labeled and purified using Bioprime random primerlabeling kits (Invitrogen, immunoenriched DNA was labeled with Cy5 fluorophore, wholecell extract DNA was labeled with Cy3 fluorophore).

    Labeled DNA was mixed (5-6 µg each of immunoenriched and whole cell extract DNA)and hybridized to arrays in Agilent hybridization chambers for up to 40 hours at 40°C.Arrays were then washed and scanned. Whole genome arrays were hybridized in batchesof 35 to 60 arrays.

    Slides were scanned using an Agilent DNA microarray scanner BA. PMT settings wereset manually to normalize bulk signal in the Cy3 and Cy5 channel. For efficient batchprocessing of scans, we used Genepix (version 6.0) software. Scans were automaticallyaligned and then manually examined for abnormal features. Intensity data were thenextracted in batch.

  • 6

    Array Design

    Whole genome arrays

    We designed a set of 115 60-mer oligonucleotide arrays to cover the non-repeat maskedregion of the sequenced human genome. Arrays were produced by Agilent Technologies(www.agilent.com).

    Selection of regions and design of subsequencesWe tiled the genome with variable density: transcription units (defined below) were tiledwith higher density and non-transcription regions were tiled with a slightly lower density.

    To define transcription units, we first selected transcripts from five different databases:RefSeq, Ensembl, MGC, VEGA (www.vega.sanger.ac.uk) and Broad(www.broad.mit.edu). The first three are commonly used databases for gene annotation,the last two are manually annotated databases covering subsets of the human genome fromthe Sanger Institute and Broad Institute, respectively. We also added all microRNAs fromthe Rfam database (Griffiths-Jones et al., 2003) and a small set of collected non-codingRNAs (manual selection).

    The entire collection of transcripts was sorted by chromosomal order. We then extendedeach transcript 10 kb upstream to capture proximal promoter regions. Each of theseextended transcripts was considered a “transcription unit”. In cases where one or moretranscription units overlapped, we merged the transcription units into a single, larger unit.We extracted DNA sequence for all transcription units. Separately, we extractedintervening genomic DNA (“intergenic units”) between transcription units. All sequencesand coordinates are from the May 2004 build of the human genome (NCBI build 35), usingthe repeatmasked (-s) option.

    We then separated sequences into subsequences in order to efficiently process sequencesfor oligo selection. We first removed all unmasked regions 100 bp or smaller. The smallsize of these regions makes it more difficult to identify high quality oligos for use on thearray. These small regions represented a small fraction of the genome and were oftencovered by neighboring probes designed against larger subsequences. For unmaskedregions that were 101 to 300 bp long, we treated each strand (Watson and Crick) as aseparate subsequence. This ensured that we would have two oligos to represent thesesubsequences if the region could not be covered by neighboring 60-mers. For regions thatwere 301 to 640 bp long, we divided the region into two, evenly sized subsequences.Unmasked regions greater than 640 bp were divided into evenly sized subsequences suchthat no individual subsequence was greater than 320 bp.

    We used the program ArrayOligoSelector (AOS)(Bozdech et al., 2003) to score 60-mersfor use on the array, but modified the oligo selection process. We had two primary reasonsfor this. First, AOS uses a relative quality scale in selecting oligos. For any particularsubsequence, it generates scores based on four parameters to evaluate each 60-mer in thesubsequence and looks for the best oligos within that set, ignoring the absolute quality ofthe oligo. As a result, lower quality oligos can be selected. Second, AOS does not have aparameter to set distance between oligos. Consequently, resolution is largely set bydefining subsequence size but is still subject to highly variable placement within each

  • 7

    subsequence. For instance, if the desired tiling density is 300 bp, we would selectsubsequences 300 bp long. For any two adjacent subsequences, probes could be separatedby as little as 0 bp (both probes were placed near the shared subsequence border) or asmuch as 480 bp (both probes placed at opposite subsequence ends).

    To avoid selecting lower quality oligos, we ran AOS to derive scores for every 60-mer inall subsquences and then eliminated oligos based on these scores. AOS uses a scoringsystem for four criteria: GC content, self-binding, complexity and uniqueness. Weselected the following ranges for each parameter: GC content between 30 percent and 100percent, self-binding score less than 100, complexity score less than or equal to 24,uniqueness greater than or equal to –40.

    To achieve more uniform tiling, we instituted a method to find probes within a particulardistance from each other for the transcription unit subsequences. We sorted all qualifiedprobes into chromosomal order and identified gaps in the genomic sequence that were notcovered by one or more 60-mers. These gaps typically represented regions that wererepeat masked or generated regions of consistently low quality oligos. For our purposes,gaps that were greater than 640 bp long represented potential dead zones or “borders”.Based on empirical experience with genome-wide location analysis technology, weconservatively estimated that we would not identify binding events that occurred morethan 320 bp away from the genomic location of any particular probe. As a result, gaps thatwere longer than 640 bp long likely contained one or more basepairs within the gap thatwould not be detected even if we used the closest qualified oligos as probes. Using theseborders, we split the set of all probes into “packages” containing all qualified probesbetween two borders.

    For packages up to 300 bp long, we designed two probes where possible, one from eachstrand (Watson and Crick). This resulted in two different probes in the region,compensating for those instances where a small region would be found isolated by twoborders from the nearest, potentially informative, neighboring probe. For packages greaterthan 301 bp long, we selected the first qualified probe in the package (lowest chromosomalcoordinate), then selected the next qualified probe that was between 150 bp and 280 bpaway. If there were multiple, eligible probes, we chose the most distal probe within the280 bp limit. If there were no probes within this limit, we continued scanning until wefound the next acceptable probe. The process was then repeated with the most recentlyselected probe. If the most recently selected probe was within 250 bp of the next border,we automatically selected the qualified probe closest to the next border. This ensured thatwe were selecting probes as close to the ends of packages as possible.

    For intergenic unit tiling, we generated subsequences and identified borders and packagesas described for genic tiling. We divided packages into evenly sized segments where themaximum segment size was 480 bp. We then selected the qualified probe closest to themidpoint of each segment.

    All probes from both transcription unit and intergenic unit tiling were combined andgrouped by chromosome and sorted by position.

  • 8

    Compiled Probes and ControlsThe design process described above led to the production of a set of 115 Agilentmicroarrays containing a total of 4,652,484 features. Each array contains 40,457 featuresexcept for array #115, which contains 40,386 features. The probes are arranged such thatarray 1 begins with the left arm of chromosome 1, array 2 picks up where array 1 ends,array 3 picks up where array 2 ends, and so on. There are some gaps in coverage thatreflect our inability to identify high quality unique 60-mers: these tend to be unsequencedregions, highly repetitive regions that are not repeat masked (such as telomeres or genefamilies) and certain regions that are probably genome duplications. We estimate that only10% of the total, non-repeat masked region is not covered by probes. As an estimate ofprobe density, 95% of all 60-mers are within 450 bp of another 60-mer; 80% of all 60-mersare within 350 bp of another 60-mer.

    We added several sets of control probes (1,500 total) to the whole genome array designs.On each array, there are 40 oligos designed against five Arabidopsis thaliana genes thatare printed in triplicate, and thus available for use with spike-in controls. TheseArabidopsis oligos were BLASTed against the human genome and do not register anysignificant hits. Since E2F4 chromatin immunoprecipitations can be accomplished with awide range of cell types and have provided a convenient positive control for ChIP-Chipexperiments (for putative regulators where no prior knowledge of targets exist, forexample), we added a total of 80 oligos representing four proximal promoter regions ofgenes that are known targets of the transcriptional regulator E2F4 (NM_001211,NM_002907, NM_031423, NM_001237). Each of the four promoters is represented by 20different oligos that are evenly positioned across the region from 3 kb upstream to 2 kbdownstream of the transcription start site. We also included a control probe set thatprovides a means to normalize intensities across multiple slides throughout the entiresignal range. There are 384 oligos printed as intensity controls; based on testhybridizations, this set of oligos gives signal intensities that cover the entire dynamic rangeof the array. Twenty additional intensity controls, representing the entire range ofintensities, were selected and printed fifteen times each for an additional 300 controlfeatures. We also incorporated 616 “gene desert” controls. To design these probes, weidentified intergenic regions of 1 Mb or greater and designed probes in the middle of theseregions. These are intended to identify genomic regions that are least likely to be bound bypromoter-binding transcriptional regulators (by virtue of their extreme distance from anyknown gene). We have used these as normalization controls in situations where a factorbinds to a large number of promoter regions. In addition to these 1,500 controls, there are2,256 controls added by Agilent (standard) and 77 blank spots.

    Promoter Array

    This set of 10 arrays was designed to cover regions between -8 kb and +2 kb relative to thetranscription start sites of 16,710 genes. See Boyer et al. (2005) for details of the design ofthe arrays.

    Transcription Factor Array

    This array was designed to cover regions between –5 kb and +5 kb relative to thetranscription start sites of 2,288 human genes encoding transcription factors as determinedby GO classifications and manual annotation. Probes were designed essentially as

  • 9

    described above for the whole genome array although tiling density was slightly improved(1 probe approximately every 250 bp). There are a total of 2,079 control spots on thetranscription factor array. The 40 Arabidopsis oligos and 80 E2F4 oligos described abovefor the whole genome design are each printed once. A total of 404 intensity controls areprinted twice. A total of 1,085 “gene desert” controls (described above in the wholegenome design) are each printed once. The intensity controls and “gene desert” controlsare expanded sets of the controls described above for the whole genome design.

    Data Normalization and Analysis

    We used GenePix software (Axon) to obtain background-subtracted intensity values foreach fluorophore for every feature on the array. To obtain set-normalized intensities, wefirst calculated, for each slide, the median intensities in each channel for the set of 1,500control probes described above and included on each array. For multiple slide sets (wholegenome and promoter array), we then calculated the average of these median intensities forall slides. Intensities were then normalized such that the median intensity of each channelfor an individual slide equaled the average of the median intensities of that channel acrossall slides.

    Among the Agilent controls is a set of negative control spots that contain 60-mersequences that do not cross-hybridize to human genomic DNA. We calculated the medianintensity of these negative control spots in each channel and then subtracted this numberfrom the set-normalized intensities of all other features.

    To correct for different amounts of genomic and immunoprecipitated DNA hybridized tothe chip, the set-normalized, negative control-subtracted median intensity value of the IP-enriched DNA channel was then divided by the median of the genomic DNA channel.This yielded a normalization factor that was applied to each intensity in the genomic DNAchannel.

    Next, we calculated the log of the ratio of intensity in the IP-enriched channel to intensityin the genomic DNA channel for each probe and used a whole chip error model (Hughes etal., 2000) to calculate confidence values for each spot on each array (single probe p-value).This error model functions by converting the intensity information in both channels to anX score which is dependent on both the absolute value of intensities and background noisein each channel. When available, replicate data were combined, using the X scores andratios of individual replicates to weight each replicate’s contribution to a combined X scoreand ratio. The X scores for the combined replicate are assumed to be normally distributedwhich allows for calculation of a p-value for the enrichment ratio seen at each feature. P-values were also calculated based on a second model assuming that, for any range of signalintensities, IP:control ratios below 1 represent noise (as the immunoprecipitation shouldonly result in enrichment of specific signals) and the distribution of noise among ratiosabove 1 is the reflection of the distribution of noise among ratios below 1.

  • 10

    Identification of Bound Regions

    Whole Genome ArraysTo automatically determine bound regions in the datasets, we developed an algorithm toincorporate information from neighboring probes. For each 60-mer, we calculated theaverage X score of the 60-mer and its two immediate neighbors. If a feature was flaggedas abnormal during scanning, we assumed it gave a neutral contribution to the average Xscore. Similarly, if an adjacent feature was beyond a reasonable distance from the probe(1000 bp), we assumed it gave a neutral contribution to the average X score. The distancethreshold of 1000 bp was determined based on the maximum size of labeled DNAfragments put into the hybridization. Since the maximum fragment size wasapproximately 550 bp, we reasoned that probes separated by 1000 or more bp would not beable to contribute reliable information about a binding event halfway between them.

    This set of averaged values gave us a new distribution that was subsequently used tocalculate p-values of average X (probe set p-values). If the probe set p-value was less than0.001, the three probes were marked as potentially bound.

    As most probes were spaced within the resolution limit of chromatin immunoprecipitation,we next required that multiple probes in the probe set provide evidence of a binding event.Candidate bound probe sets were required to pass one of two additional filters: two of thethree probes in a probe set must each have single probe p-values < 0.005 or the centerprobe in the probe set has a single probe p-value < 0.001 and one of the flanking probeshas a single point p-value < 0.1. These two filters cover situations where a binding eventoccurs midway between two probes and each weakly detects the event or where a bindingevent occurs very close to one probe and is very weakly detected by a neighboring probe.For RNA polymerase II, this algorithm identified 22,912 bound probe sets of RNApolymerase II ChIP-enriched DNA across the genome.

    Individual probe sets that passed these criteria and were spaced closely together werecollapsed into bound regions if the center probes of the probe sets were within 1000 bp ofeach other. This final step reduced the 22,912 peaks to 10,244 bound regions. The boundregions had a median size of 950 bp.

    The ES cell line we used (H9) has a female karyotype (XX). Nineteen (0.18%) of the RNApolymerase II bound regions mapped to the Y chromosome and 6 of these correspond tothe promoters of known genes. Each of these 6 genes (ASMTL, CXYorf2,HIT000024005, PLCXD1, PPP2R3B and SYBL1) are also present on the X chromosomesuggesting that all of these bound regions are duplicate measurements of X chromosomebinding events caused by hybridization of X chromosome DNA to Y chromosome probes.Subtracting out these duplicates leaves 10,225 unique genomic regions bound by RNApolymerase II in ES cells.

    Peak finding for genome-wide Suz12 binding data was carried out as described above forRNA polymerase II with the following modifications. Probe sets were marked aspotentially bound if the p-value of average X (probe set p-values) was less than 0.0001 andprobe sets were required to pass one of two additional filters: two of the three probes in aprobe set must each have single probe p-values < 0.0005 or the center probe in the probeset has a single probe p-value < 0.0001 and one of the flanking probes has a single point p-

  • 11

    value < 0.01. This algorithm identified 16,438 bound probe sets of Suz12 ChIP-enrichedDNA across the genome. As before, individual probe sets that passed these criteria andwere spaced closely together were collapsed into bound regions if the center probes of theprobe sets were within 1,000 bp of each other. This final step reduced the 16,348 peaks to3,446 bound regions. The bound regions had a median size of 1,248 bp.

    Unlike RNA polymerase II, Suz12 was often associated with large regions of DNAstretching over multiple kilobases of contiguous sequence. 28% of Suz12-bound regionswere over 2 kb in size, compared with only 7% of RNA polymerase II-bound regions. Insome instances, multiple large regions were clustered in close proximity as shown for theHox clusters.

    Promoter ArrayProbe sets were marked as potentially bound if the p-value of average X (probe set p-values) was less than 0.001 and probe sets were required to pass one of two additionalfilters: two of the three probes in a probe set must each have single probe p-values < 0.005or the center probe in the probe set has a single probe p-value < 0.001 and one of theflanking probes has a single point p-value < 0.1. This algorithm identified 7,074 boundprobe sets of Suz12 ChIP-enriched DNA, 6,302 bound probe sets of Eed ChIP-enrichedDNA and 8,205 bound probe sets of H3K27me3 ChIP-enriched DNA. As before,individual probe sets that passed these criteria and were spaced closely together werecollapsed into bound regions if the center probes of the probe sets were within 1,000 bp ofeach other. This final step reduced the peaks to 1,415 (Suz12), 1,549 (Eed) and 1,885(H3K27me3).

    Transcription Factor ArrayProbe sets were marked as potentially bound if the p-value of average X (probe set p-values) was less than 0.001 and probe sets were required to pass one of two additionalfilters: two of the three probes in a probe set must each have single probe p-values < 0.005or the center probe in the probe set has a single probe p-value < 0.001 and one of theflanking probes has a single point p-value < 0.1. As before, individual probe sets thatpassed these criteria and were spaced closely together were collapsed into bound regions ifthe center probes of the probe sets were within 1,000 bp of each other. This algorithmidentified 465 bound probe sets (299 bound regions) of Suz12 ChIP-enriched DNA inmuscle cells, 7,199 bound probe sets (645 bound regions) of Suz12 ChIP-enriched DNA inES cells, 1,375 bound probe sets (455 bound regions) of H3K27me3 ChIP-enriched DNAin muscle cells and 5,455 bound probe sets (775 bound regions) of H3K27me3 ChIP-enriched DNA in ES cells.

    Controlling for the Effect of Murine Embryonic Fibroblast Feeder Cells

    We performed two sets of experiments to measure the contribution of the murineembryonic fibroblasts (MEFs) to the RNA polymerase II binding data. In the firstexperiment, we grew a population of MEFs isolated from E13.5 embryos, irradiated andreplated the cells for 24 hours, treated the cells with formaldehyde to crosslink polymeraseto DNA and performed a chromatin IP. This DNA was then purified and labeled exactlyas described for samples of ES cells. Labeled DNA was hybridized to self-printed arraysand analyzed as described previously (Odom et al., 2004). The results indicate that mouse

  • 12

    feeder cells are unlikely to contribute more than 1% false positives to RNA polymerase IIchromatin immunoprecipitation results. Using our standard analysis, there are only 47features that show enrichment with the mouse feeder cells RNA polymerase II chromatinimmunoprecipitation. In contrast, there are typically 4,000-5,000 enriched features withhuman RNA polymerase II chromatin immunoprecipitation on self-printed arrays. In thesecond set of experiments, we obtained ES cells that were MEF-subtracted by preplatingthe cells on ungelatinized culture dishes for 1-2 hours at 37°C. The supernatant enrichedfor ES cells was then cross-linked as above and harvested for immunoprecipitation. Theresults were essentially the same with and without feeder cells. There are somedifferences, presumably due to the extra manipulations needed to separate the cells and thedecreased cell number resulting from these manipulations. While it is technically possiblethat the oligonucleotide arrays perform differently from our self-printed arrays, theseexperiments generally suggest that the contribution of 8-12% of feeder cells is unlikely tohave an effect on the final results.

    Comparing Bound Regions to Known and Predicted Genes

    The coordinates for the complete lists of RNA polymerase II-bound and Suz12-boundregions on the whole-genome arrays can be found in Table S1 and Table S7, respectively.Mapping the location of RNA polymerase II using genome-tiling arrays directly identifiedthe physical location of active promoters in living cells, thus improving our confidence intranscription start sites previously inferred from RNA evidence. Mapping the location ofSuz12 identified the location of genomic regions targeted by the chromatin regulatorPRC2. This knowledge should be valuable for improving annotation of the genome andidentifying regulatory elements that may not be detected by alternative methods.

    Comparisons to Known GenesThe locations of RNA polymerase II-bound and Suz12-bound regions were comparedrelative to transcript start and stop coordinates of known genes compiled from fivedifferent databases: RefSeq (Pruitt et al., 2005), Mammalian Gene Collection (MGC)(Gerhard et al., 2004), Ensembl (Hubbard et al., 2005), University of California Santa Cruz(UCSC) Known Genes (genome.ucsc.edu)(Kent et al., 2002) and Human Invitational (H-Inv) full-length cDNAs (Imanishi et al., 2004). All coordinate information wasdownloaded in January 2005 from the UCSC Genome Browser (NCBI build 35). Of the10,225 RNA polymerase II-bound regions, 6,741 (66%) occurred within 1 kb of gene startsfrom one of these 5 databases (Table S1). Of the 3,446 Suz12-bound regions, 2,113 (61%)occurred within 1 kb of gene starts from one of these 5 databases (Table S7).

    To convert bound transcription start sites to more useful gene names, we used conversiontables downloaded from UCSC and Ensembl to automatically assign EntrezGene(http://www.ncbi.nlm.nih.gov/entrez/) gene IDs and symbols to the RefSeq, MGC,Ensembl, UCSC Known Genes and H-Inv transcripts. Transcripts for which noEntrezGene annotation could be found in this manner were annotated manually. Thisresulted in a total of 7,106 EntrezGene genes being bound by RNA polymerase II (TableS2) and 1,893 EntrezGene genes being bound by Suz12 (Table S8).

  • 13

    The Distribution of Distances to Known GenesThe distances between each bound region and the closest RefSeq, Ensembl or MGCtranscription start site were calculated and plotted as a histogram (Text: Figure 1E). Asmight be expected for RNA polymerase II, there is a higher frequency of binding eventsover the start sites of known genes. This distribution gradually tails off in both directionsas the distance to the start site increases. Suz12 shows a similar, but broader distributionas a sizable subsetset of Suz12 binding events cover large regions of the genome. Forcomparison, the same distance calculation was made for all probes on chromosome 1.

    Fraction of transcription start sites bound by RNA polymerase II or Suz12 in ES cellsWe used several human gene databases to identify the fraction of annotated transcriptionstart sites bound by RNA polymerase II or Suz12 in ES cells (Figure S4). For eachdatabase, we calculated the percentage of annotated transcription start sites that lie within 1kb of a bound region (RNA polymerase II: MGC 42%, RefSeq 34%, Ensembl 28%, UCSCKnown Genes 26% and H-Inv 26%; Suz12: MGC 6%, RefSeq 8%, Ensembl 7%, UCSCKnown Genes 6% and H-Inv 4%).

    Comparisons to Predicted GenesThe locations of bound regions were also compared relative to transcript start and stopcoordinates of predicted genes compiled from eight different databases; GenScan (Burgeand Karlin, 1997), GeneID (Parra et al., 2000), FirstEF (Davuluri et al., 2001), ACEview(www.aceview.org), ECgene (Kim et al., 2005), UniGene (www.ncbi.nlm.nih.gov/UniGene), UCSC RetroFinder (Kent et al., 2003) and Non-human mRNAs (Kent et al.,2002). These gene models are generally derived through ab initio computational genemodeling (GenScan, GeneID and FirstEF) or EST clustering and alignment to the humangenome (ACEview, ECgene, UniGene, UCSC RetroFinder and Non-human mRNAs). Allpredictions were derived from downloads of coordinates of predicted human genes mappedto NCBI build 35 of the public human genome sequence from UCSC in January 2005. Ofthe 3,484 RNA polymerase II-bound regions not mapping to a known gene, 2,110 mappedwithin 1 kb of the start site of a predicted gene (Table S1). Therefore, a total of 8,851(87%) RNA polymerase II-bound regions corresponded to a known or predictedtranscription start site. Of the 1,353 Suz12-bound regions not mapping to a known gene,1,158 mapped within 1 kb of the start site of a predicted gene (Table S7). Therefore, a totalof 3,271 (95%) Suz12-bound regions corresponded to a known or predicted transcriptionstart site.

    Candidate Novel GenesWe reasoned that RNA polymerase II bound regions located outside of known genes andrelatively far from known transcription sites might represent novel genes. We identified1,053 genomic regions bound by RNA polymerase II that lie over 10 kb away from andoutside of any known gene (as defined as being present in any one of RefSeq, MGC,Ensembl, UCSC Known Genes or H-Inv databases) (Table S3). Of these, we calculatedthat 432 occur within 1 kb of the transcription start sites predicted by one or more of eightgene prediction algorithms (Table S4). These gene predictions are made based on ab initiocomputational gene modeling (GenScan, GeneID and FirstEF), EST clustering andalignment (ACEview (www.aceview.org), ECgene, UniGene(www.ncbi.nlm.nih.gov/UniGene), UCSC RetroFinder and Non-human mRNAs). Allpredictions were derived from downloads of coordinates of predicted human genes mappedto NCBI build 35 of the public human genome sequence from UCSC in January 2005.

  • 14

    While we favor the interpretation that RNA polymerase II-bound regions that are relativelydistant from annotated transcript start sites represent promoters for novel genes, there areseveral other possibilities. These regions could also represent new, distal start sites forknown genes. These distal regions might represent enhancers that are captured via long-range interactions with RNA polymerase II bound to proximal promoters. Finally, theseregions could also represent regions that are spatially, but not linearly, co-localized, similarto the localization of separate regions of chromosomes in the nucleolus.

    Promoters for miRNAsRNA polymerase II and Suz12 were also found associated with microRNA genes in EScells. MicroRNAs (miRNAs) are a non-coding class of small RNAs with significantregulatory potential (Bartel, 2004). In a few cases, miRNA primary transcripts have beencharacterized and shown to have the hallmarks of RNA polymerase II transcripts (Lee etal., 2004), but due to the rapid processing of primary miRNA transcripts, the location andclassification of the majority of miRNA promoters remains unknown. We found RNApolymerase II associated with genes specifying 66 miRNAs in human ES cells,representing 29% of all annotated miRNAs (Table S5). RNA polymerase occupied thepromoters of protein-coding genes harboring 35 intronic miRNAs, strengthening theproposal that miRNAs located within protein-coding genes are typically regulated by thepromoters of the corresponding host genes. We also identified the promoters of 31miRNAs that occur independently of protein-coding genes, providing global evidence thatindependently transcribed miRNAs are generally RNA polymerase II transcripts. Thissystematic identification of miRNA genes bound by RNA polymerase II overcomes manyof the limitations to miRNA detection such as the small size of the mature species and thecross-hybridization of closely related miRNAs.

    Similar analysis for Suz12-bound regions indicated that Suz12 binds the promoter regionsof 34 miRNAs. These included mir-124, a miRNA preferentially expressed in brain tissue(Sempere et al., 2004) that can shift gene expression profiles towards that of brain (Lim etal., 2005). The observation that Suz12 occupies genes that specify both transcriptional andpost-transcriptional regulators of development indicates that PRC2 functions to repressdevelopmental transcriptional programs in ES cells at multiple levels.

    Bound regions were assigned to miRNAs as follows. MiRNA clusters (data from Rfam,May 2005) were divided into two classes; intronic (inside known genes in the sameorientation) and independent. Intronic miRNA genes were classified as bound if thepromoter of their host gene was bound. For genes with alternative promoters, a promoterupstream of the miRNA had to be bound. Intronic miRNAs appeared to be transcribedfrom the promoters of their host genes; we did not observe any other RNA polymerase IIbinding close to intronic miRNAs. Intergenic miRNAs were classified as bound if RNApolymerase II or Suz12 binding was identified within 10 kb upstream of the miRNA,unless the bound region could be attributed to a neighboring gene. However, in most cases,the bound region was detected much closer to the DNA encoding the miRNA stem loops.

  • 15

    Comparing Binding and Human Expression Data

    Transcription of genes bound by RNA polymerase II and Suz12We collected 7 previously published ES cell expression datasets for comparison with ourRNA polymerase II and Suz12 binding data. The expression data, gathered usingmassively parallel signature sequencing (MPSS) and Affymetrix gene expression arrays,were processed as follows:

    MPSS data: Three MPSS datasets were collected, two from a pool of the ES cell lines H1,H7 and H9 (Brandenberger et al., 2004; Wei et al., 2005) and one for HES-2 (Wei et al.,2005). For each study, only MPSS tags detected at or over 4 transcripts per million (tpm)were used. In addition, the data provided by Wei and colleagues (Wei et al., 2005) allowedus to select only those tags that could be mapped to a single unique location in the humangenome. For tags without a corresponding EntrezGene ID, IDs were assigned using thegene name or RNA accession numbers provided by the authors.

    Gene expression microarray data: Four Affymetrix HG-U133 gene expression datasetswere collected for the cell lines H1 (Sato et al., 2003), H9, HSF1 and HSF6 (Abeyta et al.,2004). Each cell line was analyzed by the authors in triplicate. EntrezGene IDs wereassigned to the probe-sets using Affymetrix annotation or using RNA accession numbersprovides by the authors. For each probe-set, we counted the number of “Present” calls inthe three replicate array experiments performed for each cell line. Most genes arerepresented by more than one probe-set and, to enable comparison to MPSS and RNApolymerase II binding data, we then found the maximum number of P calls for each gene(defined by unique EntrezGene ID). A gene was defined as detected if it was called“Present” in at least 2 of the 3 replicate arrays.

    This provided 7 lists of genes expressed in ES cells, 3 from MPSS data and 4 frommicroarray data. We found that microarray analysis of H9 ES cells detected transcripts for78% of the genes bound by RNA polymerase II in H9 cells that were present on theAffymetrix arrays. In total, the 7 expression experiments detected transcripts for 88% ofgenes bound by RNA polymerase II (Table S6). In contrast to genes bound by RNApolymerase II, the expression of genes bound by Suz12 was detected more rarely (FigureS12). We found that 20% (±6%) of the genes bound by Suz12 alone in H9 ES cells wereexpressed, depending on the expression dataset used. The expression of some of thesegenes may be due to the incomplete shut down of transcription by Suz12, variations in thegenes bound by Suz12 in different cell culture conditions, or due to the detection of RNAtranscripts that are present in a minority of differentiated cells. Transcription of genesbound by both Suz12 and RNA polymerase II is detected substantially more often thangenes bound by Suz12 alone, consistent with the presence of RNA polymerase II.

    ES cell expression relative to differentiated cell typesWe examined the relative expression levels of genes associated with PRC2 andH3K27me3 in human ES cells (Text: Figure 2C). In order to compare ES cells with asmany human cell and tissue types as possible, we combined the data from three studies, allperformed using the Affymetrix HG-U133A platform: 3 replicates of H1 ES cells (Sato etal., 2003), 3 replicates each of H9, HSF1 and HSF6 ES cells (Abeyta et al., 2004) and 2

  • 16

    replicates of 79 other human cell and tissue types (Su et al., 2004). We extracted data fromthe original CEL files from each array and scaled the data to a median signal of 150 inGCOS (Affymetrix). We then exported the data, created expression ratios using the mediangene expression of each gene across all arrays, transformed the data into log base2 andmedian centered both gene and arrays (so that the median log2 expression ratio for eachgene and each array is 0). EntrezGeneIDs were assigned to each probe-set and for geneswith multiple probe-sets, the expression ratios averaged. This resulted in a set of 12,968unique genes. Of these, 604 were bound by Suz12, Eed and H3K27me3 at highconfidence.

    In addition to examining the relative expression levels of Suz12-target genes in ES cellsand differentiated cells, we also examined the Affymetrix absolute Present/Absentexpression calls (Figure S12). Using this measurement, we found that RNA transcripts ofSuz12-target genes were detected in ES cells much less frequently for RNA transcripts ofRNA polymerase II target genes. However, in differentiated cells, RNA transcripts weredetected for the two classes of genes more equally, indicating that many of the genessilenced by Suz12 in ES cells are transcriptionally active in differentiated cells.

    Inverse correlation between the size of the Suz12 binding domain and gene expressionWe found that, unlike RNA polymerase II, Suz12 was often associated with large regionsof DNA stretching over multiple kilobases of contiguous sequence. For example, 28.3% ofSuz12-bound regions were over 2 kb in size, compared with only 6.6% of RNApolymerase II-bound regions (Figure S8). To explore whether the size of the genomicregion occupied by Suz12 had any functional implications, we measured how RNApolymerase II co-occupancy and gene expression varied according to Suz12 coverage(Figure S13). Suz12 bound regions were assigned to RefSeq genes if they occurred within1 kb of a transcription start site. For genes associated with multiple bound regions, theregions were collapsed, unless the bound regions occupied alternative promoters, in whichcase the largest region was selected. Then for each gene, we determined whether the genewas co-occupied by RNA polymerase II and whether or not the gene was transcribed.Genes have to pass one of two criteria to be classified as transcribed: either RNAtranscripts could be detected in all three MPSS datasets or RNA transcripts could bedetected in all four Affymetrix gene expression microarray datasets. We discovered thatthe greater the extent of Suz12 binding, the less frequently the gene was transcribed andthe less frequently the target gene was occupied by RNA polymerase II. Genes associatedwith Suz12 over 4 kb of sequence were 8-times less likely to be transcribed in ES cells(from 24% of RefSeq genes to 3%) and 4-times less likely to be bound by RNApolymerase II (from 36% to 9%). This suggests that transcriptional repression of genes isfacilitated by the presence of Suz12 across large regions of DNA.

    Expression changes upon ES cell differentiationWe also compared the expression level of genes between pluripotent ES cells anddifferentiated ES cells (expression data from Sato et al, 2003). The pluripotent ES cells(H1 cell line) were cultured on Matrigel in MEF-conditioned medium and thendifferentiated (non-lineage directed) on Matrigel in non-conditioned medium for 26 daysand both samples were analyzed in triplicate on Affymetrix HG-U133A arrays. Weextracted data from the original CEL files and scaled the data to a median signal of 150 inGCOS (Affymetrix). We then exported the data and, for each probe-set, calculated the ratio

  • 17

    of the average signal in differentiated cells to the average signal in pluripotent cells.EntrezGeneIDs were assigned to each probe-set and for genes with multiple probe-sets, theexpression ratios averaged. We then selected only those genes that had transcriptsdetectable in either pluripotent or differentiated ES cells (gene called “P” in at least 2 ofthe 3 replicates), to avoid analyzing expression ratios consisting of only noise.

    To test whether Suz12 bound genes were preferentially upregulated upon differentiation(Text: Figure 6, Table S13), we compared the distribution of expression ratios for genesbound by Suz12 but not RNA Pol II with the distribution of expression ratios for all genes.As a control, we also compared the distribution of expression ratios for genes bound byneither Suz12 nor RNA Pol II (i.e. genes repressed by other means) with the distribution ofexpression ratios for all genes. We chose to present data for genes not bound by RNApolymerase II because this was the stricter comparison (genes bound by RNA polymeraseII are less likely to increase in expression as they are already being transcribed). However,the preferential induction of genes bound by Suz12 is also apparent without first filteringfor RNA polymerase II occupancy (data not shown).

    Estimating Error Rates

    We used sequence-specific PCR to estimate false positive rates for the whole-genomearray data (Figure S5). For RNA polymerase II, a subset of the bound probe sets wereselected and primer pairs designed to amplify between 100 and 200 bp within each boundprobe set. Primers were tested for specificity using BLAST and ePCR. A total of 192primer pairs were selected, where each primer had 10 or fewer matches to the genome andthe pair predicted a single amplicon. For RNA polymerase II IP samples, 10 ng ofimmunoenriched DNA was used as input to the PCR. For whole cell extract (WCE)samples, a range of unenriched DNA amounts (90, 30 and 10 ng of DNA) was used. ThePCR was performed for 28 cycles and products were visualized on an agarose gel stainedwith SYBR Gold (Amersham) and quantified using ImageQuant (Amersham). Only PCRreactions giving single bands with intensities ordered according to the WCE concentrationwere used. Genomic regions were considered enriched if the 10 ng IP sample showedeither 1.5-fold or greater enrichment compared to the 30 ng WCE sample or greater than 1-fold enrichment compared to the 90 ng WCE sample. Genomic regions were considerednot enriched if the band intensity of the 10 ng IP was less than half that of the 30 ng WCEor less than the 10 ng WCE. A total of 119 primer pairs yielded a clear enriched/notenriched decision. 114 of these showed enrichment, indicating a false positive rate of4.4%. Using this set of PCR results, we were also able produce receiver-operator curvesshowing how changes in peak identification criteria would affect the false positive andfalse negative rates. The results suggest that our selected criteria are useful for maximizingthe identification of true positives.

    Two lines of evidence suggest that the false negative rate is approximately 30%.Estimating a false negative rate is generally much more problematic than measuring a falsepositive rate because the measurement of a false negative rate assumes perfect knowledgeof the true positives in the dataset. As every method will have its own error rate,determining a set of true positives is challenging, if not impossible. Despite this importantcaveat, we have used both sequence-specific PCR and a comparison with expressiondatasets to estimate a false negative rate.

  • 18

    To obtain an estimate of the false positive rate for our sequence specific PCR reactions, wedesigned 49 primer pairs against regions of the genome that had no indication of RNApolymerase II binding (p-value for average X and center probes X > 0.3) despite being indensely tiled regions. We reasoned that any substantial PCR amplification in this regionwas more likely to reflect a false positive in the PCR then to reflect binding of a very largefraction of the genome to the initiation form of RNA polymerase II. From these PCRs, wemeasured a false detection rate of ~9%. We then designed a series of PCR primers againstprobes ‘expressing’ a broad range of p-values between these absolute negatives and ourpositive list. 60 of these pairs produced positive PCR amplifications. Correcting for theexpected false detection rate of the PCR, we calculate a probe based false negative rate of~33%.

    We also used sequence-specific PCR to estimate false positive and false negative rates forthe whole-genome Suz12 array data. For estimating false positives, a total of 108 primerpairs yielded a clear enriched/not enriched decision. 105 of these showed enrichment,indicating a false positive rate of 2.8%. Correcting for the expected false detection rate ofthe PCR, we calculated a probe based false negative rate of 27%.

    Binding of Suz12, Eed and H3K27me3

    We used a microarray containing probes for the promoters of 16,710 genes to measure thecorrelation between Suz12 binding, Eed binding and H3K27 methylation. This arraydetected binding of Suz12 to 1,039 genes, Eed to 909 genes and H3K27me3 to 1,007genes. (Text: Figure 2A, Table S9). Due to the strict significance threshold we use to calldefine a DNA binding event (see Identification of Bound Regions section), any set ofgenes we define as being bound is conservative, with a false negative rate of ~30% (seeEstimating Error Rates section). We therefore compared the binding ratios between Suz12,Eed and H3K27me3 to determine whether the genes that were only called bound by onefactor were also bound by the other factors, although at a significance that fell below ourstrict threshold (Figure S6). For genes bound by any one of Suz12, Eed or H3K27me3, wealigned the binding ratios from our Suz12 IP, our Eed IP and our H3K27me3 IP. We foundthat the binding patterns of Suz12, Eed and H3K27me3 followed one another, even atgenes where the binding of only one factor was highly significant by our analysis. Fromthis we conclude that Suz12, Eed and H3K27me3 are present at essentially the same set ofgenes in ES cells, although we cannot rule out that there is specific binding by these factorsat a small number of genes.

    The high degree of overlap between the Suz12, Eed, and H3K27me3 targets indicates thatSuz12 defines an active PRC2 complex at these genes. As a critical subunit of the PRC2complex, Suz12 has widely accepted roles in euchromatic gene silencing and dosagecompensation, where Suz12 and H3K27me3 are transiently enriched on the Xi during X-inactivation (Plath et al., 2003; Silva et al., 2003; de la Cruz et al., 2005). However,alternative roles for Suz12 have been proposed that suggest Suz12 may functionindependently of PRC2 and H3K27me3. For example, Suz12 mutations are suppressors ofposition-effect variegation (PEV) and can interact with the heterochromatin protein 1α(HP1α), indicating a role in heterochromatin-linked gene silencing (Birve et al., 2001;

  • 19

    Yamamoto et al., 2004). Suz12 is also required for germ cell development independent ofother PcG proteins and can exhibit different protein expression profiles compared to Eedand EZH2 (Birve et al., 2001; de la Cruz et al., 2005). While the vast majority of Suz12 co-localizes with Eed to regions of H3K27 methylation, non-overlapping targets may berepresentative of these alternative Suz12 roles that are independent of other PcG proteins.

    GO Classification for RNA Polymerase II and Suz12 Bound Genes

    We identified Gene Ontology classification terms (http://www.geneontology.org) enrichedfor RNA polymerase II-bound and Suz12-bound genes (defined as being within 1 kb of anannotated TSS in either the RefSeq, MGC or Ensembl databases). Hypergeometricdistributions were calculated to determine enriched terms, using for reference the totalnumber of genes annotated to that GO term. Categories with p-values < 10-5 are indicatedin Table S10.

    Many of the classifications enriched for Suz12-bound genes were related to development,transcriptional regulation and signaling and are further described in the main text. Amongthe remaining enriched classifications, we noted an additional category of interest. Over100 ion channel genes are bound by Suz12 (L-type calcium channels, voltage-gated andinward rectifying potassium channels). This is consistent with a role for PRC2 in blockingdifferentiation. L-type calcium channels are involved in the neural vs epidermal cell fatedecision and direct activation of these channels results in neural induction (Moreau et al.,1994; Leclerc et al., 2001).

    We identified 252 annotated human homeodomain transcription factors using PFam(Bateman et al., 2002) and EntrezGene. Of these, 150 (60%) were bound by Suz12. Mostof these were associated with extended domains of Suz12 binding. Given the considerablenumber of homeodomain transcription factors bound by Suz12, we searched for otherfamilies of transcription factors enriched in the set of Suz12-bound targets. Genesannotated to the molecular function GO terms GO:003700 (transcription factor activity),GO:0030528 (transcription regulator activity), or GO:003705 (RNA polymerase IItranscription factor activity/enhancer binding); or the biological process GO termsGO:006355 (regulation of transcription, DNA-dependent), or GO:0045449 (regulation oftranscription) were defined as transcription factors. Suz12-bound genes in this set forwhich a SwissProt ID could be retrieved were input into the PANDORA software package(Kaplan et al., 2003) using domain annotation to search for enriched molecular domains atstandard resolution. For reference, the same analysis was performed for transcription factorgenes bound by RNA polymerase II. Results from the first level of classification aredepicted in Figure S7, expressed as a percentage of the number of total transcriptionfactors placed in that category by PANDORA.

    Comparing Suz12-Bound Regions to domains of conservation

    Regions of genomic conservation were obtained from the PhastCons database stored atUCSC (http://genome.ucsc.edu). PhastCons identifies genomic segments of conservationbased on a two-state phylogenetic hidden Markov model with a state for conserved regionsand a state for non-conserved regions. Each conserved element is assigned a log-odds

  • 20

    score equal to its log probability under the conserved model minus its log probability underthe non-conserved model. The elements are then assigned a conservation score, which is alog transformation of the log-odds score and scales from 0 to 1000(http://genome.ucsc.edu/cgi-bin/hgTrackUi?g=phastConsElements). A LoD score of100 corresponds to a conservation score of ~500. Conserved elements overlapping exonsin the Refseq and UCSC Known Gene database were removed for most analyses. For casesin which Suz12-bound regions or TSS proximal regions (-8kb to +2kb around known startsites) contained multiple conserved elements, the top conservation score was used.

    To calculate the significance of overlap between Suz12 binding and conserved domains,we tested randomly generated genomic regions that reflected the variation in size of Suz12binding regions and the array coverage of the genome. The randomizations wereperformed by finding, for each bound region, a random region of equal size in the genomethat was present on the array. Each of these random regions was then tested to see if itoverlapped a conserved element and scored as above. Multiple runs of the randomizationwere performed and P-values were determined by assuming a binomial distribution with anexpectation derived from the randomized regions. For comparison, the same analysis wasperformed for RNA polymerase II.

    Generating Suz12-deficient Mouse Cells and Analysis of their Expression Pattern

    Generation of Suz12 -/- mouse cellsTo generate Suz12-deficient mouse cells, a targeting vector was constructed (Figure S10)from a BAC DNA clone containing the Suz12 gene isolated from a mouse genomic libraryderived from R1 embryonic stem cells (O. Ohara and H.Koseki, unpublished). Thetargeting construct had two homology sequences, a 6.1 kb EcoRI/XbaI fragment that lies5’ of the 11th exon of the locus and a 3.1 kb KpnI/XbaI fragment that lies 3’ of the 14thexon and extends just past the stop codon, removing DNA encoding amino acids 482 – 741containing the VEFS domain which is involved in interactions with EZH2 (Yamamoto etal., 2004). For the negative selection, the HSV tk cassette from pPNT vector was added.Successful integration replaced a 6.0 kb fragment containing four exons with a neomycin-resitance (Neor) gene cassette from pHR68 (gifted by Dr. T. Kondo) in a reverseorientation relative to Suz12 transcription. This vector was introduced into R1 embryonicstem cells as described previously (Akasaka et al., 1996) and four homologousrecombinants were obtained. Suz12 heterozygous ES cells were introduced into recipientblastocysts and germline transmission of the null allele was obtained with no apparentphenotypic differences between wildtype and heterozygous animals. Suz12 +/- mice werebackcrossed six times onto a C57BL/6. Genotyping was performed by Southern blottingagainst BamHI-digested genomic DNA to detect the appearance of a 3.5 kb fragment(generated by cutting at a BamHI site introduced with the neo cassette) and the loss of a6.2 kb fragment that would occur with endogenous BamHI sites (Figure S10).

    Suz12 -/- cell lines were derived from blastocysts from crosses between heterozygousSuz12 mutant animals based on conventional protocols (Hogan et al., 1994). Loss of wild-type Suz12 protein in Suz12 -/- cells was confirmed by Western blotting (Figure S10).The homozygous mutant cells display reduced ability to tri-methylate H3K27 (data notshown) indicating that PRC2 complex function is disrupted in these cells. The cells retain

  • 21

    some characteristics of ES cells, such as cellular morphology, relatively normal levels ofOct4 and Nanog expression and the ability to proliferate in culture, while gaining somecharacteristics of differentiated cells, such as upregulation of developmental transcriptionfactors (as described below). Moreover, Suz12 -/- embryos in this study arrestdevelopment at 7.75 dpc, similar to that as previously described for Suz12, Eed, and Ezh2null embryos (Schumacher et al., 1996; O'Carroll et al., 2001; Pasini et al., 2004).

    Microarray expression analysisTotal RNA was purified from the two replicate wild-type mouse ES cell lines and thereplicate Suz12 -/- cell lines using TRIzol. RNA from each Suz12 -/- cell line was labeledwith Cy5 using the Low RNA Input Fluorescent Linear Amplification Kit (Agilent) andhybridized to Mouse Development Arrays (G4120A, Agilent) with Cy3 labeled total RNAfrom the corresponding wild-type cells. Each experiment was also repeated, swapping thedyes, giving a total of four expression datasets, with each of the two biological replicatesbeing represented by two technical replicates. The arrays were scanned with an Agilentmicroarray scanner and the processed signals and expression ratios were obtained usingFeature Extraction software (Agilent). We filtered the data to remove features with signalintensities not significantly above background in both channels. Average expression ratioswere generated through inter-slide and intra-slide comparisons between the signals forSuz12 -/- cells and wild-type ES cells for each replicate. The average ratios between theself-self comparisons within each replicate set were also calculated and this population wasthen defined as the null-distribution. Expression ratios were then compared to this null-distribution and the number of standard deviations from the mean calculated. Theexpression of a gene was considered to be significantly altered in Suz12 -/- cells if theexpression ratio between Suz12 -/- and wild-type ES cells was over 2 standard deviationsfrom the mean of the null-distribution and the expression ratio for the same gene in theself-self comparisons was less than 1 standard deviation from the mean of the nulldistribution.

    Comparing mouse expression data with human binding data and human expression dataWe reasoned that genes bound by Suz12 in human ES cells have orthologs in mice thatshould be upregulated in Suz12 -/- mouse cells. A significant overlap in the genes boundby Suz12 in human ES cells and the genes upregulated in Suz12 -/- mouse cells wouldsupport a role for Suz12 in the repression of its target genes in ES cells. We expected thatthe overlap in genes bound by Suz12 in human ES cells and genes upregulated in Suz12 -/-mouse cells would be incomplete because of 1) potential differences in Suz12 occupancyin human and mouse ES cells, 2) possible repression of PRC2 target genes by additionalmechanisms 3) the effects of the Suz12 -/- on genes downstream of Suz12-target genes,due to the fact that many of these are transcriptional regulators and 4) false positive andnegative errors in both binding and expression analysis.

    The mouse microarrays contained 5341 features with a mouse EntrezGene ID. Featuresrepresenting duplicate genes were averaged, giving single expression values for 4266unique genes. Using the Agilent GeneName field, we then used the Homologene databaseto identify orthologous human genes. Homologene listed a human ortholog for 3971 of the4266 mouse genes. We determined that 557 of these 3971 genes with human orthologswere significantly upregulated in Suz12 -/- mouse cells.

  • 22

    We compared the set of 3971 human-mouse gene orthologs with the human Suz12 bindingdata and found that 346 of these 3971 genes were bound by Suz12 in human ES cells. Bycomparing the set of 346 bound genes with the set of 557 upregulated genes, we found that70 (20%) of the 346 genes bound by Suz12 in human ES cells were upregulated in Suz12 -/- cells. This overlap is significant (6x10-4) and given the complexities associated withhuman-mouse comparisons, strongly supports a role for Suz12 in the repression of itstarget genes in ES cells. Strikingly, 8 of the 10 most upregulated developmentaltranscription factors in Suz12 -/- cells were bound by Suz12 in human ES cells. Theidentities of the genes bound by Suz12 in human ES cells and upregulated in Suz12 -/-mouse cells are listed, together with their expression changes, in Table S14.

    To determine the degree to which Suz12 bound genes were preferentially upregulated inSuz12 -/- cells (Text: Figure 6C), we performed the same analysis previously used todetermine whether Suz12 bound genes were preferentially upregulated upon human EScell differentiation (p.16 of Supplemental Data).

    To compare the expression changes that occur in Suz12 -/- cells with those that occur uponhuman H1 ES cell differentiation (Text: Figure 6D), we identified a set of 182 genes thatwere present in both filtered datasets and were bound by Suz12 in human ES cells (thehuman ES cell differentiation dataset was filtered as described on page 16).

    Sample Preparation and Analysis of Differentiated Muscle

    Primary Human Skeletal Muscle Cells (HSkMCs) were obtained from Cell Applications,Inc. (San Diego, CA) and expanded according to supplier's protocols in growth medium.Upon reaching confluence, cells were shifted to differentiation medium in plates coatedwith collagen to promote attachment of differentiating cells, and medium was replacedevery 2 days. Growth and differentiation media were supplied by Cell Applications, Inc.After 6 days of differentiation, cells had fused to form multinucleated myotubes. Cellswere crosslinked and ChIP experiments were performed as described above. ChIP-chipdata were analyzed as described above.

    We observed Suz12 binding at a number of loci that were also Suz12-bound in ES cells,but loss of Suz12 and H3K27me3 was seen at several genes critical for the development ofdifferentiated muscle tissue, including MyoD, Pax3, Pax7, and Six1. Previous workindicates Ezh2 is removed from the chromatin of the muscle-specific structural genesMCK and MHCII and replaced by transcriptional activators upon differentiation (Caretti etal., 2004). While Ezh2 levels appear to decline over the course of muscle differentiation,we note Suz12 binding in differentiated muscle tissue. We cannot rule out that Suz12 maybe playing a different role in this terminally differentiated tissue than it plays in ES cells,and further work will be necessary to clarify this issue. However, the loss of Suz12 atfactors necessary for development of muscle tissue and the retention of Suz12-binding at anumber of genes important for other lineages suggests that removal of Suz12 accompaniesthe expression of key developmental regulators during cellular differentiation.

  • 23

    Comparing Suz12 Binding with Oct4, Nanog and Sox2 Binding

    To explore how Suz12 might be targeted to genes, we compared our Suz12 binding data toour previous results from profiling the DNA binding of the ES cell transcription factorsOct4, Sox2 and Nanog (Boyer et al., 2005). We found that there was a significant overlapbetween the genes bound by Suz12 and the genes bound by Oct4, Sox2 or Nanog. Of the1606 genes bound by Suz12 and present on the promoter arrays we used previously forOct4, Sox2 and Nanog, 196 (12%) were also bound by Oct4, 148 (9%) by Sox2 and 271(17%) by Nanog. 92 genes (6%) were bound by all three factors and a total of 342 (21%)by any one of the three factors. We found that genes bound by Suz12 and Oct4, Suz12 andSox2 or Suz12 and Nanog were enriched for genes encoding transcriptional regulators ofdevelopment when compared to genes bound by Suz12 alone (p = 5.4x10-20, 3.5x10-10 and2.4x10-18, respectively). There were 315 genes encoding developmental transcriptionfactors that were present on the whole-genome and promoter arrays and bound by Suz12.Of these, 103 (33%) were also bound by Oct4, 81 (26%) were also bound by Sox2 and 107(34%) were also bound by Nanog. 57 genes (18%) were bound by all three factors and atotal of 143 (45%) were bound by any one of the three factors (Table S11).

    Although Oct4, Sox2 and Nanog are all indispensable for ES cell propagation, mutations ineach regulator display slightly different phenotypes (reviewed in Chambers, 2004 andBoiani and Scholer, 2005) suggesting each may have unique contributions to stem cellidentity. This led us to ask if there were differences in the association of Oct4, Sox2 orNanog with Suz12 when the factors were considered individually or in pairs. Directanalysis of the bound sites, compared to randomized binding data, showed that Oct4 wasmore associated with Suz12 bound regions then either Sox2 or Nanog. This associationwas consistent whether we compared the factors alone or in pairs. Surprisingly, sites boundby Sox2 or Nanog but not Oct4 were not particularly associated with Suz12 (Figure S14).These data point out subtle differences in the association of Oct4, Sox2 or Nanog withSuz12 that may eventually help identify regulatory mechanisms specific to eachtranscription factor.

    Suz12, Oct4, Nanog and Sox2 binding and sequence conservationThe observation that a set of repressed genes bound by Oct4, Sox2 and Nanog wereoccupied by Suz12 and contain highly conserved non-coding DNA sequences led us toexamine whether the DNA-binding regulators occupy conserved sequence motifs thatmight contribute to PRC2 targeting at these genes. When we examined the Oct4, Sox2 andNanog binding sites at developmental TFs bound by Suz12, we found that ~50% of theirbound regions overlapped conserved elements that had LoD conservation scores > 100.The association of these DNA-binding transcription factors with conserved elements at asubstantial fraction of Suz12 occupied sites suggests that they may have some role intargeting PRC2 to conserved elements in many genes encoding developmental regulators.Oct4, Sox2 and Nanog were not found at all PRC2-occupied genes, however, so additionalregulators must also be involved in PRC2-mediated silencing in ES cells.

    Suz12, Oct4, Nanog and Sox2 binding and DNA motifsIf Oct4, Sox2 and Nanog are involved in targeting PRC2 to genes, we expect that otherfactors must influence Suz12 binding and thus explain why Suz12 is observed at only asubset of the genes bound by these transcription factors. We used MEME (Bailey andElkan, 1995) to search for additional DNA sequence motifs that might discriminate

  • 24

    between genes bound by Oct4, Sox2, Nanog and Suz12 and genes bound by only Oct4,Sox2 and Nanog. There was one motif consisting of repeats of the dinucleotide GT thatwas specifically associated with the Oct4, Sox2, Nanog and Suz12-bound sites (FigureS15). This motif is similar to one previously associated with polycomb response elementsin Drosophila (Ringrose et al., 2003). We also found DNA elements that were specificallyassociated with sites bound by Oct4, Sox2 and Nanog and not Suz12 (Figure S15). One ofthese elements contains DNA binding sites for multiple transcription factors, includingHoxA3 and C/EBP. This suggests that Oct4, Sox2 and Nanog may act with othertranscriptional regulators to positively regulate transcription at some genes, but in theabsence of these other regulators, may recruit PcG proteins and thus negatively regulatetranscription at other genes. Similar bimodal activities have been suggested for proteinsinvolved in PcG targeting in Drosophila including GAGA and zeste (Kerrigan et al., 1991;Laney and Biggin, 1992; Strutt et al., 1997). Oct4, Sox2 and Nanog have previously beendescribed as having both positive and negative roles in transcription (Yuan et al., 1995;Botquin et al., 1998; Nishimoto et al., 1999; Guo et al., 2002) and the association of Suz12with only a subset of promoters bound by these regulators would be consistent with theseobservations.

  • 25

    Index of TablesAll tables can be found on the supporting website; the URLs below can be used todownload the appropriate table.

    Table S1. Regions bound by RNA polymerase II and their relationship to known andpredicted genes.http://web.wi.mit.edu/young/hES_PRC/TableS1.xls

    Table S2. HUGO/EntrezGene identifiers for RNA Pol II bound, annotated genes.http://web.wi.mit.edu/young/hES_PRC/TableS2.xls

    Table S3. RNA polymerase II-bound regions that predict novel gene candidates.http://web.wi.mit.edu/young/hES_PRC/TableS3.xls

    Table S4. Gene models bound by RNA polymerase II.http://web.wi.mit.edu/young/hES_PRC/TableS4.xls

    Table S5. MicroRNA genes bound by RNA polymerase II and Suz12 in ES cells.http://web.wi.mit.edu/young/hES_PRC/TableS5.xls

    Table S6. Expression of genes bound by RNA polymerase II in ES cells.http://web.wi.mit.edu/young/hES_PRC/TableS6.xls

    Table S7. Regions bound by Suz12 and their relationship to known and predicted genes.http://web.wi.mit.edu/young/hES_PRC/TableS7.xls

    Table S8. HUGO/EntrezGene identifiers for Suz12-bound, annotated genes.http://web.wi.mit.edu/young/hES_PRC/TableS8.xls

    Table S9. Detection of Suz12, Eed and H3K27me3 occupancy using promoter arrays.http://web.wi.mit.edu/young/hES_PRC/TableS9.xls

    Table S10. Enriched gene ontologies among RNA Pol II-bound and Suz12-bound genes.http://web.wi.mit.edu/young/hES_PRC/TableS10.xls

    Table S11. Developmental transcription factors bound by Suz12.http://web.wi.mit.edu/young/hES_PRC/TableS11.xls

    Table S12. Developmental signaling proteins bound by Suz12.http://web.wi.mit.edu/young/hES_PRC/TableS12.xls

    Table S13. Expression of Suz12-bound genes during ES cell differentiation.http://web.wi.mit.edu/young/hES_PRC/TableS13.xls

    Table S14. Genes bound by Suz12 in ES cells and upregulated in Suz12 -/- mouse cells.http://web.wi.mit.edu/young/hES_PRC/TableS14.xls

    Table S15. Developmental regulators associated with PRC2 in ES cells and muscle.http://web.wi.mit.edu/young/hES_PRC/TableS15.xls

  • 26

    Figure Legends

    Figure S1. Human H9 ES cells cultured on a low density of irradiated murineembryonic fibroblasts.Bright-field image of H9 cell culture.

    Figure S2. Analysis of human ES cells for markers of pluripotency.Human embryonic stem cells were analyzed by immunohistochemistry for thecharacteristic pluripotency markers Oct4 and SSEA-3. For reference, nuclei were stainedwith DAPI. Our analysis indicated that >90% of the ES cell colonies were positive forOct4 and SSEA-3. Alkaline phosphatase activity was also strongly detected in human EScells.

    Figure S3. Analysis of human ES cells for differentiation potential.Teratomas were analyzed for the presence of markers for ectoderm (Tuj1), mesoderm(MF20) and endoderm (AFP). For reference, nuclei are stained with DAPI. Antibodyreactivity was detected for derivatives of all three germ layers confirming that the humanembryonic stem cells used in our analysis have maintained differentiation potential.

    Figure S4. The fraction of annotated promoters bound by RNA polymerase II orSuz12.The fraction of unique gene transcription start sites that lie within 1 kb of a genomic regionbound by RNA polymerase II and Suz12. The total number of start sites in each database isas follows: MGC n=17,188; RefSeq n=19,349; Ensembl n=30,121; UCSC Known Genesn=42,160; H-Inv n=42,777.

    Figure S5. Estimating error rates.a. Example gel images showing PCR products amplified from 16 genomic regions judgedto be bound by RNA polymerase II using the whole-genome arrays. Each primer-pair wasused to amplify unenriched, whole cell extract (WCE) DNA (90, 30 and 10 ng) andimmunoenriched (IP) DNA (10 ng). Enrichment in the IP DNA is indicated by a “+” and alack of enrichment by a “-“. PCR reactions judged to be inconclusive were labeled with an“N”.b. Example gel images showing PCR products amplified from 16 genomic regions judgednot to be bound by RNA polymerase II using the whole-genome arrays. Each genomicregion represents an annotated transcription start site.c. Receiver-operator curve for RNA polymerase II binding in human ES cells. Curvecompares percentage of true positives and false positives in binding events called fromChIP/chip compared to RT-PCR amplifications of anti-Pol II ChIP DNA. ROC curveswere determined for all regions of the genome (blue) and for the subset of regions locatedwithin 1 kb of known transcription start sites (red).

    Figure S6. Co-occupation of gene promoters by Suz12, Eed and H3K37me3.Suz12 occupancy (top panel), Eed occupancy (middle panel) and H3K27me3 occupancy(bottom panel) at transcription start sites. Each row represents a gene considered occupiedby either Suz12, Eed or H3K27me3 using our high-confidence gene calling algorithm (seesections on Data Normalization and Analysis and Identification of Bound Regions). Thesame genes are illustrated in each of the three panels. Each column represents the datafrom an oligonucleotide probe positioned relative to the start site as indicated by the gene

  • 27

    diagram below. The log binding ratios for each oligo are plotted for each protein; blueindicates enrichment of the immunoprecipitated factor (enrichment ratio >1). A scale forthe binding ratios for each panel is shown. Each factor follows the same binding pattern.From this we conclude that Suz12, Eed and H3K27me3 are present at essentially the sameset of genes and that our stringent gene calling algorithm sometimes calls a gene bound byone factor but not another factor because of the inherent false negative rate of ~30% (seeEstimating Error Rates section).

    Figure S7. Protein domain classification of Suz12- and Pol II-bound transcriptionfactors.Fraction of transcription factor categories bound by Suz12 (green) or RNA Pol II (blue).The percentage is expressed relative to all transcription factor genes assigned to thatcategory by InterPro domain (PANDORA) annotation at the default resolution.Abbreviations are b-HLH (basic helix-loop-helix), NHR (nuclear hormone receptor), ETS(erythroblast transformation specific), b-Zip (basic leucine zipper), PHD finger (planthomeodomain finger), SMAD (Sma- and Mad-related) and FHA (forkhead-associated). nindicates the number of transcription factor genes assigned to a given category.

    Figure S8. Suz12 occupies large regions of DNA.Number of RNA polymerase II (blue bars, left hand axis) and Suz12 (green bars, righthand axis) bound regions of certain sizes (x axis). Unlike RNA polymerase II, Suz12occupies over 2 kb of sequence at a significant number of genes.

    Figure S9. H3K27me3 co-occupies large domains with Suz12.a. Correlation between size of domains of Suz12 binding and H3K27me3 binding. Thetrend was calculated by computing the moving average of the size of H3K27me3 regionsusing a sliding window of 20 genes across the set of genes bound by Suz12 and H3K27and ordered by size of Suz12 bound region. Sizes of bound regions were calculated frompromoter arrays.b. Binding profile of H3K27me3 (black) across ~500 kb regions encompassing Hoxclusters A-D. Unprocessed enrichment ratios for all probes within a genomic region areshown (ChIP vs. whole genomic DNA). Approximate Hox cluster region sizes areindicated within black bars.

    Figure S10. Generation of Suz12 -/- cells.a. Targeted deletion of the Suz12 locus. Homologous recombination was used to replacethe 5’ portion of Suz12 with a neo cassette. Location of probe used for southern blotverification in (b) is shown. Restriction enzymes are denoted B, BamH1; E, EcoR1; X,Xba1.b. Southern blot analysis of BamH1 digested genomic DNA from each genotype.c. Western blot analysis of whole cell extracts derived from each genotype. Immunoblotswere probed with anti-Suz12 (top) or anti-Lamin B (bottom).d. Embryos generated from Suz12 heterozygous crosses were analyzed at different stagesof development.  At 7.75 dpc, normal as well as morphologically smaller embryos wereevident. Genotyping analysis indicated that the abnormal embryos were homozygous forthe Suz12 null allele confirming that Suz12 is required for early development.

    Figure S11. Binding of Suz12 in differentiated muscle.a. Suz12 binding profiles across the muscle regulator MYOD1 gene in H9 human ES cells

  • 28

    (green) and differentiated myotubes (grey). The plots show unprocessed enrichment ratiosfor all probes within a genomic region (ChIP vs. whole genomic DNA). Genes are shownto scale below plots (exons are represented by vertical bars). The start and direction oftranscription are noted by arrows.b. H3K27me3 profiles across the muscle regulator MYOD1 gene in H9 human ES cells(black) and differentiated myotubes (blue). The plots show unprocessed enrichment ratiosfor all probes within a genomic region (ChIP vs. whole genomic DNA). Genes are shownto scale below plots (exons are represented by vertical bars). The start and direction oftranscription are noted by arrows.c. Suz12 binding profiles across the muscle regulator PAX3 gene, as in a.d. H3K27me3 profiles across the muscle regulator PAX3 gene, as in b.e. Suz12 binding profiles across the muscle regulator PAX7 gene, as in a.f. H3K27me3 profiles across the muscle regulator PAX7 gene, as in b.

    Figure S12. Detection of genes bound by RNA polymerase II and Suz12 in human EScell expression datasets.Percentages of genes that are bound by RNA polymerase II only, RNA polymerase II andSuz12, and Suz12 only that are detected in 7 ES cell expression datasets and onedifferentiated cell expression dataset. The first four ES cell datasets and the differentiatedcell dataset were generated using gene expression arrays (H1: U133A arrays (Sato et al.,2003); H9, HSF1 and HSF6: U133A+B arrays (Abeyta et al., 2004); differentiated tissues:U133A arrays (Su et al., 2004)). The percentages are relative to the fraction of bound genesthat are represented on the arrays. The last three ES cell datasets were generated usingMPSS (Brandenberger et al., 2004; Wei et al., 2005).

    Figure S13. Relationship between size of Suz12 and RNA polymerase II co-occupancyand gene expression.The percentage of genes with detectable RNA (grey bars) and associated with RNApolymerase II (blue bars) as a function of the extent of Suz12 binding. The frequencies forgenes not bound by Suz12 are indicated on the left as controls.

    Figure S14. Association of Oct4, Sox2 or Nanog with Suz12-bound regions.a. The percentage of Oct4-bound regions (purple arrow), Sox2-bound regions (red arrow)or Nanog-bound regions (green arrow) that overlap with Suz12-bound regions are shownalong the x-axis. Comparisons were made between promoter array data from Boyer et al.,2005 and whole genome Suz12 data presented here. The dashed line indicates thedistribution of the expected overlap based on randomized data. For comparison, we alsoshow the results for a fourth transcription factor, E2F4 (blue arrow).b. The percentage of Sox2 and Oct4-cobound regions or Nanog and Oct4-bound regions(purple arrows) that overlap with Suz12-bound regions are shown along the x-axis.Comparisons were made between promoter array data from Boyer et al., 2005 and wholegenome Suz12 data presented here. The dashed line indicates the distribution of theexpected overlap based on randomized data. For comparison, we also show the results forSox2-bound regions that are not bound by Oct4 (red arrow) and Nanog-bound regions thatare not bound by Oct4 (green arrow).

  • 29

    Figure S15. Motifs associated with DNA regions that are bound by Oct4, Sox2, Nanogand Suz12 or bound by Oct4, Sox2 and Nanog.a. Consensus sequence of a motif associated with DNA regions bound by Oct4, Sox2,Nanog and Suz12. This motif was found in approximately 50% of the regions bound byOct4, Sox2, Nanog and Suz12 and enriched 4.8-fold compared to regions bound by Oct4,Sox2 and Nanog but not Suz12.b. Consensus sequence of a motif associated with DNA regions bound by Oct4, Sox2 andNanog and not bound by Suz12. This motif was found in approximately 20% of theregions bound by Oct4, Sox2, Nanog and enriched 3.0-fold compared to regions bound byOct4, Sox2, Nanog and Suz12. Putative transcription factor binding sites are labeled andindicated by black lines. Binding sites were identified with P-Match (http://www.gene-regulation.com) using the input sequence CCTGTAATCCCAGC and cut-off selection formatrix group to minimize the sum of false positives and negatives.c. Consensus sequence of a motif associated with DNA regions bound by Oct4, Sox2 andNanog and not bound by Suz12. This motif was found in approximately 15% of theregions bound by Oct4, Sox2, Nanog and enriched 2.4-fold compared to regions bound byOct4, Sox2, Nanog and Suz12. No putative transcription factor binding sites wereidentified when examined as described in b. using the input sequenceATCTCGGCTCACTG. More lenient selections for the cut-off selection indicate potentialbinding sites for C/EBP, HoxA3, CdxA, Msx-1 and v-Myb.

  • 30

    Supplementary References

    Abeyta, M. J., Clark, A. T., Rodriguez, R. T., Bodnar, M. S., Pera, R. A., and Firpo, M. T.(2004). Unique gene expression signatures of independently-derived human embryonicstem cell lines. Hum Mol Genet 13, 601-608.

    Akasaka, T., Kanno, M., Balling, R., Mieza, M. A., Taniguchi, M., and Koseki, H. (1996).A role for mel-18, a Polycomb group-related vertebrate gene, during theanteroposteriorspecification of the axial skeleton. Development 122, 1513-1522.

    Bailey, T. L., and Elkan, C. (1995). The value of prior knowledge in discovering motifswith MEME. Proc Int Conf Intell Syst Mol Biol 3, 21-29.

    Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell116, 281-297.

    Bateman, A., Birney, E., Cerutti, L., Durbin, R., Etwiller, L., Eddy, S. R., Griffiths-Jones,S., Howe, K. L., Marshall, M., and Sonnhammer, E. L. (2002). The Pfam protein familiesdatabase. Nucleic Acids Res 30, 276-280.

    Birve, A., Sengupta, A. K., Beuchle, D., Larsson, J., Kennison,