Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 1 Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics Suite™ 6.6 Overview ChIP-Sequencing technology (ChIP-Seq) uses high-throughput DNA sequencing to map protein-DNA interactions across the entire genome. Partek ® Genomics Suite™ (PGS) offers convenient visualization and analysis of the high volumes of data generated by ChIP- Seq. In this tutorial, you will go through the PGS ChIP-Seq workflow and will analyze aligned data from a ChIP sample versus a control sample in .bam format. This tutorial will illustrate how to Import ChIP-Seq data Perform QA/QC of the samples Detect and visualize peaks and enriched regions in the genome Discover binding site motifs Annotate enriched regions with overlapping genes Visualize mapped sequence reads on the genome Note: the workflow described is specific for PGS version 6.6. To upgrade to this version, go to the Main menu and access Help > Check for Updates. The screenshots shown below may vary slightly across hardware platforms and across different versions of PGS. Description of the Data Set The data for this tutorial is from Johnson et al. (2007) that maps the genomic binding sites of the NRSF (neuron-restrictive silencer factor) transcription factor across the entire genome. It includes two samples: an NRSF-enriched ChIP sample (chip.bam) and a control sample without immuno-enrichment (mock.bam). The chip.bam file contains almost 1.7 million mapped reads, and the mock.bam file contains approximately 2.3 million mapped reads. These bam files contain the aligned genomic locations and sequences of the mappable reads. This dataset contains reads from a single-end (SE) library; the differences in processing paired-end (PE) reads will also be discussed when applicable. Data and associated files for this tutorial can be downloaded from the Next Generation Sequencing tab on Help > On-line Tutorials from the PGS main menu.
26
Embed
Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics ... · PDF fileAnalysis of ChIP-Seq Data with Partek® Genomics Suite ... actual DNA-binding site (upstream on both strands).
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 1
Chapter 1 Analysis of ChIP-Seq Data with Partek
Genomics Suite™ 6.6
Overview
ChIP-Sequencing technology (ChIP-Seq) uses high-throughput DNA sequencing to map
protein-DNA interactions across the entire genome. Partek® Genomics Suite™ (PGS)
offers convenient visualization and analysis of the high volumes of data generated by ChIP-
Seq.
In this tutorial, you will go through the PGS ChIP-Seq workflow and will analyze aligned
data from a ChIP sample versus a control sample in .bam format.
This tutorial will illustrate how to
Import ChIP-Seq data
Perform QA/QC of the samples
Detect and visualize peaks and enriched regions in the genome
Discover binding site motifs
Annotate enriched regions with overlapping genes
Visualize mapped sequence reads on the genome
Note: the workflow described is specific for PGS version 6.6. To upgrade to this version,
go to the Main menu and access Help > Check for Updates. The screenshots shown below
may vary slightly across hardware platforms and across different versions of PGS.
Description of the Data Set
The data for this tutorial is from Johnson et al. (2007) that maps the genomic binding sites
of the NRSF (neuron-restrictive silencer factor) transcription factor across the entire
genome. It includes two samples: an NRSF-enriched ChIP sample (chip.bam) and a control
sample without immuno-enrichment (mock.bam). The chip.bam file contains almost 1.7
million mapped reads, and the mock.bam file contains approximately 2.3 million mapped
reads. These bam files contain the aligned genomic locations and sequences of the
mappable reads. This dataset contains reads from a single-end (SE) library; the differences
in processing paired-end (PE) reads will also be discussed when applicable.
Data and associated files for this tutorial can be downloaded from the Next Generation
Sequencing tab on Help > On-line Tutorials from the PGS main menu.
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 2
Import Instructions
The steps below will briefly describe how to import the mapped reads of ChIP-Seq data
into PGS.
Step 1 – Download the Data
Download, unzip the tutorial data, and save the bam files on your computer. Due
to the large file sizes associated with NGS data, it is recommended that bam files
be accessed locally (not across the network). The first time a bam file is read in by
PGS, the file will be sorted to allow faster access; therefore, you must have write
permission on the bam files and in the bam file folder.
Step 2 - Import Mapped Reads into PGS
Open the ChIP-Seq workflow within PGS by selecting it from the Workflows
drop-down in the upper right corner of the menu
Under Import from the ChIP-Seq workflow, select Import and manage samples
to invoke the Sequence Import wizard
Using the file browser on the left, navigate to the ChIP-Seq_Data folder
containing the bam files. For this tutorial, select chip.bam and mock.bam (Figure
1). Select OK
Figure 1: Selecting ChIP-Seq files. Date modified may be different than what is shown
In the Sequence Import dialog, specify the Output file, Species, and Genome
build. For this tutorial, set Species to Homo sapiens and Genome build to hg18.
The Output file will be the name of the parent spreadsheet. Select OK
The Bam Sample Manager (Figure 2) can be used to add new samples or files to the project
(Add samples), to remove samples (Remove selected samples), to associate (multiple) files
with particular samples (Manage samples), and to map the chromosome names from the
input files to the annotation files (Manage sequence names). Since none of these operations
are needed, select Close. If the bam file has not been sorted previously by PGS, you may
see the Sort bam files dialog; select OK to sort the files if this dialog box appears. While
the files are being sorted, you will see a message in the status bar at the bottom of the
window:
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 3
Figure 2: Bam Sample Manager used to add or remove additional bam files to the
experiment
The resulting spreadsheet is shown in Figure 3. Each sample will be on one row.
The number of aligned reads per sample is shown in column 2. The import
process is now finished
Figure 3: Viewing the spreadsheet after import. Each row contains a sample
Quality Control of Samples
In addition to any quality control that may have been performed when the data was
sequenced, it is a good idea to check the quality of the samples using PGS before analyzing
the data.
Examining the Distribution of Reads
BAM files contain both aligned and unaligned reads. The top-level spreadsheet in Figure 3
shows the number of reads that were aligned to the reference genome. A large number of
unaligned reads may be the result of poor quality sequence data or alignment problems
(wrong genome, alignment settings, etc.). You might also be interested in knowing how
many reads map to more than one location in the genome (if the aligner options supported
multiple-mapped reads).
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 4
In the QA/QC section of the ChIP-Seq workflow, select Alignments per read
A new spreadsheet called Alignment_Counts is generated (Figure 4). The titles of
columns 2 and 3 and indicate that this is single-ended data. Column 2 shows the
number of unaligned reads (0 alignments per read), and column 3 shows the
number of reads that align exactly once to the genome (1 alignment per read). If
the BAM files had contained reads that mapped to more than one location in the
genome, these would be shown after column 3
Figure 4: Alignment_Counts spreadsheet. The unaligned reads had been removed from
these BAM files and the alignment options did not permit more than one mapping location
per read
Strand Cross-Correlation
In short-read ChIP-Seq data, peaks are found upstream of the actual DNA-binding site
(upstream on both strands). In a good quality ChIP-Seq sample, the peaks on the forward
strand and the reverse strand are offset (phase-shifted) by the size of the “effective
fragment length.” The effective fragment length tends to be shorter than the length of the
fragmented DNA, the length of the size selection, and the pull-down length. Strand Cross-
Correlation calculates the correlation of the strand-specific read densities; the maximum
correlation should occur at the average size of the peak shift across all chromosomes.
For single-end reads, PGS will calculate the phase shift between the reads on the forward
strand and reads on the reverse strand using the method (Pearson cross-correlation)
described by Kharchenko et al. (2008). Note: the estimation of effective fragment length
for single-end reads can only be done on IP samples and not on mock controls since non-
enriched samples do not contain a phase shift. For paired-end reads, Strand Cross-
Correlation is calculated from the distribution of fragment lengths between the paired-ends
of the two reads.
Under QA/QC from the ChIP-Seq Workflow, select Strand Cross-Correlation. If
you have not run this step previously, you will be asked if you would like to
create a new QA/QC child spreadsheet. If prompted, select Yes
After running Strand Cross-Correlation from the QA/QC workflow, the Strand
Separation of Samples viewer will appear (Figure 5)
Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 5
Figure 5: Viewing the Strand-Cross Correlation plot to estimate effective fragment length
In Figure 5, the x-axis represents the phase-shift, and the y-axis represents the Pearson
correlation of the strand densities of the forward and reverse strands. Notice in the IP
sample, the peak occurs at 111 bp, corresponding to an average effective fragment length
of 111 base pairs. The peak location can be determined by examining the values in the
strand_correlation spreadsheet, by mousing over the peak in the graph, or by sorting the
data in the spreadsheet.
The control sample (blue) does not have a similar peak because it does not have the phase-
shift property of IP samples. The control sample does have a small peak at 26 bp which
corresponds to the sequencing read length. This is probably due to the fact that some
regions in the genome of the control sample contain many reads stacked up on each other
which will create a correlation peak when the forward and reverse strands are shifted by the
length of the reads. At the sequencing read length, the IP-sample will show a strand cross-
correlation near 0.
The location and magnitude of the peaks in the cross-correlation plot can be used as a
measure of the quality of the enriched sample. Figure 5 shows a highly enriched sample
because the peak at 111 bp dominates the peak at the read length. If the dominant peak in
the IP-enriched sample occurred at the read length, the sample was poorly enriched or
contained very few binding sites. The plot in Figure 6 shows two IP-samples with medium-
level enrichment. Multiple dominant peaks in the IP sample may indicate there are several
populations of DNA fragment lengths which will complicate peak calling (Kundaje 2010).
Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 6
Figure 6: Example of medium-level enriched samples
Detecting Peaks and Enriched Regions
Regions that contain a binding site for the DNA-binding protein of interest will have many
sequence reads mapped to it. Since single-end reads only cover one end of a sequence
fragment, enriched regions will generally show two adjacent peaks. PGS will directionally
extend each SE read in the 3’ direction by the fragment length (extended reads) to facilitate
merging adjacent peaks into a single peak. For PE reads, the fragment length is defined
from the start of the 5’ end of the first read through the 3’ end of its paired read. For peak
detection, PGS divides the genome into windows (bins) of a user-defined size and counts
the number of (midpoints of) the reads that fall within each bin. PGS fits a zero-truncated
negative binomial to the bin counts and finds all regions that are above a user-defined false
discovery rate (FDR). See the ChIP-Seq white paper for more information on the peak-
finding algorithm and tips for setting the Fragment extension and window sizes.
Under Peak Analysis of the ChIP-Seq workflow, select Detect peaks. The Detect
peaks dialog will appear (Figure 7)
Specify the Fragment Extensions by setting the Maximum average fragment size
to 110. Maximum average fragment size is based on your experimental design: the
size of the fragment pulled-down in the immunoprecipitation step, the size used
during DNA fragmentation, the fragment length used for size selection, or the
effective fragment length. If you have used an antibody that binds the DNA as the
control antibody (rather than no-enrichment as the control), you could use