Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics ... · PDF fileAnalysis of ChIP-Seq Data with Partek® Genomics Suite ... actual DNA-binding site (upstream on both strands).

Analysis of ChIP-Seq Data with Partek® Genomics Suite 6.6™ 1

Chapter 1 Analysis of ChIP-Seq Data with Partek

Genomics Suite™ 6.6

Overview

ChIP-Sequencing technology (ChIP-Seq) uses high-throughput DNA sequencing to map

protein-DNA interactions across the entire genome. Partek® Genomics Suite™ (PGS)

offers convenient visualization and analysis of the high volumes of data generated by ChIP-

Seq.

In this tutorial, you will go through the PGS ChIP-Seq workflow and will analyze aligned

data from a ChIP sample versus a control sample in .bam format.

This tutorial will illustrate how to

Import ChIP-Seq data

Perform QA/QC of the samples

Detect and visualize peaks and enriched regions in the genome

Discover binding site motifs

Annotate enriched regions with overlapping genes

Visualize mapped sequence reads on the genome

Note: the workflow described is specific for PGS version 6.6. To upgrade to this version,

go to the Main menu and access Help > Check for Updates. The screenshots shown below

may vary slightly across hardware platforms and across different versions of PGS.

Description of the Data Set

The data for this tutorial is from Johnson et al. (2007) that maps the genomic binding sites

of the NRSF (neuron-restrictive silencer factor) transcription factor across the entire

genome. It includes two samples: an NRSF-enriched ChIP sample (chip.bam) and a control

sample without immuno-enrichment (mock.bam). The chip.bam file contains almost 1.7

million mapped reads, and the mock.bam file contains approximately 2.3 million mapped

reads. These bam files contain the aligned genomic locations and sequences of the

mappable reads. This dataset contains reads from a single-end (SE) library; the differences

in processing paired-end (PE) reads will also be discussed when applicable.

Data and associated files for this tutorial can be downloaded from the Next Generation

Sequencing tab on Help > On-line Tutorials from the PGS main menu.

Analysis of ChIP-Seq Data with Partek® Genomics Suite™ 2

Import Instructions

The steps below will briefly describe how to import the mapped reads of ChIP-Seq data

into PGS.

Step 1 – Download the Data

Download, unzip the tutorial data, and save the bam files on your computer. Due

to the large file sizes associated with NGS data, it is recommended that bam files

be accessed locally (not across the network). The first time a bam file is read in by

PGS, the file will be sorted to allow faster access; therefore, you must have write

permission on the bam files and in the bam file folder.

Step 2 - Import Mapped Reads into PGS

Open the ChIP-Seq workflow within PGS by selecting it from the Workflows

drop-down in the upper right corner of the menu

Under Import from the ChIP-Seq workflow, select Import and manage samples

to invoke the Sequence Import wizard

Using the file browser on the left, navigate to the ChIP-Seq_Data folder

containing the bam files. For this tutorial, select chip.bam and mock.bam (Figure

1). Select OK

Figure 1: Selecting ChIP-Seq files. Date modified may be different than what is shown

In the Sequence Import dialog, specify the Output file, Species, and Genome

build. For this tutorial, set Species to Homo sapiens and Genome build to hg18.

The Output file will be the name of the parent spreadsheet. Select OK

The Bam Sample Manager (Figure 2) can be used to add new samples or files to the project

(Add samples), to remove samples (Remove selected samples), to associate (multiple) files

with particular samples (Manage samples), and to map the chromosome names from the

input files to the annotation files (Manage sequence names). Since none of these operations

are needed, select Close. If the bam file has not been sorted previously by PGS, you may

see the Sort bam files dialog; select OK to sort the files if this dialog box appears. While

the files are being sorted, you will see a message in the status bar at the bottom of the

window:


Figure 2: Bam Sample Manager used to add or remove additional bam files to the

experiment

The resulting spreadsheet is shown in Figure 3. Each sample will be on one row.

The number of aligned reads per sample is shown in column 2. The import

process is now finished

Figure 3: Viewing the spreadsheet after import. Each row contains a sample

Quality Control of Samples

In addition to any quality control that may have been performed when the data was

sequenced, it is a good idea to check the quality of the samples using PGS before analyzing

the data.

Examining the Distribution of Reads

BAM files contain both aligned and unaligned reads. The top-level spreadsheet in Figure 3

shows the number of reads that were aligned to the reference genome. A large number of

unaligned reads may be the result of poor quality sequence data or alignment problems

(wrong genome, alignment settings, etc.). You might also be interested in knowing how

many reads map to more than one location in the genome (if the aligner options supported

multiple-mapped reads).


In the QA/QC section of the ChIP-Seq workflow, select Alignments per read

A new spreadsheet called Alignment_Counts is generated (Figure 4). The titles of

columns 2 and 3 and indicate that this is single-ended data. Column 2 shows the

number of unaligned reads (0 alignments per read), and column 3 shows the

number of reads that align exactly once to the genome (1 alignment per read). If

the BAM files had contained reads that mapped to more than one location in the

genome, these would be shown after column 3

Figure 4: Alignment_Counts spreadsheet. The unaligned reads had been removed from

these BAM files and the alignment options did not permit more than one mapping location

per read

Strand Cross-Correlation

In short-read ChIP-Seq data, peaks are found upstream of the actual DNA-binding site

(upstream on both strands). In a good quality ChIP-Seq sample, the peaks on the forward

strand and the reverse strand are offset (phase-shifted) by the size of the “effective

fragment length.” The effective fragment length tends to be shorter than the length of the

fragmented DNA, the length of the size selection, and the pull-down length. Strand Cross-

Correlation calculates the correlation of the strand-specific read densities; the maximum

correlation should occur at the average size of the peak shift across all chromosomes.

For single-end reads, PGS will calculate the phase shift between the reads on the forward

strand and reads on the reverse strand using the method (Pearson cross-correlation)

described by Kharchenko et al. (2008). Note: the estimation of effective fragment length

for single-end reads can only be done on IP samples and not on mock controls since non-

enriched samples do not contain a phase shift. For paired-end reads, Strand Cross-

Correlation is calculated from the distribution of fragment lengths between the paired-ends

of the two reads.

Under QA/QC from the ChIP-Seq Workflow, select Strand Cross-Correlation. If

you have not run this step previously, you will be asked if you would like to

create a new QA/QC child spreadsheet. If prompted, select Yes

After running Strand Cross-Correlation from the QA/QC workflow, the Strand

Separation of Samples viewer will appear (Figure 5)


Figure 5: Viewing the Strand-Cross Correlation plot to estimate effective fragment length

In Figure 5, the x-axis represents the phase-shift, and the y-axis represents the Pearson

correlation of the strand densities of the forward and reverse strands. Notice in the IP

sample, the peak occurs at 111 bp, corresponding to an average effective fragment length

of 111 base pairs. The peak location can be determined by examining the values in the

strand_correlation spreadsheet, by mousing over the peak in the graph, or by sorting the

data in the spreadsheet.

The control sample (blue) does not have a similar peak because it does not have the phase-

shift property of IP samples. The control sample does have a small peak at 26 bp which

corresponds to the sequencing read length. This is probably due to the fact that some

regions in the genome of the control sample contain many reads stacked up on each other

which will create a correlation peak when the forward and reverse strands are shifted by the

length of the reads. At the sequencing read length, the IP-sample will show a strand cross-

correlation near 0.

The location and magnitude of the peaks in the cross-correlation plot can be used as a

measure of the quality of the enriched sample. Figure 5 shows a highly enriched sample

because the peak at 111 bp dominates the peak at the read length. If the dominant peak in

the IP-enriched sample occurred at the read length, the sample was poorly enriched or

contained very few binding sites. The plot in Figure 6 shows two IP-samples with medium-

level enrichment. Multiple dominant peaks in the IP sample may indicate there are several

populations of DNA fragment lengths which will complicate peak calling (Kundaje 2010).


Figure 6: Example of medium-level enriched samples

Detecting Peaks and Enriched Regions

Regions that contain a binding site for the DNA-binding protein of interest will have many

sequence reads mapped to it. Since single-end reads only cover one end of a sequence

fragment, enriched regions will generally show two adjacent peaks. PGS will directionally

extend each SE read in the 3’ direction by the fragment length (extended reads) to facilitate

merging adjacent peaks into a single peak. For PE reads, the fragment length is defined

from the start of the 5’ end of the first read through the 3’ end of its paired read. For peak

detection, PGS divides the genome into windows (bins) of a user-defined size and counts

the number of (midpoints of) the reads that fall within each bin. PGS fits a zero-truncated

negative binomial to the bin counts and finds all regions that are above a user-defined false

discovery rate (FDR). See the ChIP-Seq white paper for more information on the peak-

finding algorithm and tips for setting the Fragment extension and window sizes.

Under Peak Analysis of the ChIP-Seq workflow, select Detect peaks. The Detect

peaks dialog will appear (Figure 7)

Specify the Fragment Extensions by setting the Maximum average fragment size

to 110. Maximum average fragment size is based on your experimental design: the

size of the fragment pulled-down in the immunoprecipitation step, the size used

during DNA fragmentation, the fragment length used for size selection, or the

effective fragment length. If you have used an antibody that binds the DNA as the

control antibody (rather than no-enrichment as the control), you could use

http://www.partek.com/Tutorials/microarray/User_Guides/ChIP-Seq_White_Paper.pdf


different fragment lengths for both samples with the Individual maximum

fragment sizes radio button. For experiments a using mock control (no

enrichment), use Maximum average fragment size

As this example uses the mock sample as the reference, select mock in the drop-

down list under Reference sample

The peak detection algorithm will divide the genome into windows and find

windows that are enriched with reads based on the FDR value. Set the Window

Size to (base pairs) to 110

Peak Cut-off FDR determines the cut-off for the significance peaks in the chip

sample. Lower cut-off values imply greater differences between the chip and

mock peaks; higher cut-offs lessen the difference in peak heights between the chip

and mock samples. Set the Peak Cut-off FDR to 1 false positives in 1000 (0.001)

Leave the remaining parameters with the default values and select OK

Note: As transcription factor binding sites tend to have localized and sharp clusters of

reads, the window size used during the analysis of a transcription factor study can be

left relatively small (approximately the same as the average fragment length), and the

option to allow for gaps between enriched windows need not to be used. Subsequently,

in the Results reporting section, the Region in the window with most reads could also

be selected. Histone modification peaks, on the other hand, tend to be subtle, diffuse,

and spread-out. For that type of analysis, larger windows might be more suitable, and

neighboring windows may be combined (Within a gap distance of option) into larger

windows (under Window size and Results reporting, respectively). The exact settings

depend on the data and the experiment design, so fine tuning is recommended.

The More info link at the top of the dialog box displays a figure which demonstrates

the relationship between window and gap size. Try changing the How should windows

be merged or the Which regions should be reported? options; the blue bar underneath

each figure will reflect how regions are detected and reported with these settings.


Figure 7: Configuring the Peak Detection dialog

Figure 8: Viewing the detected peaks in the samples

The resulting spreadsheet (Figure 8) will appear. The spreadsheet is sorted by chromosome

number and genomic location. Each row represents one genomic region of peak enrichment

whereas the columns are:

1. Chromosome: Chromosome of region

2. Start: Start of region (inclusive)

3. Stop: End of region (exclusive)


4. Sample ID: The sample containing the enriched region

5. Interval Length: length of region, i.e., Stop – Start, in base pairs

6. Maximum Extended Reads in Window: The greatest number of (extended) reads in one of

the windows of a ChIP-Seq region

7. Reads per Million (RPM): Column 6 divided by the total number of aligned reads in the

sample (in millions). This column will help you compare peaks across samples, especially

when there is a large difference in the number of aligned reads between samples

8. Mann-Whitney p-value: Identifies separation between forward and reverse peaks for

single-end reads using the Mann-Whitney U-test. Lower p-values indicate better

separation. This p-value can be used when there was no control sample or to eliminate

reads due to PCR bias

9.-10. Total reads in region: Total number of (non-extended) reads for each sample (chip

and mock, respectively) in the given genomic region

11. p-value(Sample ID vs. mock): Compares each sample to the reference (mock in this

example) using a one-tailed binomial test. A low p-value means there are significantly

more reads in the sample specified in column 4 (that is, for each region) than in the mock

sample. This column is only included if a reference sample is specified in the Peak

Detection dialog (Figure 7)

12. scaled fold change (Sample ID vs. mock): Compares intensity of signal between each

sample (specified in column 4) to the reference sample (mock in this example). The fold-

change is scaled by a ratio of the number of reads for each sample (IP vs. control) on a per-

chromosome basis. Scaled fold changes > 1 indicate more enrichment in the IP-sample than

in the control sample. This column is only included if a reference sample is specified in the

Peak Detection dialog

13.-14. <Sample> overlap percent: Fraction of called region that overlaps a region from

the given sample where <Sample> is the name of the Sample ID in column 4. For example,

the values of 100% in column 13 and 0% in column 14 point to regions detected in the chip

sample, but not in the mock sample. Similarly, regions with the value of 100% in column

14 were detected in the mock sample (and thus might be excluded from downstream

analyses)

Create a list of enriched regions

You have just created a list of peaks found in both samples. In this section, you will create

a list that filters out peaks detected in the chip sample that also occur in the control (mock)

sample. This list will be used to search for motif binding sites.

Under Peak Analysis of the ChIP-Seq workflow, select Create a list of enriched

regions. The regions found in the IP sample that do not have many reads in the

control sample are of most interest. Use the List Creator functions to filter out

regions that have a high number of reads in the control (mock) sample by using

the p-value against the control


Select Specify New Criteria. Give the new criteria a name such as p-value

filtered, select the 1/regions (peaks) Spreadsheet, and choose Column 11. p-

value(Sample ID vs. mock). Include p-values so that comparison of the number

of reads in the sample compared to the control has a p-value less than 0.05 by

including significant with FDR of 0.05. The dialog should look like Figure 9.

Select OK

Figure 9: Configure criteria dialog to filter out peaks that occur in the control sample

Before closing the List Creator dialog, Save the list you just created. The

spreadsheet should have 2473 rows. The resulting regions are those that have

significantly more reads in the chip sample than in the mock sample. Select Close

to exit the dialog

Other List Creator operations (Figure 10) like the Venn Diagram and Union (Or) or

Intersection (And) of the lists could also be performed to create a list of “true” enriched

peaks. For instance, you could filter on the intersection between FDR and Peaks not in

mock or you may choose to filter by scaled fold change or apply a minimum number of

reads per million (RPM). The choice of how to create a list of “true” peaks is up to you and

may be different for different kinds of experimental designs.


Figure 10: List Creator commands

de novo Motif Discovery and Motif Search

Now that you have a list of enriched regions, you will learn how to find recurring patterns

or motifs in these regions. A transcription factor can bind to many sites throughout the

genome. These sites usually share a certain pattern in their sequences (consensus

sequence). By searching for these binding site motifs, you can determine the binding site

pattern and the locations of binding in the genome. PGS detects de novo motifs using the

Gibbs motif sampler (Neuwald et al., 1995).

A known database of transcription factors such as JASPAR (http://jaspar.cgb.ki.se/) can be

searched or de novo motifs may be identified using only the sequences from the identified

regions to find motifs.

Step 1 – de novo Motif Discovery

Under Peak Analysis, select Motif discovery. The two options for motif

discovery, Discover de novo motifs and Search for known motifs, will be

discussed separately

Select Discover de novo motifs and OK

Choose 1/p-value_filtered as the Spreadsheet with genomic regions. Use the

default settings: Number of Motifs 1, Discover motifs of length between 6 and 16

base pairs, and Result file: Motifs. Select OK. If the reference genome has not

been previously downloaded onto this computer, you may be asked if you would

like to download the .2bit reference genome. If prompted, select Automatically

download a .2bit file and OK if PGS is able to connect to the Internet properly. If

you do not have an Internet connection, choose one of the other two options:

(Manually specify a .2bit file or Create a .2bit file from reference fasta files). The

http://jaspar.cgb.ki.se/


reference genome is required for determining which genes overlap the enriched

peak regions and for displaying the aligned sequences

A motif visualization plot (Figure 11) and two spreadsheets will be generated.

One spreadsheet, motifs (Motifs), contains information about the motif, and the

other, instances (Motifs_instances.txt), lists the genomic locations of the motif. If

your motif does not look exactly like Figure 11, select the Reverse button, which

will give you the reverse complement of the motif

Figure 11: Viewing the binding site motif for NRSF. Use the yellow arrows in the upper

right to cycle through views of all the motifs found (if more than one was found)

Description of Motif Output

Sequence Logo window

The Sequence Logo window (Figure 11) graphically displays the best motif found in the

peak regions of the data. In this case, the motif finder discovered a motif in the NRSF-

enriched regions that is 15 base pairs in length. The height of each position is the relative

entropy (in bits) and indicates the importance of a base at a particular location in the

binding site. The title CAG.ACC..GGA.AG is the consensus sequence for the sequence

logo. Dots represent positions that contain more than one base across all reads in the motif.

The dots can be replaced with letters by checking the Show nucleotide codes checkbox;

doing so will give characters representing the possible bases at that position. For a

description of the IUPAC nucleotide codes, please visit:

http://www.bioinformatics.org/sms/iupac.html.

http://www.bioinformatics.org/sms/iupac.html


Motifs spreadsheet

The motif information spreadsheet (Figure 12), entitled Motifs, lists the information about

the motif that was visualized using the sequence logo. This includes the Counts of bases in

each position of the pattern (column 1), the Consensus Sequence (column 2), the Motif ID

(column 3), the Log Likelihood Ratio of the motif (column 4), and the Background

frequency of each of the bases in all of the sequences of that motif. The Log Likelihood

Ratio scores the relative likelihood that the found pattern did not occur by chance.

Figure 12: Viewing the motif spreadsheet

You can (re)display the Sequence Logo of the motif by right clicking on a row header and

selecting Logo View. If more than one motif was found (in the de novo motif dialog, you

only requested one motif to be found), then the yellow arrows shown in Figure 11 may be

used to cycle through the motifs.

Motif_instances spreadsheet

The Motif_instances spreadsheet (Figure 13), a child of the Motifs spreadsheet and entitled

instances, details all of the locations of the motif(s) in the enriched regions. Each row lists

a putative binding site for a motif. The genomic location is given (chromosome, start, end,

and strand), along with the Motif ID, the sequence found at that location, and a score of

how likely that site is part of the motif. The list is sorted in order of descending score. The

larger the score, the more likely the site is a true instance of the motif.


Figure 13: Viewing the motif instances spreadsheet

Step 2 – Search JASPAR for Known Motifs

Repeat the Motif discovery step; however, select the Search for known motifs

radio button and OK. This will search the JASPAR database for motifs that are

over-represented (more than by chance) in the list of sequences in the significant

regions list. The JASPAR database will download automatically if needed during

the Search for known motifs step. Downloading the JASPAR database will create

a spreadsheet in your experiment named JASPAR.txt that contains all of the

species-specific motifs in the database. Visualization of the motifs is done by

right-clicking on a row in the JASPAR.txt spreadsheet and selecting Logo View.

The yellow arrows in the upper right corner ( ) may be used to cycle

through visualization of the motifs in the JASPAR database

The motif search should be performed on the p-value_filtered list. You may

search for a particular element in the database or all of the elements in the

database. For this tutorial, use the defaults and search for all of the motifs listed in

JASPAR database (Figure 14). Select OK

Alternatively, you can also search the list of sequences for a single motif specified by a

valid nucleotide sequence (Search for motif) or if you want look for several motifs, you can

import them as a list (import the list as tab-delimited file) (Import motifs from text file).

This feature may also be used to import motifs from other databases to which you have

access (TRANSFAC®, custom database, etc.). Use the help button ( ) for specification of

the format of the text file. Sequence Quality value is a number between 0 and 1 and


indicates how closely a sequence must match the pattern for it to be called an instance of

the pattern. The higher the value, the closer it must match the pattern to be called.

Figure 14: Search for JASPAR Motifs in Sequences dialog

Two resulting spreadsheets, similar to the spreadsheets in the de novo motif

discovery step, will be generated, the motif_summary (MotifSearch) spreadsheet

(Figure 15) and the motif_instances (MotifSearch.instance) spreadsheet

Sort the motif_summary spreadsheet by p-value by right-clicking on the p-value

column and selecting Sort Ascending


Figure 15: Motif_summary spreadsheet. Each motif from the JASPAR database (or other

input database used) will be shown. Probability of Occurrence (column 2) is the

probability of detecting a false positive for this motif in a random DNA sequence. Expected

Number of Occurrences (column 3) is the Probability of Occurrence times the total length

of the reads. Actual Number of Occurrences (column 4) is the count of sequences that

match the known motif in the reads. P-value (column 5) is the uncorrected p-value

(binomial test)

As you can see in Figure 15, REST (another name for NRSF) is at the top of the list. The

spreadsheet indicates that the expected number of by-chance occurrences of the

NRSF/REST motif is less than 1, but in fact, 1071 occurrences of the motif were observed,

resulting in a very low p-value (0). This motif agrees with the motif found in the de novo

motif detection step. Interestingly, other motifs appear a significant number of times in the

ChIP-Seq peaks and may represent possible co-factors.

The motif_instances spreadsheet contains all instances of the motifs (with actual counts >0)

from the motif_summary spreadsheet.


Generating a list of regions containing the REST motif (Optional)

Because the motif_instances spreadsheet contains every instance of every motif identified,

you may wish to create a spreadsheet of just the REST instances that contains the locations

of each of the 1071 instances of the REST motif.

Select the motif_instances (MotifSearch.instance) spreadsheet in the

spreadsheet navigator

Select the Motif Name column header (column 5) in the spreadsheet

Right-click and select Find / Replace / Select as shown in Figure 16

Figure 16: Finding all REST peaks (step 1)

In the next dialog, at Find What, type in REST and choose Select All at the

bottom of the screen. This finds and selects the 1071 instances of the REST motif

as shown in Figure 17


Figure 17: Selecting all REST instances in motif_instances spreadsheet (step 2)

Close the dialog. You will notice that in the original spreadsheet, the focus has

shifted that so now row 12848 is highlighted and visible in the view.

Right-click on row 12848 and select Filter Include (Figure 18)

Figure 18: Including all REST instances that were identified by Find / Replace / Select

Notice now that the motif_instances spreadsheet has 1071 rows and that a filter

has been applied (Figure 19)


Figure 19: Filtered motif_instances spreadsheet contains 1071 REST instances. The black

and yellow bar at the far right shows that a filter has been applied to this spreadsheet

Filters are very powerful but will slow down spreadsheet operations on the original list.

Furthermore, the filter operation does not create a brand new spreadsheet. In order to create

a spreadsheet that only contains the REST instances, it is necessary to clone the original

spreadsheet with the filter applied, save the clone with a new name, and clear the filter

from the original spreadsheet.

Right click on motif_instances in the spreadsheet navigator and then select

Clone…

In the Clone Spreadsheet dialog, type REST for Name of resulting copy and

select 1/p-value_filtered/motif_summary (MotifSearch) from the pull-down

menu of Create as a child of spreadsheet

This creates a new spreadsheet in the spreadsheet navigator that has not been

saved (there is an * after the spreadsheet name). Save the spreadsheet by right-

clicking on the spreadsheet and selecting Save As… and type in REST as the File

name

To remove the filter from the original spreadsheet, right-click on motif_instances

(MotifSearch) in the spreadsheet navigator. Notice the yellow/black bar on the

right (also shown in Figure 19). Right-click anywhere in the yellow/black bar and

select Clear Filter. Now both the original spreadsheet and the REST spreadsheet

exist without filters

Finding Nearest Genomic Features

In this section, you will learn how to find genomic features (genes) that are near the IP-

enriched regions of the data. You will also learn how to classify the peak locations by gene

section (5’ UTR, 3’ UTR, Promoter, CDS).

Step 1 – Specify the Database

Make sure the spreadsheet that you want to overlap with genes is active. In this

case, you want to detect overlaps on the p-value_filtered spreadsheet, so select the

p-value_filtered spreadsheet

Under Peak Analysis, select Find nearest genomic feature. A dialog, similar to

Figure 20, will appear. Select RefSeq Transcripts. A download of the database

will be started if this information has not previously been downloaded onto your

computer. Leave the promoter region boundaries as default and select OK


Figure 20: Configuring the dialog for finding genes that overlap enriched regions of the

data

Step 2 – View the List of Nearest Genomic Features

The resulting spreadsheet (gene-list) (Figure 21) is a child of the p-value_filtered

spreadsheet. Each row represents a transcript with Transcript ID (column 5), Gene Symbol

(column 6), and genomic location of the transcript (columns 1-3). Distance to TSS (column

7) gives distance of each enriched region to the transcription start site (in base pairs;

positive means downstream and negative means upstream). Overlap with gene and region

are given in columns 8 and 9, respectively. Columns 10 and greater were already discussed

under the Detect Enriched Regions section.

Note: Percent overlap with gene is more likely to be high (close to 1) in cases where one

region covers several genes (for example, histone studies). Percent overlap with region is

likely to be high (close to 1) if a region is relatively small and is found completely within a

gene (for example, transcription factor binding studies). If both columns are close to 1, then

the gene and the region have nearly the same start and stop locations. If both columns are

small (close to 0) then the region doesn't overlap with the gene directly but the region

found likely covers only the promoter region.

Another way to interpret the percent overlap with region and percent overlap with gene is

to use Peak Analysis > Classify regions by gene section. This step is left for you to try on

your own (the input should be a region list or filtered region list).


Figure 21: Identifying closest genomic features to regions spreadsheet

Visualize Reads and Enriched Regions

You have gone through the steps for importing data and detecting TF-enriched regions and

have identified potential binding sites within these enriched regions. You might explore the

functions of the genes these binding sites regulate by using other Biological Interpretation

tools like GO Enrichment or Pathway Analysis which are discussed in other tutorials.

In this section, viewing the ChIP-Seq data using PGS’s Genome Viewer will be explained.

For more information about the viewer, see NGS Chromosome Viewer.

Step 1 – Load the Data into the Viewer

Select the parent spreadsheet (WoldChipSeqBamFiles) containing the list of

samples in two rows: one for the chip sample and one for the mock sample

Under Visualization on the ChIP-Seq Workflow, select Plot chromosome view.

The left-hand side contains a list of tracks that can be visualized. The tracks that

are shown by default are (from the top) the transcript tracks, the sequence read

visualization tracks, and the cytoband track (Figure 22)

http://www.partek.com/Tutorials/microarray/GO_Enrichment/GO_Enrichment.pdf

http://www.partek.com/Tutorials/microarray/User_Guides/ChromosomeViewer.pdf


Figure 22: Viewing the sample reads on chromosome 1

To add additional tracks, select the New Track button on the left-hand side of the

viewer. Choose Add tracks from a list of spreadsheets and Next

Add the p-value_filtered.txt track and add the Motifs_instances.txt track by

selecting the appropriate checkboxes. Uncheck Aligned Reads as these tracks are

already being displayed. Select Create

This will display the enriched regions found in the samples and the locations of the motif

instances from the de novo motif discovery (additionally, you could display the regions that

were found by searching the JASPAR database). If you have not gone through the steps for

peak detection and motif discovery, these tracks will not be available. The viewer in Figure

23 will appear.


Figure 23: Adding p-value and motif binding site tracks

The two resulting tracks display the detected regions at each location on the chromosome

for the NRSF-enriched sample (chip) and align them to the de novo discovered motif

binding sites. Switching between chromosomes is possible by selecting a chromosome

from the drop-down menu at the top of the window. The positions of the tracks can be

changed by dragging the names of the tracks on the left-hand side of the viewer to the

appropriate locations.

Step 2 – Explore the Data

Change the Genomic Scale via Zoom using the Mouse

Select the magnifying glass icon ( ) to zoom in on the data. Zoom can be done by using

one of several methods: (1) clicking and drawing a box on the plot with the left mouse

button (2) using the mouse scroll wheel (3) using the magnifying glass icons at the bottom

of the screen or (4) sliding the bar between the icons. Figure 24

shows a zoomed-in view of one of the enriched regions.

Selecting the home icon ( ) at the bottom of the screen will reset show the whole

chromosome. Selecting the selection icon ( ), allows you to select a track and change the

properties of that track.

Select ( ) and then select the chip track (or select the Bam Profile (chip) track from the

list of tracks in the left pane)

Under the Style tab, select Histogram, Alignments and select Color by Strands. Select

Apply and the viewer in Figure 24 will appear


The chip region indicates the location of the ChIP-Seq peak. Since only the ends of the

resulting fragments from the ChIP assay are read, enriched regions will generally contain

two peaks, one for the forward reads (shown in green) and one for the reverse reads (shown

in red). The control (mock.bam) does not contain an enriched region at this site. The two

motif binding site regions indicate that there are two potential binding sites in this region.

Figure 24: Viewing the zoomed-in view of an enriched region showing two possible

binding sites at this location

Shortcut to Showing an Enriched Region

To go to an enriched region from the p-value spreadsheet, right-click on the row

header of the region of interest in the p-value_filtered spreadsheet and select

Browse to Location; this action will automatically go to the coordinates of the

region

You can also type the name of a gene in the text box at the top of the viewer next

to the magnifying glass, and the viewer will display the location of that gene. For

example, typing NEUROD1 goes immediately to the NEUROD1 gene (Figure

25)


NEUROD1 contains a binding site for the NRSF motif. Notice that the enriched region for

the NRSF transcription factor is within the NEUROD1 gene. As discussed in the Johnson

et al. paper, NRSF is implicated in the repression of NEUROD1, but it was unknown

exactly where the NRSF binding occurred. This data indicates that the binding site is within

the NEUROD1 gene itself, as shown by the orange box in the Regions track.

Figure 25: Viewing the zoomed-in view of NEUROD1 gene

You may also save the reads shown in the visible genome browser window in selection

mode ( ) by right-clicking in the peak area and selecting Dump Displayed Reads to

Spreadsheet.

Additional Analysis

In addition to the items covered in this tutorial, detecting SNPs in the ChIP-Seq sample is

possible. You may look for differences in nucleotides across the samples or against a

reference genome. This analysis is the same for all of the next generation sequencing

workflows (ChIP-Seq, RNA-Seq, and DNA-Seq) and so is not covered in this tutorial.

Also, the ChIP-Seq results can be merged with gene expression data using the Genomic

Integration step in the ChIP-Seq workflow.

End of Tutorial

For additional assistance, contact our technical support staff at +1-314-878-2329 or email

[email protected].

mailto:[email protected]


References

Johnson, D. S., Mortazavi, A., Myers, R. M., & Wold, B. (2007). Genome-Wide Mapping

of in Vivo Protein-DNA Interactions (Vol. 316). New York, NY: Science.

Kharchenko, P.V., Tolstorukov, M.Y., & Park, P.J. (2008). Design and analysis of ChIP-Seq

experiments for DNA-binding proteins (Vol. 26). Nature Biotechnology.

Kundaje, A. (2010) The phantom-peak coefficient as measure of *-seq data quality. Retrieved

from

ftp://encodeftp.cse.ucsc.edu/users/akundaje/phantomPeakQuality/ThePhantomPeakCoeffi

cient.pdf.

Neuwald, A. F., Liu, J.S., & Lawrence, C.E. (1995). Gibbs motif sampling: detection of

outer membrane repeats (Vol. 4). Protein Science.

Tutorial last revised: Feb. 2012

Copyright 2012 by Partek Incorporated. All Rights Reserved. Reproduction of this material without express written consent

from Partek Incorporated is strictly prohibited.

ftp://encodeftp.cse.ucsc.edu/users/akundaje/phantomPeakQuality/ThePhantomPeakCoefficient.pdf

ftp://encodeftp.cse.ucsc.edu/users/akundaje/phantomPeakQuality/ThePhantomPeakCoefficient.pdf

Chapter 1 Analysis of ChIP-Seq Data with Partek Genomics ... · PDF fileAnalysis of ChIP-Seq Data with Partek® Genomics Suite ... actual DNA-binding site (upstream on both strands).

Documents