Forsharing cshl2011 sequencing

High-‐Resolu,on Views of Cancer Genomes

The Central Dogma

+

Your Nature Paper

Our First Experiment

Overview of BAC in the Genome

Sequencing a BAC

Sequence Coverage

Repeats

Repeats

Repeats are not created equal

Genomic Sequencing

TargeFng the Exome

  Long oligos synthesized on arrays (DNA)

  RNA baits synthesized from DNA oligo template

  RNA baits hybridized to DNA sequencing library

  Targets captured using beads and bioFn-‐labeled baits

  RNA bait degraded, leaving sequencing library enriched for target regions

Data Flow

  FASTQ files generated by Illumina pipeline   Aligned to reference genome (hg18, excluding _random, unmapped, and hap) using Novoalign   SAM/BAM used extensively

  Follow Broad InsFtute GATK pipeline for exome capture

  Use picard java library for quality assessment   Processed BAM files available via local hZp for browsing

Data Pipeline....

  Samtools import   Samtools sort

  Picard MarkDuplicates

  GATK Indel Realignment

  GATK Quality RecalibraFon

  Picard QC metrics

Realignment around Indels

  The problem -  Aligners align each read independently -  PotenFally leads to increased error rates around

indels

  A potenFal soluFon -  Locally realign reads in regions that might

harbor an indel -  Goal is to align reads overlying indels more

accurately, reducing errors in each read and, in turn, reducing SNV call error rates

Quality Recalibration

  Since most SNV callers will rely on quality scores to estimate error probabilities, having the best possible estimates for error rates is important

  Reported error rates from the Illumina sequencer generally reflect technical parameters of the base call process, but not other systematic biases

  Quality recalibration can include covariates to account for systematic biases

-  Cycle count, dinucleotide context, original quality, and sample/library variables

Variant Calling and EvaluaFon

A developing art

Sequencing Tumor/Normal Pairs

Good SNP

Suspect Variant

SomaFc (tumor only) Variant

Likely False PosiFve (normal only)

LOH

NCI60 Exome Sequencing

No Normals Available!

Variants by Genomic LocaFon

All Coding Variants

Type 1: in dbSNP, Type 2: not in dbSNP

Coding, novel (no dbSNP)

Copy Number from Exomes

Complete Genome Sequencing

Complete Genomics Data

Data

  Delivery   Via USB results

  Storage   Sizes are LARGE -  400GB per sample as delivered with raw reads included

  Should use 2-‐locaFon backed-‐up storage -  Not trivial to find such storage, so might resort to mulFple USB drives

  Minimize: -  Data movement -  Keeping mulFple copies indefinitely

Breakdown of Data Sizes

Data

  Delivery   Storage   Processing

  Data are typically tab-‐delimited text files, so Excel can be useful for examining individual small files

  Generally, command-‐line tools needed   MacOS and linux only supported operaFng systems, but Windows might work....

  Some analyses (snpdiff) require large memory

Directory Structure

Workflows

  Tumor/Normal   Copy Number

  Structural Varia,on   Annotated SomaFc Variants

  Germline   List of annotated genotypes per individual, summarized into a single file that can be used for filtering

Germline Workflow

Germline Workflow

  Output   Future direcFons

  Be “smarter” about inheritance framework

  Further refinements of comparison to other data types (exomes, snp arrays, RNA-‐seq)

Tumor/Normal Workflow

Medvedev et al., Nature 2009

The Cancer Genome Atlas Research Network Nature 000, 1-‐8 (2008) doi:10.1038/nature07385

Frequent geneFc alteraFons in three criFcal signalling pathways.

ChromaFn

  ChromaFn is the complex of protein and DNA that make up the chromosomes. It is not a staFc structure.

  DNAse is an enzyme that cuts DNA at locaFons where DNA is accessible

  These “accessible” regions have been associated with open chromaFn

  Regions of open chromaFn are necessary for transcripFonal and regulatory machinery to have access to gene neighborhoods and facilitate transcripFon

DNAse HypersensiFvity

  Method for finding regions of “open” chromaFn

  In data published with the ENCODE consorFum, DNAse hypersensiFve (HS) were shown to be correlated with:   Histone modificaFon   TranscripFon start sites   Early replicaFng regions   TranscripFon factor binding sites (experimentally determined by ChIP/chip, etc.)

IdenFficaFon and analysis of funcFonal elements in 1% of the human genome by the ENCODE pilot project. The ENCODE ConsorFum. Nature, 2007.

DNAse-‐chip Method

Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M.J., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., and Collins, F.S. Nat Methods, 2006

DNAse-‐Seq Method

Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M.J., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., and Collins, F.S. Nat Methods, 2006

DNAse Sites RelaFve to Genes

DNAse HS Sites and Gene Expression

  DNAse HS sites near transcripFon start sites are associated with acFvely transcribed genes.

  Distances between sequences in non-‐DNAse HS regions have an oscillaFng paZern with frequency that corresponds to a single turn of the double-‐helix

  DNAse is known to cut preferenFally in the minor groove, which is exposed every 10.4 bases when wrapped around a nucleosome

  A nucleosome is wrapped by 147 base pairs when complexed with DNA

  ImplicaFon: Nucleosomes are posiFoned in a highly organized, precise manner

Nucleosome PosiFoning

The Last Mile

Forsharing cshl2011 sequencing

Technology