HighResolu,on Views of Cancer Genomes
Jul 05, 2015
High-‐Resolu,on Views of Cancer Genomes
The Central Dogma
+
Your Nature Paper
Our First Experiment
Overview of BAC in the Genome
Sequencing a BAC
Sequence Coverage
Repeats
Repeats
Repeats are not created equal
Genomic Sequencing
TargeFng the Exome
Long oligos synthesized on arrays (DNA)
RNA baits synthesized from DNA oligo template
RNA baits hybridized to DNA sequencing library
Targets captured using beads and bioFn-‐labeled baits
RNA bait degraded, leaving sequencing library enriched for target regions
Data Flow
FASTQ files generated by Illumina pipeline Aligned to reference genome (hg18, excluding _random, unmapped, and hap) using Novoalign SAM/BAM used extensively
Follow Broad InsFtute GATK pipeline for exome capture
Use picard java library for quality assessment Processed BAM files available via local hZp for browsing
Data Pipeline....
Samtools import Samtools sort
Picard MarkDuplicates
GATK Indel Realignment
GATK Quality RecalibraFon
Picard QC metrics
Realignment around Indels
The problem - Aligners align each read independently - PotenFally leads to increased error rates around
indels
A potenFal soluFon - Locally realign reads in regions that might
harbor an indel - Goal is to align reads overlying indels more
accurately, reducing errors in each read and, in turn, reducing SNV call error rates
Quality Recalibration
Since most SNV callers will rely on quality scores to estimate error probabilities, having the best possible estimates for error rates is important
Reported error rates from the Illumina sequencer generally reflect technical parameters of the base call process, but not other systematic biases
Quality recalibration can include covariates to account for systematic biases
- Cycle count, dinucleotide context, original quality, and sample/library variables
Variant Calling and EvaluaFon
A developing art
Sequencing Tumor/Normal Pairs
Good SNP
Suspect Variant
SomaFc (tumor only) Variant
Likely False PosiFve (normal only)
LOH
NCI60 Exome Sequencing
No Normals Available!
Variants by Genomic LocaFon
All Coding Variants
Type 1: in dbSNP, Type 2: not in dbSNP
Coding, novel (no dbSNP)
Copy Number from Exomes
Complete Genome Sequencing
Complete Genomics Data
Data
Delivery Via USB results
Storage Sizes are LARGE - 400GB per sample as delivered with raw reads included
Should use 2-‐locaFon backed-‐up storage - Not trivial to find such storage, so might resort to mulFple USB drives
Minimize: - Data movement - Keeping mulFple copies indefinitely
Breakdown of Data Sizes
Data
Delivery Storage Processing
Data are typically tab-‐delimited text files, so Excel can be useful for examining individual small files
Generally, command-‐line tools needed MacOS and linux only supported operaFng systems, but Windows might work....
Some analyses (snpdiff) require large memory
Directory Structure
Workflows
Tumor/Normal Copy Number
Structural Varia,on Annotated SomaFc Variants
Germline List of annotated genotypes per individual, summarized into a single file that can be used for filtering
Germline Workflow
Germline Workflow
Output Future direcFons
Be “smarter” about inheritance framework
Further refinements of comparison to other data types (exomes, snp arrays, RNA-‐seq)
Tumor/Normal Workflow
Medvedev et al., Nature 2009
The Cancer Genome Atlas Research Network Nature 000, 1-‐8 (2008) doi:10.1038/nature07385
Frequent geneFc alteraFons in three criFcal signalling pathways.
ChromaFn
ChromaFn is the complex of protein and DNA that make up the chromosomes. It is not a staFc structure.
DNAse is an enzyme that cuts DNA at locaFons where DNA is accessible
These “accessible” regions have been associated with open chromaFn
Regions of open chromaFn are necessary for transcripFonal and regulatory machinery to have access to gene neighborhoods and facilitate transcripFon
DNAse HypersensiFvity
Method for finding regions of “open” chromaFn
In data published with the ENCODE consorFum, DNAse hypersensiFve (HS) were shown to be correlated with: Histone modificaFon TranscripFon start sites Early replicaFng regions TranscripFon factor binding sites (experimentally determined by ChIP/chip, etc.)
IdenFficaFon and analysis of funcFonal elements in 1% of the human genome by the ENCODE pilot project. The ENCODE ConsorFum. Nature, 2007.
DNAse-‐chip Method
Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M.J., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., and Collins, F.S. Nat Methods, 2006
DNAse-‐Seq Method
Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M.J., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., and Collins, F.S. Nat Methods, 2006
DNAse Sites RelaFve to Genes
DNAse HS Sites and Gene Expression
DNAse HS sites near transcripFon start sites are associated with acFvely transcribed genes.
Distances between sequences in non-‐DNAse HS regions have an oscillaFng paZern with frequency that corresponds to a single turn of the double-‐helix
DNAse is known to cut preferenFally in the minor groove, which is exposed every 10.4 bases when wrapped around a nucleosome
A nucleosome is wrapped by 147 base pairs when complexed with DNA
ImplicaFon: Nucleosomes are posiFoned in a highly organized, precise manner
Nucleosome PosiFoning
The Last Mile