Background Golden Helix - Founded in 1998 - Genetic association software - Analytic services - Hundreds of users worldwide - Over 900 customer citations in scientific journals Products I Build with My Team - SNP & Variation Suite (SVS) - SNP, CNV, NGS tertiary analysis - Import and deal with all flavors of upstream data - VarSeq - Annotate and filter variants in gene panels, exomes and genomes for clinical labs and researchers. - GenomeBrowse (Free!) - Visualization of everything with genomic coordinates. All standardized file formats.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Background
Golden Helix- Founded in 1998- Genetic association software- Analytic services- Hundreds of users worldwide- Over 900 customer citations in scientific
journals
Products I Build with My Team- SNP & Variation Suite (SVS)
- SNP, CNV, NGS tertiary analysis- Import and deal with all flavors of upstream data
- VarSeq- Annotate and filter variants in gene panels, exomes and
genomes for clinical labs and researchers.- GenomeBrowse (Free!)
- Visualization of everything with genomic coordinates. All standardized file formats.
Data Warehousing / Scientific Analysis => Columnar
You’ve got to know what regression means, what Naïve Bayes means, what k-Nearest Neighbors means. It’s all statistics.
All of that stuff turns out to be defined on arrays. It’s not defined on tables. The tools of future data scientists are going to be array-based tools. Those may live on top of relational database systems. They may live on top of an array database system, or perhaps something else. It’s completely open.
“Dimensional Moedeling”- Fact tables & dimensional tables- Fact tables often measurements over time- Dimensional table goes into item details- Denormalized data, complexity hidden- Often many sources loaded into same warehouse
- Logs- One or more relational databases (sales, customer-facing etc)- Vender / Payment information
Genomics (Other Life Science) DataData Warehouse Like
Gabe’s Adjusted “Moore’s Law” NGS Cost Graph
Sequencers: Versatile tools for science
Genomics is Big Data
5,000 public data repositories Broad Institute:
- Process 40K samples/year- 1000 people- 51 High Throughput Sequencers- 10+ PB of storage
1 Genome in Data- ~300GB Compressed Sequence Data- ~150MB Compressed Variant Data- Seq data went through 5-6 steps
We Want Variants
Differences between your DNA and a reference come in man sizes:- Single letter substitutions are called
Single Nucleotide Polymorphisms (SNPs)
- Small “length polymorphisms” are called Insertions/Deletions (InDels)
- Large duplications/deletiosn are called Copy Number Variations
Average European has ~3 million small variations to the reference. 100K of those in the 30K “gene coding” regions (~2% of the genome)
Next Generation Sequencing Analysis
PrimaryAnalysis
Secondary Analysis
TertiaryAnalysis
“Sense Making”
Analysis of hardware generated data, software built by vendors Use FPGA and GPUs to handle real-time optical or eletrical signals
from sequencing hardware
Filtering/clipping of “reads” and their qualities Alignment/Assembly of reads Recalibrating, de-duplication, variant calling on aligned reads
QA and filtering of variant calls Annotation (querying) variants to databases, filtering on results Merging/comparing multiple samples (multiple files) Visualization of variants in genomic context Statistics on matrixes
Algorithmic Classifciation- How variant interacts with genes (85K tx)
Region Based- Disease regions- Gene Lists
Annotations are Hard!
HGVS is a standard that is not standard- Tries to serve different goals- Many representations of same variant- Should not be used as IDs, but not many
good alternatives
Transcripts- Transcript set choice extremely important,
hard to curate with meaningful attributes as well.
Public Data Curation- ClinVar: multi-record lines- NHLBI: MAF vs AAF, splitting “glob” fields- 1kG: No genotype counts- ExAC: Multi-allelic splitting, left-align- COSMIC: No Ref/Alt, only HGVS- dbNSFP: Abbreviations and aggregate scores
Versioning and Issues- ClinVar missing variants in VCF- dbSNP patches without version changes
Splice Mutation
asdf
N-Glycanase Deficiency
http://www.ngly1.org/ Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genetics in Medicine. March 2014.
http://www.ngly1.org/ Matthew Might and Matt Wilsey. The
shifting model in clinical diagnostics: how next-generation sequencing and families are altering the way rare diseases are discovered, studied, and treated. Genetics in Medicine. March 2014.
Cancer is a disease of the genome “Molecular Targeted” drugs effective usually side-effect free Required genetic testing to direct cancer treatment becoming affordable