A Dstrbuted Annotaton Ppelne for MSSNG Building infrastructure to annotate 10,000 Autism Spectrum Disorder Genomes using Google’s Cloud. Simon Twigger, Ph.D. BioTeam
A Dstrbuted Annotaton Ppelne for
MSSNGBuilding infrastructure to annotate 10,000 Autism Spectrum Disorder Genomes using
Google’s Cloud.
Simon Twigger, Ph.D.BioTeam
ABOUT MSSNGMSSNG is a collaboration between Google and Autism Speaks to create the world’s largest genomic database on autism.By sequencing the DNA of over 10,000 families affected by autism, MSSNG will answer the many questions we still have about the disorder
Upcoming Release:1711 Individuals(681 affected, 1030 unaffected) https://mss.ng
General Process
ConsentedFamilies
DNASamples
DNASequencer
Ref: ACGTGCGATCCTAGCTACGSub: ACGTGCGAACCTAGCTACG
Total Variants in Big Query
Find the Unique VariantsSELECT
CONCAT(reference_name,'-',CAST(start AS STRING),'-',CAST(end AS STRING),'-',reference_bases,'-',alternate_bases) AS id,
reference_name,start +CASE
WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN 1WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases,
LENGTH(alternate_bases) + 1, 1)ELSE 1
END AS start_base_one,end +CASE
WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN 0WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN 0ELSE 0
END AS end_base_one,CASE
WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN IF (alternate_bases CONTAINS reference_bases, '-', reference_bases)
WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases, SUBSTR(reference_bases,1+LENGTH(alternate_bases)), reference_bases)
ELSE reference_basesEND AS reference_bases_one,CASE
WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN IF (alternate_bases CONTAINS reference_bases, SUBSTR(alternate_bases,1+LENGTH(reference_bases)), alternate_bases)
WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases, '-', alternate_bases)
ELSE alternate_basesEND AS alternate_bases_one
FROM FLATTEN ([mssng_20150303.variants],call)WHERE call.FILTER = 'PASS' OR call.FILTER = 'VQSRTrancheSNP99.90to100.00' OR call.FILTER = 'VQSRTrancheINDEL99.90to100.00'OMIT RECORD IF EVERY(alternate_bases IS NULL) OR EVERY(alternate_bases = '<NON_REF>') OR reference_bases = ''GROUP EACH BY id, reference_name, start_base_one, end_base_one, reference_bases_one, alternate_bases_one;
Unique Variants in BigQuery
38.5M Variants - Which one(s) are important?
Only found in specific family trees?In genes believed to be associated with ASD?
Associated with biological pathways implicated in ASD?damaging to a protein involved in a neurological pathway?
In a relevant gene’s regulatory region?In a gene know to be pathogenic in other relevant diseases?
Not seen in patients with some other disease?
Only found in affected patients?
next step is to Annotate the Variants with known data
Many different ways to prioritize, e.g. variants which are…
Variant Annotation• Existing annotation pipeline written in Perl
• AnnoVar used to associate variants with existing biological knowledge
• Moved entire infrastructure over to Google’s Cloud to allow it to use Google Genomics API and run entirely on the Google platform.
• Provide management tools to manage annotation databases and the annotation process
Annotation Jobs
38.5M Variants in~36hrs
Example Variant Annotations
Field Value Description
ID X-148037416-148037417-G-T Location and DNA change
Effect nonsynonymous SNV Predicted effect (Leu->Phe)
Symbol AFF2 AF4/FMR2 family, member 2
OMIM Mental retardation, X-linked Known disease phenotype
HPO HP:0000152 Human Phenotype Ontology IDs
dbSNP rs371160275 dbSNP identifier
10B observed variants
38M unique variants
?? causative variants
https://www.mss.ng/researchers
MSSNG’s philosophy is to promote and enable ‘open science’ research to lead to a better
understanding of autism. We welcome you to join us
https://www.mss.ng/poster