Page 1
Joaquín Dopazo
Computational Genomics Department,
Centro de Investigación Príncipe Felipe (CIPF),
Functional Genomics Node, (INB),
Bioinformatics Group (CIBERER) and
Medical Genome Project,
Spain.
Bioinformatics and NGS: an
indissoluble marriage for advancing in
hearing loss research
http://bioinfo.cipf.es http://www.medicalgenomeproject.com http://www.babelomics.org http://www.hpc4g.org @xdopazo
Fundación Ramón Areces, Madrid, 5th Marzo 2015
Page 2
Why Bioinformatics and NGS are important? Lessons learned from the Spanish 1000 genomes project:
Rare and familiar diseases sequencing initiative
• Metabolic (86 samples)
• Optiz
• Atypical fracture
• coQ10 deficiency
• Congenital disorder of glycosylation types I and II
• Maple syrup urine disease
• Pelizaeus-like
• 4 unknown syndroms
• Genetic (24 samples)
• Charcot-Marie-Tooth
• Rett Syndrome
• Neurosensorial (35 samples)
• Usher
• AD non-syndromic hearing loss
• AR non-syndromic hearing loss
• RP
• Mitochondrial (28 samples)
• Progressive External Oftalmoplegy
• Multi-enzymatic deficiency in mitochondrial
respiratory complexes
• CoQ disease
• Other
• APL (10 samples)
Autism (37 samples)
Mental retardation (autosomal recessive) (24)
Immunodeficiency (18)
Leber's congenital amaurosis (9)
Cataract (2)
RP(AR) (60)
RP(AD) (46)
Deafness (24)
CLAPO (4)
Skeletal Dysplasia (3)
Cantú syndrome (1)
Dubowitz syndrome (2)
Gorham-Stout syndrome (1)
Malpuech syndrome (4)
Hirschprung’s disease (81)
Hereditary macrothrombocytopenia (3)
MTC (41)
Controls (301)
1044 samples = 183 samples + 200 controls + 360 samples + 301 controls
Page 3
Organization of the initiative
Diseases with: • Unknown genes • Known genes/mutations discarded
Search for: • Novel genes • Responsible genes known but unknown modifier genes • Susceptibility Genes • Therapeutic targets
http://www.gbpa.es/
Data production Sequencing platforms Data analysis
Big-Data Team
science paradigm
Page 4
Data management, analysis
and storage
http://www.gbpa.es/
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
ATTGCGATT
GGCAGAGC
GGCAAAGT
Raw files
(FastQ)
DB
Analysis
Pipeline
Storage
K-DB
Gene 1 ksdhkahcka
Gene 2 jckacsksda
Gene 3 lkkxkccj<jdc
Gene 4 ksfdjvjvlsdkvjd
Gene 5 kckcksñdksd
Gene 6 ldkdkcksdcldl
Gene x kcdlkclkldsklk
Gene Y jcdksdkcdks
Prioritization
report Dialog with experts in the
disease + validations
Samples
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
GCGTATAG
CACGGGTA
TCTGTATTA
TGGTGGAT
ATCAGCGG
VCF BAM Processed files
Page 5
Pipeline of data analysis Initial QC
Sequence cleansing
Base quality
Remove adapters
Remove duplicates
FASTQ file
Variant calling + QC
Calling and labeling of missing values
Calling SNVs and indels (GATK) using 6 statistics based on QC, strand bias, consistence (poor QC callings are converted to missing values as well)
Create multiple VCF with missing, SNVs and indels
VCF file
Mapping + QC
Mapping (HPG)
Remove multiple mapping reads
Remove low quality mapping reads
Realigning
Base quality recalibrating
BAM file
Variant and gene prioritization + QC
Counts of sites with variants
Variant annotation (function, putative effect, conservation, etc.)
Inheritance analysis (including compound heterozygotes in recessive inheritance)
Filtering by frequency with external controls (Spanish controls, dbSNP, 1000g, ESP) and annotation
Multi-family intersection of genes and variants
Function/Network-based prioritization
Report
Primary analysis Gene prioritization
Page 6
Pipeline of data analysis Primary processing
Initial QC
FASTQ file
Mapping
BAM file
Variant calling
VCF File
Knowledge-based prioritization
Proximity to other known disease genes
Functional proximity
Network proximity
Burden tests
Other prioritization methods
Secondary analysis
(Successive filtering)
Variant annotation
Filtering by effect
Filtering by MAF
Filtering by family segregation
Primary analysis
Gene prioritization
VARIANT
annotation tool
Page 7
Variant annotation HPG Variant, a suite of tools for HPC-based genomic variant annotation VARIANT = VARIant
ANnotation Tool. Tools implemented using OpenMP, Nvidia CUDA and MPI for large clusters.
EFFECT: A CLI and web application, it's a cloud-based genomic variant effect predictor tool
has been implemented (http://variant.bioinfo.cipf.es, Medina 2012 NAR)
VCF: C library and tool: allows to analyze large VCFs files with a low memory footprint: stats,
filter, split, merge, etc. Example: hpg-variant vcf –stats –vcf-file ceu.vcf
Annotations
sought
Page 8
The knowledge database
CellBase (Bleda, 2012, NAR), a comprehensive integrative database and RESTful Web Services API, more than 250GB of data and 90 tables exported in TXT and JSON:
● Core features: genes, transcripts, exons, cytobands, proteins (UniProt),...
● Variation: dbSNP and Ensembl SNPs, HapMap, 1000Genomes, Cosmic, ...
● Functional: 40 OBO ontologies (Gene Ontology), Interpro, etc.
● Regulatory: TFBS, miRNA targets, conserved regions, etc.
● System biology: Interactome (IntAct), Reactome database, co-expressed genes.
NoSQL and scales to TB
Wiki: http://docs.bioinfo.cipf.es/projects/cellbase/wiki
Project: http://bioinfo.cipf.es/compbio/cellbase
Now available at the EBI: http://www.ebi.ac.uk/cellbase/webservices/rest/v3/
Page 9
Pipeline of data analysis Primary processing
Initial QC
FASTQ file
Mapping
BAM file
Variant calling
VCF File
Knowledge-based prioritization
Proximity to other known disease genes
Functional proximity
Network proximity
Burden tests
Other prioritization methods
Secondary analysis
(Successive filtering)
Variant annotation
Filtering by effect
Filtering by MAF
Filtering by family segregation
Primary analysis
Gene prioritization
1000 genomes
EVS
Local variants
Page 10
Use known variants and their population frequencies to filter out.
• Typically dbSNP, 1000 genomes and
the 6515 exomes from the ESP are
used as sources of population
frequencies.
• We sequenced 300 healthy controls
(rigorously phenotyped) to add and
extra filtering step to the analysis
pipeline
Novembre et al., 2008. Genes mirror
geography within Europe. Nature Comparison of MGP controls to 1000g
How important do you
think local information is
to detect disease genes?
Page 11
Filtering with or without local variants
Number of genes as a function of individuals in the study of a dominant disease Retinitis Pigmentosa autosomal dominant
The use of local
variants makes
an enormous
difference
Page 12
The CIBERER Exome Server (CES): the first repository of variability of the Spanish
population
Only another similar
initiative exists: the GoNL
http://www.nlgenome.nl/
http://ciberer.es/bier/exome-server/
And more recently
the Finnish
population
Page 13
Information provided
Genotypes in the
different reference
populations
Genomic coordinates, variation, gene.
SNPid
if any
Page 14
Information provided
PolyPhen and SIFT
pathogenicity indexes Phenotype,
if available
Page 15
Variants can also be seen within their genomic context
GenomeMaps viewer (Medina et al., 2013, NAR) embedded in the application.
GenomeMaps is the official genome viewer of the ICGC (http://dcc.icgc.org/)
Page 16
Occurrence of pathological variants in “normal” population
Reference
genome is
mutated
Nine carriers
in 1000
genomes
One affected
and 73 carriers
in EVS
Page 17
Table of Spanish Frequencies
(TSF)
DB of Spanish variants (DBSV)
Chr Position Ref Alt 0/0 0/1 1/1
1 1365313 A T 75 0 0
1 1484884 G A 70 4 1
2 326252 T C 25 35 15
CES use
Other countries
CES input
External
Unrelated? (DBSV)
VCFs Spanish? (TSF)
YES YES
NO NO
Counts
Internal
Regional
AIM (Ancestry-informative
markers) are used to
discard kinship and
different ethnicity
Page 18
Organization of the database
Project D1 D2 … Case Control Counts
A x x f1
X x f2
B X X f3
X X f4
C X X f5
X X f6
X X f7
… … … … … … …
Organized in projects / diseases / case-control. Frequencies are
calculated for each project-disease-status, and selections can be
done as required. The items can be combined to maximize
pseudo-control sample size
Example: frequencies f1, f2, and f5 can be used as pseudo-
controls for studying disease D2. Under a less stringent scenario
f4 and f6 could also be used.
Page 19
Are we there yet? Variability spectrum of the
Spanish population A total of 131.897 variant positions, unique in Spanish population, were
detected in all the 75 samples together. Approximately 90.000 were
singletons. 51.295 variants are non-synonymous changes and 18.450
correspond to synonymous changes (pattern opposite to variants shared
with 1000g and EVS).
Page 20
CIBERER 76 samples
CES II 76+269+X
Mixed
MGP 269 samples
Healthy controls
Phase I Phase II Phase III
CES II 1000+76+269+X
Mixed
More CIBERER samples
SPANEX: 1000 exomes (200 ongoing)
CIBERER
CIBERER exome server roadmap and the Spanish 1000 genomes project
(Spanex)
2014-June 2014 2015 Today 400
Page 21
BiERapp: interactive web-based tool for easy candidate prioritization by successive filtering
SEQUENCING CENTER
Data preprocessing
VCF FASTQ
Genome Maps
BAM
BiERapp filters
No-SQL (Mongo) VCF indexing
Population frequencies Consequence types
Experimental design
BAM viewer and Genomic context ?
Easy
sc
ale
up
Page 22
NA19660 NA19661
NA19600 NA19685
BiERapp: the interactive filtering tool for easy candidate prioritization
http://bierapp.babelomics.org Aleman et al., 2014 NAR
Page 23
NA19660 NA19661
NA19600 NA19685
A/T A/T
T/T A/T
NA19660 NA19661
NA19600 NA19685
?/? A/T
T/T A/T
1
A proper filtering system must consider missing values
Unreported alternative
alleles can happen
because:
a) The position was
read and the
reference allele was
found
b) The position could
not be read and/or
it was low quality
(missing value)
Most VCF formats do
not allow deconvolution
of both scenarios.
We specifically include
missing values
Page 24
3-Methylglutaconic aciduria (3-MGA-uria) is
a heterogeneous group of syndromes
characterized by an increased excretion of
3-methylglutaconic and 3-methylglutaric
acids.
WES with a consecutive filter approach is
enough to detect the new mutation in this
case.
Successive Filtering approach An example with 3-Methylglutaconic aciduria syndrome
Page 25
Readjusting filtering thresholds
Primary
analysis
VCF
Frequency
Deleteriousness
Experimental design
GO enrichment
Network analysis
Pathway analysis
Gene
yes
no
Paper
BiERapp
Quite often, the result
is not conclusive either
by excess or by defect
of candidates .
And it is completely
dependent on the
disease and the
experimental setup
In our experience,
easy interactivity
in the filtering is
the best asset for
gene discovery
Page 26
Results: 36 new disease variants in known genes and 27 disease variants in 13 new genes
WES
IRDs
arRP (EYS)
BBS
arRP arRP (USH2)
3-MGA-uria
(SERAC1)
NBD (BCKDK )
Page 27
Tool for defining panels If no diagnostic variants appear, then
variants of uncertain effect are studied
Also incidental findings can be handled
Diagnostic mutations
http://team.babelomics.org
Diagnostic by targeted resequencing
(panels –real or virtual– of genes) Collaboration with M.A. Moreno, Hospital Ramon y Cajal
New filter based on
local population variant
frequencies
Page 28
Virtual panels are a reality
4813 genes with known
phenotypes.
• One physical panel
• As many virtual panels
as you need
Page 29
CACNA1F, CACNA2D4
GNAT2
RP
CORD/COD
CORD/COD
CVD
CVD
MD
LCA
ERVR/EVR
C2ORF71, C8ORF37, CA4,CERKL, CNGA1, CNGB1, DHDDS,EYS, FAM161A, IDH3B,KLHL7 IMPG2, MAK, NRL, PAP1, PDE6A, PDE6G, PRCD, PRF3, PRPF8, PRPF31 RBP3, RGR, ROM1, RP1, RP2, SNRNP200, TOPORS, TTC8 ZNF513
PDE6B, RHO, SAG
GRK1, GRM6, NYX, TRPM1
CABP4,
LCA5, RD3
CRB1, IMPDH1, LRAT, MERTK, RDH12, RPE65, SPATA7, TULP1
CRX
AIPL1, GUCY2D, RPGRIP1
ADAM9, GUCA1A, HRG4/UNC119, KCNV2, PDE6H, PITPNM3, RAX2, RDH5, RIM1
CNGA3, PDE6C
BCP, GCP, RCP
ABCA4, PROM1, PRPH2, RPGR
RLBP1, SEMA4A
C1QTNF5, EFEMP1, ELOVL4, HMNC1, RS1, TIMP3
FSCN2, GUCA1B
NR2E3 BEST1
FZD4, KCNJ13, LRP5, NDP, TSPAN12, VCAN
NB
ABHD12, CDH23, CIB2, DFNB31, GPR98, HARS, MYO7A, PCDH15, USH1C, USH1G
CLRN1, USH2A
USH
CEP290
BBS1
BBS ARL6,, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, BBS12,, INPP5E, LZTFL1, MKKS, MKS1, SDCCAG8, TRIM32, TTC8
Building virtual panels An example with Inherited Retinal Dystrophies
LCA-Leber Congenital Amaurosis CORD/COD- Cone and cone-rod dystro. CVD- Colour Vision Defects MD- Macular Degeneration ERVR/EVR- Erosive and Exudative Vitreoretinopathies USH- Usher Syndrome RP- Retinitis Pigmentosa NB- Night Blindness BBS- Bardet-Biedl Syndrome
Page 30
CACNA1F, CACNA2D4
GNAT2
RP
CORD/COD
CORD/COD
CVD
CVD
MD
LCA
ERVR/EVR
C2ORF71, C8ORF37, CA4,CERKL, CNGA1, CNGB1, DHDDS,EYS, FAM161A, IDH3B,KLHL7 IMPG2, MAK, NRL, PAP1, PDE6A, PDE6G, PRCD, PRF3, PRPF8, PRPF31 RBP3, RGR, ROM1, RP1, RP2, SNRNP200, TOPORS, TTC8 ZNF513
PDE6B, RHO, SAG
GRK1, GRM6, NYX, TRPM1
CABP4,
LCA5, RD3
CRB1, IMPDH1, LRAT, MERTK, RDH12, RPE65, SPATA7, TULP1
CRX
AIPL1, GUCY2D, RPGRIP1
ADAM9, GUCA1A, HRG4/UNC119, KCNV2, PDE6H, PITPNM3, RAX2, RDH5, RIM1
CNGA3, PDE6C
BCP, GCP, RCP
ABCA4, PROM1, PRPH2, RPGR
RLBP1, SEMA4A
C1QTNF5, EFEMP1, ELOVL4, HMNC1, RS1, TIMP3
FSCN2, GUCA1B
NR2E3 BEST1
FZD4, KCNJ13, LRP5, NDP, TSPAN12, VCAN
NB
ABHD12, CDH23, CIB2, DFNB31, GPR98, HARS, MYO7A, PCDH15, USH1C, USH1G
CLRN1, USH2A
USH
CEP290
BBS1
BBS ARL6,, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, BBS12,, INPP5E, LZTFL1, MKKS, MKS1, SDCCAG8, TRIM32, TTC8
Building virtual panels
LCA-Leber Congenital Amaurosis CORD/COD- Cone and cone-rod dystro. CVD- Colour Vision Defects MD- Macular Degeneration ERVR/EVR- Erosive and Exudative Vitreoretinopathies USH- Usher Syndrome RP- Retinitis Pigmentosa NB- Night Blindness BBS- Bardet-Biedl Syndrome
Panel for RP
Page 31
CACNA1F, CACNA2D4
GNAT2
RP
CORD/COD
CORD/COD
CVD
CVD
MD
LCA
ERVR/EVR
C2ORF71, C8ORF37, CA4,CERKL, CNGA1, CNGB1, DHDDS,EYS, FAM161A, IDH3B,KLHL7 IMPG2, MAK, NRL, PAP1, PDE6A, PDE6G, PRCD, PRF3, PRPF8, PRPF31 RBP3, RGR, ROM1, RP1, RP2, SNRNP200, TOPORS, TTC8 ZNF513
PDE6B, RHO, SAG
GRK1, GRM6, NYX, TRPM1
CABP4,
LCA5, RD3
CRB1, IMPDH1, LRAT, MERTK, RDH12, RPE65, SPATA7, TULP1
CRX
AIPL1, GUCY2D, RPGRIP1
ADAM9, GUCA1A, HRG4/UNC119, KCNV2, PDE6H, PITPNM3, RAX2, RDH5, RIM1
CNGA3, PDE6C
BCP, GCP, RCP
ABCA4, PROM1, PRPH2, RPGR
RLBP1, SEMA4A
C1QTNF5, EFEMP1, ELOVL4, HMNC1, RS1, TIMP3
FSCN2, GUCA1B
NR2E3 BEST1
FZD4, KCNJ13, LRP5, NDP, TSPAN12, VCAN
NB
ABHD12, CDH23, CIB2, DFNB31, GPR98, HARS, MYO7A, PCDH15, USH1C, USH1G
CLRN1, USH2A
USH
CEP290
BBS1
BBS ARL6,, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, BBS12,, INPP5E, LZTFL1, MKKS, MKS1, SDCCAG8, TRIM32, TTC8
Building virtual panels
LCA-Leber Congenital Amaurosis CORD/COD- Cone and cone-rod dystro. CVD- Colour Vision Defects MD- Macular Degeneration ERVR/EVR- Erosive and Exudative Vitreoretinopathies USH- Usher Syndrome RP- Retinitis Pigmentosa NB- Night Blindness BBS- Bardet-Biedl Syndrome
Extended panel
for RP
Page 32
CACNA1F, CACNA2D4
GNAT2
RP
CORD/COD
CORD/COD
CVD
CVD
MD
LCA
ERVR/EVR
C2ORF71, C8ORF37, CA4,CERKL, CNGA1, CNGB1, DHDDS,EYS, FAM161A, IDH3B,KLHL7 IMPG2, MAK, NRL, PAP1, PDE6A, PDE6G, PRCD, PRF3, PRPF8, PRPF31 RBP3, RGR, ROM1, RP1, RP2, SNRNP200, TOPORS, TTC8 ZNF513
PDE6B, RHO, SAG
GRK1, GRM6, NYX, TRPM1
CABP4,
LCA5, RD3
CRB1, IMPDH1, LRAT, MERTK, RDH12, RPE65, SPATA7, TULP1
CRX
AIPL1, GUCY2D, RPGRIP1
ADAM9, GUCA1A, HRG4/UNC119, KCNV2, PDE6H, PITPNM3, RAX2, RDH5, RIM1
CNGA3, PDE6C
BCP, GCP, RCP
ABCA4, PROM1, PRPH2, RPGR
RLBP1, SEMA4A
C1QTNF5, EFEMP1, ELOVL4, HMNC1, RS1, TIMP3
FSCN2, GUCA1B
NR2E3 BEST1
FZD4, KCNJ13, LRP5, NDP, TSPAN12, VCAN
NB
ABHD12, CDH23, CIB2, DFNB31, GPR98, HARS, MYO7A, PCDH15, USH1C, USH1G
CLRN1, USH2A
USH
CEP290
BBS1
BBS ARL6,, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, BBS12,, INPP5E, LZTFL1, MKKS, MKS1, SDCCAG8, TRIM32, TTC8
Building virtual panels
LCA-Leber Congenital Amaurosis CORD/COD- Cone and cone-rod dystro. CVD- Colour Vision Defects MD- Macular Degeneration ERVR/EVR- Erosive and Exudative Vitreoretinopathies USH- Usher Syndrome RP- Retinitis Pigmentosa NB- Night Blindness BBS- Bardet-Biedl Syndrome
Super extended
panel for RP
Page 33
Knowledge DB
Fre
q. p
op
ul.
MiSeq
IonTorrent
IonProton
HiSeq
IonProton
NO
Diagnostic Therapeutic
decision
Ne
w v
aria
nts
D
ise
ase
All
Candidate
Prioritization
Data
pre
pro
cessin
g
Sequence DB
Se
qu
en
ces
Freqs.
Future
technologies
New knowledge
for future
diagnostic
The final schema: diagnostic and discovery
Page 34
Implementation of tools for genomic big data
management in the IT4I Supercomputing
Center (Czech Republic)
The pipelines of primary and
secondary analysis developed by the
Computational Genomics
Department has proven its efficiency
in the analysis of more than 1000
exomes in a joint collaborative
project of the CIBERER and the
MGP
A first pilot has been implemented in
the IT4I supercomputing center,
which aims to centralize the analysis
of genomics data in the country. Genomic data management solutions scalable to country size
Page 35
What is next?
Miniaturized sequencing
devices (still far away
from clinic)… …that will bring sequencing closer to the bed
We only lack the bioinformatics to deal with
Page 36
Software development
See interactive map of for the last 24h use http://bioinfo.cipf.es/toolsusage Babelomics is the third most cited tool for functional analysis. Includes more than 30 tools for advanced, systems-biology based data analysis
More than 150.000 experiments were analyzed in our tools during the last year
HPC on CPU, SSE4, GPUs on NGS data processing Speedups up to 40X
Genome maps is now part
of the ICGC data portal
Ultrafast genome viewer with google technology
Mapping
Visualization
Functional analysis
Variant annotation
CellBase Knowledge
database
Variant
prioritization
NGS
panels
Signaling network Regulatory
network Interaction
network
Diagnostic
CellBase is now
available at EBI
Prototype running
in Czech Republic
Page 37
The Computational Genomics Department at the Centro de Investigación Príncipe Felipe (CIPF),
Valencia, Spain, and…
...the INB, National Institute of
Bioinformatics (Functional Genomics
Node) and the BiER
(CIBERER Network of Centers for Rare
Diseases)
@xdopazo
@bioinfocipf