Bioinformatics and NGS for advancing in hearing loss research

Joaquín Dopazo

Computational Genomics Department,

Centro de Investigación Príncipe Felipe (CIPF),

Functional Genomics Node, (INB),

Bioinformatics Group (CIBERER) and

Medical Genome Project,

Spain.

Bioinformatics and NGS: an

indissoluble marriage for advancing in

hearing loss research

http://bioinfo.cipf.es http://www.medicalgenomeproject.com http://www.babelomics.org http://www.hpc4g.org @xdopazo

Fundación Ramón Areces, Madrid, 5th Marzo 2015

Why Bioinformatics and NGS are important? Lessons learned from the Spanish 1000 genomes project:

Rare and familiar diseases sequencing initiative

• Metabolic (86 samples)

• Optiz

• Atypical fracture

• coQ10 deficiency

• Congenital disorder of glycosylation types I and II

• Maple syrup urine disease

• Pelizaeus-like

• 4 unknown syndroms

• Genetic (24 samples)

• Charcot-Marie-Tooth

• Rett Syndrome

• Neurosensorial (35 samples)

• Usher

• AD non-syndromic hearing loss

• AR non-syndromic hearing loss

• RP

• Mitochondrial (28 samples)

• Progressive External Oftalmoplegy

• Multi-enzymatic deficiency in mitochondrial

respiratory complexes

• CoQ disease

• Other

• APL (10 samples)

Autism (37 samples)

Mental retardation (autosomal recessive) (24)

Immunodeficiency (18)

Leber's congenital amaurosis (9)

Cataract (2)

RP(AR) (60)

RP(AD) (46)

Deafness (24)

CLAPO (4)

Skeletal Dysplasia (3)

Cantú syndrome (1)

Dubowitz syndrome (2)

Gorham-Stout syndrome (1)

Malpuech syndrome (4)

Hirschprung’s disease (81)

Hereditary macrothrombocytopenia (3)

MTC (41)

Controls (301)

1044 samples = 183 samples + 200 controls + 360 samples + 301 controls

Organization of the initiative

Diseases with: • Unknown genes • Known genes/mutations discarded

Search for: • Novel genes • Responsible genes known but unknown modifier genes • Susceptibility Genes • Therapeutic targets

http://www.gbpa.es/

Data production Sequencing platforms Data analysis

Big-Data Team

science paradigm

Data management, analysis

and storage

http://www.gbpa.es/

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

ATTGCGATT

GGCAGAGC

GGCAAAGT

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

ATTGCGATT

GGCAGAGC

GGCAAAGT

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

ATTGCGATT

GGCAGAGC

GGCAAAGT

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

ATTGCGATT

GGCAGAGC

GGCAAAGT

Raw files

(FastQ)

DB

Analysis

Pipeline

Storage

K-DB

Gene 1 ksdhkahcka

Gene 2 jckacsksda

Gene 3 lkkxkccj<jdc

Gene 4 ksfdjvjvlsdkvjd

Gene 5 kckcksñdksd

Gene 6 ldkdkcksdcldl

Gene x kcdlkclkldsklk

Gene Y jcdksdkcdks

Prioritization

report Dialog with experts in the

disease + validations

Samples

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

GCGTATAG

CACGGGTA

TCTGTATTA

TGGTGGAT

ATCAGCGG

VCF BAM Processed files

Pipeline of data analysis Initial QC

Sequence cleansing

Base quality

Remove adapters

Remove duplicates

FASTQ file

Variant calling + QC

Calling and labeling of missing values

Calling SNVs and indels (GATK) using 6 statistics based on QC, strand bias, consistence (poor QC callings are converted to missing values as well)

Create multiple VCF with missing, SNVs and indels

VCF file

Mapping + QC

Mapping (HPG)

Remove multiple mapping reads

Remove low quality mapping reads

Realigning

Base quality recalibrating

BAM file

Variant and gene prioritization + QC

Counts of sites with variants

Variant annotation (function, putative effect, conservation, etc.)

Inheritance analysis (including compound heterozygotes in recessive inheritance)

Filtering by frequency with external controls (Spanish controls, dbSNP, 1000g, ESP) and annotation

Multi-family intersection of genes and variants

Function/Network-based prioritization

Report

Primary analysis Gene prioritization

Pipeline of data analysis Primary processing

Initial QC

FASTQ file

Mapping

BAM file

Variant calling

VCF File

Knowledge-based prioritization

Proximity to other known disease genes

Functional proximity

Network proximity

Burden tests

Other prioritization methods

Secondary analysis

(Successive filtering)

Variant annotation

Filtering by effect

Filtering by MAF

Filtering by family segregation

Primary analysis

Gene prioritization

VARIANT

annotation tool

Variant annotation HPG Variant, a suite of tools for HPC-based genomic variant annotation VARIANT = VARIant

ANnotation Tool. Tools implemented using OpenMP, Nvidia CUDA and MPI for large clusters.

EFFECT: A CLI and web application, it's a cloud-based genomic variant effect predictor tool

has been implemented (http://variant.bioinfo.cipf.es, Medina 2012 NAR)

VCF: C library and tool: allows to analyze large VCFs files with a low memory footprint: stats,

filter, split, merge, etc. Example: hpg-variant vcf –stats –vcf-file ceu.vcf

Annotations

sought

The knowledge database

CellBase (Bleda, 2012, NAR), a comprehensive integrative database and RESTful Web Services API, more than 250GB of data and 90 tables exported in TXT and JSON:

● Core features: genes, transcripts, exons, cytobands, proteins (UniProt),...

● Variation: dbSNP and Ensembl SNPs, HapMap, 1000Genomes, Cosmic, ...

● Functional: 40 OBO ontologies (Gene Ontology), Interpro, etc.

● Regulatory: TFBS, miRNA targets, conserved regions, etc.

● System biology: Interactome (IntAct), Reactome database, co-expressed genes.

NoSQL and scales to TB

Wiki: http://docs.bioinfo.cipf.es/projects/cellbase/wiki

Project: http://bioinfo.cipf.es/compbio/cellbase

Now available at the EBI: http://www.ebi.ac.uk/cellbase/webservices/rest/v3/

http://docs.bioinfo.cipf.es/projects/cellbase/wiki

http://bioinfo.cipf.es/compbio/cellbase

Pipeline of data analysis Primary processing

Initial QC

FASTQ file

Mapping

BAM file

Variant calling

VCF File

Knowledge-based prioritization

Proximity to other known disease genes

Functional proximity

Network proximity

Burden tests

Other prioritization methods

Secondary analysis

(Successive filtering)

Variant annotation

Filtering by effect

Filtering by MAF

Filtering by family segregation

Primary analysis

Gene prioritization

1000 genomes

EVS

Local variants

Use known variants and their population frequencies to filter out.

• Typically dbSNP, 1000 genomes and

the 6515 exomes from the ESP are

used as sources of population

frequencies.

• We sequenced 300 healthy controls

(rigorously phenotyped) to add and

extra filtering step to the analysis

pipeline

Novembre et al., 2008. Genes mirror

geography within Europe. Nature Comparison of MGP controls to 1000g

How important do you

think local information is

to detect disease genes?

Filtering with or without local variants

Number of genes as a function of individuals in the study of a dominant disease Retinitis Pigmentosa autosomal dominant

The use of local

variants makes

an enormous

difference

The CIBERER Exome Server (CES): the first repository of variability of the Spanish

population

Only another similar

initiative exists: the GoNL

http://www.nlgenome.nl/

http://ciberer.es/bier/exome-server/

And more recently

the Finnish

population

Information provided

Genotypes in the

different reference

populations

Genomic coordinates, variation, gene.

SNPid

if any

Information provided

PolyPhen and SIFT

pathogenicity indexes Phenotype,

if available

Variants can also be seen within their genomic context

GenomeMaps viewer (Medina et al., 2013, NAR) embedded in the application.

GenomeMaps is the official genome viewer of the ICGC (http://dcc.icgc.org/)

Occurrence of pathological variants in “normal” population

Reference

genome is

mutated

Nine carriers

in 1000

genomes

One affected

and 73 carriers

in EVS

Table of Spanish Frequencies

(TSF)

DB of Spanish variants (DBSV)

Chr Position Ref Alt 0/0 0/1 1/1

1 1365313 A T 75 0 0

1 1484884 G A 70 4 1

2 326252 T C 25 35 15

CES use

Other countries

CES input

External

Unrelated? (DBSV)

VCFs Spanish? (TSF)

YES YES

NO NO

Counts

Internal

Regional

AIM (Ancestry-informative

markers) are used to

discard kinship and

different ethnicity

Organization of the database

Project D1 D2 … Case Control Counts

A x x f1

X x f2

B X X f3

X X f4

C X X f5

X X f6

X X f7

… … … … … … …

Organized in projects / diseases / case-control. Frequencies are

calculated for each project-disease-status, and selections can be

done as required. The items can be combined to maximize

pseudo-control sample size

Example: frequencies f1, f2, and f5 can be used as pseudo-

controls for studying disease D2. Under a less stringent scenario

f4 and f6 could also be used.

Are we there yet? Variability spectrum of the

Spanish population A total of 131.897 variant positions, unique in Spanish population, were

detected in all the 75 samples together. Approximately 90.000 were

singletons. 51.295 variants are non-synonymous changes and 18.450

correspond to synonymous changes (pattern opposite to variants shared

with 1000g and EVS).

CIBERER 76 samples

CES II 76+269+X

Mixed

MGP 269 samples

Healthy controls

Phase I Phase II Phase III

CES II 1000+76+269+X

Mixed

More CIBERER samples

SPANEX: 1000 exomes (200 ongoing)

CIBERER

CIBERER exome server roadmap and the Spanish 1000 genomes project

(Spanex)

2014-June 2014 2015 Today 400

BiERapp: interactive web-based tool for easy candidate prioritization by successive filtering

SEQUENCING CENTER

Data preprocessing

VCF FASTQ

Genome Maps

BAM

BiERapp filters

No-SQL (Mongo) VCF indexing

Population frequencies Consequence types

Experimental design

BAM viewer and Genomic context ?

Easy

sc

ale

up

NA19660 NA19661

NA19600 NA19685

BiERapp: the interactive filtering tool for easy candidate prioritization

http://bierapp.babelomics.org Aleman et al., 2014 NAR

http://bierapp.babelomics.org/

NA19660 NA19661

NA19600 NA19685

A/T A/T

T/T A/T

NA19660 NA19661

NA19600 NA19685

?/? A/T

T/T A/T

1

A proper filtering system must consider missing values

Unreported alternative

alleles can happen

because:

a) The position was

read and the

reference allele was

found

b) The position could

not be read and/or

it was low quality

(missing value)

Most VCF formats do

not allow deconvolution

of both scenarios.

We specifically include

missing values

3-Methylglutaconic aciduria (3-MGA-uria) is

a heterogeneous group of syndromes

characterized by an increased excretion of

3-methylglutaconic and 3-methylglutaric

acids.

WES with a consecutive filter approach is

enough to detect the new mutation in this

case.

Successive Filtering approach An example with 3-Methylglutaconic aciduria syndrome

Readjusting filtering thresholds

Primary

analysis

VCF

Frequency

Deleteriousness

Experimental design

GO enrichment

Network analysis

Pathway analysis

Gene

yes

no

Paper

BiERapp

Quite often, the result

is not conclusive either

by excess or by defect

of candidates .

And it is completely

dependent on the

disease and the

experimental setup

In our experience,

easy interactivity

in the filtering is

the best asset for

gene discovery

Results: 36 new disease variants in known genes and 27 disease variants in 13 new genes

WES

IRDs

arRP (EYS)

BBS

arRP arRP (USH2)

3-MGA-uria

(SERAC1)

NBD (BCKDK )

Tool for defining panels If no diagnostic variants appear, then

variants of uncertain effect are studied

Also incidental findings can be handled

Diagnostic mutations

http://team.babelomics.org

Diagnostic by targeted resequencing

(panels –real or virtual– of genes) Collaboration with M.A. Moreno, Hospital Ramon y Cajal

New filter based on

local population variant

frequencies

Virtual panels are a reality

4813 genes with known

phenotypes.

• One physical panel

• As many virtual panels

as you need

CACNA1F, CACNA2D4

GNAT2

RP

CORD/COD

CORD/COD

CVD

CVD

MD

LCA

ERVR/EVR

C2ORF71, C8ORF37, CA4,CERKL, CNGA1, CNGB1, DHDDS,EYS, FAM161A, IDH3B,KLHL7 IMPG2, MAK, NRL, PAP1, PDE6A, PDE6G, PRCD, PRF3, PRPF8, PRPF31 RBP3, RGR, ROM1, RP1, RP2, SNRNP200, TOPORS, TTC8 ZNF513

PDE6B, RHO, SAG

GRK1, GRM6, NYX, TRPM1

CABP4,

LCA5, RD3

CRB1, IMPDH1, LRAT, MERTK, RDH12, RPE65, SPATA7, TULP1

CRX

AIPL1, GUCY2D, RPGRIP1

ADAM9, GUCA1A, HRG4/UNC119, KCNV2, PDE6H, PITPNM3, RAX2, RDH5, RIM1

CNGA3, PDE6C

BCP, GCP, RCP

ABCA4, PROM1, PRPH2, RPGR

RLBP1, SEMA4A

C1QTNF5, EFEMP1, ELOVL4, HMNC1, RS1, TIMP3

FSCN2, GUCA1B

NR2E3 BEST1

FZD4, KCNJ13, LRP5, NDP, TSPAN12, VCAN

NB

ABHD12, CDH23, CIB2, DFNB31, GPR98, HARS, MYO7A, PCDH15, USH1C, USH1G

CLRN1, USH2A

USH

CEP290

BBS1

BBS ARL6,, BBS2, BBS4, BBS5, BBS7, BBS9, BBS10, BBS12,, INPP5E, LZTFL1, MKKS, MKS1, SDCCAG8, TRIM32, TTC8

Building virtual panels An example with Inherited Retinal Dystrophies

LCA-Leber Congenital Amaurosis CORD/COD- Cone and cone-rod dystro. CVD- Colour Vision Defects MD- Macular Degeneration ERVR/EVR- Erosive and Exudative Vitreoretinopathies USH- Usher Syndrome RP- Retinitis Pigmentosa NB- Night Blindness BBS- Bardet-Biedl Syndrome

CACNA1F, CACNA2D4

GNAT2

RP

CORD/COD

CORD/COD

CVD

CVD

MD

LCA

ERVR/EVR


PDE6B, RHO, SAG


CABP4,

LCA5, RD3


CRX



CNGA3, PDE6C

BCP, GCP, RCP


RLBP1, SEMA4A


FSCN2, GUCA1B

NR2E3 BEST1


NB


CLRN1, USH2A

USH

CEP290

BBS1


Building virtual panels


Panel for RP

CACNA1F, CACNA2D4

GNAT2

RP

CORD/COD

CORD/COD

CVD

CVD

MD

LCA

ERVR/EVR


PDE6B, RHO, SAG


CABP4,

LCA5, RD3


CRX



CNGA3, PDE6C

BCP, GCP, RCP


RLBP1, SEMA4A


FSCN2, GUCA1B

NR2E3 BEST1


NB


CLRN1, USH2A

USH

CEP290

BBS1




Extended panel

for RP

CACNA1F, CACNA2D4

GNAT2

RP

CORD/COD

CORD/COD

CVD

CVD

MD

LCA

ERVR/EVR


PDE6B, RHO, SAG


CABP4,

LCA5, RD3


CRX



CNGA3, PDE6C

BCP, GCP, RCP


RLBP1, SEMA4A


FSCN2, GUCA1B

NR2E3 BEST1


NB


CLRN1, USH2A

USH

CEP290

BBS1




Super extended

panel for RP

Knowledge DB

Fre

q. p

op

ul.

MiSeq

IonTorrent

IonProton

HiSeq

IonProton

NO

Diagnostic Therapeutic

decision

Ne

w v

aria

nts

D

ise

ase

All

Candidate

Prioritization

Data

pre

pro

cessin

g

Sequence DB

Se

qu

en

ces

Freqs.

Future

technologies

New knowledge

for future

diagnostic

The final schema: diagnostic and discovery

Implementation of tools for genomic big data

management in the IT4I Supercomputing

Center (Czech Republic)

The pipelines of primary and

secondary analysis developed by the

Computational Genomics

Department has proven its efficiency

in the analysis of more than 1000

exomes in a joint collaborative

project of the CIBERER and the

MGP

A first pilot has been implemented in

the IT4I supercomputing center,

which aims to centralize the analysis

of genomics data in the country. Genomic data management solutions scalable to country size

What is next?

Miniaturized sequencing

devices (still far away

from clinic)… …that will bring sequencing closer to the bed

We only lack the bioinformatics to deal with

Software development

See interactive map of for the last 24h use http://bioinfo.cipf.es/toolsusage Babelomics is the third most cited tool for functional analysis. Includes more than 30 tools for advanced, systems-biology based data analysis

More than 150.000 experiments were analyzed in our tools during the last year

HPC on CPU, SSE4, GPUs on NGS data processing Speedups up to 40X

Genome maps is now part

of the ICGC data portal

Ultrafast genome viewer with google technology

Mapping

Visualization

Functional analysis

Variant annotation

CellBase Knowledge

database

Variant

prioritization

NGS

panels

Signaling network Regulatory

network Interaction

network

Diagnostic

CellBase is now

available at EBI

Prototype running

in Czech Republic

The Computational Genomics Department at the Centro de Investigación Príncipe Felipe (CIPF),

Valencia, Spain, and…

...the INB, National Institute of

Bioinformatics (Functional Genomics

Node) and the BiER

(CIBERER Network of Centers for Rare

Diseases)

@xdopazo

@bioinfocipf

Bioinformatics and NGS for advancing in hearing loss research

Health & Medicine

samples autism

inheritance analysis

unknown genes

storage http

samples charcotmarie

db gene

novel genes responsible

hearing loss research