Page 1
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Graduate SchoolBioinformatics Sequence Analysis
Introduction
Barbera van Schaik
Bioinformatics Laboratory, KEBBAcademic Medical Center
[email protected]
March 9, 2020
1 / 50
Page 2
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Related Graduate School courses
• DNA technology
• Unix
• Computing in R
• Practical biostatistics
• Advanced biostatistics
• Bioinformatics
• Bioinformatics Sequence Analysis
• Research Data Managementhttps://www.amc.nl/web/leren/graduate-school.htm
2 / 50
Page 3
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
In this course
Bioinformatics Sequence Analysis
You will learn what is behind commonly used methods forsequence analysis, how to analyze datasets with(reasonably) user-friendly interfaces, and get introduced tocommand-line tools for next generation sequencing (NGS)
3 / 50
Page 4
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Not in this course
1 Sequence assembly
2 Bisulphite sequencing
3 Protein sequence analysis
4 Metagenomics
4 / 50
Page 5
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Bioinformatics Sequence Analysis
1 Introduction to sequence analysis
2 Sequencing techniques
3 Brief introduction Linux and R (self study)
4 NGS pre-processing
5 (Multiple) sequence alignment
6 Case: Neuroblastoma
7 Introduction to R2
8 Exome sequence analysis
9 RNAseq
The focus is on human data, but many techniques are alsoapplicable to other organisms
5 / 50
Page 6
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Practical things
Certificate
• Attend all sessions (one day can be skipped, ask forpossibility for self-study)
• Active participation
Other things
• Lunch is not included
• Coffee is available at the machines with your AMC card
• Slides and exercises are published onhttps://bioinformatics.amc.nl/
6 / 50
Page 7
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
In this hour
IntroductionYou will get an indication about the scale of sequence data,how to handle the data, where to find publicly availabledata and tools, and what can be done with NGS
7 / 50
Page 8
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Overview
1 Welcome
2 Scale of sequence dataDNA sequencingGenome projects
3 Bioinformatics databases and toolsDatabasesSequence analysis
4 Handling sequence dataComputingApplication areas
8 / 50
Page 9
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sanger
9 / 50
Page 10
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Automated sequencing
10 / 50
Page 11
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequencing centers
11 / 50
Page 12
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Next generation sequencing
12 / 50
Page 13
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Genome projects
• HGP
• 1000g
• UK10K >100K genomes
• Personal genomes
13 / 50
Page 14
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Human Genome Project
http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml
14 / 50
Page 15
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Human Genome Project
http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml
15 / 50
Page 16
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
1000 genomes project
http://www.1000genomes.org/
16 / 50
Page 17
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
UK10K
4000 genomes6000 exomeshttp://www.uk10k.org/
17 / 50
Page 18
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
The 100K genomes project
The project will focus onpatients with a rare disease andtheir families and patients withcancer. The first samples forsequencing are being takenfrom patients living in Englandwith discussions taking placewith Scotland, Wales andNorthern Ireland aboutpotential future involvement.http://www.genomicsengland.co.uk/
18 / 50
Page 19
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Personal genomes
100,000 genomes plus medical recordshttp://www.personalgenomes.org/
19 / 50
Page 20
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequencers around the world
http://omicsmaps.com/
20 / 50
Page 21
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequencers around the world 2015
http://omicsmaps.com/
21 / 50
Page 22
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Big data
22 / 50
Page 23
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
DNA sequencing rate
Stephens et al. (2015) PLoS One
23 / 50
Page 24
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
GenBank, EMBL and DDBJ
International Nucleotide Sequence Database CollaborationDaily exchange of sequence data
https://www.ncbi.nlm.nih.gov/
https://www.ebi.ac.uk/
http://www.ddbj.nig.ac.jp/
24 / 50
Page 25
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Nucleotide sequence databases
From: http://www.davelunt.net/
25 / 50
Page 26
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
GenBank
Release 236 (Feb 2020)has 399,376,854,872 base pairs from 216,214,215sequences. In addition, there are 1,206,720,688 WGSrecords containing 6,968,991,265,752 base pairs ofsequence data.
https://www.ncbi.nlm.nih.gov/genbank/statistics/
GenBank has doubled approximately every 18 months
26 / 50
Page 27
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Core databases and derivatives
Nucleotide sequence databases
Core: RNA, DNA
Genbank
EMBL
DDBJ
RNA grouped per gene UniGene
Genome assemblies
Human
Model organisms
Bacteria
Plants
Etc
Genome comparisons Conserved regions
DNA motifs
Protein binding sites
Conserved regions
DNA structure
Restriction sites
Gene expressionExpressed Sequence Tags (ESTs)
RNAseq
Variants
SNPs, insertions and deletions
Structural variants
Allele databases
Specialized databases
Gene specific
Disease specific
Genome projects
MetagenomicsMicrobiome
Environment samples
Protein translations
27 / 50
Page 28
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Where to start?
https://www.oxfordjournals.org/nar/database/c/
28 / 50
Page 29
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequence analysis
Sequence alignment
• Needleman-Wunsch
• Smith-Waterman
• BLAST
• BLAT
• ClustalW
• BWA, BFAST, Bowtie, Tophat, etc, etc
Sequence suites/packages
• Emboss package
• CLCbio workbench
• Galaxy
• R Bioconductor
Hundreds of tools to analyse sequence data...
29 / 50
Page 30
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Tools
https://academic.oup.com/nar/article/47/W1/W1/5524725
30 / 50
Page 31
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Tools
Most tools are only available via the command-line (on linuxsystems)
31 / 50
Page 32
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Open source
Free as in freedomYou can use, change, integrate, and review the codeOpen source allows sharing and promotes collaborationNo vendor lock-in
32 / 50
Page 33
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Open source
• Software
• Databases
• Journals
• Standards
• Hardware
• Art
• Money
• Drinks
• Medicine
• Fashion
• Educationhttps://en.wikipedia.org/wiki/Open_source
33 / 50
Page 34
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Handling sequence data
34 / 50
Page 35
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Buy a bigger cluster (centralizedmodel)
35 / 50
Page 36
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Dutch life science grid
http://surfsara.nl/
36 / 50
Page 37
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Cloud computing
37 / 50
Page 38
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
HPC cloud at SurfSara
You will use a linux environment that runs on the HPC cloudto get acquainted with command-line tools
38 / 50
Page 39
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
NGS application areas
39 / 50
Page 40
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Whole genomes
• De novo sequencing
• Re-sequencing
• Copy number variations
• Rearrangements
• New insertions/deletions/mutations
40 / 50
Page 41
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Structural variation
The Human Genome Structural Variation Working Group, Nature 2007
41 / 50
Page 42
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
SNP / haplotype analysis
Linkage studiesForensic research
42 / 50
Page 43
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Gene expression
https://en.wikipedia.org/wiki/Regulation_
of_gene_expression
• Full-length transcripts
• EST sequencing
• 5’ transcript ends(5’-RATE, CAGE)
• SAGE ditag sequencing
• SAGE-like 3’ endsequencing
• Nebulized fragments
• ncRNA sequencing
43 / 50
Page 44
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Epigenetics
Treatment with sodium bisulfiteUnmethylated cytosines change into uracilMethylated cytosines are unchangedCompare sequences with reference sequence
44 / 50
Page 45
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Metagenomics and microbialdiversity
Study genomic content in acomplex mixture ofmicroorganisms(bacteria or viruses in someenvironment)Identify new species
45 / 50
Page 46
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Paleogenomics
Sequencing ofancient DNAMummiesSabretoothMammothNeanderthal
46 / 50
Page 47
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Gene regulation
47 / 50
Page 48
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequence analysis
Usually starts with sequence alignment or sequence assemblyDepending on the application other tools/methods are used ordeveloped
48 / 50
Page 49
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
With a click of a button...
.. or perhaps not. You will find out during this course.Computer exercises sequence analysis:
1 Via web tools
2 Creating pipelines online
3 With command-line tools in a Linux environment
49 / 50
Page 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Bioinformatics Sequence Analysis
1 Introduction to sequence analysis
2 Sequencing techniques
3 Brief introduction Linux and R (self study)
4 NGS pre-processing
5 (Multiple) sequence alignment
6 Case: Neuroblastoma
7 Introduction to R2
8 Exome sequence analysis
9 RNAseq
50 / 50