Bioinformatics Jan T. Kim Introduction Mol. Bio. Basics Plant Phylogeny FMD Transmission Sequence Analysis Pairwise Alignment BLAST NGS MSA Bioinformatics and Advanced Programming Jan T. Kim BCS Advanced Programming SG, 11 Dec 2014
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Bioinformatics and Advanced Programming
Jan T. Kim
BCS Advanced Programming SG, 11 Dec 2014
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Abstract
Bioinformatics can be defined strictly as the science of information in biological systems, or more broadly asdeveloping and applying computational tools for analysing biological data. The biological processes thatgenerate this information, particularly evolution, are highly complex, and therefore analysis of biologicalinformation is often computationally challenging. I will present the following selected topics and highlight theadvanced computing challenges they involve, and also outline advances in the biosciences that have beenenabled by tackling these challenges.Many bioinformatics analyses are based on DNA sequences which today can be determined at very highvolume through ”Next Generation Sequencing” (NGS) techniques. As a result, the volume of publiclyavailable sequence data has reached the range of petabytes. Searching this body of data requires highlyoptimised computational approaches, such as BLAST (”Basic Local Alignment Search Tool”).More recently, NGS methods that generate very large numbers of ”short reads”, i.e. strings of sequence .Central computational challenges resulting from these new technologies are ”de novo assembly” of theoriginal long sequence(s) from short reads, and mapping very large numbers of short reads to a knownreference sequence.Phylogeny analysis, i.e. reconstruction of ancestry relationships among species, is a classical field ofbioinformatics which typically involves two steps, first a multiple alignment of the sequences is computedwhich subsequently is used to compute a tree. Computing multiple alignments is an optimisation problemthat can only be approximately solved.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The Pirbright Institute
Preventing and controlling viral diseases
Core funding:
Project funding by BBSRCand many others.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Bioinformatics: Definition(s)
• Scientific inquiry into information in biological systems.
• Computational analysis of biological data.
• Computer assisted mining of biological literature.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
DNA: Structure
POH
OH
O
O
CH2
✔✔❚❚
✧✧
❜❜
O✧✧❜❜
❜❜
✧✧❜❜
✧✧ N
N
NH2
N
N
O
POHO
O
CH2
✔✔❚❚
✧✧
❜❜
O
❜❜
✧✧❜❜
✧✧ N
N
NH2
O
HO
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Base Complementarity
http://commons.wikimedia.org/wiki/File:DNA chemical structure.svg
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The “Central Dogma”
http://en.wikipedia.org/wiki/Central_dogma_of_molecular_biologyhttp://en.wikipedia.org/wiki/File:Cdmb.svg
http://en.wikipedia.org/wiki/Central_dogma_of_molecular_biologyhttp://en.wikipedia.org/wiki/File:Cdmb.svg
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The Success of Bioinformatics
• The Object: Information in biological systems:
In living systems, a dynamics of information hasgained control over the dynamics of energy,which determines the behavior of most non-livingsystems.
[Langton, 1992]
• Genetic information is digital.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The Success of Bioinformatics
• The Object: Information in biological systems:
In living systems, a dynamics of information hasgained control over the dynamics of energy,which determines the behavior of most non-livingsystems.
[Langton, 1992]
• Genetic information is digital.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The Success of Bioinformatics
• The Object: Information in biological systems:
In living systems, a dynamics of information hasgained control over the dynamics of energy,which determines the behavior of most non-livingsystems.
[Langton, 1992]
• Genetic information is digital.
TACCGTCAC
CTACACCAT
ACCTACATG
TTCACATTA
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Sequence Data Is Big Data
• NCBI-GenBank Flat File Release 204.0 (15 Oct 2014):ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
• 178,322,253 loci,• 181,563,676,918 bases.
[Crosswell and Thornton, 2012]
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Sequence Data Is Big Data
• NCBI-GenBank Flat File Release 204.0 (15 Oct 2014):ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
• 178,322,253 loci,• 181,563,676,918 bases.
http://www.ebi.ac.uk/ena/about/statistics
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Land Plant Phylogeny
Angiosperms(flowering plants)
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Land Plant Phylogeny
Angiosperms(flowering plants)
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Land Plant Phylogeny
Angiosperms(flowering plants)
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Phylogeny of MADS Proteins
98
78
ZMM1 (Zea)
FBP11 (Petunia)
BAG1 (Brassica)
AG (Arabidopsis)
PLE (Antirrhinum)
100
GGM3 (Gnetum)
DAL2 (Picea)
GGM10 (Gnetum)
AGL12 (Arabidopsis)
88
98
TOBMADS1 (Nicotiana)
TM3 (Lycopersicon)
SAMADSA (Sinapis)
DEFH24 (Antirrhinum)
AGL14 (Arabidopsis)
98
DAL3 (Picea)
GGM1 (Gnetum)
78
624658
AGL13 (Arabidopsis)
AGL6 (Arabidopsis)
Zag5 (Zea)
ZAG3 (Zea)
62
PRMADS2 (Pinus)
GBM1 (Ginkgo)
PRMADS3 (Pinus)
DAL1 (Picea)
GGM11 (Gnetum)
GGM9 (Gnetum)
100
ZMM6 (Zea)
OM1 (Aranda)
EGM1 (Eucalyptus)
FBP2 (Petunia)
AGL2 (Arabidopsis)
100
ZAP1 (Zea)
AP1 (Arabidopsis)
SQUA (Antirrhinum)
EAP2 (Eucalyptus)
TM4 (Lycopersicon)
100
GGM4 (Gnetum)
GGM8 (Gnetum)
GGM5 (Gnetum)
CRM7 (Ceratopteris)
OPM1 (Ophioglossum)
OPM5 (Ophioglossum)
CRM6 (Ceratopteris)
OPM4 (Ophioglossum)
GGM12 (Gnetum)
GGM6 (Gnetum)
GGM7 (Gnetum)
98
NMHC5 (Medicago)
AGL17 (Arabidopsis)
DEFH125 (Antirrhinum)
ANR1 (Arabidopsis)
OPM3 (Ophioglossum)
100
CRM9 (Ceratopteris)
CRM3 (Ceratopteris)
4852
100
NTGLO (Nicotiana)
GLO (Antirrhinum)
PI (Arabidopsis)
DAPI (Delphinium)
OSMADS2 (Oryza)
100
PMADS1 (Petunia)
NTDEF (Nicotiana)
DEF (Antirrhinum)
TM6 (Lycopersicon)
BOBAP3 (Brassica)
AP3 (Arabidopsis)
58
DAL13 (Picea)
GGM2 (Gnetum)
GGM13 (Gnetum)
98
CRM1 (Ceratopteris)
CRM5 (Ceratopteris)
CRM4 (Ceratopteris)
CRM2 (Ceratopteris)
CERMADS5 (Ceratopteris)
CRM10 (Ceratopteris)
100
AGL15-1 (Brassica)
AGL15-2 (Brassica)
AGL15 (Arabidopsis)
AG
TM
3
AG
L6
AG
L2
SQ
UA
GG
M4
AG
L17
CR
M3
GL
O
DE
F
DE
F/G
LO
CR
M1
AG
L15
Angiosperms Gnetales Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The AG and AGL6 Subfamilies
9878
ZMM1 (Zea)FBP11 (Petunia)BAG1 (Brassica)AG (Arabidopsis)
PLE (Antirrhinum)100 GGM3 (Gnetum)
DAL2 (Picea)GGM10 (Gnetum)
AG subfamily
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The AG and AGL6 Subfamilies
9878
ZMM1 (Zea)FBP11 (Petunia)BAG1 (Brassica)AG (Arabidopsis)
PLE (Antirrhinum)100 GGM3 (Gnetum)
DAL2 (Picea)GGM10 (Gnetum)
AG subfamily
Angiosperms
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The AG and AGL6 Subfamilies
9878
ZMM1 (Zea)FBP11 (Petunia)BAG1 (Brassica)AG (Arabidopsis)
PLE (Antirrhinum)100 GGM3 (Gnetum)
DAL2 (Picea)GGM10 (Gnetum)
AG subfamily
Angiosperms
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The AG and AGL6 Subfamilies
9878
ZMM1 (Zea)FBP11 (Petunia)BAG1 (Brassica)AG (Arabidopsis)
PLE (Antirrhinum)100 GGM3 (Gnetum)
DAL2 (Picea)GGM10 (Gnetum)
AG subfamily
62
46
58AGL13 (Arabidopsis)
AGL6 (Arabidopsis)Zag5 (Zea)ZAG3 (Zea)
62
PRMADS2 (Pinus)GBM1 (Ginkgo)
PRMADS3 (Pinus)DAL1 (Picea)GGM11 (Gnetum)GGM9 (Gnetum)
ZMM6 (Zea)
AGL6 subfamily
Angiosperms
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The AG and AGL6 Subfamilies
9878
ZMM1 (Zea)FBP11 (Petunia)BAG1 (Brassica)AG (Arabidopsis)
PLE (Antirrhinum)100 GGM3 (Gnetum)
DAL2 (Picea)GGM10 (Gnetum)
AG subfamily
62
46
58AGL13 (Arabidopsis)
AGL6 (Arabidopsis)Zag5 (Zea)ZAG3 (Zea)
62
PRMADS2 (Pinus)GBM1 (Ginkgo)
PRMADS3 (Pinus)DAL1 (Picea)GGM11 (Gnetum)GGM9 (Gnetum)
ZMM6 (Zea)
AGL6 subfamily
Angiosperms
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The DEF and GLO Subfamilies
4852
100
NTGLO (Nicotiana)GLO (Antirrhinum)PI (Arabidopsis)
DAPI (Delphinium)OSMADS2 (Oryza)
100
PMADS1 (Petunia)NTDEF (Nicotiana)DEF (Antirrhinum)
TM6 (Lycopersicon)BOBAP3 (Brassica)
AP3 (Arabidopsis)58 DAL13 (Picea)
GGM2 (Gnetum)GGM13 (Gnetum)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The DEF and GLO Subfamilies
4852
100
NTGLO (Nicotiana)GLO (Antirrhinum)PI (Arabidopsis)
DAPI (Delphinium)OSMADS2 (Oryza)
100
PMADS1 (Petunia)NTDEF (Nicotiana)DEF (Antirrhinum)
TM6 (Lycopersicon)BOBAP3 (Brassica)
AP3 (Arabidopsis)58 DAL13 (Picea)
GGM2 (Gnetum)GGM13 (Gnetum)
Angiosperms
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
The DEF and GLO Subfamilies
4852
100
NTGLO (Nicotiana)GLO (Antirrhinum)PI (Arabidopsis)
DAPI (Delphinium)OSMADS2 (Oryza)
100
PMADS1 (Petunia)NTDEF (Nicotiana)DEF (Antirrhinum)
TM6 (Lycopersicon)BOBAP3 (Brassica)
AP3 (Arabidopsis)58 DAL13 (Picea)
GGM2 (Gnetum)GGM13 (Gnetum)
Angiosperms
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Conclusion: Land Plant Phylogeny
MADS-Box Genes Reveal That Gnetophytes Are More CloselyRelated to Conifers than to Flowering Plants[Winter et al., 1999].
Angiosperms
Gnetales
Gymnosperms
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Transmission Trees
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Consensus Sequence Data
site sequences
p0 (ancestor) 2p1b 6p2b 7p2c 3p3b 8p3c 2p4b 2p5 5p6b 3p7 8p8 1
sum 47combinations 967680
• 2007 FMD outbreak inUK
• 10 premises, 2 ancestorsamples (p0)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Reference Transmission Tree
p0p1b
p2b
p2c
p5p3b
p4b
p3c
p6b
p7 p8
Based on a TCS geneaology of all 47 samples, and additionalbackground knowledge.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Flowchart
consensus sequences
sample with 1 sequence / premise
Hamming distances
TCS genealogy
rooted genealogy
rpetal rfugal closest MST
Analysis carried out for 1000 random samples containing oneconsensus sequence from each site.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Results: Tree Topology
t2 t0 t5 t16 t6 t7 t3 t14
t11
t31
t12
t41
t13
t18
t19 t9 t20
t43
t44
t24
t26
t47
t25
t17
t34
t37 t4 t42
t21
t22
t33
t38
t39 t8 t40
t15
t23
t30 t1 t10
t46
t53
t54
t28
t29
t36
t51
t56
t32
t35
t45
t48
t49
t50
t57
t60
t61
t62
t63
t27
t52
t55
t58
t59
rpetal
topology
freq
uenc
y
0
50
100
150
200
t2 t0 t5 t16 t6 t7 t3 t14
t11
t31
t12
t41
t13
t18
t19 t9 t20
t43
t44
t24
t26
t47
t25
t17
t34
t37 t4 t42
t21
t22
t33
t38
t39 t8 t40
t15
t23
t30 t1 t10
t46
t53
t54
t28
t29
t36
t51
t56
t32
t35
t45
t48
t49
t50
t57
t60
t61
t62
t63
t27
t52
t55
t58
t59
topology
topo
Dis
t to
ref.
tree
0
2
4
6
8
10
radipetal: branch nodes merged towards root (p0)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Results: Tree Topology
t0 t4 t2 t9 t3 t11 t7 t13 t5 t14 t6 t1 t8 t12
t23
t22
t16
t15
t19
t17
t24
t26
t18
t25
t10
t29
t27
t35
t31
t36
t20
t21
t28
t30
t32
t33
t34
t37
t38
t39
rfugal
topology
freq
uenc
y
0
50
100
150
200
t0 t4 t2 t9 t3 t11 t7 t13 t5 t14 t6 t1 t8 t12
t23
t22
t16
t15
t19
t17
t24
t26
t18
t25
t10
t29
t27
t35
t31
t36
t20
t21
t28
t30
t32
t33
t34
t37
t38
t39
topology
topo
Dis
t to
ref.
tree
0
2
4
6
8
10
radifugal: branch nodes merged away from root (p0)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Results: Tree Topology
t0 t6 t4 t12 t8 t3 t2 t9 t14
t19
t13 t5 t11
t10
t16 t7 t23
t22
t15 t1 t21
t28
t24
t17
t25
t37
t18
t26
t34
t35
t38
t20
t27
t29
t30
t31
t32
t33
t36
t39
t40
t41
closest
topology
freq
uenc
y
0
50
100
150
200
t0 t6 t4 t12 t8 t3 t2 t9 t14
t19
t13 t5 t11
t10
t16 t7 t23
t22
t15 t1 t21
t28
t24
t17
t25
t37
t18
t26
t34
t35
t38
t20
t27
t29
t30
t31
t32
t33
t36
t39
t40
t41
topology
topo
Dis
t to
ref.
tree
0
2
4
6
8
10
closest: branch nodes merged towards closest premise
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Results: Tree Topology
t4 t9 t2 t0 t3 t10 t1 t12 t7 t13 t8 t5 t11
t16 t6 t14
t15
t17
mst
topology
freq
uenc
y
0
50
100
150
200
t4 t9 t2 t0 t3 t10 t1 t12 t7 t13 t8 t5 t11
t16 t6 t14
t15
t17
topology
topo
Dis
t to
ref.
tree
0
2
4
6
8
10
MST
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Summary: FMD TransmissionTrees
• Algorithms for constructing transmission trees• from TCS genealogies: radipetal, radifugal, closest,• minimum spanning tree (MST).
• Comparison based on the 2007 outbreak.• closest provides TCS based transmission trees best
precision.• MST provides even marginally better precision.
• Outlook:• Try more sophisticated distance measures.• Include further transmission tree reconstruction methods.• Larger data sets, NGS “beyond the consensus”
[Wright et al., 2011]
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Pairwise Alignment: Idea
S = ACATCTCGT = ACTGTA
alignment
Sa = ACATCTCG|| | | :
T a = AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Formal Definition
• Extend sequences S and T by inserting gaps to Sa andT a.
• aligned sequences have equal length: |Sa|= |T a|• gaps cannot be paired with gaps
• Biological background: homology, symbols in a columnshould derive from same common ancestor.
• Match: column with equal symbols in Sa and T a.
• Indel: column with a gap symbol in Sa or T a.
• Mismatch: column with different symbols (non-gap) inSa and T a.
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Formal Definition
• Extend sequences S and T by inserting gaps to Sa andT a.
• aligned sequences have equal length: |Sa|= |T a|• gaps cannot be paired with gaps
• Biological background: homology, symbols in a columnshould derive from same common ancestor.
• Match: column with equal symbols in Sa and T a.
• Indel: column with a gap symbol in Sa or T a.
• Mismatch: column with different symbols (non-gap) inSa and T a.
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Formal Definition
• Extend sequences S and T by inserting gaps to Sa andT a.
• aligned sequences have equal length: |Sa|= |T a|• gaps cannot be paired with gaps
• Biological background: homology, symbols in a columnshould derive from same common ancestor.
• Match: column with equal symbols in Sa and T a.
• Indel: column with a gap symbol in Sa or T a.
• Mismatch: column with different symbols (non-gap) inSa and T a.
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Formal Definition
• Extend sequences S and T by inserting gaps to Sa andT a.
• aligned sequences have equal length: |Sa|= |T a|• gaps cannot be paired with gaps
• Biological background: homology, symbols in a columnshould derive from same common ancestor.
• Match: column with equal symbols in Sa and T a.
• Indel: column with a gap symbol in Sa or T a.
• Mismatch: column with different symbols (non-gap) inSa and T a.
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Scoring of Alignments
The score m(k) of column k is
• the space penalty m(k) =−g , if one symbol is the gapsymbol, here: g = 2),
• otherwise the pair score m(k) = µ(sa(k), ta(k)), here
µ(x ,y) =
{
1, if x = y ,−1, otherwise
Sa = A C A T C T C G
T a = A C - T G T - Ascore: +1 +1 −2 +1 −1 +1 −2 −1 =−2
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Objective: Find the alignment with maximal score.
• Problem: The number of alignments is
(
|S |+ |T |)
|S |
)
·
(
|S |+ |T |
|T |
)
• Trying out all alignments is impossible.• Recursion results in trying out all alignments.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Optimal Alignments
• Observation: A prefix alignment of an optimal alignmentis optimal (as well).
• Otherwise, a contradiction results: The optimalalignment could be improved by changing the prefix.
• Dynamic programming: Tabulate optimal scores ofprefix alignments
ACATCTCG
AC-TGT-A
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Table ofPrefix-Alignment Scores
- A G A C
- 0.0 -2.0 -4.0 -6.0 -8.0
A -2.0 1.0 -1.0 -3.0 -5.0
G -4.0 -1.0 2.0 0.0 -2.0
C -6.0 -3.0 0.0 1.0 1.0
The optimal alignment score is 1.0.☞Notice O(n2) complexity.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Backtracking the Alignment- A G A C
- -0.0 -2.0✛ -4.0✛ -6.0✛ -8.0✛
A -2.0
✻1.0
❅❅■-1.0✛ -3.0
❅❅■✛ -5.0✛
G -4.0
✻-1.0
✻2.0
❅❅■0.0✛ -2.0✛
C -6.0
✻-3.0
✻0.0
✻1.0
❅❅■1.0
❅❅■
AG-C
AGAC
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
BLAST: Basic Local AlignmentSearch Tool
• Objective: Given a query sequence, find similarsequences in a database.
• Size of database prohibits pairwise alignment of query toall entries.
• Algorithm outline [Altschul et al., 1997]:
1 Scan for hits, i.e. gapless short word alignments exceedinga threshold score.
2 Extend hits maximally to obtain HSPs (high scoring pairs.3 Combine HSPs to (gapped) alignments.
• E-values indicate expected number of HSPs with givenscore.
• Interesting HSPs have E ≪ 1.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
BLAST:Search Engine for Sequences
http://blast.ncbi.nlm.nih.gov/
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
BLAST:Search Engine for Sequences
http://blast.ncbi.nlm.nih.gov/
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
BLAST:Search Engine for Sequences
http://blast.ncbi.nlm.nih.gov/
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
BLAST:Search Engine for Sequences
http://blast.ncbi.nlm.nih.gov/
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
“Next Generation” Sequencing
• Long DNA sequences cannot be read like a tape.
• Short fragments from random genomic locations can besequenced.
• NGS generates very large number of (very) shortsequencing reads.
http://www.illumina.com/systems/miseq.ilmn
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
Massive numbers of sequencing reactions take place in oneflow cell.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
DNA is fragmented and adapters are ligated.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
Fragments (with adapters) are attached to the slide in a flowcell.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
The slide is studded with primers, facilitating bridgeamplification . . .
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
. . . resulting in double stranded fragments . . .
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
. . . which are then denatured.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
Multiple rounds of amplification result in a cluster from eachinitial fragment.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
Reversible terminator nucleotides are added.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
Incorporated nucleotide fluoresce at different wave lengths.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
After removal of the terminator, the next nucleotide is added. . .
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
. . . and the fluorescent light is imaged.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
Each image yields one base for each cluster.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
read0 =
read1 =
read2 =
read3 =
Each image yields one base for each cluster.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
read0 = G
read1 = T
read2 = C
read3 = A
Each image yields one base for each cluster.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
read0 = GC
read1 = TA
read2 = CT
read3 = AG
Each image yields one base for each cluster.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
read0 = GCT
read1 = TAA
read2 = CTT
read3 = AGC
Each image yields one base for each cluster.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
read0 = GCTG
read1 = TAAG
read2 = CTTA
read3 = AGCC
Each image yields one base for each cluster.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
read0 = GCTGA
read1 = TAAGT
read2 = CTTAG
read3 = AGCCG
Each image yields one base for each cluster.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Illumina NGS Sequencing
Reads are further processed, e.g. in sequence assembly.
http://www.illumina.com/documents/products/techspotlights/techspotlight sequencing.pdf
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Applications of “Next Generation”Sequencing
• Novel genome sequencing
• Re-sequencing to discover genomic variation• Single nucleotide polumorphisms (SNPs), and their
association to pheonotypic traits,• Evolution of genomic variation patterns.
• Metagenomics
• *-Seq techniques• gene expression measurement: RNA-Seq• binding sites: ChIP-Seq• microRNA-Seq
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Mapping NGS Reads
• Task: align billions of to a known reference genome.
• Not feasible using dynamic programming.
• Feasible using advanced indexing of the reference.• e.g. Burrows-Wheeler transform• Pigeonhole principle
GACTAGAGTAGACGATGAGACCCATGACA
GGC GAGTAGACGAT GACCCATGATAGGCT GAGTAGCCGATG CCCATGACAGGCT AGTAGCCGATGAG CTCATGACAGGCTAG GTAGACGATGAGA CATGACAGGCTAGA AGACGATGAGA ATGACAGGCTAGAG AGCCGATGAGACC ATGACAGGCTAGAGT GACGATGAGACCC
CCGATGAGACCCAT
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Finding Single NucleotidePolymorphisms (SNPs)
GACTAGAGTAGACGATGAGACCCATGACA
GGC GAGTAGACGAT GACCCATGATA
GGCT GAGTAGCCGATG CCCATGACA
GGCT AGTAGCCGATGAG CTCATGACA
GGCTAG GTAGACGATGAGA CATGACA
GGCTAGA AGACGATGAGA ATGACA
GGCTAGAG AGCCGATGAGACC ATGACA
GGCTAGAGT GACGATGAGACCC
CCGATGAGACCCAT
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Finding Single NucleotidePolymorphisms (SNPs)
GACTAGAGTAGACGATGAGACCCATGACA
GGC GAGTAGACGAT GACCCATGATA
GGCT GAGTAGCCGATG CCCATGACA
GGCT AGTAGCCGATGAG CTCATGACA
GGCTAG GTAGACGATGAGA CATGACA
GGCTAGA AGACGATGAGA ATGACA
GGCTAGAG AGCCGATGAGACC ATGACA
GGCTAGAGT GACGATGAGACCC
CCGATGAGACCCAT
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Aligning Reads to ReferenceSequences
Example: RNA-Seq
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Assembling NGS Reads
GCTGATGTGCCGCCTCACTCCGGTGG
CACTCCGGTGG
CTCACTCCTGTGG
GCTGATGTGCCACCTCA
GATGTGCCGCCTCACTC
GTGCCACCTCACTCCGG
CTCCGGTGG
• Many copies of a genome are fragmented
• Each base has quality, giving its probability of beingcorrect.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
NGS Assembly: Example
C99A99G99A98G95C96A93
C99A99G99A99G99C99A97G95A94C96A89
A99G99A99C99A99A45C26T57A87A85G84T78
A99A99G99T99G99C98T99A96T91C88A82
C99T99A99T99C99A99A96C94T95
T99A99T99C99A99A94C97T95A91G88
A99A99C99T98A91G93
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
NGS Assembly: Example
C99A99G99A98G95C96A93
C99A99G99A99G99C99A97G95A94C96A89
A99A99A99A99A99A99A99G99A99C99A99A45C26T57A87A85G84T78
A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99G99T99G99C98T99A96T91C88A82
A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99C99T99A99T99C99A99A96C94T95
A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99T99A99T99C99A99A94C97T95A91G88
A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99C99T98A91G93
C99A99G99A99G99C99A99G99A99C99A99A99C99T99A99A99G99T99G99C99T99A99T99C99A99A99C99T99A99G99
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
NGS Assembly: Example
C99A99G99A98G95C96A93
C99A99G99A99G99C99A97G95A94C96A89
A99A99A99A99A99A99A99G99A99C99A99A45C26T57A87A85G84T78
A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99G99T99G99C98T99A96T91C88A82
A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99C99T99A99T99C99A99A96C94T95
A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99T99A99T99C99A99A94C97T95A91G88
A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99A99C99T98A91G93
C99A99G99A99G99C99A99G99A99C99A99A99N99T99A99A99G99T99G99C99T99A99T99C99A99A99C99T99A99G99
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
NGS Sequence Assembly
• Assembly depends on overlaps among reads.
• Quality of bases must be taken into account.
• Reads that are too short are not informative.
• Repetitive sequences make assembly difficult.
• Insufficient depth results in multiple contigs.
• Sufficient depth is a key success factor:• Joining of contigs depends on sufficient overlap (N50
value).• Resolving low quality bases depends on depth.• Depth does not help resolve repetitive sequences.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
NGS Assembly: Overlap Approach
ATTCCCGTA
CCCGTAA
6
TAATCTACGACTAAG
2
ATTAAGTCA
1
CTACGAT
GTCACAACC
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
NGS Assembly: Overlap Approach
ATTCCCGTA
CCCGTAA
6
TAATCTACGACTAAG
2
ATTAAGTCA
1
1
3
1
GTCACAACC
1CTACGAT
2
1
21
4
2
1
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
De Bruijn Graph
http://commons.wikimedia.org/wiki/File:DeBruijn-as-line-digraph.svg
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
NGS Assembly: De BruijnApproach
Compeau, Pevzner & Tesler, Nature Computational Biology 29 (2011): 987–991, Fig. 3
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Polymorphisms and de BruijnAssembly
[Leggett et al., 2013, Fig. 1]
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Summary ¡: NGS Data Analysis
• Mapping to a reference sequence, using indexing• resequencing• detection of SNPs and other variants,• identification of genes (RNA-seq).
• De novo assembly of genomes or transcriptomes.• Resource intensive (particularly memory)• Overlap assembly: feasible with smaller sets• De Bruijn graph assembly of k-mers
• NGS metagenomics . . .
Software has limitations and is evolving rapidly.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Outline
1 IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
2 Sequence AnalysisPairwise AlignmentBLAST
3 “Next Generation” Sequencing Challenges
4 Multiple Sequence Alignment (MSA)
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Multiple Alignment
• Extend pairwise approach?• 2 sequences: table of n2 prefix alignments• 3 sequences: table of n3 prefix alignments• Warning: Very large numbers ahead
• Aligning 100 sequences of 300 symbols: about 10170 prefixalignments.
• How much computing time does the universe have?
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Multiple Alignment
• Extend pairwise approach?• 2 sequences: table of n2 prefix alignments• 3 sequences: table of n3 prefix alignments• Warning: Very large numbers ahead
• Aligning 100 sequences of 300 symbols: about 10170 prefixalignments.
• How much computing time does the universe have?
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Multiple Alignment
• Extend pairwise approach?• 2 sequences: table of n2 prefix alignments• 3 sequences: table of n3 prefix alignments• Warning: Very large numbers ahead
• Aligning 100 sequences of 300 symbols: about 10170 prefixalignments.
• How much computing time does the universe have?
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Multiple Alignment
• Extend pairwise approach?• 2 sequences: table of n2 prefix alignments• 3 sequences: table of n3 prefix alignments• Warning: Very large numbers ahead
• Aligning 100 sequences of 300 symbols: about 10170 prefixalignments.
• How much computing time does the universe have?
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Progressive Multiple Alignment
• Compute all pairwise alignments
• Use alignment dissimilarities to produce a guilde tree.
• Align most similar pair of sequences and merge them intoa profile.
• Progressively align profiles.
• Result: All sequences aligned (and merged into oneprofile).
• Programs clustal, muscle
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
More Uses ofContinuous Sequences
• Profile searches (mostly superseded by HMMs)
• Progressive multiple alignment
ACAC
ACCC
AGT
AGAT
AGCT
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
More Uses ofContinuous Sequences
• Profile searches (mostly superseded by HMMs)
• Progressive multiple alignment
ACAC
ACCC
AGT
AGAT
AGCTa 1.0 0.0 0.5 0.0c 0.0 0.0 0.5 0.0g 0.0 1.0 0.0 0.0
t 0.0 0.0 0.0 1.0
- 0.0 0.0 0.0 0.0
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
More Uses ofContinuous Sequences
• Profile searches (mostly superseded by HMMs)
• Progressive multiple alignment
ACAC
ACCC
AGT
AGAT
AGCTa 1.0 0.0 0.5 0.0c 0.0 0.0 0.5 0.0g 0.0 1.0 0.0 0.0
t 0.0 0.0 0.0 1.0
- 0.0 0.0 0.0 0.0
a 1.0 0.0 0.5 0.0
c 0.0 1.0 0.5 1.0
g 0.0 0.0 0.0 0.0
t 0.0 0.0 0.0 0.0
- 0.0 0.0 0.0 0.0
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
More Uses ofContinuous Sequences
• Profile searches (mostly superseded by HMMs)
• Progressive multiple alignment
ACAC
ACCC
AGT
AGAT
AGCTa 1.0 0.0 0.5 0.0c 0.0 0.0 0.5 0.0g 0.0 1.0 0.0 0.0
t 0.0 0.0 0.0 1.0
- 0.0 0.0 0.0 0.0
a 1.0 0.0 0.5 0.0
c 0.0 1.0 0.5 1.0
g 0.0 0.0 0.0 0.0
t 0.0 0.0 0.0 0.0
- 0.0 0.0 0.0 0.0
a 1.0 0.0 0.3 0.0
c 0.0 0.0 0.3 0.0
g 0.0 1.0 0.0 0.0
t 0.0 0.0 0.0 1.0
- 0.0 0.0 0.3 0.0
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
More Uses ofContinuous Sequences
• Profile searches (mostly superseded by HMMs)
• Progressive multiple alignment
ACAC
ACCC
AGT
AGAT
AGCTa 1.0 0.0 0.5 0.0c 0.0 0.0 0.5 0.0g 0.0 1.0 0.0 0.0
t 0.0 0.0 0.0 1.0
- 0.0 0.0 0.0 0.0
a 1.0 0.0 0.5 0.0
c 0.0 1.0 0.5 1.0
g 0.0 0.0 0.0 0.0
t 0.0 0.0 0.0 0.0
- 0.0 0.0 0.0 0.0
a 1.0 0.0 0.3 0.0
c 0.0 0.0 0.3 0.0
g 0.0 1.0 0.0 0.0
t 0.0 0.0 0.0 1.0
- 0.0 0.0 0.3 0.0a 1.0 0.0 0.4 0.0
c 0.0 0.4 0.4 0.4
g 0.0 0.6 0.0 0.0
t 0.0 0.0 0.0 0.6
- 0.0 0.0 0.2 0.0
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Multiple Alignment
(program: clustalx)http://www.clustal.org/
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Overview of Molecular Phylogeny
sequences
alignment
aligned sequences
dist. calc.
dist. matrix
neighbor j.
tree
parsimony
tree
max. likelih.
tree
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Acknowledgements
• Kai-Uwe Winter, Thomas Münster, Luzie U. Wingen,Günter Theißen, Heinz Saedler
• Begoña Valdazo-Gonzalez, Nick Knowles, Don King
• Jan Gewehr, Thomas Martinetz, Daniel Polani, SimonMoxon, Vincent Moulton
• Anyela Camargo, Alessandra Devoto, John Turner
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
References
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang,
J., Zhang, Z., Miller, W., and Lipman, D. J. (1997).Gapped BLAST and PSI-BLAST: A new generation ofprotein database search programs.Nucleic Acids Research, 25:3389–3402.http://nar.oupjournals.org/cgi/content/full/25/17/3389.
Crosswell, L. C. and Thornton, J. M. (2012).
ELIXIR: A distributed infrastructure for Europeanbiological data.Trends in Biotechnology, 30:241–242.
Langton, C. G. (1992).
Preface.In Langton, C. G., Taylor, C., Farmer, J. D., andRasmussen, S., editors, Artificial Life II, volume X ofSanta Fe Institute Studies in the Sciences ofComplexity, Proceedings, pages xiii–xviii, RedwoodCity, CA. Addison-Wesley.
Leggett, R. M., Ramirez-Gonzalez, R. H., Verweij, W.,
Kawashima, C. G., Iqbal, Z., Jones, J. D., Caccamo,M., and MacLean, D. (2013).Identifying and classifying trait linked polymorphismsin non-reference species by walking coloured de bruijngraphs.PLoS One, 8:e60058.
Winter, K.-U., Becker, A., Münster, T., Kim, J. T.,
Saedler, H., and Theißen, G. (1999).MADS-box genes reveal that gnetophytes are moreclosely related to conifers than to flowering plants.Proceedings of the National Academy of Sciences,USA, 96:7342–7347.
Wright, C. F., Morelli, M. J., Thébaud, G., Knowles,
N. J., Merzyk, P., Paton, D. J., Haydon, D. T., andKing, D. P. (2011).Beyond the consensus: Dissecting within-host viralpopulation diversity of foot-and-mouth disease virusby using next-generation genome sequencing.Journal of Virology, 85:2266–2275.
Bioinformatics
Jan T. Kim
Introduction
Mol. Bio. Basics
Plant Phylogeny
FMDTransmission
SequenceAnalysis
PairwiseAlignment
BLAST
NGS
MSA
Thank Youfor your attention and participation
IntroductionMolecular Biology BasicsResolving the Phylogeneny of Land PlantsReconstructing Foot and Mouth Disease Transmission Trees
Sequence AnalysisPairwise AlignmentBLAST
``Next Generation'' Sequencing ChallengesMultiple Sequence Alignment (MSA)