Creating Reference-Grade Human Genome Assemblies Tina Graves Lindsay GRC Workshop at Genome Informatics Sept 19, 2016
Creating Reference-Grade Human Genome Assemblies
Tina Graves LindsayGRC Workshop at Genome InformaticsSept 19, 2016
The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans
AC074378.4AC079749.5
AC134921.2AC147055.2
AC140484.1AC019173.4
AC093720.2AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4AC079749.5
AC134921.1AC147055.2
AC093720.2AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4AC140484.1
AC019173.4AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 – Conflicting Alleles
GAP
Samples to be Sequenced
Sequencing Plan
Definitions of Genome Level• Platinum Genome
• Haploid genome source• Contiguous, haplotype-resolved representation of entire
genome• BAC library available
• Gold Genome• Diploid genome source• Part of a trio
• Parents will be sequenced to help haplotype resolve some regions
• BAC libraries available • Targeted regions sequenced using these BAC libraries• Will contain some haplotype resolved regions
CHM1: A Key Resource for Improving the Reference
• CHM1 cell line established from a haploid hydatidiform mole (complete, paternal; 46XX) (U.Surti)
• CHORI-17 BAC library (P. deJong)• CHORI-17 BAC end sequences (n=325,659)• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)• CHORI-17 BACs
• >750 have been sequenced• 664 of them in Genbank as phase 3 sequence
• CHM1 WGS assembly• Initial assembly produced from >100X coverage of Illumina data• Initial PacBio assembly produced using ~54X of P5 PacBio data• Latest PacBio assembly produced using ~60X of P6 PacBio data
Assembly Assessment Methods• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the finished BACs
• Assembly Assembly alignments will be generated between each PB assembly and GRCh38
• BioNano Genome Map• SV calls generated from comparing the BioNano data to
each of the assemblies • Hybrid scaffolding conflicts will also point out potential
assembly errors
• Alignment of the Illumina reads back to the each of the assemblies• Heterozygous calls are likely indicative of a collapse in the
assembly (for the haploid genomes)
Hybrid Scaffolds – PacBio and BioNano
Seq Assem
Seq Assem
Seq Assem
BN Hybrid
BN Hybrid
BN Hybrid
# of Contigs
Contig N50 (Mb)
Total Size (Gb)
# of Scaffolds
Scaff N50 (Mb)
Total Size (Gb)
CHM1 (P6)GCA_001297185MGI CHM1 map(Jason’s version)
3641 26.9 2.99 161 47.6 2.84
CHM1 (P6) GCA_001307025MGI CHM1 Map
(Adam’s version)
4850 20.6 2.94 221 40.04 2.82
Hybrid ScaffoldHybrid Scaffold
PacBio Contigs
BioNano Contigs
1q21 Region – GRCh38 vs GCA_0012971851 Megabase
GRCh38
GCA_001297185
Seg Dup Track
1q21 Region - GRCh38 vs GCA_001297185
GRCh38
GCA_001297185
Seg Dup Track
99.9+% identity99.1% identity
CHM1 – Next Steps
• Move forward with improving GCA_001297185
• Based on alignment of BioNano data as well as comparisons to GRCh38, make additional breaks where possible
• Incorporate all finished BACs
• Final alignment to GRCh38 in order to produce chromosome AGPs and submit
First Gold Genome - NA19240
Initial Assembly Stats# Seq Contigs 3569Max Contig Length 20,393,869 bpTotal Assembly Size 2,745,634,789 bpN50 6,003,115 bpN90 848,151 bpN95 345,457 bp
• NA19240 – Yoruban sample
• Generated >70X raw PacBio data
Publication Pending
NA19240 BioNano Hybrid and SV StatsSeq
AssemSeq
AssemSeq Asse
m
BN Hybrid
BN Hybrid
BN Hybrid
BN Hybrid
BN Hybrid
# of Contigs
Contig N50 (Mb)
Total Size (Gb)
# of Scaffold
s
Scaffold N50 (Mb)
Total Size (Gb)
Conflicts WGS
Conflicts BN
NA19240 3569 6.01 2.75 421 14.78 2.74 49 60
Potential mis-
assemblies
Breaks made
Conflicts 28 22Ends 13 5Insertions 5 2Translocations
74 14
Initial curated assembly = GCA_001524155.1
Finished BACs Resolve This Region
GRCh38
PB Assembly
BAC Alignments
Seg Dup
Which Assembly is Best?
2.815 2.820 2.825 2.830 2.835 2.840 2.845 2.8505.806.006.206.406.606.807.007.207.407.607.80
Contig Lengt
h N50 (MB)
Total Assembly Size (GB)
HG00733 Puerto Rican Assembly Stats
• Use other sources to assess multiple assemblies• BioNano• Long linked reads
Genome Status
Data Source
Origin Level of Coverage
Status
CHM1 NA Platinum Assembly ImprovementCHM13 NA Platinum Assembly Assessment
NA19240 Yoruban Gold Paper in ReviewHG00733 Puerto
RicanGold Assembly Assessment
HG00514 Han Chinese
Gold Assembly Assessment
NA12878 European Gold Assembly AssessmentHG01352 Columbian Gold Assembly AssessmentHG02818 Gambian Gold Data Generation
CompletedHG02059 Kinh
Vietnamese
Gold Data Generation Completed
NA19434 Luhya Gold Data Generation
AcknowledgementsThe McDonnell Genome Institute at Washington University in St. Louis
Rick WilsonBob FultonWes WarrenKaryn Meltz SteinbergVince MagriniSean McGrathDerek AlbrachtMilinn KremitzkiSusan RockDebbie ScheerChad Tomlinson
Patrick MinxChris MarkovicEddie BelterLee TraniSara KohlbergSusan Dutcher
University of WashingtonEvan Eichler
NCBIValerie Schneider
University of Pittsburgh School of Medicine
(CHM1 and CHM13 cell line)Urvashi Surti
BioNano GenomicsPalak ShethAlex Hastie
Pacific BiosciencesJason ChinNick Sisneros
UCSFPui-Yan KwokYvonne LaiChin LinCatherine
Chu
NHGRIAdam PhillippySergey Koren
10X GenomicsDeanna Church