Targeted genomic enrichment and SMRT sequencing of immune-related gene complexes John Hammond ([email protected])
Jan 26, 2021
Targeted genomic enrichment and
SMRT sequencing of immune-related
gene complexes
John Hammond
Innate immune gene variation: germ-line
encoded NK cell receptors and MHC class I
Haplotype variation, high polymorphism and variegated expression creates NK
cell subsets with different specificities and functionsLRC NKC
MHC
Genetically defined and inbred animals are key to dissecting genomic
function
This arm of the immune system is critical in controlling and resolving viral
infection
Diverse NK cell receptor systems are rapidly evolving under intense
selection pressure from rapidly evolving pathogens.
CD8 T cells
• Reference genome not well resolved in these regions
• Poor SNP coverage in hard to assemble repetitive
regions
• Reference genome presents only one haplotype of
many
Highly repetitive regions are difficult to
sequence with short read technology
Human MHC class I is highly diverse but
haplotypes do not vary in gene content
Gene A B C E F G
Alleles 4,200 5,091 3,854 27 30 60
Proteins 2,923 3,664 2,644 8 5 19
Nulls 200 150 144 1 0 3
A greater degree of
structural variation
in cattle
Sequence identity comparison of two
cattle class I genomic haplotypes
MHC class II
The current HD SNP chip does not interrogate
MHC variation
The cattle KIR complex has expanded and demonstrates all
the features of a functional immune complex
Key properties of KIR loci Human KIR Cattle KIR
Inhibitory and Activating
Activating genes disarmed
Functionally variable haplotypes ?
Polymorphic
Paired activating and inhibitory
receptors
Identity
The cattle NKC is largely correct in reference assembly-
determined by BAC clones and a new cattle reference assembly
Schwartz et al. 2017. Immunogenetics.
Dis
tan
ce
be
twe
en
SN
Ps
SNP position in the genome
~280 kb
8 SNPs & 17 genes
The cattle natural killer complex
• missing SNP variation over the most diverse region
The identity between genes and gene blocks is too
high to map short reads over the KLRC region
% o
f re
ads d
iffe
rent
from
UM
D3.1
Location on BTA5
Illumina 250 bp PE reads
Enrichment and de novo assembly of
immune related gene complexes in cattle for
SNP discovery.
Cattle are arguably the most important livestock species: they
provide humans with meat, milk, hides, traction, manure, status
and security.
Reducing the burden of disease can have enormous positive
impact for food security and welfare.
Complex immune traits are phenotypically diverse making
breeding/selection processes challenging-but there are many
opportunities!
Where are the immune genes known to be involved bTB?
Prof Liz Glass, Roslin.
Used the Roche (Nimblegen) SeqCap EZ system
Targeted enrichment of Immune-related gene
cluster with Roche Nimblegen probes
Library prep needed considerable optimisation
Average pull down fragment was 5.5 kb
http://www.google.co.uk/url?sa=i&rct=j&q=Nimblegen+tiling&source=images&cd=&cad=rja&docid=RqmqtKzR_JO9XM&tbnid=NG3OIJhNAfRi9M:&ved=0CAUQjRw&url=http://www.nimblegen.com/choice/&ei=KSHoUbi5FsOVO5v7gbgG&bvm=bv.49478099,d.ZWU&psig=AFQjCNG3w84UM3ozPlZoopmCwGwvKHeulQ&ust=1374253620612261http://www.google.co.uk/url?sa=i&rct=j&q=Nimblegen+tiling&source=images&cd=&cad=rja&docid=RqmqtKzR_JO9XM&tbnid=NG3OIJhNAfRi9M:&ved=0CAUQjRw&url=http://www.nimblegen.com/choice/&ei=KSHoUbi5FsOVO5v7gbgG&bvm=bv.49478099,d.ZWU&psig=AFQjCNG3w84UM3ozPlZoopmCwGwvKHeulQ&ust=1374253620612261
Four rounds of probe design and optimisation
Illumina set
First design to use “masking” as a way to deal with multiple variant targets for single
genomic region, thus reducing over-capture of non-variant subregions.
NG1-Pilot set ~ 9Mb with 50 known matches
Update included an increase number of target regions and increase in number of variant
inputs per region. Match level stretched to 50 for coverage.
NG2 ~ 5Mb with 50 known matches
Similar strategy to 151221_BTAU_TPI_NiGen2_EZ_HX3. Size of NKC target decreased,
and MHCIIb,TPI, RP and IG regions removed.
NG3 ~ 5Mb Main aim is to reduce off target mapping of 70-80 %
Redesign of 170131_BTAU_DH_TPI_EZ_HX3. Match levels reduced to 3, mapping targets
against reference included, efficiencies estimated, probes in NKC region replicated 3x and
MHC replicated 2x.
2 animals
PacBio sequencing; 23 animals
NG1 probe set- good enrichment but still much off target
Custom chromosome
NKC
MHC
Custom chromosome
NKC
MHC
NG2 probe set- better enrichment but NKC dropped out
Probe performance NKC (NG2)
-
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
chr5/NKC
Blue = nucleotides binding to whole chromosome
Green = nucleotides binding to target area
Probe performance LRC (NG2)
-
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
chr18/LRC
Blue = nucleotides binding to whole chromosome
Green = nucleotides binding to target area
-
50,000,000
100,000,000
150,000,000
200,000,000
250,000,000
chr23/MHC
Probe performance MHC (NG2)
Blue = nucleotides binding to whole chromosome
Green = nucleotides binding to target area
• On-target efficiency sacrificed for overlapping probe
coverage- not entirely necessary
• Many off-target regions unsupported by probe
sequence- multimapping and polymorphism/variation
• Masking does not adequately reduce the redundancy
of the inputs but does allow probes with similar
sequences to hybridize to similar haplotype
sequences resulting in greater depth of coverage
Probe design summary
• At least subreads from 2 SMRTcells combined for de
novo assembly with Canu
• filtered subreads as input
• Default parameters minreadlength>1kb• for MHC minreadlength >3kb improves assembly
• For the LRC this does not improve the assemblies
• gfa file as output from Canu screened for contigs that
contain MHC or KIR genes/haplotypes with bandage,
which were then extracted and mapped to known
MHC/KIR haplotypes
Enrichment and De novo assembly
A18 – “gene 6” reconstructed haplotype
252NC1 TRIM26
Gene6
6*01301
2 contigs: 170kb, 53kb
105991
9 contigs
NC1 TRIM26
Gene6
6*01301
705983TNC1 TRIM26
Gene6
6*01301
7 contigs
A31 – “gene 1+2” reconstructed haplotype
NC1 TRIM26Gene1
*02101Gene2
*02201
20005
8 contigs
604652NC1 TRIM26
Gene1
*02101Gene2
*02201
8 contigs
Heterozygous A18/A31(mixed reads from 252 and 200005 as input in de novo)
• Shared regions assemble contigs that are more similar to the haplotype with
more reads
• Alleles for gene 6, gene 2, gene 1 identical to previous
• FALCON
MHC class I full-length bovine haplotypes
P3 NC1 1 4 2 6 T
TRIM26
A14
ARS14 P3 NC1 5 2 T
35kb 62kb 67kb 69kb 58kb103kb
A11 3 2 T
Angus P3 NC1 3 2 6 TP6
Brahman P3 3 2 T3
A18
A31
P3 NC1 6 T
P3 NC1 1 2 T
20kb
Breed IDknown
haplotypede novo
haplotypeallele
haplotype allelesHereford Dominette ? 02*07001; 05*07201
Friesian 252 A18 A18 A18 06*01301
Friesian 200005 A31 A31 A31 01*02101; 02*02201
Friesian 105991 A18 A18 A18 06*01301
Friesian Herman A14/? A14? 01*02301; 04*02401;02*02501; no 06*
Friesian 505204 A14 A14 A14? 01*02301;04*02401;02*02501; no 06*
Friesian 604652 A31 A31 A31 01*02101; 02*02201
Hereford Domino ? 02*06001;06*04001;05*07201
Friesian 705983T A18 A18 A18 06*01301
Highland 8052 ? new? 01*03102; new02*?
Sahiwal 83H ? new? new03*?
Friesian 206818 ? A14/A14 A14 01*02301;04*02401;02*02501;06*04001
Friesian 706886 ? A14? new01*;02*02501;04*02401
Friesian 206846 ? new? 01*01901;02*02501
Friesian 206853 ? A14/het? A14/? 01*02301;04*02401;02*02501;06*04001
Friesian 504805 A10/A14 01*02301;04*02401;03*00201; new02*
Friesian 504805 A10/A1401*02301; 04*02401;02*02501;
03*00201;
Friesian dried706823 ? new? 01*02101;04*02401;02*02501;
Friesian 306812 ? new? new 01*;02*00801;04*02401
Friesian 159 A31 A31 01*02101; 02*02201
Most likely haplotype based on alleles
1-3bp differences to allele
• 200005 – reads from 2 SMRTcells* (>940,000 subreads, >1kb length)
200005_NG1+Sequel
KIR haplotype contains block A and B
• 200005 – reads from 2 SMRTcells* (>625,000 subreads, >3kb length)
*includes one Sequel run
• 252 – reads from 4 SMRTcells* (>863,000 subreads with >3kb length):
8 contigs, longest 118kb
KIR haplotype from 252 missing block B?
252_NG1
• 252 – reads from 1 SMRTcells (>486,000 subreads >1kb length):
8 contigs
One contig*includes one Sequel run
De novo assembly of KIR from two other A18 animals
705983T_NG2+2rep
105991_NG2+2rep
• 705983T – reads from 2 SMRTcells (> 776,000 subreads >1kb length)
• 105991 – reads from 2 SMRTcells (> 606,000 subreads >1kb length)
• Also missing block B?
- Illumina reads from 125 Holstein bulls mapped to
immune-related genecluster haplotypes (BWA)
- NKC, LRC, MHC
- SNPs called with x variant caller
- Filtered SNPs: QUAL > 900, strictly biallelic (no INDELs)o Called for all individuals; Alternative allele frequencies: between 5%
and 95% (only NKC)
- Selected SNPs 10-15kb apart across region and based
on representation of mapping data
- checked 50bp flanking region if repeat within haplotype
o Transferability to other haplotypes (according to SNP
coordinate) checked
o SNPs checked for haplotype specificity, and gene specificity
(MHC only)
SNP selection using de novo assembled
haplotypes
New SNP panel over 3 different gene complexes being
used to increase the power of GWAS for complex
disease traits
The single SNP on the current Illumina SNP chip
The cattle new cattle LRC SNP panel
UMD3.1
A14
A11
A18
A31
pink SNPs
blue SNPs
orange SNPs
MHC haplotype SNP selection
~280 kb
MAGOHB
KLRA KLRJKLRC1-3
NKC haplotype SNP selection
First round of SNP selection successful
(~70 % success).
Established segregating
markers in a cohort or 1500
extreme phenotype bTB
resistant cattle, currently doing
GWAS
Acknowledgments
Immunogenetics
Nick Sanderson
Alasdair Allan
Mark Gibson
John Schwartz
Rebecca Philp
Clare Grant
Karen Billington
Paul Norman
Libby Guethlein
Peter Parham
Farbod Babrzadeh
John Young
Richard Borne
Doro Harrison
Elizabeth Morecroft
Kevan Hanson
William Mwangi
Giuseppe Maccari
Derek Bickhart
Timothy Smith
William Thompson
Juan Medrano
Denise Raterman
Cynthia Moehlenkamp
George Mayhew
BB/M027155/1, BB/J006211/1
GCRF Databases and ResourcesLiz Glass