Illumin8er: Software for the Illumina GAII Ian Carr, Joanne Morgan, Phil Chambers, Alex Markham, David Bonthron& Graham Taylor Leeds Institute of Molecular.
Post on 02-Apr-2015
214 Views
Preview:
Transcript
Illumin8er: Software for the Illumina GAII
Ian Carr, Joanne Morgan, Phil Chambers, Alex Markham, David Bonthron& Graham Taylor
Leeds Institute of Molecular Medicine, Leeds Teaching Hospitals & Cancer Research UK
Sipping from the hosepipeThe cost of DNA sequencing is plummeting
Current sequence output from an Illumina GAII is over 1 Gigabase per day
Managing the data is the single biggest challenge to bringing the benefits to patients and cost savings to to the Healthcare budget
The next biggest challenge is optimising the workflow to achieve cost efficiency
What should the software do?
Scan for and report mutations against a defined reference sequence.
Be able to handle bar-code sequence tags
Be easy to use
Report on data quality
Export to a database
Why Illumina?Cost: 0002p per base
Capacity: 3.5 Gigabase per run
Simplicity: library>cluster station>sequence>data
500,000,000 bases per channel
Software requirementsRuns in MS Windows
User definable reference sequence
Quality scores
Automatic mutation callingSNPs Indels
Speed
Initial data manipulationIlluminator can transform data in prb.txt or
seq.txt in to fasta files
If tagged data is used each tag is separated in to an individual file.
The prb.txt files can be filtered for low quality data
Reference filesReference files are created from plain text
files of the genomic sequence and a cDNA sequence in either a plain text file or a genbank web page.
If a genbank page is used the SNP data in the page is also imported with cDNA sequence.
The reference file contains the position of the exons and ORF relative to the genomic sequence to aid mutation annotation.
Indexing the reference sequence
Each octamer in the reference sequence is mapped to an array of 65537 octamers (the extra one is for unmapped rubbish such as ‘nnnnnnnn’)
Some octamers have no positions in the reference while others have several.
GCTGGTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGTGAAACATTGG
aaaaaaaaaaaaaaac
aaaaaaataaaaaaag
aaaaaacaaaaaaacc
tttttttt
tttttttctttttttg
~65000
nnnnnnnn
Mapping reads with 3’ mismatchesTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGGAAA
Position where octamer is found in ref seq
60629005000
6148900
3066221400
18302500
Match up positions where octamer increase by 8 606
29005000
6148900
3066221400
NA
not+8b
p+8bp +8bp
3’ mismatches have a run of 3 foot prints with the last octomer missing.This goes in to array 2 (phase 2)
GCTGGTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGTGAAACATTGG
Mapping reads with 5’ mismatchesGTGAGGGGGGGGCAGGAGTGCTTGGGTTGTGGTGAA
Position where octamer is found in ref seq 5700
6148900
3066221400
630
Match up positions where octamer increase by 8 NA 614
8900
3066221400
630+8bp
+8bp
GCTGGTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGTGAAACATTGG
not+8b
p
5’ mismatches have a run of 3 foot prints with the first octomer missing.This goes in to array 3 (phase 3)
Mapping reads with internal mismatches
TGAGGGGTGGGGCAGAAGTGCTTGGGTTGTGGTGAA
Position where octamer is found in ref seq
60629005000
16645900
3066221400
630
Match up positions where octamer increase by 8 606
29005000
16645900
3066221400
630+8bp
not+8bp
GCTGGTGAGGGGTGGGGCAGGAGTGCTTGGGTTGTGGTGAAACATTGG
not+8b
p
internal mismatches have a run of 3 foot prints with either the second or third octamer out of phase.This goes in to array 4 (phase 4)
+16bp
What each phase is used for
Phase 1 = perfect matches
Phase 2 = indels and small mutations at end of a read
Phase 3 = indels and small mutations at start of a read
Phase 4 = small mutations in the middle of read
Small changes These are found by looking at Phase 4 data.
Homozygous mutation are in Phase 4 but not phase 1 (seen as a hole)
Heterozygous variants are in seen in phase 4 and wt seen in phase 1 data.
WT in Phase 1data
Mut in Phase 4Data.(The wt alleleIs present due to seq errors elsewhere in the read.)
InDels
Phase 2 data gets indels from end of the read while Phase 3 gets them from the start of the read.
In a perfect world Phase 2 and 3 data should mirror each other.
Global view
Data for a PCR product containing two exons; blue = exonic DNA pink = protein coding DNA
The red and blue lines show the read depth of forward and reverse reads.
The lower panel shows the reference and deduced sequences around the a point on the upper panel selected by clicking on the panel with the mouse
Data view
Forward and Reverse sequences
Patient sequence
Patient’s other allele sequence
Score for each nucleotideReference genomic, cDNA and protein sequence
Read depth
Heterozygous base
Indel interface
Forward and Reverse sequences
Reference sequence
Patient sequences with indel at start and end of read
Consensus sequence of patient reads across indel
Alignment of patient and reference sequence to identify indel
Data exportThe program can both export and import the
alignment data as a plain text file
Create an updatable library of sequence variants
Export sequence variants as a text file
Create a LOVD import file for the sequence variants
Validation: BRCA1&BRCA2
Illuminator detected all the mutations previously identified by dye terminator Sanger sequencing of the exons in BRCA1 and 2 of 10 individuals. Each nucleotide had a read depth of at least 75 reads (approximately 6.6x103 sequences per gene). The alignment and mutation annotation took ~50 seconds per gene per person
ConclusionsIllumin8er is
Easy to use RapidRuns on Windows desktopUses standard Illumina output filesReports mutations in a sensitive and specific
manner
Next steps..Make freely available by download
http://dna.leeds.ac.uk/illumin8er/
Design compatible LOVD
Large scale validation trial
top related