Sequencing the Maize (B73) Genome Genome Sequencing Center Maize Genome Sequencing Consortium
Mar 27, 2015
Sequencing the Maize (B73) Genome
GenomeSequencingCenter
Maize Genome Sequencing Consortium
The Team
• WU Genome Sequencing Center (R. Wilson, PI)- Bob Fulton, Pat Minx, Sandy Clifton
• Arizona Genome Institute (R. Wing)• Cold Spring Harbor Laboratory
- D. Ware, L. Stein- R. McCombie, R. Martienssen
• Iowa State University (P. Schnable & S. Aluru)• The Maize research community
The Plan
library_doneshotgun_done
prefin_donefinished
4311
3106
2261
4110%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Progress Through Pipeline
Progress as of 9/30/06
Progress Through Pipeline Across Time
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 4 7 10 13 16 19 22 25
week
number of clone
library_done
shotgun_done
prefin_done
finished
Agenda9:00 – 9:15 Introductions and Project Overview (Rick Wilson)
9:15 – 10:15 Plans and Progress – WU/AGI/CSHL/ISU Project
Map and Tile Path Selection (Rod Wing)Library Construction and Production (Lucinda Fulton)Sequence Improvement (Bob Fulton, Dick McCombie, Rod Wing)Data Submission (Joanne Nelson)Annotation and Data Display (Doreen Ware)Outreach (Rick Wilson)
10:15 - 10:30 Break
10:30 – 11:00 Plans and Progress – DOE Project (Dan Rohksar)
11:00 – 11:30 Future Plans and CollaborationsPat Schnable (by phone) - retrotransposons
11:30 – Noon Executive Session
Noon – 1:00 Working Lunch and Discussion
1:00 Depart for Airport
BAC-by-BAC Strategy to Sequence the Maize Genome
Maize B73 Genome (2300 Mb)
BAC library construction (Hind III, EcoR I/MboI ; 27X deep ; 150kb avg. insert)
BAC End Sequencing
~800,000
Genetic Anchoring in silico, overgo hybridization
Fingerprinting ~460,000 BACs
STC databaseBAC physical maps (HICF & Agarose)FPC databases
(Agarose and HICF) Choose a seed BAC
Shotgun sequencing and finishing
STC database search, FP comparison
Determine minimum overlap BACs
Complete maize genome sequence
Map Summary
1. Total Assembled Contigs: 721– Equal to 2,150 Mb, 93.5% coverage of 2300 Mb genome– Anchored: 421 ctgs, 86.1% the genome – average anchored contig size: 4.7 Mb– Unanchored: 300 ctgs, 7.4% coverage
average unanchored contig size: 0.56 Mb– 189 of the 300 unanchored contigs are less
than 10 clones– Largest anchored contig 22.9Mb in Chr9– Largest unanchored contig 6.7 Mb
2. Total FPC Markers: 25,924– STS markers: 9,129– Overgo Markers: 14,877– Anchored markers: 1918
MTP Selection
•Seed BACs: 4000, done
•Mega Contig: 197, done
•Clone Walking from Seed BACs: 2,800 done; in progress
•Total clones picked = 6,997
•On track to deliver 1000 clones/month until maze MTP is complete
Flowchart for MTP picking and Library Construction
Clone selection(combine seed BAC and BAC end sequences with
fingerprinting and trace files)
Clone picking (Resource Center)
MTP sequencingGenBank BAC end sequence database
Library DNA production
Hfq sequencing
Clone verification
Clone shipping
Continue shotgun library construction at WashU
DNA shearing
Seed BAC database
MTP BAC end database
Library DNA production
Seed BAC Walking
In Agarose and HICF map, selecting large clones next to seed BAC
Blastn search of BAC end sequences against seed BAC sequences
Check blastn alignment for candidate clones
Check trace file for Dye blob
Check the Sulston score in HICF map for overlap
Check Agarose fingerprints to avoid overlap with large bands
Choose walking clone
Minimum Tile Path Pipeline• BAC End Sequence of potential BACs
are BLASTed against the Seed BACs
• Results are classified based on location on the FPC
• A table for each BAC is created of filtered BLAST results with links to CMap and GBrowse
• Blast results are imported into CMap and GBrowse with additional information such as trace files and FPCs
Minimum Tile Path Pipeline Usage• A table of alignments between the
seed BAC and the BAC end sequences contains links to CMap and GBrowse.
• CMap displays the FPC data for the seed BAC and the potential next BACs.
• GBrowse provides an alignment of the BES with the seed sequence and displays the trace data.
Blast Results Table
Maize Production Sequencing
Shotgun of 19,000 BACs
Fosmid End Sequencing of 1 Million Reads
BAC End Sequencing of 220,000 clones
Maize BAC shotgun
BAC DNA received from AGI or prepared at the GSC
Small Scale Library Construction
Production Sequencing - 1,536 reads/project
Automated Shotgun_done
Maize BAC Shotgun Reads
0
200000
400000
600000
800000
1000000
1200000
1400000
Feb-06 Mar-06 Apr-06 May-06 Jun-06 Jul-06 Aug-06 Sep-06
Months
Reads/Month
Reads
To date 3,106 BAC clones are shotgun_done
Maize Fosmid Sequencing
Fosmid trays 0001 to 0471 were received from Messing labInitial QC was fine, but bulk shipment has failed to grow
Stamping results of the original trays show no growth
85 Fosmid ligations which represent ~250,000 clones werereceived from the Messing lab, plating is underway
GSC Fosmid library construction has been completed and represents 1M clones
Expected completion date is November of this year.
Maize BAC End Sequencing
BAC end sequencing will be completed next week
Total of 440,000 reads from two different libraries
Pass rate of 75% with an average read length 600 bases
Paired end read rate is ~70%
Sequence Improvement Pipeline
•Shotgun_done triggers the prefinishing pipeline
•Initial identification of “do finish” regions
•Manual sorting and use of autoedit(Gordon) to break
apart misassembly.
•Autofinish(Gordon) used to choose directed reactions for
all gaps and regions of low quality in “do finish” regions
•Reassembly and 2nd iteration of prefinishing pipeline
•Final identification of “do finish” regions and handoff to
finishing pipeline
0
100
200
300
400
500
600
700
800
1-5 ctg6-10 ctg11-15 ctg16-20 ctg21-25ctg26-30 ctg31-35 ctg35-40 ctg40+ ctg
before prefinish after prefinish
Clone Improvement through the Prefinishing Pipeline
End
Spanning Plasmids
Coverage (green)
Repeat Tags
Do FinishGSS sequence
EST sequence
Alignment with cDNA read pairs
Alignment with End Sequences
Future Plans for Improved Throughput
•Automated Shotgun-done status assigning
•Overlap Evaluation at Prefinishing
•Addition of Fosmid End Pairs at Prefinishing
•Direct Sequencing for Unspanned Gaps
•Additional Finishing Staff Hired at all 3 Centers
Maize clone submissions
clone status submission keywords shotgun complete HTGS_PHASE1; HTGS_FULLTOP
2 rounds of prefinish HTGS_PHASE1; HTGS_PREFIN
in finishing HTGS_PHASE1; HTGS_ACTIVEFIN
finished HTGS_PHASE1; HTGS_IMPROVED
zea mays[ORGN] AND HTGS_PREFIN[KYWD] AND WUGSC[CNTR]zea mays[ORGN] AND HTGS_IMPROVED[KYWD] AND WUGSC[CNTR]
Restrict by date range:zea mays[ORGN] AND WUGSC[CNTR] AND HTGS_FULLTOP[KYWD] AND 2006/09[PDAT]zea mays[ORGN] AND WUGSC[CNTR] AND HTGS_FULLTOP[KYWD] AND 2006/09/26:2006/10/03[PDAT]
Query GenBank by keywords
HTGS_IMPROVED submissions
Pick a clonename, any clonename - DEFINITION Zea mays chromosome 4 clone CH201-11H16; ZMMBBc0011H16Center project name: Z_AF-11H16
Improved sequence is annotated on submission record
Where possible, contigs have been ordered and oriented based on read pairing. and these regions are designated as scaffolds.
Small contigs (<2kb) that don’t represent a clone end, don’t contain improved sequence, or are not part of a scaffold are removed from the final submission.
Contigs are screened for bacterial contamination
FEATURES Location/Qualifiers source 1..173904 /organism="Zea mays" /mol_type="genomic DNA" /db_xref="taxon:4577" /chromosome="unknown" /clone="CH201-112C8; ZMMBBc0112C08" misc_feature 1..51940 /note="scaffold_name:Scaffold1" misc_feature 1..36440 /note="assembly_name:Contig245 clone_end:left vector_side:T7" gap 36441..36540 /estimated_length=unknown misc_feature 36541..51940 /note="assembly_name:Contig240" misc_feature 51941..129231 /note="scaffold_name:Scaffold2" gap 51941..52040 /estimated_length=unknown misc_feature 52041..59371 /note="assembly_name:Contig250”........... misc_feature 120342..122491 /note="Improved sequence." misc_feature 128142..129231 /note="Improved sequence." misc_feature 129232..139656 /note="scaffold_name:Scaffold3" .....
1005 HTGS_FULLTOP 254 PREFIN_DONE1532 ACTIVE_FIN357 HTGS_IMPROVED
GenBank
Ongoing work at CSHL
• BAC Annotations Levels
• Data Analysis
• Display
• Project Management
• Collaborations
BAC Data Analysis
• Ensembl Pipeline • 3 inclusive phases of
annotation– Level I: Display BAC
information
– Level II: Sequence-based annotations
– Level III: Integrative annotations
Shiran Pasternak, Apurva Narechania, Joshua Stein
Application of Mathematical Repeat Analysis
• Identifies novel repeats w/o dependence on curation.– Based on frequency of 20-mers in JGI WGS sequence
• Correlates with presence of retroelements.
• Can modulate threshold to optimize application.
Exon Coverage at Log 1.25 Repeat Level
90.21
0.150
20406080
100
TE non-TE
% Coverage
Apurva Narechania, Joshua Stein
Retroelement Annotation
• Classify retroelement families• Current list covers ~68% of genome• Ten most prevalent account for ~80% retroelement sequences
– Ji, huck, opie, zeon, cinful, prem1, grande, xilon, gyma, giepum
Collaboration with Jeff Bennetzen and Philip SanMiguel
Goal is to visualize the history of transpositions
Giepum element interrupted by ji and opie in AC148166
Joshua Stein
Whole Genome Alignments• Wobble Aware Bulk Aligner (WABA)*
– TIGR Transcripts Rice– WABA alignments Maize
• Distinguishes between:– low similarity regions (grey)– high-similarity regions (medium blue)– high similarity regions w/ wobble-base mismatch of coding regions (green)
*Kent, WJ & Zahler, A.M. (2000). Genome Res. 10:1115-25
Joshua Stein
Whole Genome Alignments• BLASTZ* with AXTCHAIN** & CHAINNET**
– Sensitive gapped BLAST algorithm designed for aligning long sequences.– Accommodates long gaps & overlapping gaps, inversions, translocations, & duplications
*Schwartz, S et al. (2003). Genome Res. 13:103-7 **Kent, WJ, et al. (2003). PNAS 100:11484-11489
Example of BLASTZ(net) display in Ensembl.
www.maizesequence.org
Sequenced BAC
FPC Contig
Virtual Bin
Core Bin Marker
Chromosome
Synteny Views
Main Navigation bar is accessible from every page
Contains multiple entry points to the genome
MapView
Displays statistics by chromosome and provides entry points based on a single chromosome
CytoViewProvides detail information on features anchored to the FPC map.
The side bar highlights the location on the chromosome and provides page specific functionality
including data export.
The Detailed view is customizable, tracks can be added or removed by the users.
Feature contain drop down menus that contain general information as well as provided internal links, and external links.
ContigView
This view is based BAC coordinated and displays annotation levels II and III.
The header contains the Clone name in the physical map, GenBank Accession, and Chromosome and FPC contig information.
Detailed view offers semantic zooming, customizable and provides links to other views and information resources.
SyntenyView
Upcoming Features
• Release – October 2006
• BlastView– December 2006
• BAC Annotation– Level II January, 2007– Level III annotation April, 2007– WG alignments June, 2007
• BioMart– January, 2007
• NSF collaborations– TwinScan annotations: March, 2007– Maize Optical Map: July, 2007– Full-length cDNAs: December, 2007
• Notification System– Users are notified
• When a region of interest is updated
• When markers are aligned to a specific sequence
– January, 2007
Hardware Environments
• Software– Developed locally– Managed with source control– Frequent releases to staging
environment– Quarterly production releases
• Data– Timed analysis on staging
environment– Mirrored weekly on production
Shiran Pasternak, Apurva Narechania
Quality Assurance
• Unit-testing framework– Binary assertions
– Failure report and automatic notification
• Software Quality Control– e.g., code retrieves correct data from the database
• Data Quality Control– e.g., clone in Genbank record exists in FPC map
Shiran Pasternak
Project Management
• Mantis Bug Tracker– Manage tasks using
priorities, severities, and resource allocations
– Automated submission of issues using feedback form
– Generation of progress reports
Project Management
• Wiki– Enhances group
communication– Meeting notes,
flowcharts, specification documents
– Maintains history of specifications and design decisions
– Seamless editing
Collaborations
• MaizeGDB (Iowa State University, University of Missouri)
– C. Lawrence
• Maize Optical Map (University of Wisconsin)
– D. Schwartz
• Maize Transposon Annotation (University of Georgia, Purdue)
– J. Bennetzen, P. San Miguel
• Ensembl (EBI)
– E. Birney
• Vmatch for Mathematical Repeats (University of Hamburg)
– S. Kurtz
• Maize Full Length cDNA project (Arizona Genomics Institute)
– Y. Yu
• TwinScan (Danforth Plant Science Center)
– B. Barbazuk