BioSci 145B lecture 5 page 1 © copyright Bruce Blumberg 2004. All rights reserved BioSci 145B Lecture #5 5/4/2004 Bruce Blumberg –2113E McGaugh Hall -

BioSci 145B lecture 5 page 1 ©copyright Bruce Blumberg 2004. All rights reserved

BioSci 145B Lecture #5 5/4/2004

• Bruce Blumberg– 2113E McGaugh Hall - office hours Wed 12-1 PM (or by

appointment)– phone 824-8573– [email protected]

• TA – Curtis Daly [email protected]– 2113 McGaugh Hall, 924-6873, 3116– Office hours Tuesday 11-12

• lectures will be posted on web pages after lecture – http://eee.uci.edu/04s/05705/ - link only here– http://blumberg-serv.bio.uci.edu/bio145b-sp2004– http://blumberg.bio.uci.edu/bio145b-sp2004

http://eee.uci.edu/04s/05705/

http://blumberg-serv.bio.uci.edu/bio145b-sp2004

http://blumberg.bio.uci.edu/bio145b-sp2004


Genome sequencing

• The problem– Genome sizes for most eukaryotes are large (108-109 bp)– High quality sequences only about 600-800 bp /pass

• The solution– Break genome into lots of bits and sequence them all– Reassemble with computer

• The benefit– Rapid increase in information about genome size, gene

comparisons, etc


Genome sequencing (contd)

• Shotgun sequencing NOT invented by Craig Venter– Messing 1981 first description

of shotgun– Sanger lab developed current

methods in 1983– approach

• blast genome into small chunks– Shearing is usual – 4-cutters also used

• clone these chunks– In the early days, try

to make small insert libraries .5-1.5kb

– Now typically make 3 library types

» 3-5 kb, 8 kb plasmid» 40 kb fosmid - to jump

repetitive sequences


Genome sequencing(contd)• sequence + assemble

by computer– A priori difficulties

• how to assemble fragments

– Software now very good

• what to do about repeats?

– Fosmids and BAC STC help a lot

• how to get nice uniform distribution of sequenceswithout too much redundancy?

– Biggest problem, not really well resolved


Genome sequencing(contd)– Assembled sequences always have gaps of various sizes

• how to cross these gaps?– Quickly and cost-effectively

• Need to link sequences somehow– How depends on the size of the gaps to be crossed


Genome sequencing(contd)– For small gaps (up to 8

kb or so) • often can close by

sequencing both ends of clones

– For medium sized gaps (8-30 kb)

• Primer walking across a linking clone (cosmid or fosmid)



• Large gaps require much more effort– Identify large insert clones that span gap

• Typically from BAC end sequences• May have to screen libraries to find

– Shotgun sequence these and assemble– Close any small gaps remaining with primer walking



• Shotgun sequencing (contd)– How to minimize sequence redundancy (re-sequencing the same

region)?• Best way to minimize redundancy is map before you start

– C. elegans was done this way - when the sequence was finished, it was FINISHED

» mapping took almost 10 years– mapping much too tedious and nonprofitable for Celera

» who cares about redundancy, let’s sequence and make $$• why does redundancy matter?

– Finished sequence today costs about $0.50/base



– Mapping by fingerprinting

– Mapping by hybridization



• Actual large insert fingerprinting gel


Traditional (map first) vs STC (map as you go along) mapping

Map beforesequencing

Map asyou go


The human genome

• In Feb 12 2001, Celera and Human Genome project published “draft” human genome sequencs– Celera -> 39114 (WGS)– Ensembl -> 29691 (map as you go)– Consensus from all sources ~30K

• Number of genes– C. elegans – 19,000– Arabidopsis 25,000

• Predictions had been from 50-140k human genes– What’s up with that?– Are we only slightly more complicated than a weed?– How can we possibly get a human with less than 2x the number

of genes as C. elegans– Implications?

• UNRAVELING THE DNA MYTH: The spurious foundation of genetic engineering, Barry Commoner, Harpers Magazine Feb, 2002


The human genome

• The answer – Somewhat sloppy science– Gene sets don’t overlap completely– Floor is 42K – 105,680 UniGene clusters from ESTs (down from 128,826 last year)

= 42113


Genome sequencing(contd)

• Whole genome shotgun sequencing (Celera)– premise is that rapid generation of draft sequence is valuable– why bother trying to clone and sequence difficult regions?

• Basically just forget regions of repetitive DNA - not cost effective

– R0t analysis suggests not many genes there anyway

– using this approach, genome was alleged to be 90% finished in 2001• More than 95% today• rule of thumb is that it takes at least as long to finish the last 5%

as it took to get the first 95%– problems

• sequence may never be complete as is C. elegans• much redundant sequence with many sparse regions and lots of

gaps.• Fragment assembly for regions of highly repetitive DNA is

dubious at best• Map as you go method inherently more complete

– Sets up for finishing since an ordered set of overlapping BACs is produced

• Both methods produce reasonable data given enough sequencing


The human genome

• How finished is the human genome sequence?– Draft sequence to high coverage– Chromosome by chromosome finishing now

• Chr 22 – 1999• Chr 21 – 2000• Chr 20 – 2001• Chr 15 – 2003• Chr 6,7,Y-2003• Chr 13,19 -2004



• Knowing what we know now – how to approach a large new genome?– Xenopus tropicalis 1.7 Gb (about ½ human)– BAC end sequencing– Whole genome shotgun– Gaps closed with BACS– 8 x coverage by end of 2004– Finishing dependent on additional funding


Genome sequencing

• DOE – Joint Genome Institute– http://www.jgi.doe.gov/– Numerous advances in sequencing technology

• Increased pass rate from ~70% to > 90%• Lowered cost nearly 3 fold


Other sequencing technologies

• Sequencing by hybridization is most interesting– Construct a high-density

microchip with all possible combinations of a short oligonucleotide

• Up to 25-mers • By photolithography

– Synthesized onchip directly

– Label and hybridize fragment to be sequenced

– Wash stringently– Read fluorescent spots– Reconstruct sequence

by computer


Other sequencing technologies (contd)

• Sequencing by hybridization rarely used for de novo sequencing– Extremely fast and useful to sequence something you already

know the sequence of but want to identify mutation– Disease causing changes

• e.g in mitochondrial DNA– SNP discovery– Works best for examining sequence of <10 kb


Other sequencing technologies (contd)

• http://www.affymetrix.com/products/arrays/index.affx• SNP discovery

– Photo shows mitochondrial chip

– Right panel shows pairs of normal (top) vs disease (bottom) (Leber’s Hereditary Optic Neuropathy)

• Top 3 disease mutations

• Bottom control with no change


Useful software for molecular biology (contd)

• NCBI – www.ncbi.nlm.nih.gov– main information and analysis resource– indispensable resource



• NCBI – Blast – how to find similar genes• www.ncbi.nlm.nih.gov/BLAST/



• Why pay Celera?


Practice midterm

1. (6 points) Your laboratory works on the strange organisms that live around hydrothermal vents in the deep ocean as a model system for the first multicellular organisms. Your PI has developed a new method of culturing such organisms, making it possible to grow the wormlike animals found around the vents in the laboratory. One of the first things that needs to be done is to construct the molecular tools that will be required to characterize your assigned animal, the Pompeii worm (Alvinella pompejana) which can survive an environment as hot as 80° C. The ultimate goal will be to establish an A. pompejana genome project including whole genome sequencing and mapping, an EST project and DNA microarrays.

The first goal is to make a genomic library. What type of library will you make, i.e., which type of vector? Justify your choice. What type of equipment will be required to make your library?


Practice midterm

1. answer

You should choose to make a BAC or PAC library. BAC is best for genome sequencing because it accepts large inserts, is stable and the vector is small, facilitating shotgun sequencing

Not so much equipment required other than standard molecular biology laboratory equipment, electroporator and PFGE – pulsed field gel electrophoresis. PFGE is indispensable for isolation of large DNA as needs to be used for making good genomic libraries.


Practice midterm

2. (4 points) Describe a method to make a physical map of the A. pompejana genome in order to facilitate large-scale sequencing.

Use large insert genomic library to construct a map.

Map the clones by fingerprinting, map as you go, or hybridization.

Restriction mapping of the whole genome was NOT an acceptable answer.


Practice midterm

3. (5 points) You received an E. coli strain with the following genotype from a neighboring laboratory for the purposes of propagating your genomic library: mcrA, Δ(mrr-hsdRMS-mcrBC), ΔlacX74, deoR, recA1, araD139, Δ(ara-leu)7697, galU, galK, endA1, nupG (in every case above, the bacteria are DEFICIENT in the indicated gene product)a) Is this a good strain for the type of genomic library you have

chosen to make, i.e., does it have the necessary genetic markers for your library to be stable and readily screened?

b) If so, what are the desirable markers that the strain has. If not, which ones are missing?

c) Would the strain be suitable if you had made a YAC library? Why?

a) suitable for PAC and BAC

b) is restriction deficient, and deoR. Some also pointed out that the strain should have lacZΔM15 for blue white selection if BACs were being used.

c) strain is not suitable for YAC library because yeast artificial chromosomes can only be propagated in YEAST


Practice midterm

4. (5 points) A colleague has experimentally determined that the A. pompejana genome is 110 Mb – right between C. elegans (97 Mb) and Drosophila melanogaster (120 Mb). Describe a sequencing strategy that could allow the rapid generation of a draft genome sequence. How might you combine the mapping proposed in your answer to question 2 to facilitate the completion of the genome sequence?

Whole genome shotgun will generate a rapid draft sequence. Combining this with whole genome map made in 2 will enable closing gaps.


Practice midterm

5. (6 points) As a side project, you decide to see if the A. pompejana genome contains homeobox genes. You dig into the laboratory archives and find a cDNA probe that contains the Drosophila melanogaster Antennapedia homeobox. What is the best way to find whether the A. pompejana genome contains homeobox genes? If so, how will you isolate genomic clones containing these homeobox genes? Let’s say you find 8 A. pompejana homeobox genes. Describe a quick way to tell whether they are located in one or more clusters as in Drosophila or C. elegans?

Genomic southern with A. pompejana DNA probed with Antp homeobox to work out conditions

Screen the genomic library you made using the Antp probe using these conditions

Once you recover the 8 genes, start hybridizing them back to the large insert clones or to Southern of PFGE electrophoresis of 8-cutter digest of genomic DNA. Note whether more than 1 homeobox gene maps to each clone or fragment


Practice midterm

7. (6 points) Remember that you also need to provide material for the EST project. This means that it is time to make cDNA libraries, right? Assume that the libraries you make will be used for more than just EST sequencing. What sort of vector will you choose? Should you go to the trouble of enriching the library for full-length cDNAs? If so, how? Should the libraries be standard, normalized, or subtracted? Justify your answer. If normalized or subtracted libraries are required, describe generally how you will make them.

• Plasmid vector (NOT PAC or BAC)• Yes you should enrich for full-length cDNAs since the library will

be used for multiple purposes• Cap trap, oligo-capping or cap-affinity chromatography gets full-

length mRNA which should yield a library enriched for full-length cDNAs

• The libraries should be normalized since EST sequencing is contemplated and we don’t want to sequence the same thing many times

• Make normalized libraries by making driver from the library you wish to normalize, then hybridizing it back to ss-cDNA from that library to a low Cot value (5-10). After removing hybrids, use the remaining cDNA to make the normalize library


Practice midterm

8. (4 points) What are the major differences between normalized and subtracted cDNA libraries? If you want to use a cDNA library to isolate genes expressed specifically in the tail of A. pompejana compared with the head, would it be better to normalize or subtract the probe that you will use? Explain your reasoning.

Normalized libraries are depleted in abundant genes and enhanced in rare genes by self-hybridization.

Subtracted libraries are depleted in genes that are common between two sources

A subtracted probe is appropriate here since you wish to identify genes specifically expressed in the tail.

BioSci 145B lecture 5 page 1 © copyright Bruce Blumberg 2004. All rights reserved BioSci 145B Lecture #5 5/4/2004 Bruce Blumberg –2113E McGaugh Hall -

Documents

rights reservedtraditional

finished sequence

findshotgun sequence

small insert libraries

medium sized gaps

gaps of various sizeshow

redundancy matter

small chunksshearing