Top Banner
Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
63

Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Jan 20, 2016

Download

Documents

Gabriel Cain
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Automated sequencing machines,

particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Page 2: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

All the Genes?• Any human gene can now be found in the

genome by similarity searching with over 95% certainty.

• However, the sequence still has many gaps– unlikely to find an uninterrupted genomic

segment for any gene – still can’t identify pseudogenes with certainty

• This will improve as more sequence data accumulates

Page 3: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Finding Genes in genome Sequence is

Not Easy

• About 2% of human DNA encodes functional genes.

• Genes are interspersed among long stretches of non-coding DNA.

• Repeats, pseudo-genes, and introns confound matters

Page 4: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Impact on Bioinformatics

• Genomics produces high-throughput, high-quality data, and bioinformatics provides the analysis and interpretation of these massive data sets.

• It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis.

Page 5: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Six basic questions about genomes

[1] how is a genome sequenced?

[2] when is the project finished?

[3] sequence one individual or many?

[4] what information is in the DNA?

[5] how many genes are in the genome?

[6] how can whole genomes be compared?

Page 6: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

[1] Genome projects: sequencing strategies

Hierarchical shotgun methodAssemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished.

A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region.

Scaffold: an ordered set of contigs placed on a chromosome.

ShotgunAn approach used to decode an organism's genome by shredding it into smallerfragments of DNA which can be sequenced individually. The sequences of thesefragments are then ordered, based on overlaps in the genetic code, and finallyreassembled into the complete sequence. The 'whole genome shotgun' method isapplied to the entire genome all at once, while the 'hierarchical shotgun' method isapplied to large, overlapping DNA fragments of known location in the genome.

http://www.genome.gov/glossary.cfm

Page 7: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

3. Whole Genome Shotgun Sequencing

cut many times at random

genome

forward-reverse linked reads

• plasmids (2 – 10 Kbp)

• cosmids (40 Kbp) known dist

~500 bp~500 bp

Page 8: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

ARACHNE: Whole Genome Shotgun Assembly

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge good pairs of reads into longer contigs

3. Link contigs to form supercontigs

http://www-genome.wi.mit.edu/wga/

Page 9: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

[2] When is the project finished?

Get five to ten-fold coverage

Finished sequence: a clone insert is contiguouslysequenced with high quality standard of error rate0.01%. There are usually no gaps in the sequence.

Draft sequence: clone sequences may contain severalregions separated by gaps. The true order andorientation of the pieces may not be known.

Page 10: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
Page 11: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

[1] Interspersed repeats: transposon-derived repeats-- 45% of human genome; LTR, SINE, LINE

[2] Processed pseudogenes

[3] Simple sequence repeats-- micro- and minisatellites-- ACAAACT, 11 million times in a Drosophila-- Human genome has 50,000 CA dinucleotide repeats

[4] Segmental duplications (about 5% of human genome)

[5] Tandem repeats (e.g. telomeres, centromeres)

Repetitive DNA sequences: five classes

Page 12: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

• LINE and SINE repeats. A LINE (long interspersed nuclear element) encodes a reverse transcriptase (RT) and perhaps other proteins. Mammalian genomes contain an old LINE family, called LINE2, which apparently stopped transposing before the mammalian radiation, and a younger family, called L1 or LINE1, many of which were inserted after the mammalian radiation (and are still being inserted). A SINE (short interspersed nuclear element) generally moves using RT from a LINE. Examples include the MIR elements, which co-evolved with the LINE2 elements. Since the mammalian radiation, each lineage has evolved its own SINE family. Primates have Alu elements and mice have B1, B2, etc. The process of insertion of a LINE or SINE into the genome causes a short sequence (7-21 bp for Alus) to be repeated, with one copy (in the same orientation) at each end of the inserted sequence. Alus have accumulated preferentially in GC-rich regions, L1s in GC-poor regions.

Page 13: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Hypotheses:

• Nongenic DNA performs essential functions, such as regulation of gene expression.

• Nongenic DNA is inert, genetically and physiologically. Excess DNA is incidental and is called “junk DNA.”

• Nongenic DNA is a functional parasite or selfish DNA (retrotransposons).

• Nongenic DNA has a structural function.

What is the function of nongenic DNA?

Page 14: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Clasificación del ADN

FUNCIONAL (secuencias que cumplen una función)

- Codante (se traducen en proteínas)

-No codante (no se traducen)

* Transcrito (cumple función a nivel de RNA: subun. ribos.)

* No transcrito (cumple función a nivel de DNA: intrón, promotor, enhancer, etc.)

NO-FUNCIONAL (secuencias que no cumplen ninguna función: “Junk DNA” –

basura)

Page 15: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Gene-finding algorithms

Homology-based searches (“extrinsic”) Rely on previously identified genes

Algorithm-based searches (“intrinsic”)Investigate nucleotide composition, open-reading frames, and other intrinsic properties of genomic DNA

Page 16: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

DNA

RNA

Mature RNA

protein

intron

Page 17: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

DNA

RNA

RNA

protein

Homology-based searching: compare DNAto expressed genes (ESTs)

intron

Page 18: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

DNA

RNA

Algorithm-based searching: compare DNA in exons(unique codon usage) to introns (unique splices sites)to noncoding DNA. Identify open reading frames (ORFs).

Page 19: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
Page 20: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
Page 21: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

[6] how can whole genomes be compared?

-- molecular phylogeny

-- You can BLAST (or PSI-BLAST) all the DNA and/or

protein in one genome against another

-- We looked at TaxPlot and COG for bacterial (and for

some eukaryotic) genomes

Page 22: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Orthologue & Paralogue

• Orthologue- homologous genes with identical function in different organisms.

• Paralogue- homologous genes in the same organism originated from gene duplication.

Page 23: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Orthologue & Paralogue

Gene A

Gene B

diverge

Species 1 Species 2

Gene A

Gene B

Page 24: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Orthologue & Paralogue

Species 1

Gene A

Gene B

Species 2

Gene A

Gene B

Page 25: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Orthologue & Paralogue

Species 1

Gene A

Species 2

Gene A

Gene B

Page 26: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Orthologue & Paralogue

Species 1

Gene A

Species 2

Gene B

Page 27: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Comparative GenomicsUsing ACT

The Artemis Comparison Tool

Page 28: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Artemis

• Artemis is a free DNA sequence viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation.

• http://www.sanger.ac.uk/Software/Artemis/

Page 29: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Artemis comparison tool ACT

• Based on artemis and coded in java.

• Allows visualisation of two sequences or more and a comparison file.

• The comparison file can be BLASTn or tBLASTx.

• Retains all the functionality of artemis.

Page 30: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Running ACT

Sequence 1 Sequence 2

BLASTntBLASTx

Reformat

MSPcrunch

Page 31: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

DNA sequence

RepeatMasker Blastn HalfwiseBlastxGene finders tRNA scan

Repeats Promoters Pseudo-GenesrRNAGenes

tRNA

Fasta BlastP Pfam Prosite Psort SignalP TMHMM

Page 32: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

The Annotation Process

DNA SEQUENCE

AN

NA

LY

SIS

SO

FT

WA

RE

UsefulInformation

Annotator

Page 33: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

AT content

Forward translations

Reverse Translations

DNA and aminoacids

DNA in Artemis

Page 34: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Gene structure

• IN TRYPANOSOMATIDS– Polycistronic structure– Genes occur on a single strand at a time.– Inflection points– No splicing

Page 35: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
Page 36: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Trypanosome gene structure

Page 37: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

GENE STRUCTURE IN MALARIA

• Splicing

• No polycistronic units

• Can have small exons

• Low complexity regions

Page 38: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

AT content

• Coding regions have higher GC content in AT rich genomes

Page 39: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

AT content

Page 40: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

CODON USAGE

• Codon bias is different for each organisms.

• DNA content in coding regions is restricted but not in non coding regions.

• The codon usage for any particular gene can influence expression.

Page 41: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Codon usage

• All organisms have a preferred set of codons.

Malaria TrypanosomaGUU 0.41 GUU 0.28

GUC 0.06 GUC 0.19

GUA 0.42 GUA 0.14

GUG 0.11 GUG 0.39

Page 42: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Codon Usage• http://www.kazusa.or.jp/codon/

Page 43: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Codon Usage in Artemis

Forward frames

Reverseframes

Page 44: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

GC frame plot

• Plots the third position GC content of each frame of a DNA sequence.

• In coding DNA the GC content of the 3rd base is often higher.

• Good prediction of coding in malaria and trypanosomes.

Page 45: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
Page 46: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Genefinding programs

• Genefinding software packages use hidden markov models.

• Predict coding, intergenic and intron sequences

• Need to be trained on a specific organism.

• Never perfect!

Page 47: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

PhatCawley et al. (2001) Mol. Bio. Para. 118 p167

http://www.stat.berkeley.edu/users/scawley/Phat/

• Based on a generalised hidden Markov model (GHMM)

• Free easily installed and run.

• Is good at predicting multiexon genes but will in some cases miss out genes altogether and will over predict.

Page 48: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Whant is an HMM

• A statistical model that represents a gene.

• Similar to a “weight Matrix” that can recognise gaps and treat them in a systematic way.

• Has a different “states” that represent introns,exons and intergenic regions.

Page 49: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

GlimmerMSalzberg et al. (1999) genomics 59 24-31

• Adaption of the prokaryotic genefinder Glimmer.

Delcher et al. (1999) NAR 2 4363-4641

• Based on a interpolated HMM (IHMM).• Only used short chains of bases (markov

chains) to generate probabilities.• Trained identically to Phat

Page 50: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

GlimmerM

• Under predicts splicing

• Hardly hardly ever misses a gene completely.

• Does over predict.

• Free with licence.

Page 51: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Homology Data

• Coding regions are more conserved than non coding regions due to selective pressure.

• Comparing all possible translations against all known proteins will give clues to known genes.

• Blastx

Page 52: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

The Gene Prediction Process

DNA SEQUENCE

AN

NA

LY

SIS

SO

FT

WA

RE

GoodGene Models

Annotator

DNA Plots

Phat

GlimmerM

BlastX

FASTA

ESTs

Page 53: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

T. brucei vs L. major (cont.)

Page 54: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

T. brucei vs T. cruzi

Page 55: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

L. major has break in synteny that is conserved in T. brucei and T. cruzi

T. cruziChr3.

T. Bruceichr1

L. Majorchr12T. Brucei

chr6

Page 56: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

The ACT Display

Zoom scroll bar

genome1

Filter scrollbar

genome2

Genome2

Blast HSPs

genome3

Page 57: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

ACT

• Designed for looking at complete bacterial genomes.

Page 58: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Knowlesicontgs

FalciparumChr 3

YoeliiContigs (TIGR)

tblastx

tblastx

Page 59: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
Page 60: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

AG

-FM

VZ

- US

PA

G- F

MV

Z-U

SP

Page 61: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
Page 62: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.
Page 63: Automated sequencing machines, particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

Software

• www.sanger.ac.uk/Software/Artemis

• www.sanger.ac.uk/Software/ACT

• www.genome.nghri.nih.gov/blastall

• www.cgr.ki.se/cgr/goups/sonnhammer/MSPcrunch.html