- 1 - EST Clustering An expressed sequence tag or EST is a short sub-sequence of a transcribed cDNA sequence. They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. An EST is produced by one-shot sequencing of a cloned mRNA (i.e. sequencing several hundred base pairs from an end of a cDNA clone taken from a cDNA library). The resulting sequence is a relatively low quality fragment whose length is limited by current technology to approximately 500 to 800 nucleotides. Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions of expressed genes. They may be present in the database as either cDNA/mRNA sequence or as the reverse complement of the mRNA, the template strand. Fig1: Manufacture of EST Overview of clustering and consensus generation EST Clustering is performed as a process that utilizes clustering information that is less and less definitive. Initially sequence identity provides a good guide to cluster membership. Shared annotation provides joining information that can be of more variable quality. Thus the number of accurately clustered ESTs is heavily dependent on a strategy that can assign cluster membership based on verifiable criteria; sequence identity is currently the most useful of these. Clustering can be performed with or without sequence consensus generation. It is preferable, although more difficult, to manufacture a consensus sequence from each cluster. The clustering overview will briefly describe processes that result in consensus sequence generation. What is an EST clustering A cluster is fragmented, EST data (DNA or protein) and (if known) gene sequence data, consolidated, placed in correct context and indexed by gene such that all expressed data
12
Embed
EST Clustering - BAMbioinformatics.iasri.res.in/BAMAST/Book.html/EbookNew/... · 2012. 10. 1. · Characterization of splice variants and alternative polyadenylation. In silico differential
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
- 1 -
EST Clustering
An expressed sequence tag or EST is a short sub-sequence of a transcribed cDNA sequence.
They may be used to identify gene transcripts, and are instrumental in gene discovery and
gene sequence determination. An EST is produced by one-shot sequencing of a cloned
mRNA (i.e. sequencing several hundred base pairs from an end of a cDNA clone taken
from a cDNA library). The resulting sequence is a relatively low quality fragment whose
length is limited by current technology to approximately 500 to 800 nucleotides. Because
these clones consist of DNA that is complementary to mRNA, the ESTs represent portions
of expressed genes. They may be present in the database as either cDNA/mRNA sequence
or as the reverse complement of the mRNA, the template strand.
Fig1: Manufacture of EST
Overview of clustering and consensus generation
EST Clustering is performed as a process that utilizes clustering information that is less and
less definitive. Initially sequence identity provides a good guide to cluster membership.
Shared annotation provides joining information that can be of more variable quality. Thus
the number of accurately clustered ESTs is heavily dependent on a strategy that can assign
cluster membership based on verifiable criteria; sequence identity is currently the most
useful of these. Clustering can be performed with or without sequence consensus
generation. It is preferable, although more difficult, to manufacture a consensus sequence
from each cluster. The clustering overview will briefly describe processes that result in
consensus sequence generation.
What is an EST clustering
A cluster is fragmented, EST data (DNA or protein) and (if known) gene sequence data,
consolidated, placed in correct context and indexed by gene such that all expressed data
- 2 -
concerning a single gene is in a single index class, and each index class contains the
information for only one gene. The goal of the clustering process is to incorporate
overlapping ESTs which tag the same transcript of the same gene in a single cluster. For
clustering, we measure the similarity (distance) between any 2 sequences. The distance is
then reduced to a simple binary value: accept or reject two sequences in the same cluster.
Similarity can be measured using different algorithms:
Pairwise alignment algorithms:
(a) Smith-Waterman is the most sensitive, but time consuming (ex. cross-match)
(b) Heuristic algorithms, as BLAST and FASTA, trade some sensitivity for speed
Non-alignment based scoring methods:
d2 cluster algorithm: based on word comparison and composition (word identity
and multiplicity) (Burke et al., 99). No alignments are performed) fast.
Pre-indexing methods.
Purpose-built alignments based clustering methods.
Types of clustering
Loose and stringent clustering
ESTs by their nature have a degree of erroneous sequence data, complicated by short length
and some mis-annotation. Stringent one-pass assembly methods tend to result in fewer,
shorter consensus sequences. Looser systems for clustering result in larger, more 'sloppy'
clusters, with various expressed forms being represented within each cluster. Each
approach has its advantages and disadvantages. Stringent clustering provides greater initial
fidelity, at a cost of lower coverage of expressed gene data and a lower inclusion rate of
expressed gene forms. Loose clustering provides greater coverage, at a cost of possible
inclusion of paralogous expressed genes, lower fidelity data, but at a gain of greater
inclusion of alternate expressed forms.
(a) Stringent clustering:
Greater initial fidelity
One pass
Lower coverage of expressed gene data
Lower cluster inclusion of expressed gene forms
Shorter consensi
(b) Loose clustering
Lower initial fidelity
Multi-pass
Greater coverage of expressed gene data
- 3 -
Greater cluster inclusion of alternate expressed forms
Longer consensi
Risk to include paralogs in the same gene index
Supervised and unsupervised EST clustering
Supervised clustering
ESTs are classified with respect to known reference sequences or “seeds” (full
length mRNAs, exon constructs from genomic sequences, previously assembled
EST cluster consensus).
Unsupervised clustering
ESTs are classified without any prior knowledge
The three major gene indices use different EST clustering methods:
TIGR Gene Index uses a stringent and supervised clustering method, which generate
shorter consensus sequences and separate splice variants.
STACK uses a loose and unsupervised clustering method, producing longer
consensus sequences and including splice variants in the same index.
A combination of supervised and unsupervised methods with variable levels of
stringency is used in UniGene. No consensus sequences are produced.
Importance for ESTs:
ESTs represent the most extensive available survey of the transcribed portion
of genomes.
ESTs are indispensable for gene structure prediction, gene discovery and
genomic mapping.
Characterization of splice variants and alternative polyadenylation.
In silico differential display and gene expression studies (specific tissue
expression, normal/disease states).
SNP data mining.
High-volume and high-throughput data production at low cost.
Low data quality of ESTs:
High error rates (_ 1=100) because of the sequence reading single-pass.
Sequence compression and frame-shift errors due to the sequence reading
single-pass.
A single EST represents only a partial gene sequence.
Not a defined gene/protein product.
Not curated in a highly annotated form.
High redundancy in the data) huge number of sequences to analyze.
Improving ESTs: Clustering, Assembling and Gene indices:
- 4 -
The value of ESTs is greatly enhanced by clustering and assembling. It can solve many
problems associated with ESTs
solving redundancy can help to correct errors
longer and better annotated sequences
easier association to mRNAs and proteins
detection of splice variants
fewer sequences to analyze
Gene indices: All expressed sequences (as ESTs) concerning a single gene are grouped in a
single index class, and each index class contains the information for only one gene.
Different clustering/assembly procedures have been proposed with associated
resulting databases (gene indices):
UniGene (http://www.ncbi.nlm.nih.gov/UniGene)
TIGR Gene Indices (http://www.tigr.org/tdb/tgi.shtml)
STACK (http://www.sambi.ac.za/Dbases.html)
UniGene (http://www.ncbi.nlm.nih.gov/UniGene)
UniGene Gene Indices available for a number of organisms. UniGene clusters are produced
with a supervised procedure: ESTs are clustered using GenBank CDSs and mRNAs data as
“seed” sequences. There is no attempts to produce contigs or consensus sequences.
UniGene uses pairwise sequence comparison at various levels of stringency
to group related sequences, placing closely related and alternatively spliced
transcripts into one cluster.
UniGene procedure:
(1) Screen for contaminants, repeats, and low-complexity regions in Embank:
(e) The resulting clusters are called anchored clusters since their 3’ end is supposed
known.
(f) Ensures 5’ and 3’ ESTs from the same cDNA clone belongs to the same
cluster.
(g) ESTs that have not been clustered, are reprocessed with lower level of
stringency. ESTs added during this step are called guest members.
(f) Clusters of size 1 (containing a single sequence) are compared against the
rest of the clusters with a lower level of stringency and merged with the
cluster containing the most similar sequence.
(j) For each build of the database, clusters IDs change if clusters are split or
merged.
TIGR Gene Indices (http://www.tigr.org/tdb/tgi)
(a) TIGR produces Gene Indices for a number of organisms
(b) TIGR Gene Indices are produced using strict supervised clustering methods. (c) Clusters are assembled in consensus sequences, called tentative consensus (TC)
sequences,that represent the underlying mRNA transcripts.
(d) The TIGR Gene Indices building method tightly groups highly related
sequences and discard under-represented, divergent, or noisy sequences.
(e) TC sequences can be used for genome annotation, genome mapping, and
identification of orthologs/paralogs genes.
(f) TIGR Gene Indices characteristics:
separate closely related genes into distinct consensus sequences
separate splice variants into separate clusters
low level of contamination
TIGR Gene Indices procedure:
(a) EST sequences recovered form dbEST
(http://www.ncbi.nlm.nih.gov/dbEST)
(b) Sequences are trimmed to remove:
Vectors
polyA/T tails
adaptor sequences
bacterial sequences
(c) Get Tentative consensus and singletons from previous database build
(d) Supervised and strict clustering:
Use ETs, TCs, and CDSs as template;
Compare cleaned ESTs to the template using FLAST (a rapid pairwise
comparison program).
Sequences are grouped in the same cluster if both conditions are true:
(a) they share _ 95% identity over 40 bases or longer regions