Sept. 21, 2006, 5:30 Sept. 21, 2006, 5:30 PM PM Florida State University — Florida State University — Bioinformatics Workshop #1 Bioinformatics Workshop #1 An Introduction to Multiple An Introduction to Multiple Sequence Alignment & Analysis thru Sequence Alignment & Analysis thru GCG’s SeqLab GCG’s SeqLab Steven M. Thompson Steven M. Thompson Florida State Florida State University School of University School of Computational Science Computational Science (SCS) (SCS)
30
Embed
Sept. 21, 2006, 5:30 PM Florida State University — Bioinformatics Workshop #1 An Introduction to Multiple Sequence Alignment & Analysis thru GCG’s SeqLab.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sept. 21, 2006, 5:30 PMSept. 21, 2006, 5:30 PM
Florida State University — Florida State University — Bioinformatics Workshop #1Bioinformatics Workshop #1
An Introduction to Multiple Sequence An Introduction to Multiple Sequence Alignment & Analysis thru GCG’s SeqLabAlignment & Analysis thru GCG’s SeqLab
Steven M. ThompsonSteven M. Thompson
Florida State University Florida State University School of Computational School of Computational
Science (SCS)Science (SCS)
But first a prelude: My definitions —
BiocomputingBiocomputing and and computational biologycomputational biology are synonymous and are synonymous and
describe the use of computers and computational techniques to describe the use of computers and computational techniques to
analyze any biological system, from molecules, through cells, analyze any biological system, from molecules, through cells,
tissues, and organisms, all the way to populations.tissues, and organisms, all the way to populations.
BioinformaticsBioinformatics describes using computational techniques to access, describes using computational techniques to access,
analyze, and interpret the biological information in any of the analyze, and interpret the biological information in any of the
available biological databases.available biological databases.
Sequence analysisSequence analysis is the study of molecular sequence data for the is the study of molecular sequence data for the
purpose of inferring the function, mechanism, interactions, purpose of inferring the function, mechanism, interactions,
evolution, and perhaps structure of biological molecules.evolution, and perhaps structure of biological molecules.
GenomicsGenomics analyzes the context of genes or complete genomes (the analyzes the context of genes or complete genomes (the
total DNA content of an organism) within and across genomes.total DNA content of an organism) within and across genomes.
ProteomicsProteomics is the subdivision of genomics concerned with analyzing is the subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of the complete protein complement, i.e. the proteome, of
organisms, both within and between different organisms.organisms, both within and between different organisms.
from a ‘virtual’ DNA sequence to actual molecular from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round.physical characterization, not the other way ‘round.
Using bioinformatics tools, you can infer all Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, sorts of functional, evolutionary, and, structural insights into a gene product, structural insights into a gene product, without the need to isolate and purify massive without the need to isolate and purify massive amounts of protein! Eventually you can go on amounts of protein! Eventually you can go on to clone and express the gene based on that to clone and express the gene based on that analysis using PCR techniques.analysis using PCR techniques.
The computer and molecular databases are an The computer and molecular databases are an essential part of this process.essential part of this process.
And a ‘way’ to think about it:And a ‘way’ to think about it:The reverse biochemistry analogy —The reverse biochemistry analogy —
The exponential growth of molecular sequence databasesYearYear BasePairs BasePairs SequencesSequences
OK — well, how do you do it?OK — well, how do you do it?
Back to multiple sequence Back to multiple sequence alignment — Applicability?alignment — Applicability?
Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared —sequences being compared —
N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences
See:See:
MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and
PIMA (‘local’ portions only) on the multiple PIMA (‘local’ portions only) on the multiple alignment page at thealignment page at the
Baylor College of Medicine’s Search Baylor College of Medicine’s Search Launcher —Launcher —
Specialized format conversion Specialized format conversion tools such as GCG’s tools such as GCG’s SeqConv+ program and SeqConv+ program and PAUPSearch wrapper.PAUPSearch wrapper.
Don Gilbert’s public domain Don Gilbert’s public domain ReadSeq program.ReadSeq program.
Still more complications —Still more complications —
Indels and missing Indels and missing
data symbols (i.e. data symbols (i.e.
gaps) designation gaps) designation
discrepancy discrepancy
headaches —headaches —
., -, ~, ?, N, or X., -, ~, ?, N, or X
. . . . . Help!. . . . . Help!
Web resources for pairwise, Web resources for pairwise, progressive multiple alignment —progressive multiple alignment —http://www.techfak.uni-bielefeld.de/bcd/Curric/
MulAli/welcome.html..
http://pbil.univ-lyon1.fr/alignment.html
http://www.ebi.ac.uk/clustalw/
http://searchlauncher.bcm.tmc.edu/
However, problems with very large datasets and huge However, problems with very large datasets and huge
multiple alignments make doing multiple sequence multiple alignments make doing multiple sequence
alignment on the Web impractical after your dataset alignment on the Web impractical after your dataset
has reached a certain size. You’ll know it when has reached a certain size. You’ll know it when
you’re there!you’re there!
If large datasets become intractable for analysis on the Web, what other resources are available?
Desktop software solutions — public domain Desktop software solutions — public domain
programs are available, but . . . complicated to programs are available, but . . . complicated to
install, configure, and maintain. User must be install, configure, and maintain. User must be
pretty computer savvy. So, pretty computer savvy. So,
commercial software packages are available, e.g. commercial software packages are available, e.g.
in order of increasing power and complexity —in order of increasing power and complexity —
The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX
account. (GCG Reformat and SeqConv+ programs)account. (GCG Reformat and SeqConv+ programs)
The sequence is in a local GCG database in which case you ‘point’ to it by using The sequence is in a local GCG database in which case you ‘point’ to it by using
any of the GCG database logical names. A colon, “any of the GCG database logical names. A colon, “::,” always sets the logical ,” always sets the logical
name apart from either an accession number or a proper identifier name or a name apart from either an accession number or a proper identifier name or a
wildcard expression and they are case insensitive.wildcard expression and they are case insensitive.
The sequence is in a GCG format multiple sequence file, either an MSF (multiple The sequence is in a GCG format multiple sequence file, either an MSF (multiple
sequence format) file or an RSF (rich sequence format) file. To specify sequence format) file or an RSF (rich sequence format) file. To specify
sequences contained in a GCG multiple sequence file, supply the file name sequences contained in a GCG multiple sequence file, supply the file name
followed by a pair of braces, “followed by a pair of braces, “{}{},” containing the sequence specification, e.g. a ,” containing the sequence specification, e.g. a
wildcard — {wildcard — {**}.}.
Finally, the most powerful method of specifying sequences is in a GCG “list” file. Finally, the most powerful method of specifying sequences is in a GCG “list” file.
This is merely a list of other sequence specifications and can even contain This is merely a list of other sequence specifications and can even contain
other list files within it. The convention to use a GCG list file in a program is to other list files within it. The convention to use a GCG list file in a program is to
precede it with an at sign, “precede it with an at sign, “@@.” Furthermore, attribute information within list .” Furthermore, attribute information within list
files can specify particular sequence aspects.files can specify particular sequence aspects.
!!NA_SEQUENCE 1.0!!NA_SEQUENCE 1.0
This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.
Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future
you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The
line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.
example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..
GENBANKPLUS:*GENBANKPLUS:* all of GenBank plus EST, HTC, and GSSall of GenBank plus EST, HTC, and GSS SYNTHETIC:*SYNTHETIC:* GenBank syntheticGenBank synthetic
GBP:*GBP:* all of GenBank plus EST, HTC, and GSSall of GenBank plus EST, HTC, and GSS SY:*SY:* GenBank syntheticGenBank synthetic
GENBANK:*GENBANK:* all of GenBank except EST, HTC, and GSSall of GenBank except EST, HTC, and GSS UNANNOTATED:*UNANNOTATED:* GenBank unannotatedGenBank unannotated
GB:*GB:* all of GenBank except EST, HTC, and GSSall of GenBank except EST, HTC, and GSS UN:*UN:* GenBank unannotatedGenBank unannotated
BACTERIAL:*BACTERIAL:* GenBank bacteria and archaeaGenBank bacteria and archaea REFSEQNUC:*REFSEQNUC:* NCBI RefSeq transcriptomesNCBI RefSeq transcriptomes
BA:*BA:* GenBank bacteria and archaeaGenBank bacteria and archaea RS_RNA:*RS_RNA:* NCBI RefSeq transcriptomesNCBI RefSeq transcriptomes
OTHERMAMMAL:*OTHERMAMMAL:* GenBank other mammalGenBank other mammal
OM:*OM:* GenBank other mammalGenBank other mammal HOMO:*HOMO:* NCBI human RefSeq working draftNCBI human RefSeq working draft
OTHERVERTEBRATE:*OTHERVERTEBRATE:* GenBank other vertebrateGenBank other vertebrate PAN:*PAN:* NCBI chimpanzee RefSeq working draftNCBI chimpanzee RefSeq working draft
OV:*OV:* GenBank other vertebrateGenBank other vertebrate DANIO:*DANIO:* Sanger Zebrafish assemblySanger Zebrafish assembly
PLANT:*PLANT:* GenBank plant and fungiGenBank plant and fungi Sequence databases, amino acids:Sequence databases, amino acids:
PL:*PL:* GenBank plant and fungiGenBank plant and fungi
PRIMATE:*PRIMATE:* GenBank primateGenBank primate UNIPROT:*UNIPROT:* all of Swiss-Prot and all of SPTREMBLall of Swiss-Prot and all of SPTREMBL
PR:*PR:* GenBank primate GenBank primate UNI: *UNI: * all of Swiss-Prot and all of SPTREMBLall of Swiss-Prot and all of SPTREMBL
RODENT:*RODENT:* GenBank rodentGenBank rodent SWISSPROTPLUS:*SWISSPROTPLUS:* all of Swiss-Prot and all of SPTREMBLall of Swiss-Prot and all of SPTREMBL
RO:*RO:* GenBank rodentGenBank rodent SWP:*SWP:* all of Swiss-Prot and all of SPTREMBLall of Swiss-Prot and all of SPTREMBL
VI:*VI:* GenBank viralGenBank viral SWISSPROT:*SWISSPROT:* all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
VIRAL:*VIRAL:* GenBank viralGenBank viral SWISS:*SWISS:* all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
TAGS:*TAGS:* GenBank EST, HTC, and GSSGenBank EST, HTC, and GSS SW:*SW:* all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
EST:*EST:* GenBank EST Expressed Sequence TagsGenBank EST Expressed Sequence Tags SPTREMBL:*SPTREMBL:* Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations
HTC:*HTC:* GenBank High Throughput cDNAGenBank High Throughput cDNA GENPEPT:*GENPEPT:* all of GenBank’s CDS translationsall of GenBank’s CDS translations
HTG:*HTG:* GenBank High Throughput GenomicGenBank High Throughput Genomic GP:*GP:* all of GenBank’s CDS translationsall of GenBank’s CDS translations
SeqLab is the merger of Steve Smith’s Genetic Data SeqLab is the merger of Steve Smith’s Genetic Data
Environment and GCG’s Wisconsin Package Interface:Environment and GCG’s Wisconsin Package Interface:
GDE + WPI = SeqLabGDE + WPI = SeqLab
Requires an X11-Windowing environment — either Requires an X11-Windowing environment — either
native on UNIX computers (including LINUX, but not native on UNIX computers (including LINUX, but not
included in default Apple Mac OS X installs, see Apple’s included in default Apple Mac OS X installs, see Apple’s
free X11 package or XDarwin), or with X-server free X11 package or XDarwin), or with X-server
emulation software on Windows personal computers.emulation software on Windows personal computers.
FOR MORE INFO...FOR MORE INFO...
Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html and Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html and
contact me (contact me (stevetstevet@[email protected]) for further bioinformatics ) for further bioinformatics
assistance and collaboration.assistance and collaboration.
Gunnar von Heijne in his old but quite readable treatise, Gunnar von Heijne in his old but quite readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion:(1987), provides a very appropriate conclusion:
““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”
He continues:He continues:
““. . . if any lesson is to be drawn . . . it surely is that to be able to make a . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”all we have to become better biologists. But that’s all it takes.”
Conclusions —Conclusions —
Many texts are becoming Many texts are becoming
available in the field.available in the field.
To ‘honk-my-own-horn’ a bit, check out:To ‘honk-my-own-horn’ a bit, check out:
Current Protocols in BioinformaticsCurrent Protocols in Bioinformatics
from John Wiley & Sons, Inc.,from John Wiley & Sons, Inc.,