Sequences and alignments NCBI EUtils and BLAST Phylogenetics Protein structures IOB Workshop: Biopython A programming toolkit for bioinformatics Eric Talevich Institute of Bioinformatics, University of Georgia Mar. 29, 2012 Eric Talevich IOB Workshop: Biopython
A workshop on bioinformatics programming using Biopython and the Python programming language, held at the University of Georgia in Spring 2010 and 2012. These workshops are part of a series for the Institute of Bioinformatics (IoB) and Bioinformatics Grad Student Association (BIGSA) at UGA.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
IOB Workshop: BiopythonA programming toolkit for bioinformatics
Eric Talevich
Institute of Bioinformatics, University of Georgia
Mar. 29, 2012
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Getting startedwith
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Installing Python
Biopython is a library for the Python programming language.
First, you’ll need these installed:
Python 2.7 from http://python.org. It may already beinstalled on your computer. (Version 2.6 is OK, too.)
IDLE, a simple Integrated DeveLopment Environment.Usually bundled with the Python distribution.
Now, start an interactive session in IDLE. 1
1On your own, check out IPython (http://ipython.scipy.org/). It’s anenhanced Python interpreter that feels somewhat like R.
read: Parse a one-element file and return the element
write: Write elements to a file
convert: Parse one format and immediately write another
Biopython uses the same I/O conventions for alignments(AlignIO), BLAST results (Blast), and phylogenetic trees(Phylo).
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
The SeqRecord object
SeqIO.parse returns SeqRecords.SeqRecord wraps a Seq object and attaches metadata.
1 Pass the file name to the SeqIO parser; specify FASTA format:from Bio import SeqIO
seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
2 To see all records at once, convert the iterator to a list:allrecs = list(seqrecs)
print allrecs[0]
print allrecs[0].seq
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
The SeqRecord object
SeqIO.parse returns SeqRecords.SeqRecord wraps a Seq object and attaches metadata.
1 Pass the file name to the SeqIO parser; specify FASTA format:from Bio import SeqIO
seqrecs = SeqIO.parse("1ATP.fasta", "fasta")
print seqrecs
2 To see all records at once, convert the iterator to a list:allrecs = list(seqrecs)
print allrecs[0]
print allrecs[0].seq
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Example: Shuffled sequences
Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.
Procedure:
1 Read the source sequence from a file– Use Bio.SeqIO
2 In a loop:
Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords
3 Write the shuffled SeqRecords to another file
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
import randomfrom Bio import SeqIOfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecord
o r i g r e c = SeqIO . r e a d ("gi2.gb" , "genbank" )a l p h a b e t = o r i g r e c . seq . a l p h a b e to u t r e c s = [ ]f o r i i n x r a n g e ( 1 , 3 1 ) :
n u c l e o t i d e s = l i s t ( o r i g r e c . seq )random . s h u f f l e ( n u c l e o t i d e s )new seq = Seq ("" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )n e w r e c = SeqRecord ( new seq ,
i d="shuffle" + s t r ( i ) )o u t r e c s . append ( n e w r e c )
SeqIO . w r i t e ( o u t r e c s , "gi2_shuffled.fasta" , "fasta" )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
Example: ORF translation
Split a set of unannotated DNA sequences into uniqueORFs, translating in all 6 frames.
Biopython can help with each piece of this problem:
1 Parse the given unannotated DNA sequences (SeqIO.parse)
2 Get the template strand’s sequence (Seq.reverse complement)
3 Translate both strands into protein sequences (Seq.translate)
4 Shift each strand by +1 and +2 for alternate reading frames(string-like Seq slicing)
5 Split sequences at stop codons (Seq.split(’*’))
6 Write translated sequences to a new file (SeqIO.write)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
def t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1):””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s .
R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i ns e q u e n c e s .”””r e v = seq . r e v e r s e c o m p l e m e n t ( )f o r i i n ra ng e ( 3 ) :
# Coding ( C r i c k ) s t r a n dy i e l d seq [ i : ] . t r a n s l a t e ( t a b l e )
# Template ( Watson ) s t r a n dy i e l d r e v [ i : ] . t r a n s l a t e ( t a b l e )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
def t r a n s l a t e o r f s ( sequences , m i n p r o t l e n =60):””” Find and t r a n s l a t e a l l ORFs i n s e q u e n c e s .
T r a n s l a t e s each s e q u e n c e i n a l l 6 r e a d i n g frames ,s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s ani t e r a b l e o f a l l p r o t e i n s e q u e n c e s o f l e n g t h a tl e a s t m i n p r o t l e n .”””f o r seq i n s e q u e n c e s :
f o r f rame i n t r a n s l a t e s i x f r a m e s ( seq ) :f o r p r o t i n f rame . s p l i t ("*" ) :
i f l e n ( p r o t ) >= m i n p r o t l e n :y i e l d p r o t
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
from Bio import SeqIOfrom Bio . SeqRecord import SeqRecord
i f n a m e == "__main__" :import s y si n f i l e = s y s . s t d i no u t f i l e = s y s . s t d o u tr e c o r d s = SeqIO . p a r s e ( i n f i l e , "fasta" )s e q s = ( r e c . seq f o r r e c i n r e c o r d s )p r o t e i n s = t r a n s l a t e o r f s ( s e q s )s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) )
f o r i , seq i n enumerate ( o r f s ) )SeqIO . w r i t e ( s r e c s , o u t f i l e , "fasta" )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
The Seq objectSeqIO and the SeqRecord object
AlignIO and the Alignment object
Alignment: a set of sequences with the same length and alphabet.
Use AlignIO just like SeqIO:>>> from Bio import AlignIO
from Bio import Entrez , SeqIOE n t r e z . e m a i l = "[email protected]"
h a n d l e = E n t r e z . e f e t c h ( db="nucleotide" , i d="M95169" ,r e t t y p e="gb" , retmode="text" )
r e c o r d = SeqIO . r e a d ( handle , "genbank" )h a n d l e . c l o s e ( )p r i n t r e c o r dp r i n t r e c o r d . f e a t u r e s [ 1 0 ]s l i c e d = r e c o r d [ 2 0 0 0 0 : ] # L a s t ˜25% o f t he genomep r i n t s l i c e d
from Bio . Seq import Seqfrom Bio . A lphabet import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ "translation" ]
f o r f i n r e c o r d . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )
# Search f o r homologs o f a p r o t e i n s e q u e n c e
from Bio import SeqIOfrom Bio . B l a s t import NBCIWWW, NCBIXML
# Read and r e f o r m a t th e q u e r y s e q u e n c es e q r e c = SeqIO . r e a d (’gi2.gb’ , ’gb’ )q u e r y = s e q r e c . fo rmat (’fasta’ )
# Submit an o n l i n e BLAST q u e r y# ( This t a k e s some t ime to run )r e s u l t h a n d l e = NCBIWWW. q b l a s t (’blastx’ , ’nr’ , q u e r y )
# 1 . Save t he BLAST r e s u l t s as an XML f i l e
w i t h open (’aprotinin_blast.xml’ , ’w’ ) as s a v e f i l e :s a v e f i l e . w r i t e ( r e s u l t h a n d l e . r e a d ( ) )
r e s u l t h a n d l e . c l o s e ( )
# NB: The BLAST r e s u l t h a n d l e can o n l y be r e a d once# Reload i t from th e f i l ew i t h open (’aprotinin_blast.xml’ ) as r e s u l t h a n d l e :
b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )
# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e s
def g e t s c o r e s ( a l i g n m e n t s ) :f o r a l n i n a l i g n m e n t s :
f o r hsp i n a l n . h s p s :y i e l d hsp . s c o r e
s c o r e s = l i s t ( g e t s c o r e s ( b l a s t r e c o r d . a l i g n m e n t s ) )
# Draw t he h i s t o g r a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ("Scores of %d BLAST hits" % l e n ( s c o r e s ) )p y l a b . x l a b e l ("BLAST score" )p y l a b . y l a b e l ("# hits" )p y l a b . show ( )
# Save a copy f o r l a t e rp y l a b . s a v e f i g (’aprotinin_scores.png’ )
# 3 . E x t r a c t th e s e q u e n c e s o f h igh−s c o r i n g BLAST h i t s
from Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecord
def g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) :f o r a l n i n a l i g n m e n t s :
f o r hsp i n a l n . h s p s :i f hsp . s c o r e >= t h r e s h o l d :
y i e l d SeqRecord ( Seq ( hsp . s b j c t ) ,i d=a l n . a c c e s s i o n )
break
b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . a l i g n m e n t s , 321)SeqIO . w r i t e ( b e s t s e q s , ’aprotinin.fasta’ , ’fasta’ )
Biopython has wrappers for other command-line programs in:
Bio.Blast.Applications — the Blast+ suite
Bio.Align.Applications — Muscle, ClustalW, . . .
Bio.Emboss.Applications — needle, water, . . .
Let’s re-align our BLAST results using Muscle, and format thealignment for use with stand-alone Phylip.
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
from Bio import A l i g n I Ofrom Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O
# C o n s t r u c t th e s h e l l commandmuscle cmd = MuscleCommandline ( i n p u t="aprotinin.fasta" )# Execute the command# Get output ( the a l i g n m e n t ) and any e r r o r messagesm u s c l e o u t , m u s c l e e r r = muscle cmd ( )
# Read t he a l i g n m e n t back i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , "fasta" )
# Format th e a l i g n m e n t f o r P h y l i pA l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin.phy’ , ’phylip’ )
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Phylogenetics
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Phylogenetic tree I/O
Start with:>>> from Bio import Phylo
Input and output of trees is just like SeqIO:
read, parse single or multiple trees in Newick, Nexus andPhyloXML formats
write to any of the formats supported by read/parse
convert between two formats in one step
Use StringIO to load strings directly:>>> from cStringIO import StringIO
>>> handle = StringIO("((A,B),(C,(D,E)));")
>>> tree = Phylo.read(handle, "newick")
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
What’s in a tree?
Make a tree with branch lengths:>>> tree = Phylo.read(StringIO("((A:1,B:1):2,
... (C:2,(D:1,E:1):1):1);"), "newick")
View the object structure of the entire tree:>>> print tree
Draw an “ASCII-art” (plain text) representation:>>> Phylo.draw ascii(tree)
. . . OK, let’s do it properly now:>>> Phylo.draw(tree)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Modify the tree
Check the tree object for its methods:>>> help(tree)
Try a few:>>> tree.get terminals()
>>> clade = tree.common ancestor("A", "B")
>>> clade.color = "red"
>>> tree.root with outgroup("D", "E")
>>> tree.ladderize()
>>> Phylo.draw(tree)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
External applications
Biopython wraps a number of external programs for phylogenetics.We’re not going to use them now, but here’s where to find them:
Bio.Phylo.PAML — PAML wrappers & helpers
Bio.Phylo.Applications — command-line wrapper for PhyML(PhymlCommandline); RAxML and others on theway. (Anything you’d like to see sooner?)
Bio.Emboss.Applications — other tools ported via Embassy,including Phylip
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Proteinstructures
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Going 3D: The PDB module
Load a structure:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
’1ATP.pdb’)
Inspect the object hierarchy:
>>> list(struct)
>>> model = struct[0]
>>> list(model)
>>> chain = model[’E’]
>>> list(chain)
>>> residue = chain[15]
>>> list(residue)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Going 3D: The PDB module
Load a structure:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
’1ATP.pdb’)
Inspect the object hierarchy:
>>> list(struct)
>>> model = struct[0]
>>> list(model)
>>> chain = model[’E’]
>>> list(chain)
>>> residue = chain[15]
>>> list(residue)
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Figure: The “SMCRA” object hierarchy
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Extracting a peptide sequence
Get the amino acid sequence through a Polypeptide object:
>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct = parser.get structure(’1ATP’,
... ’1ATP.pdb’)
>>> ppb = PDB.PPBuilder()
>>> peptides = ppb.build peptides(struct)
>>> for pep in peptides:
... print pep.get sequence()
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Calculating RMSD
Given two aligned structures, filter a list of targetresidues for high RMS deviation.
Input: list of residue positions (integers)two equivalent chains from aligned proteinmodels — residue numbers must matchMinimum RMSD value (float)
Output: list of residue positions, filtered
Procedure: 1 Extract coordinates of Cα atoms2 If available (not glycine), extract Cβ
coordinates, too3 Use Bio.SVDSuperimposer to calculate the
RMSD between coordinates4 Compare to the given RMSD threshold
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
from Bio . SVDSuperimposer import SVDSuperimposerfrom numpy import a r r a y
def f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) :s u p e r = SVDSuperimposer ( )f o r r e s i n r e s i d s :
r e f r e s = r e f c h a i n [ r e s ]cmpres = cmpchain [ r e s ]coord1 = [ r e f r e s [ ’CA’ ] . g e t c o o r d ( ) ]coord2 = [ cmpres [ ’CA’ ] . g e t c o o r d ( ) ]i f r e f r e s . h a s i d (’CB’ ) and cmpres . h a s i d (’CB’ ) :
# Not g l y c i n ecoord1 . append ( r e f r e s [ ’CB’ ] . g e t c o o r d ( ) )coord2 . append ( cmp res [ ’CB’ ] . g e t c o o r d ( ) )
s u p e r . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 ) )rmsd = s u p e r . g e t i n i t r m s ( )i f rmsd >= t h r e s h o l d :
y i e l d r e s
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Figure: Superimposed structures, with selected deviating residues
Eric Talevich IOB Workshop: Biopython
Sequences and alignmentsNCBI EUtils and BLAST
PhylogeneticsProtein structures
Further reading
Biopython tutorial:http:
//biopython.org/DIST/docs/tutorial/Tutorial.html
Biopython wiki:http://biopython.org/
This presentation:http://www.slideshare.net/etalevich/