Biopython programming workshop at UGA

Sequences and alignmentsNCBI EUtils and BLAST

PhylogeneticsProtein structures

IOB Workshop: BiopythonA programming toolkit for bioinformatics

Eric Talevich

Institute of Bioinformatics, University of Georgia

Mar. 29, 2012

Eric Talevich IOB Workshop: Biopython



Getting startedwith




Installing Python

Biopython is a library for the Python programming language.

First, you’ll need these installed:

Python 2.7 from http://python.org. It may already beinstalled on your computer. (Version 2.6 is OK, too.)

IDLE, a simple Integrated DeveLopment Environment.Usually bundled with the Python distribution.

Now, start an interactive session in IDLE. 1

1On your own, check out IPython (http://ipython.scipy.org/). It’s anenhanced Python interpreter that feels somewhat like R.


http://python.org

http://ipython.scipy.org/



Installing Python packages

Biopython is a Python package. There are a few standard ways toinstall Python packages:

From source: Download from PyPI 2, unpack and install with theincluded setup.py script.

easy install: Install from source 3, then use the easy install

command to fetch install all other packages by name:$ easy install <package name>

pip: Like easy install, use pip 4 to manage packages:$ pip install <package name>

2http://pypi.python.org/pypi/3http://pypi.python.org/pypi/setuptools4http://pypi.python.org/pypi/pip


http://pypi.python.org/pypi/

http://pypi.python.org/pypi/setuptools

http://pypi.python.org/pypi/pip



Installing NumPy, matplotlib and Biopython

Biopython relies on a few other Python packages for extrafunctionality. We’ll use these:

numpy — efficient numerical functions and data structures(for Bio.PDB)

matplotlib — plotting (for Bio.Phylo)

Then finally:

biopython — the reason we’re here today

(Biopython, NumPy, matplotlib, setuptools and pip are also packaged for

many Linux distributions.)




Testing

Check your Biopython installation:

>>> import Bio

>>> print Bio. version

Import a NumPy-based component:

>>> from Bio import PDB

Show a simple plot:

>>> from matplotlib import pyplot

>>> pyplot.plot(range(5), range(5))

>>> pyplot.show()




Let’s start using




Biopython1 Sequences and alignments

The Seq objectSeqIO and the SeqRecord object

2 NCBI EUtils and BLASTEUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

3 Phylogenetics

4 Protein structures





Sequencesand

Alignments





The Seq object

>>> from Bio.Seq import Seq

>>> myseq = Seq(’AGTACACTGGT’)

>>> myseq

Seq(’AGTACACTGGT’, Alphabet())

>>> print myseq

AGTACACTGGT

>>> myseq.transcribe()

Seq(’AGUACACUGGU’, RNAAlphabet())

>>> myseq.translate()

Seq(’STL’, ExtendedIUPACProtein())





A Seq object consists of:

data — the underlying Python character string

alphabet — DNA, RNA, protein, etc.

It supports most Python string methods:>>> myseq.count(’GT’)

2

And some biology-specific methods, too:>>> myseq.reverse complement()

Seq(’ACCAGTGTACT’, Alphabet())

Intrigued? Read on:>>> help(Seq)





SeqIO: Sequence Input/Output

Sequence data is stored in many different file formats.Bio.SeqIO supports:

abi fastq phylip swissace genbank pir tab

clustal ig qual uniprot-xmlembl imgt seqxml

emboss nexus sfffasta phd stockholm

Manually fetch some data from the PDB website: 5

1ATP.fasta — two protein sequences, FASTA format

1ATP.pdb — the 3D structure, for later

5http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP


http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP




The SeqIO API

SeqIO provides four functions:

parse: Iteratively parse all elements in the file

read: Parse a one-element file and return the element

write: Write elements to a file

convert: Parse one format and immediately write another

Biopython uses the same I/O conventions for alignments(AlignIO), BLAST results (Blast), and phylogenetic trees(Phylo).





The SeqRecord object

SeqIO.parse returns SeqRecords.SeqRecord wraps a Seq object and attaches metadata.

1 Pass the file name to the SeqIO parser; specify FASTA format:from Bio import SeqIO

seqrecs = SeqIO.parse("1ATP.fasta", "fasta")

print seqrecs

2 To see all records at once, convert the iterator to a list:allrecs = list(seqrecs)

print allrecs[0]

print allrecs[0].seq





The SeqRecord object

SeqIO.parse returns SeqRecords.SeqRecord wraps a Seq object and attaches metadata.

1 Pass the file name to the SeqIO parser; specify FASTA format:from Bio import SeqIO

seqrecs = SeqIO.parse("1ATP.fasta", "fasta")

print seqrecs

2 To see all records at once, convert the iterator to a list:allrecs = list(seqrecs)

print allrecs[0]

print allrecs[0].seq





Example: Shuffled sequences

Given a real DNA sequence, create a “background” set ofrandomized sequences with the same composition.

Procedure:

1 Read the source sequence from a file– Use Bio.SeqIO

2 In a loop:

Shuffle the sequence– Use random.shuffle from Python’s standard libraryCreate a new SeqRecord from the shuffled sequence– Because SeqIO.write works with SeqRecords

3 Write the shuffled SeqRecords to another file







Procedure:


2 In a loop:









Procedure:


2 In a loop:







import randomfrom Bio import SeqIOfrom Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecord

o r i g r e c = SeqIO . r e a d ("gi2.gb" , "genbank" )a l p h a b e t = o r i g r e c . seq . a l p h a b e to u t r e c s = [ ]f o r i i n x r a n g e ( 1 , 3 1 ) :

n u c l e o t i d e s = l i s t ( o r i g r e c . seq )random . s h u f f l e ( n u c l e o t i d e s )new seq = Seq ("" . j o i n ( n u c l e o t i d e s ) , a l p h a b e t )n e w r e c = SeqRecord ( new seq ,

i d="shuffle" + s t r ( i ) )o u t r e c s . append ( n e w r e c )

SeqIO . w r i t e ( o u t r e c s , "gi2_shuffled.fasta" , "fasta" )





Example: ORF translation

Split a set of unannotated DNA sequences into uniqueORFs, translating in all 6 frames.

Biopython can help with each piece of this problem:

1 Parse the given unannotated DNA sequences (SeqIO.parse)

2 Get the template strand’s sequence (Seq.reverse complement)

3 Translate both strands into protein sequences (Seq.translate)

4 Shift each strand by +1 and +2 for alternate reading frames(string-like Seq slicing)

5 Split sequences at stop codons (Seq.split(’*’))

6 Write translated sequences to a new file (SeqIO.write)





def t r a n s l a t e s i x f r a m e s ( seq , t a b l e =1):””” T r a n s l a t e a n u c l e o t i d e s e q u e n c e i n 6 f r a m e s .

R e t u r n s an i t e r a b l e o f 6 t r a n s l a t e d p r o t e i ns e q u e n c e s .”””r e v = seq . r e v e r s e c o m p l e m e n t ( )f o r i i n ra ng e ( 3 ) :

# Coding ( C r i c k ) s t r a n dy i e l d seq [ i : ] . t r a n s l a t e ( t a b l e )

# Template ( Watson ) s t r a n dy i e l d r e v [ i : ] . t r a n s l a t e ( t a b l e )





def t r a n s l a t e o r f s ( sequences , m i n p r o t l e n =60):””” Find and t r a n s l a t e a l l ORFs i n s e q u e n c e s .

T r a n s l a t e s each s e q u e n c e i n a l l 6 r e a d i n g frames ,s p l i t s s e q u e n c e s on s t o p codons , and p r o d u c e s ani t e r a b l e o f a l l p r o t e i n s e q u e n c e s o f l e n g t h a tl e a s t m i n p r o t l e n .”””f o r seq i n s e q u e n c e s :

f o r f rame i n t r a n s l a t e s i x f r a m e s ( seq ) :f o r p r o t i n f rame . s p l i t ("*" ) :

i f l e n ( p r o t ) >= m i n p r o t l e n :y i e l d p r o t





from Bio import SeqIOfrom Bio . SeqRecord import SeqRecord

i f n a m e == "__main__" :import s y si n f i l e = s y s . s t d i no u t f i l e = s y s . s t d o u tr e c o r d s = SeqIO . p a r s e ( i n f i l e , "fasta" )s e q s = ( r e c . seq f o r r e c i n r e c o r d s )p r o t e i n s = t r a n s l a t e o r f s ( s e q s )s e q r e c s = ( SeqRecord ( seq , i d="orf"+s t r ( i ) )

f o r i , seq i n enumerate ( o r f s ) )SeqIO . w r i t e ( s r e c s , o u t f i l e , "fasta" )





AlignIO and the Alignment object

Alignment: a set of sequences with the same length and alphabet.

Use AlignIO just like SeqIO:>>> from Bio import AlignIO

>>> aln = AlignIO.read("PF01601.sto", "stockholm")

>>> print alnSingleLetterAlphabet() alignment with 22 rows and 730 columns

NCTDAV-----LTYSSFGVCADGSIIA-VQPRNV-----SYDSV...HIQ Q1HVL3 CVH22/539-1170

NCTTAV-----MTYSNFGICADGSLIP-VRPRNS-----SDNGI...HVQ SPIKE CVHNL/723-1356

NCTEPV-----LVYSNIGVCKSGSIGY-VPSQS------GQVKI...HVQ Q692M1 9CORO/740-1383

NCTEPA-----LVYSNIGVCKNGAIGL-VGIRN------TQPKI...HIQ Q0Q4F4 9CORO/729-1360

NCTSPR-----LVYSNIGVCTSGAIGL-LSPKX------AQPQI...HVQ Q0Q4F6 9CORO/743-1371

NCTNPV-----LTYSSYGVCPDGSITR-LGLTD------VQPHF...--T A4ULL0 9CORO/726-1328

NCTKPV-----LSYGPISVCSDGAIAG-TSTLQN-----TRPSI...KEW A6N263 9CORO/406-1035

ECDIPIGAGICASYHTVSLLRSTSQKSIVAYTMS------LGAD...HYT Q6T7X8 CVHSA/647-1255

...

DCE-PV-----ITYSNIGVCKNGAFVF-INVTH------SDGDV...HVH Q0PKZ5 CVPPU/797-1449





Snack Time




EUtils: Entrez Programming UtilitiesNCBI BlastExternal programs

EUtils and BLAST





EUtils: Entrez Programming Utilities

Access NCBI’s online services:from Bio import Entrez

Entrez.email = "[email protected]"

Request a GenBank record:handle = Entrez.efetch(db="protein", id="69316",

rettype="gb", retmode="text")

record = SeqIO.read(handle, "gb")

Specify multiple IDs in one query:handle = Entrez.efetch(db="protein",

id="349839,349840",

rettype="fasta", retmode="text")

records = SeqIO.parse(handle, "fasta")












id="349839,349840",














id="349839,349840",







Interlude: SeqRecord attributes

seq: the sequence (Seq) itself

id: primary ID for the sequence, e.g. accession number(string)

name: “common” name/id for the sequence, like GenBankLOCUS id

description: human-readible description of the sequence

letter annotations: restricted dictionary of additional info aboutindividual letters in the sequence, e.g. quality scores

annotations: dictionary of additional unstructured info

features: list of SeqFeature objects with more structuredinformation — e.g. position of genes on a genome,domains on a protein sequence.

dbxrefs: list of database cross-references (strings)





from Bio import Entrez , SeqIOE n t r e z . e m a i l = "[email protected]"

h a n d l e = E n t r e z . e f e t c h ( db="nucleotide" , i d="M95169" ,r e t t y p e="gb" , retmode="text" )

r e c o r d = SeqIO . r e a d ( handle , "genbank" )h a n d l e . c l o s e ( )p r i n t r e c o r dp r i n t r e c o r d . f e a t u r e s [ 1 0 ]s l i c e d = r e c o r d [ 2 0 0 0 0 : ] # L a s t ˜25% o f t he genomep r i n t s l i c e d

from Bio . Seq import Seqfrom Bio . A lphabet import g e n e r i c p r o t e i nt r a n s l a t i o n s = [ f . q u a l i f i e r s [ "translation" ]

f o r f i n r e c o r d . f e a t u r e s [ 1 : ] ]p r o t e i n s = [ Seq ( t [ 0 ] , g e n e r i c p r o t e i n )

f o r t i n t r a n s l a t i o n s ]





NCBI Blast

BLAST can be used either standalone or through NCBI’s server.

Online: >>> from Bio.Blast import NCBIWWW

>>> result handle = NCBIWWW.qblast(

’blastp’, ’nr’, query string)

Standalone: “Legacy” (blastall):>>> from Bio.Blast.Applications import

BlastallCommandline

>>> help(BlastallCommandline)

New hotness (Blast+):>>> from Bio.Blast.Applications import

NcbiblastpCommandline

>>> help(NcbiblastpCommandline)





Parsing BLAST output

BLAST produces reports in plain-text and XML format.

Biopython requests XML by default.

>>> from Bio.Blast import NCBIWWW, NCBIXML

>>> result handle = NCBIWWW.qblast(’blastp’,

... ’nr’, query string)

>>> blast record = NCBIXML.read(result handle)

>>> print blast record





# Search f o r homologs o f a p r o t e i n s e q u e n c e

from Bio import SeqIOfrom Bio . B l a s t import NBCIWWW, NCBIXML

# Read and r e f o r m a t th e q u e r y s e q u e n c es e q r e c = SeqIO . r e a d (’gi2.gb’ , ’gb’ )q u e r y = s e q r e c . fo rmat (’fasta’ )

# Submit an o n l i n e BLAST q u e r y# ( This t a k e s some t ime to run )r e s u l t h a n d l e = NCBIWWW. q b l a s t (’blastx’ , ’nr’ , q u e r y )





# 1 . Save t he BLAST r e s u l t s as an XML f i l e

w i t h open (’aprotinin_blast.xml’ , ’w’ ) as s a v e f i l e :s a v e f i l e . w r i t e ( r e s u l t h a n d l e . r e a d ( ) )

r e s u l t h a n d l e . c l o s e ( )

# NB: The BLAST r e s u l t h a n d l e can o n l y be r e a d once# Reload i t from th e f i l ew i t h open (’aprotinin_blast.xml’ ) as r e s u l t h a n d l e :

b l a s t r e c o r d = NCBIXML . r e a d ( r e s u l t h a n d l e )





# 2 . D i s p l a y a h i s t o g r a m o f BLAST h i t s c o r e s

def g e t s c o r e s ( a l i g n m e n t s ) :f o r a l n i n a l i g n m e n t s :

f o r hsp i n a l n . h s p s :y i e l d hsp . s c o r e

s c o r e s = l i s t ( g e t s c o r e s ( b l a s t r e c o r d . a l i g n m e n t s ) )

# Draw t he h i s t o g r a mimport p y l a bp y l a b . h i s t ( s c o r e s , b i n s =20)p y l a b . t i t l e ("Scores of %d BLAST hits" % l e n ( s c o r e s ) )p y l a b . x l a b e l ("BLAST score" )p y l a b . y l a b e l ("# hits" )p y l a b . show ( )

# Save a copy f o r l a t e rp y l a b . s a v e f i g (’aprotinin_scores.png’ )





Figure: Histogram of BLAST scores generated by pylab





# 3 . E x t r a c t th e s e q u e n c e s o f h igh−s c o r i n g BLAST h i t s

from Bio . Seq import Seqfrom Bio . SeqRecord import SeqRecord

def g e t h s p s ( a l i g n m e n t s , t h r e s h o l d ) :f o r a l n i n a l i g n m e n t s :

f o r hsp i n a l n . h s p s :i f hsp . s c o r e >= t h r e s h o l d :

y i e l d SeqRecord ( Seq ( hsp . s b j c t ) ,i d=a l n . a c c e s s i o n )

break

b e s t s e q s = g e t h s p s ( b l a s t r e c o r d . a l i g n m e n t s , 321)SeqIO . w r i t e ( b e s t s e q s , ’aprotinin.fasta’ , ’fasta’ )





Calling other external programs

Biopython has wrappers for other command-line programs in:

Bio.Blast.Applications — the Blast+ suite

Bio.Align.Applications — Muscle, ClustalW, . . .

Bio.Emboss.Applications — needle, water, . . .

Let’s re-align our BLAST results using Muscle, and format thealignment for use with stand-alone Phylip.




from Bio import A l i g n I Ofrom Bio . A l i g n . A p p l i c a t i o n s import MuscleCommandlinefrom S t r i n g I O import S t r i n g I O

# C o n s t r u c t th e s h e l l commandmuscle cmd = MuscleCommandline ( i n p u t="aprotinin.fasta" )# Execute the command# Get output ( the a l i g n m e n t ) and any e r r o r messagesm u s c l e o u t , m u s c l e e r r = muscle cmd ( )

# Read t he a l i g n m e n t back i na l i g n = A l i g n I O . r e a d ( S t r i n g I O ( m u s c l e o u t ) , "fasta" )

# Format th e a l i g n m e n t f o r P h y l i pA l i g n I O . w r i t e ( [ a l i g n ] , ’aprotinin.phy’ , ’phylip’ )




Phylogenetics




Phylogenetic tree I/O

Start with:>>> from Bio import Phylo

Input and output of trees is just like SeqIO:

read, parse single or multiple trees in Newick, Nexus andPhyloXML formats

write to any of the formats supported by read/parse

convert between two formats in one step

Use StringIO to load strings directly:>>> from cStringIO import StringIO

>>> handle = StringIO("((A,B),(C,(D,E)));")

>>> tree = Phylo.read(handle, "newick")




What’s in a tree?

Make a tree with branch lengths:>>> tree = Phylo.read(StringIO("((A:1,B:1):2,

... (C:2,(D:1,E:1):1):1);"), "newick")

View the object structure of the entire tree:>>> print tree

Draw an “ASCII-art” (plain text) representation:>>> Phylo.draw ascii(tree)

. . . OK, let’s do it properly now:>>> Phylo.draw(tree)




Modify the tree

Check the tree object for its methods:>>> help(tree)

Try a few:>>> tree.get terminals()

>>> clade = tree.common ancestor("A", "B")

>>> clade.color = "red"

>>> tree.root with outgroup("D", "E")

>>> tree.ladderize()

>>> Phylo.draw(tree)




External applications

Biopython wraps a number of external programs for phylogenetics.We’re not going to use them now, but here’s where to find them:

Bio.Phylo.PAML — PAML wrappers & helpers

Bio.Phylo.Applications — command-line wrapper for PhyML(PhymlCommandline); RAxML and others on theway. (Anything you’d like to see sooner?)

Bio.Emboss.Applications — other tools ported via Embassy,including Phylip




Proteinstructures




Going 3D: The PDB module

Load a structure:


>>> parser = PDB.PDBParser()

>>> struct = parser.get structure(’1ATP’,

’1ATP.pdb’)

Inspect the object hierarchy:

>>> list(struct)

>>> model = struct[0]

>>> list(model)

>>> chain = model[’E’]

>>> list(chain)

>>> residue = chain[15]

>>> list(residue)




Going 3D: The PDB module

Load a structure:




’1ATP.pdb’)

Inspect the object hierarchy:

>>> list(struct)

>>> model = struct[0]

>>> list(model)

>>> chain = model[’E’]

>>> list(chain)

>>> residue = chain[15]

>>> list(residue)




Figure: The “SMCRA” object hierarchy




Extracting a peptide sequence

Get the amino acid sequence through a Polypeptide object:




... ’1ATP.pdb’)

>>> ppb = PDB.PPBuilder()

>>> peptides = ppb.build peptides(struct)

>>> for pep in peptides:

... print pep.get sequence()




Calculating RMSD

Given two aligned structures, filter a list of targetresidues for high RMS deviation.

Input: list of residue positions (integers)two equivalent chains from aligned proteinmodels — residue numbers must matchMinimum RMSD value (float)

Output: list of residue positions, filtered

Procedure: 1 Extract coordinates of Cα atoms2 If available (not glycine), extract Cβ

coordinates, too3 Use Bio.SVDSuperimposer to calculate the

RMSD between coordinates4 Compare to the given RMSD threshold




from Bio . SVDSuperimposer import SVDSuperimposerfrom numpy import a r r a y

def f i l t r m s ( r e s i d s , r e f c h a i n , cmpchain , t h r e s h = 0 . 5 ) :s u p e r = SVDSuperimposer ( )f o r r e s i n r e s i d s :

r e f r e s = r e f c h a i n [ r e s ]cmpres = cmpchain [ r e s ]coord1 = [ r e f r e s [ ’CA’ ] . g e t c o o r d ( ) ]coord2 = [ cmpres [ ’CA’ ] . g e t c o o r d ( ) ]i f r e f r e s . h a s i d (’CB’ ) and cmpres . h a s i d (’CB’ ) :

# Not g l y c i n ecoord1 . append ( r e f r e s [ ’CB’ ] . g e t c o o r d ( ) )coord2 . append ( cmp res [ ’CB’ ] . g e t c o o r d ( ) )

s u p e r . s e t ( a r r a y ( coord1 ) , a r r a y ( coord2 ) )rmsd = s u p e r . g e t i n i t r m s ( )i f rmsd >= t h r e s h o l d :

y i e l d r e s




Figure: Superimposed structures, with selected deviating residues




Further reading

Biopython tutorial:http:

//biopython.org/DIST/docs/tutorial/Tutorial.html

Biopython wiki:http://biopython.org/

This presentation:http://www.slideshare.net/etalevich/

biopython-programming-workshop-at-uga


http://biopython.org/DIST/docs/tutorial/Tutorial.html

http://biopython.org/DIST/docs/tutorial/Tutorial.html

http://biopython.org/

http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga

http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga



Thanks’Preciate it.

Gracias


Biopython programming workshop at UGA

Technology

protein sequences

blast eutils

alignments ncbi eutils

alignmentsncbi eutils

seqrecord object2 ncbi

biopython installation

seq import seq myseq

python distribution