Top Banner
10/7/2013 BCHB524 - 2013 - Edwards Sequence File Parsing using Biopython BCHB524 2013 Lecture 11
17

Sequence File Parsing using Biopython

Jan 21, 2016

Download

Documents

reegan

Sequence File Parsing using Biopython. BCHB524 2013 Lecture 11. Review. Modules in the standard-python library: sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards

Sequence File Parsing using Biopython

BCHB5242013

Lecture 11

Page 2: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 2

Review

Modules in the standard-python library: sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files csv – read delimited line based records from files

Plus lots, lots more.

Page 3: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 3

BioPython

Additional modules that make many common bioinformatics tasks easier File parsing (many formats) & web-retrieval Formal biological alphabets, codon tables, etc Lots of other stuff…

Have to install separately Not part of standard python, or Enthought

biopython.org

Page 4: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 4

Biopython: Fasta format

Most common biological sequence data format Header/Description line

>accession description

Multi-accession sometimes represented accession1|accession2|accession3 lots of variations, no standardization

No prescribed format for the description Other lines

sequence, one chunk per line. Usually all lines, except the last, are the same length.

Page 5: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 5

BioPython: Bio.SeqIO

import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 6: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 6

Biopython: Other formats Genbank format

From NCBI, also format for RefSeq sequence

UniProt/SwissProt flat-file format From UniProt for SwissProt and TrEMBL

UniProt-XML format: From UniProt for SwissProt and TrEMBL

Use the gzip module to handle compressed sequence databases

Page 7: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 7

BioPython: Bio.SeqIO

import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "genbank"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 8: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 8

BioPython: Bio.SeqIO

import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "swiss"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 9: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 9

BioPython: Bio.SeqIO

import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 10: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 10

BioPython: Bio.SeqIO and gzip

import Bio.SeqIOimport sysimport gzip

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 11: Sequence File Parsing using Biopython

What about the other "stuff"

BioPython makes it easy to get access to non-sequence information stored in "rich" sequence databases Annotations Cross-References Sequence Features Literature

10/7/2013 BCHB524 - 2013 - Edwards 11

Page 12: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 12

BioPython: Bio.SeqIO

import Bio.SeqIOimport sysimport gzip

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):    # What else is available in the SeqRecord?    print "\n------NEW SEQRECORD------\n"    print "repr(seq_record)\n\t",repr(seq_record)    print "dir(seq_record)\n\t",dir(seq_record)    breakseqfile.close()

Page 13: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 13

BioPython: Bio.SeqRecordimport Bio.SeqIOimport sysimport gzip

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.annotations\n\t",seq_record.annotations    print "seq_record.features\n\t",seq_record.features    print "seq_record.dbxrefs\n\t",seq_record.dbxrefs    print "seq_record.format('fasta')\n",seq_record.format('fasta')    breakseqfile.close()

Page 14: Sequence File Parsing using Biopython

BioPython: Random access

Sometimes you want to access the sequence records "randomly"… …to pick out the ones you want (by accession)

Why not make a dictionary, with accessions as keys, and SeqRecord values? Use SeqIO.to_dict(…)

What if you don't want to hold it all in memory Use SeqIO.index(…)

10/7/2013 BCHB524 - 2013 - Edwards 14

Page 15: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 15

BioPython: Bio.SeqIO.to_dict(…)import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the sequence databaseseqfile = open(seqfilename)

# Use to_dict to make a dictionary of sequence recordssprot_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse(seqfile, "uniprot-xml"))

# Close the fileseqfile.close()

# Access and print a sequence recordprint sprot_dict['Q6GZV8']

Page 16: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 16

BioPython: Bio.SeqIO.index(…)import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Use index to make an out of core dict of seq records sprot_index = Bio.SeqIO.index(seqfilename, "uniprot-xml")

# Access and print a sequence recordprint sprot_index['Q6GZV8']

Page 17: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 17

Exercises Read through and try the examples from Chapters 2-5 of

BioPython's Tutorial. Download human proteins from RefSeq and compute amino-acid

frequencies for the (RefSeq) human proteome. Which amino-acid occurs the most? The least? Hint: access RefSeq human proteins from

ftp://ftp.ncbi.nih.gov/refseq Download human proteins from SwissProt and compute amino-acid

frequencies for the SwissProt human proteome. Which amino-acid occurs the most? The least? Hint: access SwissProt human proteins from

http://www.uniprot.org/downloads -> “Taxonomic divisions” How similar are the human amino-acid frequencies of in RefSeq and

SwissProt?