BioPython Workshop BioPython Workshop Gershon Celniker Gershon Celniker Tel Aviv University Tel Aviv University
Dec 28, 2015
BioPython BioPython WorkshopWorkshop
Gershon CelnikerGershon Celniker
Tel Aviv University Tel Aviv University
IntroductionIntroduction• The Biopython Project is an international association of developers of freely available Python
(http://www.python.org) tools for computational molecular biology. • Python is an object oriented, interpreted, exible language that is becoming increasingly
popular for scientific computing. • Python is easy to learn, has a very clear syntax and can easily be extended with modules.• The Biopython web site (http://www.biopython.org) provides an online resource for
modules, scripts, and web links for developers of Python-based software for bioinformatics use and research.
• Basically, the goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes.
• Biopython features include parsers for various Bioinformatics file formats(BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy, Clustalw, DSSP, MSMS...)
• Basically, we just like to program in Python and want to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts.
https://github.com/biopython/biopython/tree/master/Doc/examples
IntroductionIntroduction• The full tutorial located here:• http://biopython.org/DIST/docs/tutorial/Tutorial.html
• Example files are located here:• https://github.com/biopython/biopython/tree/master/Doc/examples
BioPython, Lets try it!BioPython, Lets try it!
FASTA formatFASTA format
http://en.wikipedia.org/wiki/FASTA_formatFASTA is pronounced "fast A", and stands for "FAST-All", because it works with any alphabet, an extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.
Lets write our first Lets write our first parsing scriptparsing scriptParsing sequence File formatsCypripedioideae (this is the subfamily of lady slipper orchids). This search gave me only 94 hits, which I saved as a FASTA - ls orchid.fasta
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNACGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
Notice that the FASTA format does not specify the alphabet, so Bio.SeqIO has defaulted to the rathergeneric SingleLetterAlphabet() rather than something DNA specic.
Lets write our first Lets write our first parsing scriptparsing script
Output:gi|2765658|emb|Z78533.1|CIZ78533Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())740...gi|2765564|emb|Z78439.1|PBZ78439Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())592
Sequence slicingSequence slicing
Output:
gi|2765658|emb|Z78533.1|CIZ78533
GC content exerciseGC content exercise
Output:My seq legnth:32G:9
TranscriptionTranscription
Output:
TranslationTranslation
Output:
Translation tablesTranslation tables
Translation – continued Translation – continued
Retrieving data from the Retrieving data from the netnet
Output:O23729CHS3_BROFIRecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName: Full=Naringenin-chalcone synthase 3;Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE', ProteinAlphabet())Length 394['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']
Parsing data from fasta – Parsing data from fasta – part Bpart B
AlignmentAlignment
BlastBlast
PlotsPlots
Plots - resultPlots - result
Going 3D: The PDB Going 3D: The PDB modulemodule
Bio.
Going 3D: The PDB Going 3D: The PDB modulemodule
Bio.