2015 bioinformatics python_strings_wim_vancriekinge

Post on 11-Jan-2017

1678 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

Transcript

FBW06-10-2015

Wim Van Criekinge

Bioinformatics.be

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

StringsRegular expressions

Python

• Programming languages are overrated– If you are going into bioinformatics you probably

learn/need multiple– If you know one you know 90% of a second

• Choice does matter but it matters far less than people think it does

• Why Python?– Lets you start useful programs asap– Build-in libraries – incl BioPython– Free, most platforms, widely (scientifically) used

• Versus Perl?– Incredibly similar– Consistent syntax, indentation

Version 2.7 and 3.4 on athena.ugent.be

Eclipse IDE Components

MenubarsFull drop down menus plus quick

access to common functions

Editor PaneThis is where we edit

our source code

Perspective SwitcherWe can switch

between various perspectives here

Outline PaneThis contains a hierarchical

view of a source file

Package Explorer PaneThis is where our

projects/files are listed

Miscellaneous PaneVarious components can appear in this pane – typically this contains a console

and a list of compiler problems

Task List PaneThis contains a list of “tasks” to complete

Where is the workspace ?

GitHub: Hosted GIT

• Largest open source git hosting site• Public and private options• User-centric rather than project-centric• http://github.ugent.be (use your Ugent

login and password)– Accept invitation from Bioinformatics-I-

2015URI:– https://github.ugent.be/Bioinformatics-I-

2015/Python.git

Run Install.py (is BioPython installed ?)

import pipimport sysimport platformimport webbrowser

print ("Python " + platform.python_version()+ " installed packages:")

installed_packages = pip.get_installed_distributions()installed_packages_list = sorted(["%s==%s" % (i.key, i.version) for i in installed_packages])print(*installed_packages_list,sep="\n")

Control Structures

if condition: statements[elif condition: statements] ...else: statements

while condition: statements

for var in sequence: statements

breakcontinue

range

The range function specifies a range of integers: range(start, stop) - the integers between start (inclusive)

and stop (exclusive)

It can also accept a third value specifying the change between values. range(start, stop, step) - the integers between start (inclusive)

and stop (exclusive) by step

Example:for x in range(5, 0, -1): print xprint "Blastoff!"

Output:543 21Blastoff!

Exercise: How would we print the "99 Bottles of Beer" song?

Grouping Indentation

In Python:

for i in range(20): if i%3 == 0: print (i) if i%5 == 0: print ("Bingo!”) print ("---”)

0Bingo!---------3---------6---------9---------12---------15Bingo!---------18------

while

while loop: Executes a group of statements as long as a condition is True. good for indefinite loops (repeat an unknown number of times)

Syntax:while condition: statements

Example:number = 1while number < 200: print number, number = number * 2

Output:1 2 4 8 16 32 64 128

if

if statement: Executes a group of statements only if a certain condition is true. Otherwise, the statements are skipped.

Syntax:if condition: statements

Example:gpa = 3.4if gpa > 2.0: print "Your application is accepted."

if/else

if/else statement: Executes one block of statements if a certain condition is True, and a second block of statements if it is False. Syntax:

if condition: statementselse: statements

Example:gpa = 1.4if gpa > 2.0: print "Welcome to Mars University!"else: print "Your application is denied."

Multiple conditions can be chained with elif ("else if"):if condition: statementselif condition: statementselse: statements

Logic

Many logical expressions use relational operators:

Logical expressions can be combined with logical operators:

Operator Example Resultand 9 != 6 and 2 < 3 True

or 2 == 3 or -1 < 5 True

not not 7 > 0 False

Operator Meaning Example Result== equals 1 + 1 == 2 True

!= does not equal 3.2 != 2.5 True

< less than 10 < 5 False

> greater than 10 > 5 True

<= less than or equal to 126 <= 100 False

>= greater than or equal to 5.0 >= 5.0 True

PI-thon.py

Introduction

Buffon's Needle is one of the oldest problems in the field of geometrical probability. It was first stated in 1777. It involves dropping a needle on a lined sheet of paper and determining the probability of the needle crossing one of the lines on the page. The remarkable result is that the probability is directly related to the value of pi.

https://www.youtube.com/watch?v=Vws1jvMbs64&feature=youtu.be

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

Strings

string: A sequence of text characters in a program. Strings start and end with quotation mark " or apostrophe ' characters. Examples:

"hello""This is a string""This, too, is a string. It can be very long!"

A string may not span across multiple lines or contain a " character."This is nota legal String.""This is not a "legal" String either."

A string can represent characters by preceding them with a backslash. \t tab character \n new line character \" quotation mark character \\ backslash character

Example: "Hello\tthere\nHow are you?"

Strings

Indexes

Characters in a string are numbered with indexes starting at 0: Example:

name = "P. Diddy"

Accessing an individual character of a string:variableName [ index ]

Example:print name, "starts with", name[0]

Output:P. Diddy starts with P

index 0 1 2 3 4 5 6 7characte

rP . D i d d y

Strings

• "hello"+"world" "helloworld" # concatenation

• "hello"*3 "hellohellohello" # repetition

• "hello"[0] "h" # indexing• "hello"[-1] "o" # (from end)• "hello"[1:4] "ell" # slicing• len("hello") 5 # size• "hello" < "jello" 1 # comparison• "e" in "hello" 1 # search• "escapes: \n etc, \033 etc, \if etc"• 'single quotes' """triple quotes""" r"raw strings"

String properties

len(string) - number of characters in a string (including spaces)

str.lower(string) - lowercase version of a string str.upper(string) - uppercase version of a string

Example:name = "Martin Douglas Stepp"length = len(name)big_name = str.upper(name)print big_name, "has", length, "characters"

Output:MARTIN DOUGLAS STEPP has 20 characters

a.replace

Text processing

text processing: Examining, editing, formatting text. often uses loops that examine the characters of a string

one by one

A for loop can examine each character in a string in sequence.

Example:for c in "booyah": print c

Output:booyah

Strings and numbers

ord(text) - converts a string into a number. Example: ord("a") is 97, ord("b") is 98, ...

Characters map to numbers using standardized mappings such as ASCII and Unicode.

chr(number) - converts a number into a string. Example: chr(99) is "c"

Exercise: Write a program that performs a rotation cypher. e.g. "Attack" when rotated by 1 becomes "buubdl"

Lists

• Flexible arrays, not Lisp-like linked lists

• a = [99, "bottles of beer", ["on", "the", "wall"]]

• Same operators as for strings• a+b, a*3, a[0], a[-1], a[1:], len(a)

• Item and slice assignment• a[0] = 98• a[1:2] = ["bottles", "of", "beer"]

-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]• del a[-1] # -> [98, "bottles", "of",

"beer"]

More List Operations

>>> a = range(5) # [0,1,2,3,4]>>> a.append(5) # [0,1,2,3,4,5]>>> a.pop() # [0,1,2,3,4]>>> a.insert(0, 42) # [42,0,1,2,3,4]>>> a.pop(0) # [0,1,2,3,4]>>> a.reverse() # [4,3,2,1,0]>>> a.sort() # [0,1,2,3,4]

Dictionaries

• Hash tables, "associative arrays"• d = {"duck": "eend", "water": "water"}

• Lookup:• d["duck"] -> "eend"• d["back"] # raises KeyError exception

• Delete, insert, overwrite:• del d["water"] # {"duck": "eend", "back": "rug"}• d["back"] = "rug" # {"duck": "eend", "back":

"rug"}• d["duck"] = "duik" # {"duck": "duik", "back":

"rug"}

More Dictionary Ops

• Keys, values, items:• d.keys() -> ["duck", "back"]• d.values() -> ["duik", "rug"]• d.items() -> [("duck","duik"),

("back","rug")]• Presence check:

• d.has_key("duck") -> 1; d.has_key("spam") -> 0

• Values of any type; keys almost any• {"name":"Guido", "age":43,

("hello","world"):1, 42:"yes", "flag": ["red","white","blue"]}

Dictionary Details

• Keys must be immutable:– numbers, strings, tuples of immutables

• these cannot be changed after creation– reason is hashing (fast lookup technique)– not lists or other dictionaries

• these types of objects can be changed "in place"

– no restrictions on values• Keys will be listed in arbitrary order

– again, because of hashing

Reference Semantics

• Assignment manipulates references• x = y does not make a copy of y• x = y makes x reference the object y

references• Very useful; but beware!• Example:

>>> a = [1, 2, 3]>>> b = a>>> a.append(4)>>> print b[1, 2, 3, 4]

a

1 2 3b

a

1 2 3b

4

a = [1, 2, 3]

a.append(4)

b = a

a 1 2 3

Changing a Shared List

a

1b

a

1b

a = 1

a = a+1

b = a

a 1

2

Changing an Integer

old reference deletedby assignment (a=...)

new int object createdby add operator (1+1)

Example Function

def gcd(a, b): "greatest common divisor" while a != 0: a, b = b%a, a # parallel assignment return b

>>> gcd.__doc__'greatest common divisor'>>> gcd(12, 20)4

Overview

What is Python ?Why Python 4 Bioinformatics ?How to Python

IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)

StringsREGULAR EXPRESSIONS

What is a regular expression?

• A regular expression (regex) is simply a way of describing text.

• Regular expressions are built up of small units (atoms) which can represent the type and number of characters in the text

• Regular expressions can be very broad (describing everything), or very narrow (describing only one pattern).

Why would you use a regex?

• Often you wish to test a string for the presence of a specific character, word, or phrase– Examples

• “Are there any letter characters in my string?”

• “Is this a valid accession number?”• “Does my sequence contain a start codon

(ATG)?”• The EcoRI restriction enzyme cuts at the

consensus sequence GAATTC.

Real world problems

• Match IP Addresses, email addresses, URLs

• Match balanced sets of parenthesis• Substitute words• Tokenize• Validate• Count• Delete duplicates• Natural Language processing

RE in Python

• Unleash the power - built-in re module• Functions

– to compile patterns• compile

– to perform matches• match, search, findall, finditer

– to perform operations on match object• group, start, end, span

– to substitute• sub, subn

• - Metacharacters

Quantifiers

• [ATGC]• You can specify the number of times

you want to see an atom. Examples• \d* : Zero or more times• \d+ : One or more times• \d{3} : Exactly three times• \d{4,7} : At least four, and not more

than seven• \d{3,} : Three or more times

• We could rewrite /\d\d\d-\d\d\d\d/ as:– /\d{3}-\d{4}/

Anchors

• Anchors force a pattern match to a certain location• ^ : start matching at beginning of string• $ : start matching at end of string• \b : match at word boundary (between \w

and \W)• Example:

• /^\d\d\d-\d\d\d\d$/ : matches only valid phone numbers

Grouping, capturing

• You can group atoms together with parentheses• /cat+/ matches cat, catt, cattt• /(cat)+/ matches cat, catcat, catcatcat

• Use as many sets of parentheses as you need

• match.group()

Regex.py

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2))else: print ("No match!!") phone = "2004-959-559 # This is Phone Number"

# Delete Python-style commentsnum = re.sub(r'#.*$', "", phone)print ("Phone Num : ", num)

# Remove anything other than digitsnum = re.sub(r'\D', "", phone) print ("Phone Num : ", num)

text = 'abbaaabbbbaaaaa'pattern = 'ab'

for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))

Regex.py

text = 'abbaaabbbbaaaaa'pattern = 'ab'

for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))

References

• http://docs.python.org/• http://code.activestate.com/recipes/langs/

python/• http://www.regular-expressions.info/• http://www.dabeaz.com/ply/ply.html• Mastering Regular Expressions by Jeffrey E

F. Friedl• Python Cookbook by Alex Martelli, Anna

Martelli & David Ascher• Text processing in Python by David Mertz

Oefening 1

1. Which of following 4 sequences (seq1/2/3/4)

a) contains a “Galactokinase signature”

b) How many of them?

http://us.expasy.org/prosite/

>SEQ1MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT

YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR

>SEQ2MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE

VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ

>SEQ3MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY

SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL

>SEQ4MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG

GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA

Oefening 1

2. Find the answer in ultimate-sequence.txt ?

>ultimate-sequenceACTCGTTATGATATTTTTTTTGAACGTGAAAATACTT

TTCGTGCTATGGAAGGACTCGTTATCGTGAAGTTGAACGTTCTGAATGTATGCCTCTTGAAATGGAAAATACTCATTGTTTATCTGAAATTTGAATGGGAATTTTATCTACAATGTTTTATTCTTACAGAACATTAAATTGTGTTATGTTTCATTTCACATTTTAGTAGTTTTTTCAGTGAAAGCTTGAAAACCACCAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAAACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAAAGAAATACGTTCCCAAGAATTAGCTTCATGAGTAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAA

Oefening 2

my %AA1 = ( 'UUU','F','UUC','F','UUA','L','UUG','L','UCU','S','UCC','S','UCA','S','UCG','S','UAU','Y','UAC','Y','UAA','*','UAG','*','UGU','C','UGC','C','UGA','*','UGG','W',

'CUU','L','CUC','L','CUA','L','CUG','L','CCU','P','CCC','P','CCA','P','CCG','P','CAU','H','CAC','H','CAA','Q','CAG','Q','CGU','R','CGC','R','CGA','R','CGG','R',

'AUU','I','AUC','I','AUA','I',

'AUG','M','ACU','T','ACC','T','ACA','T','ACG','T','AAU','N','AAC','N','AAA','K','AAG','K','AGU','S','AGC','S','AGA','R','AGG','R',

'GUU','V','GUC','V','GUA','V','GUG','V','GCU','A','GCC','A','GCA','A','GCG','A','GAU','D','GAC','D','GAA','E','GAG','E','GGU','G','GGC','G','GGA','G',

'GGG','G' );

Oefening 2

AA1 = {'UUU':'F','UUC':'F','UUA':'L','UUG':'L','UCU':'S','UCC':'S','UCA':'S','UCG':'S','UAU':'Y','UAC':'Y','UAA':'*','UAG':'*','UGU':'C','UGC':'C','UGA':'*','UGG':'W','CUU':'L','CUC':'L','CUA':'L','CUG':'L','CCU':'P','CCC':'P','CCA':'P','CCG':'P','CAU':'H','CAC':'H','CAA':'Q','CAG':'Q','CGU':'R','CGC':'R','CGA':'R','CGG':'R','AUU':'I','AUC':'I','AUA':'I','AUG':'M','ACU':'T','ACC':'T','ACA':'T','ACG':'T','AAU':'N','AAC':'N','AAA':'K','AAG':'K','AGU':'S','AGC':'S','AGA':'R','AGG':'R','GUU':'V','GUC':'V','GUA':'V','GUG':'V','GCU':'A','GCC':'A','GCA':'A','GCG':'A','GAU':'D','GAC':'D','GAA':'E','GAG':'E','GGU':'G','GGC':'G','GGA':'G','GGG':'G' }

Oefening 2

Translations

Python way:tab = str.maketrans("ACGU","UGCA")sequence = sequence.translate(tab)[::-1]

http://www.pythonchallenge.com

top related