2015 bioinformatics python_strings_wim_vancriekinge
Post on 11-Jan-2017
1678 Views
Preview:
Transcript
FBW06-10-2015
Wim Van Criekinge
Bioinformatics.be
Overview
What is Python ?Why Python 4 Bioinformatics ?How to Python
IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)
StringsRegular expressions
Python
• Programming languages are overrated– If you are going into bioinformatics you probably
learn/need multiple– If you know one you know 90% of a second
• Choice does matter but it matters far less than people think it does
• Why Python?– Lets you start useful programs asap– Build-in libraries – incl BioPython– Free, most platforms, widely (scientifically) used
• Versus Perl?– Incredibly similar– Consistent syntax, indentation
Version 2.7 and 3.4 on athena.ugent.be
Eclipse IDE Components
MenubarsFull drop down menus plus quick
access to common functions
Editor PaneThis is where we edit
our source code
Perspective SwitcherWe can switch
between various perspectives here
Outline PaneThis contains a hierarchical
view of a source file
Package Explorer PaneThis is where our
projects/files are listed
Miscellaneous PaneVarious components can appear in this pane – typically this contains a console
and a list of compiler problems
Task List PaneThis contains a list of “tasks” to complete
Where is the workspace ?
GitHub: Hosted GIT
• Largest open source git hosting site• Public and private options• User-centric rather than project-centric• http://github.ugent.be (use your Ugent
login and password)– Accept invitation from Bioinformatics-I-
2015URI:– https://github.ugent.be/Bioinformatics-I-
2015/Python.git
Run Install.py (is BioPython installed ?)
import pipimport sysimport platformimport webbrowser
print ("Python " + platform.python_version()+ " installed packages:")
installed_packages = pip.get_installed_distributions()installed_packages_list = sorted(["%s==%s" % (i.key, i.version) for i in installed_packages])print(*installed_packages_list,sep="\n")
Control Structures
if condition: statements[elif condition: statements] ...else: statements
while condition: statements
for var in sequence: statements
breakcontinue
range
The range function specifies a range of integers: range(start, stop) - the integers between start (inclusive)
and stop (exclusive)
It can also accept a third value specifying the change between values. range(start, stop, step) - the integers between start (inclusive)
and stop (exclusive) by step
Example:for x in range(5, 0, -1): print xprint "Blastoff!"
Output:543 21Blastoff!
Exercise: How would we print the "99 Bottles of Beer" song?
Grouping Indentation
In Python:
for i in range(20): if i%3 == 0: print (i) if i%5 == 0: print ("Bingo!”) print ("---”)
0Bingo!---------3---------6---------9---------12---------15Bingo!---------18------
while
while loop: Executes a group of statements as long as a condition is True. good for indefinite loops (repeat an unknown number of times)
Syntax:while condition: statements
Example:number = 1while number < 200: print number, number = number * 2
Output:1 2 4 8 16 32 64 128
if
if statement: Executes a group of statements only if a certain condition is true. Otherwise, the statements are skipped.
Syntax:if condition: statements
Example:gpa = 3.4if gpa > 2.0: print "Your application is accepted."
if/else
if/else statement: Executes one block of statements if a certain condition is True, and a second block of statements if it is False. Syntax:
if condition: statementselse: statements
Example:gpa = 1.4if gpa > 2.0: print "Welcome to Mars University!"else: print "Your application is denied."
Multiple conditions can be chained with elif ("else if"):if condition: statementselif condition: statementselse: statements
Logic
Many logical expressions use relational operators:
Logical expressions can be combined with logical operators:
Operator Example Resultand 9 != 6 and 2 < 3 True
or 2 == 3 or -1 < 5 True
not not 7 > 0 False
Operator Meaning Example Result== equals 1 + 1 == 2 True
!= does not equal 3.2 != 2.5 True
< less than 10 < 5 False
> greater than 10 > 5 True
<= less than or equal to 126 <= 100 False
>= greater than or equal to 5.0 >= 5.0 True
PI-thon.py
Introduction
Buffon's Needle is one of the oldest problems in the field of geometrical probability. It was first stated in 1777. It involves dropping a needle on a lined sheet of paper and determining the probability of the needle crossing one of the lines on the page. The remarkable result is that the probability is directly related to the value of pi.
https://www.youtube.com/watch?v=Vws1jvMbs64&feature=youtu.be
Overview
What is Python ?Why Python 4 Bioinformatics ?How to Python
IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)
Strings
string: A sequence of text characters in a program. Strings start and end with quotation mark " or apostrophe ' characters. Examples:
"hello""This is a string""This, too, is a string. It can be very long!"
A string may not span across multiple lines or contain a " character."This is nota legal String.""This is not a "legal" String either."
A string can represent characters by preceding them with a backslash. \t tab character \n new line character \" quotation mark character \\ backslash character
Example: "Hello\tthere\nHow are you?"
Strings
Indexes
Characters in a string are numbered with indexes starting at 0: Example:
name = "P. Diddy"
Accessing an individual character of a string:variableName [ index ]
Example:print name, "starts with", name[0]
Output:P. Diddy starts with P
index 0 1 2 3 4 5 6 7characte
rP . D i d d y
Strings
• "hello"+"world" "helloworld" # concatenation
• "hello"*3 "hellohellohello" # repetition
• "hello"[0] "h" # indexing• "hello"[-1] "o" # (from end)• "hello"[1:4] "ell" # slicing• len("hello") 5 # size• "hello" < "jello" 1 # comparison• "e" in "hello" 1 # search• "escapes: \n etc, \033 etc, \if etc"• 'single quotes' """triple quotes""" r"raw strings"
String properties
len(string) - number of characters in a string (including spaces)
str.lower(string) - lowercase version of a string str.upper(string) - uppercase version of a string
Example:name = "Martin Douglas Stepp"length = len(name)big_name = str.upper(name)print big_name, "has", length, "characters"
Output:MARTIN DOUGLAS STEPP has 20 characters
a.replace
Text processing
text processing: Examining, editing, formatting text. often uses loops that examine the characters of a string
one by one
A for loop can examine each character in a string in sequence.
Example:for c in "booyah": print c
Output:booyah
Strings and numbers
ord(text) - converts a string into a number. Example: ord("a") is 97, ord("b") is 98, ...
Characters map to numbers using standardized mappings such as ASCII and Unicode.
chr(number) - converts a number into a string. Example: chr(99) is "c"
Exercise: Write a program that performs a rotation cypher. e.g. "Attack" when rotated by 1 becomes "buubdl"
Lists
• Flexible arrays, not Lisp-like linked lists
• a = [99, "bottles of beer", ["on", "the", "wall"]]
• Same operators as for strings• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment• a[0] = 98• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]• del a[-1] # -> [98, "bottles", "of",
"beer"]
More List Operations
>>> a = range(5) # [0,1,2,3,4]>>> a.append(5) # [0,1,2,3,4,5]>>> a.pop() # [0,1,2,3,4]>>> a.insert(0, 42) # [42,0,1,2,3,4]>>> a.pop(0) # [0,1,2,3,4]>>> a.reverse() # [4,3,2,1,0]>>> a.sort() # [0,1,2,3,4]
Dictionaries
• Hash tables, "associative arrays"• d = {"duck": "eend", "water": "water"}
• Lookup:• d["duck"] -> "eend"• d["back"] # raises KeyError exception
• Delete, insert, overwrite:• del d["water"] # {"duck": "eend", "back": "rug"}• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}
More Dictionary Ops
• Keys, values, items:• d.keys() -> ["duck", "back"]• d.values() -> ["duik", "rug"]• d.items() -> [("duck","duik"),
("back","rug")]• Presence check:
• d.has_key("duck") -> 1; d.has_key("spam") -> 0
• Values of any type; keys almost any• {"name":"Guido", "age":43,
("hello","world"):1, 42:"yes", "flag": ["red","white","blue"]}
Dictionary Details
• Keys must be immutable:– numbers, strings, tuples of immutables
• these cannot be changed after creation– reason is hashing (fast lookup technique)– not lists or other dictionaries
• these types of objects can be changed "in place"
– no restrictions on values• Keys will be listed in arbitrary order
– again, because of hashing
Reference Semantics
• Assignment manipulates references• x = y does not make a copy of y• x = y makes x reference the object y
references• Very useful; but beware!• Example:
>>> a = [1, 2, 3]>>> b = a>>> a.append(4)>>> print b[1, 2, 3, 4]
a
1 2 3b
a
1 2 3b
4
a = [1, 2, 3]
a.append(4)
b = a
a 1 2 3
Changing a Shared List
a
1b
a
1b
a = 1
a = a+1
b = a
a 1
2
Changing an Integer
old reference deletedby assignment (a=...)
new int object createdby add operator (1+1)
Example Function
def gcd(a, b): "greatest common divisor" while a != 0: a, b = b%a, a # parallel assignment return b
>>> gcd.__doc__'greatest common divisor'>>> gcd(12, 20)4
Overview
What is Python ?Why Python 4 Bioinformatics ?How to Python
IDE: Eclipse & PyDev / AthenaCode Sharing: Git(hub)
StringsREGULAR EXPRESSIONS
What is a regular expression?
• A regular expression (regex) is simply a way of describing text.
• Regular expressions are built up of small units (atoms) which can represent the type and number of characters in the text
• Regular expressions can be very broad (describing everything), or very narrow (describing only one pattern).
Why would you use a regex?
• Often you wish to test a string for the presence of a specific character, word, or phrase– Examples
• “Are there any letter characters in my string?”
• “Is this a valid accession number?”• “Does my sequence contain a start codon
(ATG)?”• The EcoRI restriction enzyme cuts at the
consensus sequence GAATTC.
Real world problems
• Match IP Addresses, email addresses, URLs
• Match balanced sets of parenthesis• Substitute words• Tokenize• Validate• Count• Delete duplicates• Natural Language processing
RE in Python
• Unleash the power - built-in re module• Functions
– to compile patterns• compile
– to perform matches• match, search, findall, finditer
– to perform operations on match object• group, start, end, span
– to substitute• sub, subn
• - Metacharacters
Quantifiers
• [ATGC]• You can specify the number of times
you want to see an atom. Examples• \d* : Zero or more times• \d+ : One or more times• \d{3} : Exactly three times• \d{4,7} : At least four, and not more
than seven• \d{3,} : Three or more times
• We could rewrite /\d\d\d-\d\d\d\d/ as:– /\d{3}-\d{4}/
Anchors
• Anchors force a pattern match to a certain location• ^ : start matching at beginning of string• $ : start matching at end of string• \b : match at word boundary (between \w
and \W)• Example:
• /^\d\d\d-\d\d\d\d$/ : matches only valid phone numbers
Grouping, capturing
• You can group atoms together with parentheses• /cat+/ matches cat, catt, cattt• /(cat)+/ matches cat, catcat, catcatcat
• Use as many sets of parentheses as you need
• match.group()
Regex.py
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2))else: print ("No match!!") phone = "2004-959-559 # This is Phone Number"
# Delete Python-style commentsnum = re.sub(r'#.*$', "", phone)print ("Phone Num : ", num)
# Remove anything other than digitsnum = re.sub(r'\D', "", phone) print ("Phone Num : ", num)
text = 'abbaaabbbbaaaaa'pattern = 'ab'
for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))
Regex.py
text = 'abbaaabbbbaaaaa'pattern = 'ab'
for match in re.finditer(pattern, text): s = match.start() e = match.end() print ('Found "%s" at %d:%d' % (text[s:e], s, e))
References
• http://docs.python.org/• http://code.activestate.com/recipes/langs/
python/• http://www.regular-expressions.info/• http://www.dabeaz.com/ply/ply.html• Mastering Regular Expressions by Jeffrey E
F. Friedl• Python Cookbook by Alex Martelli, Anna
Martelli & David Ascher• Text processing in Python by David Mertz
Oefening 1
1. Which of following 4 sequences (seq1/2/3/4)
a) contains a “Galactokinase signature”
b) How many of them?
http://us.expasy.org/prosite/
>SEQ1MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT
YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR
>SEQ2MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE
VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ
>SEQ3MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY
SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL
>SEQ4MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG
GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA
Oefening 1
2. Find the answer in ultimate-sequence.txt ?
>ultimate-sequenceACTCGTTATGATATTTTTTTTGAACGTGAAAATACTT
TTCGTGCTATGGAAGGACTCGTTATCGTGAAGTTGAACGTTCTGAATGTATGCCTCTTGAAATGGAAAATACTCATTGTTTATCTGAAATTTGAATGGGAATTTTATCTACAATGTTTTATTCTTACAGAACATTAAATTGTGTTATGTTTCATTTCACATTTTAGTAGTTTTTTCAGTGAAAGCTTGAAAACCACCAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAAACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAAAGAAATACGTTCCCAAGAATTAGCTTCATGAGTAAGAAGAAAAGCTGGTATGCGTAGCTATGTATATATAAAATTAGATTTTCCACAAAAAATGATCTGATAA
Oefening 2
my %AA1 = ( 'UUU','F','UUC','F','UUA','L','UUG','L','UCU','S','UCC','S','UCA','S','UCG','S','UAU','Y','UAC','Y','UAA','*','UAG','*','UGU','C','UGC','C','UGA','*','UGG','W',
'CUU','L','CUC','L','CUA','L','CUG','L','CCU','P','CCC','P','CCA','P','CCG','P','CAU','H','CAC','H','CAA','Q','CAG','Q','CGU','R','CGC','R','CGA','R','CGG','R',
'AUU','I','AUC','I','AUA','I',
'AUG','M','ACU','T','ACC','T','ACA','T','ACG','T','AAU','N','AAC','N','AAA','K','AAG','K','AGU','S','AGC','S','AGA','R','AGG','R',
'GUU','V','GUC','V','GUA','V','GUG','V','GCU','A','GCC','A','GCA','A','GCG','A','GAU','D','GAC','D','GAA','E','GAG','E','GGU','G','GGC','G','GGA','G',
'GGG','G' );
Oefening 2
AA1 = {'UUU':'F','UUC':'F','UUA':'L','UUG':'L','UCU':'S','UCC':'S','UCA':'S','UCG':'S','UAU':'Y','UAC':'Y','UAA':'*','UAG':'*','UGU':'C','UGC':'C','UGA':'*','UGG':'W','CUU':'L','CUC':'L','CUA':'L','CUG':'L','CCU':'P','CCC':'P','CCA':'P','CCG':'P','CAU':'H','CAC':'H','CAA':'Q','CAG':'Q','CGU':'R','CGC':'R','CGA':'R','CGG':'R','AUU':'I','AUC':'I','AUA':'I','AUG':'M','ACU':'T','ACC':'T','ACA':'T','ACG':'T','AAU':'N','AAC':'N','AAA':'K','AAG':'K','AGU':'S','AGC':'S','AGA':'R','AGG':'R','GUU':'V','GUC':'V','GUA':'V','GUG':'V','GCU':'A','GCC':'A','GCA':'A','GCG':'A','GAU':'D','GAC':'D','GAA':'E','GAG':'E','GGU':'G','GGC':'G','GGA':'G','GGG':'G' }
Oefening 2
Translations
Python way:tab = str.maketrans("ACGU","UGCA")sequence = sequence.translate(tab)[::-1]
http://www.pythonchallenge.com
top related