May 19, 2015
Homework
● ATCurve.py
● take an input string from the user● check if the sequence only contains DNA – if
not, prompt for new sequence. ● calculate a running average of AT content
along the sequence. Window size should be 3, and the step size should be 1. Print one value per line.
● Note: you need to include several runtime examples to show that all parts of the code works.
ATCurve.py - thinking
● Take input from user:● raw_input
● Check for the presence of !ATCG● use sets – very easy
● Calculate AT – window = 3, step = 1● iterate over string in slices of three
ATCurve.py
# variable valid is used to see if the string is ok or not.valid = Falsewhile not valid: # promt user for input using raw_input() and store in string, # convert all characters into uppercase test_string = raw_input("Enter string: ") upper_string = test_string.upper()
# Figure out if anything else than ATGCs are present dnaset = set(list("ATGC")) upper_string_set = set(list(upper_string))
if len(upper_string_set - dnaset) > 0: print "Non-DNA present in your string, try again"
else: valid = True
if valid: for i in range(0, len(upper_string)-3, 1): at_sum = 0.0
at_sum += upper_string.count("A",i,i+2)at_sum += upper_string.count("T",i,i+2)print at_sum/3
Homework
● CodonFrequency.py● take an input string from the user● if the sequence only contains DNA
– find a start codon in your string– if startcodon is present
● count the occurrences of each three-mer from start codon and onwards
● print the results
CodonFrequency.py - thinking
● First part – same as earlier● Find start codon: locate index of AUG
● Note, can simplify and find ATG
● If start codon is found:● create dictionary● for slice of three in input[StartCodon:]:
– get codon– if codon is in dict:
● add to count
– if not:● create key-value pair in dict
CodonFrequency.py
input = raw_input("Type a piece of DNA here: ")
if len(set(input) - set(list("ATGC"))) > 0: print "Not a valid DNA sequence"else: atg = input.find("ATG") if atg == -1: print "Start codon not found" else: codondict = {} for i in xrange(atg,len(input)-3,3): codon = input[i:i+3] if codon not in codondict: codondict[codon] = 1 else: codondict[codon] +=1
for codon in codondict: print codon, codondict[codon]
CodonFrequency.py w/ stopcodon
input = raw_input("Type a piece of DNA here: ")
if len(set(input) - set(list("ATGC"))) > 0: print "Not a valid DNA sequence"else: atg = input.find("ATG") if atg == -1: print "Start codon not found" else: codondict = {} for i in xrange(atg,len(input) -3,3): codon = input[i:i+3] if codon in ['UAG', 'UAA', 'UAG']: break elif codon not in codondict: codondict[codon] = 1 else: codondict[codon] +=1
for codon in codondict: print codon, codondict[codon]
Results
[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.pyType a piece of DNA here: ATGATTATTTAAATGATG 1ATT 2TAA 1[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.pyType a piece of DNA here: ATGATTATTTAAATGTATG 2ATT 2TAA 1[karinlag@freebee]/projects/temporary/cees-python-course/Karin%
Working with files
● Reading – get info into your program● Parsing – processing file contents● Writing – get info out of your program
Reading and writing
● Three-step process● Open file
– create file handle – reference to file● Read or write to file● Close file
– will be automatically close on program end, but bad form to not close
Opening files
● Opening modes:● “r” - read file● “w” - write file● “a” - append to end of file
● fh = open(“filename”, “mode”)● fh = filehandle, reference to a file, NOT the
file itself
Reading a file
● Three ways to read● read([n]) - n = bytes to read, default is all● readline() - read one line, incl. newline● readlines() - read file into a list, one element
per line, including newline
Reading example
● Log on to freebee, and go to your area● do cp ../Karin/fastafile.fsa .● open python
● Q: what does the response mean?
>>> fh = open("fastafile.fsa", "r")>>> fh
Read example
● Use all three methods to read the file. Print the results.
● read● readlines● readline
● Q: what happens after you have read the file?
● Q: What is the difference between the three?
Read example
>>> fh = open("fastafile.fsa", "r")>>> withread = fh.read()>>> withread'>This is the description line\nATGCGCTTAGGATCGATAGCGATTTAGA\nTTAGCGGA\n'>>> withreadlines = fh.readlines()>>> withreadlines[]>>> fh = open("fastafile.fsa", "r")>>> withreadlines = fh.readlines()>>> withreadlines['>This is the description line\n', 'ATGCGCTTAGGATCGATAGCGATTTAGA\n', 'TTAGCGGA\n']>>> fh = open("fastafile.fsa", "r")>>> withreadline = fh.readline()>>> withreadline'>This is the description line\n'>>>
Parsing
● Getting information out of a file● Commonly used string methods
● split([character]) – default is whitespace● replace(“in string”, “put into instead”)● “string character”.join(list)
– joins all elements in the list with string character as a separator
– common construction: ''.join(list)● slicing
Type conversions
● Everything that comes on the command line or from a file is a string
● Conversions:● int(X)
– string cannot have decimals– floats will be floored
● float(X)● str(X)
Parsing example
● Continue using fastafile.fsa● Print only the description line to screen● Print the whole DNA string
>>> fh = open("fastafile.fsa", "r")>>> firstline = fh.readline()>>> print firstline[1:-1]This is the description line>>> sequence = ''>>> for line in fh:... sequence += line.replace("\n", "")... >>> print sequenceATGCGCTTAGGATCGATAGCGATTTAGA>>>
Accepting input from command line
● Need to be able to specify file name on command line
● Command line parameters stored in list called sys.argv – program name is 0
● Usage:● python pythonscript.py arg1 arg2 arg3....
● In script:● at the top of the file, write import sys● arg1 = sys.argv[1]
Batch example
● Read fastafile.fsa with all three methods● Per method, print method, name and
sequence ● Remember to close the file at the end!
Batch exampleimport sysfilename = sys.argv[1]#using readlinefh = open(filename, "r")firstline = fh.readline()name = firstline[1:-1]sequence =''for line in fh: sequence += line.replace("\n", "")print "Readline", name, sequence
#using readlines()fh = open(filename, "r")inputlines = fh.readlines()name = inputlines[0][1:-1]sequence = ''for line in inputlines[1:]: sequence += line.replace("\n", "")print "Readlines", name, sequence
#using readfh = open(filename, "r")inputlines = fh.read()name = inputlines.split("\n")[0][1:-1]sequence = "".join(inputlines.split("\n")[1:])print "Read", name, sequence
fh.close()
Classroom exercise
● Modify ATCurve.py script so that it accepts the following input on the command line:
● fasta filename● window size
● Let the user input an alternate filename if it contains !ATGC
● Print results to screen
ATCurve2.pyimport sys# Define filename filename = sys.argv[1]windowsize = int(sys.argv[2])
# variable valid is used to see if the string is ok or not. valid = Falsewhile not valid: fh = open(filename, "r") inputlines = fh.readlines() name = inputlines[0][1:-1] sequence = '' for line in inputlines[1:]: sequence += line.replace("\n", "") upper_string = sequence.upper()
# Figure out if anything else than ATGCs are present dnaset = set(list("ATGC")) upper_string_set = set(list(upper_string))
if len(upper_string_set - dnaset) > 0: print "Non-DNA present in your file, try again" filename = raw_input("Type in filename: ") else: valid = True
if valid: for i in range(0, len(upper_string)-windowsize + 1, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+windowsize) at_sum += upper_string.count("T",i,i+windowsize) print i + 1, at_sum/windowsize
Writing to files
● Similar procedure as for read● Open file, mode is “w” or “a”● fh.write(string)
– Note: one single string– No newlines are added
● fh.close()
ATContent3.py
● Modify previous script so that you have the following on the command line
● fasta filename for input file● window size● output file
● Output should be on the format● number, AT content● number is the 1-based position of the first
nucleotide in the window
ATCurve3.py
import sys# Define filename filename = sys.argv[1]windowsize = int(sys.argv[2])outputfile = sys.argv[3]
if valid: fh = open(outputfile, "w") for i in range(0, len(upper_string)-windowsize + 1, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+windowsize) at_sum += upper_string.count("T",i,i+windowsize) fh.write(str(i + 1) + " " + str(at_sum/windowsize) + "\n") fh.close()
Homework: TranslateProtein.py
● Input files are in /projects/temporary/cees-python-course/Karin
● translationtable.txt - tab separated● dna31.fsa
● Script should:
● Open the translationtable.txt file and read it into a dictionary
● Open the dna31.fsa file and read the contents.● Translates the DNA into protein using the dictionary● Prints the translation in a fasta format to the file
TranslateProtein.fsa. Each protein line should be 60 characters long.