Day3

File handling

Karin Lagesen

[email protected]

Homework

● ATCurve.py

● take an input string from the user● check if the sequence only contains DNA – if

not, prompt for new sequence. ● calculate a running average of AT content

along the sequence. Window size should be 3, and the step size should be 1. Print one value per line.

● Note: you need to include several runtime examples to show that all parts of the code works.

ATCurve.py - thinking

● Take input from user:● raw_input

● Check for the presence of !ATCG● use sets – very easy

● Calculate AT – window = 3, step = 1● iterate over string in slices of three

ATCurve.py

# variable valid is used to see if the string is ok or not.valid = Falsewhile not valid: # promt user for input using raw_input() and store in string, # convert all characters into uppercase test_string = raw_input("Enter string: ") upper_string = test_string.upper()

# Figure out if anything else than ATGCs are present dnaset = set(list("ATGC")) upper_string_set = set(list(upper_string))

if len(upper_string_set - dnaset) > 0: print "Non-DNA present in your string, try again"

else: valid = True

if valid: for i in range(0, len(upper_string)-3, 1): at_sum = 0.0

at_sum += upper_string.count("A",i,i+2)at_sum += upper_string.count("T",i,i+2)print at_sum/3

Homework

● CodonFrequency.py● take an input string from the user● if the sequence only contains DNA

– find a start codon in your string– if startcodon is present

● count the occurrences of each three-mer from start codon and onwards

● print the results

CodonFrequency.py - thinking

● First part – same as earlier● Find start codon: locate index of AUG

● Note, can simplify and find ATG

● If start codon is found:● create dictionary● for slice of three in input[StartCodon:]:

– get codon– if codon is in dict:

● add to count

– if not:● create key-value pair in dict

CodonFrequency.py

input = raw_input("Type a piece of DNA here: ")

if len(set(input) - set(list("ATGC"))) > 0: print "Not a valid DNA sequence"else: atg = input.find("ATG") if atg == -1: print "Start codon not found" else: codondict = {} for i in xrange(atg,len(input)-3,3): codon = input[i:i+3] if codon not in codondict: codondict[codon] = 1 else: codondict[codon] +=1

for codon in codondict: print codon, codondict[codon]

CodonFrequency.py w/ stopcodon

input = raw_input("Type a piece of DNA here: ")

if len(set(input) - set(list("ATGC"))) > 0: print "Not a valid DNA sequence"else: atg = input.find("ATG") if atg == -1: print "Start codon not found" else: codondict = {} for i in xrange(atg,len(input) -3,3): codon = input[i:i+3] if codon in ['UAG', 'UAA', 'UAG']: break elif codon not in codondict: codondict[codon] = 1 else: codondict[codon] +=1

for codon in codondict: print codon, codondict[codon]

Results

[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.pyType a piece of DNA here: ATGATTATTTAAATGATG 1ATT 2TAA 1[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.pyType a piece of DNA here: ATGATTATTTAAATGTATG 2ATT 2TAA 1[karinlag@freebee]/projects/temporary/cees-python-course/Karin%

Working with files

● Reading – get info into your program● Parsing – processing file contents● Writing – get info out of your program

Reading and writing

● Three-step process● Open file

– create file handle – reference to file● Read or write to file● Close file

– will be automatically close on program end, but bad form to not close

Opening files

● Opening modes:● “r” - read file● “w” - write file● “a” - append to end of file

● fh = open(“filename”, “mode”)● fh = filehandle, reference to a file, NOT the

file itself

Reading a file

● Three ways to read● read([n]) - n = bytes to read, default is all● readline() - read one line, incl. newline● readlines() - read file into a list, one element

per line, including newline

Reading example

● Log on to freebee, and go to your area● do cp ../Karin/fastafile.fsa .● open python

● Q: what does the response mean?

>>> fh = open("fastafile.fsa", "r")>>> fh

Read example

● Use all three methods to read the file. Print the results.

● read● readlines● readline

● Q: what happens after you have read the file?

● Q: What is the difference between the three?

Read example

>>> fh = open("fastafile.fsa", "r")>>> withread = fh.read()>>> withread'>This is the description line\nATGCGCTTAGGATCGATAGCGATTTAGA\nTTAGCGGA\n'>>> withreadlines = fh.readlines()>>> withreadlines[]>>> fh = open("fastafile.fsa", "r")>>> withreadlines = fh.readlines()>>> withreadlines['>This is the description line\n', 'ATGCGCTTAGGATCGATAGCGATTTAGA\n', 'TTAGCGGA\n']>>> fh = open("fastafile.fsa", "r")>>> withreadline = fh.readline()>>> withreadline'>This is the description line\n'>>>

Parsing

● Getting information out of a file● Commonly used string methods

● split([character]) – default is whitespace● replace(“in string”, “put into instead”)● “string character”.join(list)

– joins all elements in the list with string character as a separator

– common construction: ''.join(list)● slicing

Type conversions

● Everything that comes on the command line or from a file is a string

● Conversions:● int(X)

– string cannot have decimals– floats will be floored

● float(X)● str(X)

Parsing example

● Continue using fastafile.fsa● Print only the description line to screen● Print the whole DNA string

>>> fh = open("fastafile.fsa", "r")>>> firstline = fh.readline()>>> print firstline[1:-1]This is the description line>>> sequence = ''>>> for line in fh:... sequence += line.replace("\n", "")... >>> print sequenceATGCGCTTAGGATCGATAGCGATTTAGA>>>

Accepting input from command line

● Need to be able to specify file name on command line

● Command line parameters stored in list called sys.argv – program name is 0

● Usage:● python pythonscript.py arg1 arg2 arg3....

● In script:● at the top of the file, write import sys● arg1 = sys.argv[1]

Batch example

● Read fastafile.fsa with all three methods● Per method, print method, name and

sequence ● Remember to close the file at the end!

Batch exampleimport sysfilename = sys.argv[1]#using readlinefh = open(filename, "r")firstline = fh.readline()name = firstline[1:-1]sequence =''for line in fh: sequence += line.replace("\n", "")print "Readline", name, sequence

#using readlines()fh = open(filename, "r")inputlines = fh.readlines()name = inputlines[0][1:-1]sequence = ''for line in inputlines[1:]: sequence += line.replace("\n", "")print "Readlines", name, sequence

#using readfh = open(filename, "r")inputlines = fh.read()name = inputlines.split("\n")[0][1:-1]sequence = "".join(inputlines.split("\n")[1:])print "Read", name, sequence

fh.close()

Classroom exercise

● Modify ATCurve.py script so that it accepts the following input on the command line:

● fasta filename● window size

● Let the user input an alternate filename if it contains !ATGC

● Print results to screen

ATCurve2.pyimport sys# Define filename filename = sys.argv[1]windowsize = int(sys.argv[2])

# variable valid is used to see if the string is ok or not. valid = Falsewhile not valid: fh = open(filename, "r") inputlines = fh.readlines() name = inputlines[0][1:-1] sequence = '' for line in inputlines[1:]: sequence += line.replace("\n", "") upper_string = sequence.upper()

# Figure out if anything else than ATGCs are present dnaset = set(list("ATGC")) upper_string_set = set(list(upper_string))

if len(upper_string_set - dnaset) > 0: print "Non-DNA present in your file, try again" filename = raw_input("Type in filename: ") else: valid = True

if valid: for i in range(0, len(upper_string)-windowsize + 1, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+windowsize) at_sum += upper_string.count("T",i,i+windowsize) print i + 1, at_sum/windowsize

Writing to files

● Similar procedure as for read● Open file, mode is “w” or “a”● fh.write(string)

– Note: one single string– No newlines are added

● fh.close()

ATContent3.py

● Modify previous script so that you have the following on the command line

● fasta filename for input file● window size● output file

● Output should be on the format● number, AT content● number is the 1-based position of the first

nucleotide in the window

ATCurve3.py

import sys# Define filename filename = sys.argv[1]windowsize = int(sys.argv[2])outputfile = sys.argv[3]

if valid: fh = open(outputfile, "w") for i in range(0, len(upper_string)-windowsize + 1, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+windowsize) at_sum += upper_string.count("T",i,i+windowsize) fh.write(str(i + 1) + " " + str(at_sum/windowsize) + "\n") fh.close()

Homework: TranslateProtein.py

● Input files are in /projects/temporary/cees-python-course/Karin

● translationtable.txt - tab separated● dna31.fsa

● Script should:

● Open the translationtable.txt file and read it into a dictionary

● Open the dna31.fsa file and read the contents.● Translates the DNA into protein using the dictionary● Prints the translation in a fasta format to the file

TranslateProtein.fsa. Each protein line should be 60 characters long.

Day3

Education

dna string fh

input string

file fh

string conversions

file close file

intx string

inputenter string

string character