Top Banner
File handling Karin Lagesen [email protected]
28

Day3

May 19, 2015

Download

Education

karinlag

Day 3 of a Python intro course for biologists.
Theme: how to work with files
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Day3

File handling

Karin Lagesen

[email protected]

Page 2: Day3

Homework

● ATCurve.py

● take an input string from the user● check if the sequence only contains DNA – if

not, prompt for new sequence. ● calculate a running average of AT content

along the sequence. Window size should be 3, and the step size should be 1. Print one value per line.

● Note: you need to include several runtime examples to show that all parts of the code works.

Page 3: Day3

ATCurve.py - thinking

● Take input from user:● raw_input

● Check for the presence of !ATCG● use sets – very easy

● Calculate AT – window = 3, step = 1● iterate over string in slices of three

Page 4: Day3

ATCurve.py

# variable valid is used to see if the string is ok or not.valid = Falsewhile not valid: # promt user for input using raw_input() and store in string, # convert all characters into uppercase test_string = raw_input("Enter string: ") upper_string = test_string.upper()

# Figure out if anything else than ATGCs are present dnaset = set(list("ATGC")) upper_string_set = set(list(upper_string))

if len(upper_string_set - dnaset) > 0: print "Non-DNA present in your string, try again"

else: valid = True

if valid: for i in range(0, len(upper_string)-3, 1): at_sum = 0.0

at_sum += upper_string.count("A",i,i+2)at_sum += upper_string.count("T",i,i+2)print at_sum/3

Page 5: Day3

Homework

● CodonFrequency.py● take an input string from the user● if the sequence only contains DNA

– find a start codon in your string– if startcodon is present

● count the occurrences of each three-mer from start codon and onwards

● print the results

Page 6: Day3

CodonFrequency.py - thinking

● First part – same as earlier● Find start codon: locate index of AUG

● Note, can simplify and find ATG

● If start codon is found:● create dictionary● for slice of three in input[StartCodon:]:

– get codon– if codon is in dict:

● add to count

– if not:● create key-value pair in dict

Page 7: Day3

CodonFrequency.py

input = raw_input("Type a piece of DNA here: ")

if len(set(input) - set(list("ATGC"))) > 0: print "Not a valid DNA sequence"else: atg = input.find("ATG") if atg == -1: print "Start codon not found" else: codondict = {} for i in xrange(atg,len(input)-3,3): codon = input[i:i+3] if codon not in codondict: codondict[codon] = 1 else: codondict[codon] +=1

for codon in codondict: print codon, codondict[codon]

Page 8: Day3

CodonFrequency.py w/ stopcodon

input = raw_input("Type a piece of DNA here: ")

if len(set(input) - set(list("ATGC"))) > 0: print "Not a valid DNA sequence"else: atg = input.find("ATG") if atg == -1: print "Start codon not found" else: codondict = {} for i in xrange(atg,len(input) -3,3): codon = input[i:i+3] if codon in ['UAG', 'UAA', 'UAG']: break elif codon not in codondict: codondict[codon] = 1 else: codondict[codon] +=1

for codon in codondict: print codon, codondict[codon]

Page 9: Day3

Results

[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.pyType a piece of DNA here: ATGATTATTTAAATGATG 1ATT 2TAA 1[karinlag@freebee]/projects/temporary/cees-python-course/Karin% python CodonFrequency2.pyType a piece of DNA here: ATGATTATTTAAATGTATG 2ATT 2TAA 1[karinlag@freebee]/projects/temporary/cees-python-course/Karin%

Page 10: Day3

Working with files

● Reading – get info into your program● Parsing – processing file contents● Writing – get info out of your program

Page 11: Day3

Reading and writing

● Three-step process● Open file

– create file handle – reference to file● Read or write to file● Close file

– will be automatically close on program end, but bad form to not close

Page 12: Day3

Opening files

● Opening modes:● “r” - read file● “w” - write file● “a” - append to end of file

● fh = open(“filename”, “mode”)● fh = filehandle, reference to a file, NOT the

file itself

Page 13: Day3

Reading a file

● Three ways to read● read([n]) - n = bytes to read, default is all● readline() - read one line, incl. newline● readlines() - read file into a list, one element

per line, including newline

Page 14: Day3

Reading example

● Log on to freebee, and go to your area● do cp ../Karin/fastafile.fsa .● open python

● Q: what does the response mean?

>>> fh = open("fastafile.fsa", "r")>>> fh

Page 15: Day3

Read example

● Use all three methods to read the file. Print the results.

● read● readlines● readline

● Q: what happens after you have read the file?

● Q: What is the difference between the three?

Page 16: Day3

Read example

>>> fh = open("fastafile.fsa", "r")>>> withread = fh.read()>>> withread'>This is the description line\nATGCGCTTAGGATCGATAGCGATTTAGA\nTTAGCGGA\n'>>> withreadlines = fh.readlines()>>> withreadlines[]>>> fh = open("fastafile.fsa", "r")>>> withreadlines = fh.readlines()>>> withreadlines['>This is the description line\n', 'ATGCGCTTAGGATCGATAGCGATTTAGA\n', 'TTAGCGGA\n']>>> fh = open("fastafile.fsa", "r")>>> withreadline = fh.readline()>>> withreadline'>This is the description line\n'>>>

Page 17: Day3

Parsing

● Getting information out of a file● Commonly used string methods

● split([character]) – default is whitespace● replace(“in string”, “put into instead”)● “string character”.join(list)

– joins all elements in the list with string character as a separator

– common construction: ''.join(list)● slicing

Page 18: Day3

Type conversions

● Everything that comes on the command line or from a file is a string

● Conversions:● int(X)

– string cannot have decimals– floats will be floored

● float(X)● str(X)

Page 19: Day3

Parsing example

● Continue using fastafile.fsa● Print only the description line to screen● Print the whole DNA string

>>> fh = open("fastafile.fsa", "r")>>> firstline = fh.readline()>>> print firstline[1:-1]This is the description line>>> sequence = ''>>> for line in fh:... sequence += line.replace("\n", "")... >>> print sequenceATGCGCTTAGGATCGATAGCGATTTAGA>>>

Page 20: Day3

Accepting input from command line

● Need to be able to specify file name on command line

● Command line parameters stored in list called sys.argv – program name is 0

● Usage:● python pythonscript.py arg1 arg2 arg3....

● In script:● at the top of the file, write import sys● arg1 = sys.argv[1]

Page 21: Day3

Batch example

● Read fastafile.fsa with all three methods● Per method, print method, name and

sequence ● Remember to close the file at the end!

Page 22: Day3

Batch exampleimport sysfilename = sys.argv[1]#using readlinefh = open(filename, "r")firstline = fh.readline()name = firstline[1:-1]sequence =''for line in fh: sequence += line.replace("\n", "")print "Readline", name, sequence

#using readlines()fh = open(filename, "r")inputlines = fh.readlines()name = inputlines[0][1:-1]sequence = ''for line in inputlines[1:]: sequence += line.replace("\n", "")print "Readlines", name, sequence

#using readfh = open(filename, "r")inputlines = fh.read()name = inputlines.split("\n")[0][1:-1]sequence = "".join(inputlines.split("\n")[1:])print "Read", name, sequence

fh.close()

Page 23: Day3

Classroom exercise

● Modify ATCurve.py script so that it accepts the following input on the command line:

● fasta filename● window size

● Let the user input an alternate filename if it contains !ATGC

● Print results to screen

Page 24: Day3

ATCurve2.pyimport sys# Define filename filename = sys.argv[1]windowsize = int(sys.argv[2])

# variable valid is used to see if the string is ok or not. valid = Falsewhile not valid: fh = open(filename, "r") inputlines = fh.readlines() name = inputlines[0][1:-1] sequence = '' for line in inputlines[1:]: sequence += line.replace("\n", "") upper_string = sequence.upper()

# Figure out if anything else than ATGCs are present dnaset = set(list("ATGC")) upper_string_set = set(list(upper_string))

if len(upper_string_set - dnaset) > 0: print "Non-DNA present in your file, try again" filename = raw_input("Type in filename: ") else: valid = True

if valid: for i in range(0, len(upper_string)-windowsize + 1, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+windowsize) at_sum += upper_string.count("T",i,i+windowsize) print i + 1, at_sum/windowsize

Page 25: Day3

Writing to files

● Similar procedure as for read● Open file, mode is “w” or “a”● fh.write(string)

– Note: one single string– No newlines are added

● fh.close()

Page 26: Day3

ATContent3.py

● Modify previous script so that you have the following on the command line

● fasta filename for input file● window size● output file

● Output should be on the format● number, AT content● number is the 1-based position of the first

nucleotide in the window

Page 27: Day3

ATCurve3.py

import sys# Define filename filename = sys.argv[1]windowsize = int(sys.argv[2])outputfile = sys.argv[3]

if valid: fh = open(outputfile, "w") for i in range(0, len(upper_string)-windowsize + 1, 1): at_sum = 0.0 at_sum += upper_string.count("A",i,i+windowsize) at_sum += upper_string.count("T",i,i+windowsize) fh.write(str(i + 1) + " " + str(at_sum/windowsize) + "\n") fh.close()

Page 28: Day3

Homework: TranslateProtein.py

● Input files are in /projects/temporary/cees-python-course/Karin

● translationtable.txt - tab separated● dna31.fsa

● Script should:

● Open the translationtable.txt file and read it into a dictionary

● Open the dna31.fsa file and read the contents.● Translates the DNA into protein using the dictionary● Prints the translation in a fasta format to the file

TranslateProtein.fsa. Each protein line should be 60 characters long.