Module Overview
• We will learn how to read, create and modify files– Pay special attention to pickled files
• They are very easy to use!
The file system
• Provides long term storage of information. • Will store data in stable storage (disk)• Cannot be RAM because:
– Dynamic RAM loses its contents when powered off
– Static RAM is too expensive– System crashes can corrupt contents of the main
memory
Overall organization
• Data managed by the file system are grouped in user-defined data sets called files
• The file system must provide a mechanism for naming these data– Each file system has its own set of conventions– All modern operating systems use a
hierarchical directory structure
Windows solution
• Each device and each disk partition is identified by a letter– A: and B: were used by the floppy drives– C: is the first disk partition of the hard drive– If hard drive has no other disk partition,
D: denotes the DVD drive• Each device and each disk partition has its own
hierarchy of folders
Windows solution
C:
WindowsUsers
Second diskD:
Program Files
Flash driveF:
UNIX/LINUX organization
• Each device and disk partition has its own directory tree– Disk partitions are glued together through the
operation to form a single tree• Typical user does not know where her files
are stored
UNIX/LINUX organizationRoot partition
bin
usr
/ Other partition
The magicmount
Second partition can be accessed as /usr
Mac OS organization
• Similar to Windows – Disk partitions are not merged – Represented by separate icons on the
desktop
Accessing a file (I)
• Your Python programs are stored in a folder AKA directory– On my home PC it is
C:\Users\Jehan-Francois Paris\Documents\Courses\1306\Python
• All files in that directory can be directly accessed through their names– "myfile.txt"
Accessing a file (II)
• Files in subdirectories can be accessed by specifying first the subdirectory– Windows style:
• "test\\sample.txt" – Note the double backslash
– Linux/Unix/Mac OS X style:• "test/sample.txt"
– Generally works for Windows
Why the double backslash?
• The backslash is an escape character in Python– Combines with its successor to represent
non-printable characters• ‘\n’ represents a newline• ‘\t’ represents a tab
– Must use ‘\\’ to represent a plain backslash
Accessing a file (III)
• For other files, must use full pathname– Windows Style:
• "C:\\Users\\Jehan-Francois Paris\\Documents\\Courses\\1306\\Python\\myfile.txt"
Accessing file contents
• Two step process:– First we open the file– Then we access its contents
• Read• Write
• When we are done, we close the file.
What happens at open() time?
• The system verifies– That you are an authorized user– That you have the right permission
• Read permission• Write permission• Execute permission exists but doesn’t apply
and returns a file handle /file descriptor
The file handle
• Gives the user– Direct access to the file
• No directory lookups– Authority to execute the file operations whose
permissions have been requested
Python open()
• open(name, mode = ‘r’, buffering = -1)
where– name is name of file– mode is permission requested
• Default is ‘r’ for read only– buffering specifies the buffer size
• Use system default value (code -1)
The modes
• Can request– ‘r’ for read-only– ‘w’ for write-only
• Always overwrites the file– ‘a’ for append
• Writes at the end– ‘r+’ or ‘a+’ for updating (read + write/append)
Examples
• f1 = open("myfile.txt") same asf1 = open("myfile.txt", "r")
• f2 = open("test\\sample.txt", "r")
• f3 = open("test/sample.txt", "r")
• f4 = open("C:\\Users\\Jehan-Francois Paris\\Documents\\Courses\\1306\\Python\\myfile.txt")
Reading a file
• Three ways:– Global reads– Line by line– Pickled files
Global reads
• fh.read()– Returns whole contents of file specified by
file handle fh– File contents are stored in a single string
that might be very large
Example
• f2 = open("test\\sample.txt", "r") bigstring = f2.read()print(bigstring)f2.close() # not required
Output of example
• To be or not to be that is the questionNow is the winter of our discontent
– Exact contents of file ‘test\sample.txt’
Line-by-line reads
• for line in fh : # do not forget the column #anything you wantfh.close() # not required
Example
• f3 = open("test/sample.txt", "r") for line in f3 : # do not forget the column
print(line)f3.close() # not required
Output
• To be or not to be that is the question
Now is the winter of our discontent
– With one or more extra blank lines
Why?
• Each line ends with an end-of-line marker• print(…) adds an extra end-of-line
Trying to remove blank lines
• print('----------------------------------------------------')f5 = open("test/sample.txt", "r") for line in f5 : # do not forget the column print(line[:-1]) # remove last charf5.close() # not requiredprint('-----------------------------------------------------')
The output
• ----------------------------------------------------To be or not to be that is the questionNow is the winter of our disconten-----------------------------------------------------
• The last line did not end with an EOL!
A smarter solution (I)
• Only remove the last character if it is an EOL– if line[-1] == ‘\n’ :
print(line[:-1]else print line
A smarter solution (II)
• print('----------------------------------------------------')fh = open("test/sample.txt", "r")for line in fh : # do not forget the column if line[-1] == '\n' : print(line[:-1]) # remove last char else : print(line)print('-----------------------------------------------------')fh.close() # not required
It works!
• ----------------------------------------------------To be or not to be that is the questionNow is the winter of our discontent-----------------------------------------------------
Making sense of file contents
• Most files contain more than one data item per line– COSC 713-743-3350
UHPD 713-743-3333• Must split lines
– mystring.split(sepchar)where sepchar is a separation character• returns a list of items
Splitting strings
• >>> text = "Four score and seven years ago">>> text.split()['Four', 'score', 'and', 'seven', 'years', 'ago']
• >>>record ="1,'Baker, Andy', 83, 89, 85">>> record.split(',')[' 1', "'Baker", " Andy'", ' 83', ' 89', ' 85']
Not what we wanted!
Example
# how2split.pyprint('----------------------------------------------------')f5 = open("test/sample.txt", "r")for line in f5 :
words = line.split() for xxx in words : print(xxx)f5.close() # not requiredprint('-----------------------------------------------------')
Output
• ----------------------------------------------------Tobe…ofourdiscontent-----------------------------------------------------
Other separators (I)
• Commas– CSV Excel format
• Values are separated by commas• Strings are stored without quotes
–Unless they contain a comma• “Doe, Jane”, freshman, 90, 90
–Quotes within strings are doubled
Other separators (II)
• Tabs( ‘\t’)– Advantages:
• Your fields will appear nicely aligned• Spaces, commas, … are not an issue
– Disadvantage:• You do not see them
–They look like spaces
Why it is important
• When you must pick your file format, you should decide how the data inside the file will be used:– People will read them– Other programs will use them– Will be used by people and machines
An exercise
• Converting our output to CSV format– Replacing tabs by commas
• Easy–Will use string replace function
First attempt
• fh_in = open('grades.txt', 'r') # the 'r' is optionalbuffer = fh_in.read()newbuffer = buffer.replace('\t', ',')fh_out = open('grades0.csv', 'w')fh_out.write(newbuffer)fh_in.close()fh_out.close()print('Done!')
The output
• Alice 90 90 90 90 90Bob 85 85 85 85 85Carol 75 75 75 75 75
becomes• Alice,90,90,90,90,90
Bob,85,85,85,85,85Carol,75,75,75,75,75
Dealing with commas (I)
• Work line by line• For each line
– split input into fields using TAB as separator– store fields into a list
• Alice 90 90 90 90 90becomes[‘Alice’, ’90’, ’90’, ’90’, ’90’, ’90’]
Dealing with commas (II)
– Put within double quotes any entry containing one or more commas
– Output list entries separated by commas• ['"Baker, Alice"', 90, 90, 90, 90, 90]
becomes"Baker, Alice",90,90,90,90,90
Dealing with commas (III)
• Our troubles are not over:– Must store somewhere all lines until we are
done– Store them in a list
Dealing with double quotes
• Before wrapping items with commas with double quotes replace– All double quotes by pairs of double quotes– 'Aguirre, "Lalo" Eduardo'
becomes'Aguirre, ""Lalo"" Eduardo'then'"Aguirre, ""Lalo"" Eduardo"'
General organization (I)
• linelist = [ ]• for line in file
– itemlist = line.split(…)– linestring = '' # empty string– for each item in itemlist
• remove any trailing newline• double all double quotes• if item contains comma, wrap• add to linestring
General organization (II)
• for line in filefor line in file– ……– for each item in itemlistfor each item in itemlist
• double all double quotesdouble all double quotes• if item contains comma, wrapif item contains comma, wrap• add to linestringadd to linestring
– append linestring to stringlist
General organization (III)
• for line in filefor line in file– ……– remove last comma of linestring– add newline at end of linestring– append linestring to stringlist
• for linestring in in stringline – write linestring into output file
The program (I)
• # betterconvert2csv.py""" Convert tab-separated file to csv"""fh = open('grades.txt','r') #input filelinelist = [ ] # global data structurefor line in fh : # outer loop itemlist = line.split('\t') # print(str(itemlist)) # just for debugging linestring = '' # start afresh
The program (II)
• for item in itemlist : #inner loop item = item.replace('"','""') # for quotes if item[-1] == '\n' : # remove it item = item[:-1] if ',' in item : # wrap item linestring += '"' + item +'"' + ',' else : # just append linestring += item +',' # end of inside for loop
The program (III)
• # must replace last comma by newline linestring = linestring[:-1] + '\n' linelist.append(linestring)# end of outside for loopfh.close()fhh = open('great.csv', 'w')for line in linelist : fhh.write(line)fhh.close()
Notes
• Most print statements used for debugging were removed– Space considerations
• Observe that the inner loop adds a comma after each item– Wanted to remove the last one
• Must also add a newline at end of each line
The input file
• Alice 90 90 90 90 90Bob 85 85 85 85 85Carol 75 75 75 75 75Doe, Jane 90 90 90 80 70Fulano, Eduardo "Lalo" 90 90 9090
The output file
• Alice,90,90,90,90,90Bob,85,85,85,85,85Carol ,75,75,75,75,75"Doe, Jane",90,90 ,90 ,80 ,75"Fulano, Eduardo ""Lalo""",90,90,90,90
Mistakes being made (I)
• Mixing lists and strings:– Earlier draft of program declared
• linestring = [ ]and did• linestring.append(item)
– Outcome was• ['Alice,', '90,'. … ]
instead of• 'Alice,90, …'
Mistakes being made (II)
• Forgetting to add a newline– Output was a single line
• Doing the append inside the inner loop:– Output was
• Alice,90Alice,90,90Alice,90,90,90…
Mistakes being made
• Forgetting that strings are immutable:– Trying to do
• linestring[-1] = '\n'
instead of• linestring = linestring[:-1] + '\n'
– Bigger issue:• Do we have to remove the last comma?
Could we have done better? (I)
• Make the program more readable by decomposing it into functions– A function to process each line of input
• do_line(line)– Input is a string ending with newline–Output is a string in CSV format–Should call a function processing individual
items
Could we have done better? (II)
– A function to process individual items• do_item(item)
– Input is a string–Returns a string
• With double quotes "doubled"• Without a newline• Within quotes if it contains a comma
The new program (I)
• def do_item(item) : item = item.replace('"','""') if item[-1] == '\n' : item = item[:-1] if ',' in item : item ='"' + item +'"' return item
The new program (II)
• def do_line(line) : itemlist = line.split('\t') linestring = '' # start afresh for item in itemlist : linestring += do_item(item) +',' linestring += '\n' return linestring
The new program (III)
• fh = open('grades.txt','r')linelist = [ ]for line in fh : linelist.append(do_line(line))fh.close()
The new program (IV)
• fhh = open('great.csv', 'w')for line in linelist : fhh.write(line)fhh.close()
Why it is better
• Program is decomposed into small modules that are much easier to understand– Each fits on a PowerPoint slide
The break statement
• Makes the program exit the loop it is in• In next example, we are looking for
first instance of a string in a file– Can exit as soon it is found
Example (I)
• searchstring= input('Enter search string:')found = Falsefh = open('grades.txt')for line in fh : if searchstring in line : print(line) found = True break
Example (II)
• if found == True : print("String %s was found" % searchstring)else : print("String %s NOT found " % searchstring)
Flags
• A variable like found– That can either be True or False– That is used in a condition for an if or a while
is often referred to as a flag
A dumb mistake
• Unlike C and its family of languages,Python does not let you write– if found = True
for– if found == True
• There are still cases where we can do mistakes!
Example
• >>> b = 5>>> c = 8>>> a = b = c>>> a8
• >>> a = b == c>>> aTrue
HANDLING EXCEPTIONS
When a wrong value is entered
• When user is prompted for– number = int(input("Enter a number: ")
and enters– a non-numerical string
a ValueError exception is raised and the program terminates
• Python a programs catch errors
The try… except pair (I)
• try:<statements being tried>
except Exception as ex:<statements catching the exception>
• Observe– the colons– the indentation
The try… except pair (II)
• try:<statements being tried>
except Exception as ex:<statements catching the exception>
• If an exception occurs while the program executes the statements between the try and the except, control is immediately transferred to the statements after the except
A better example
• done = Falsewhile not done : filename= input("Enter a file name: ") try : fh = open(filename) done = True except Exception as ex: print ('File %s does not exist' % filename)print(fh.read())
An Example (I)
• done = Falsewhile not done : try : number = int(input('Enter a number:')) done = True except Exception as ex: print ('You did not enter a number')print ("You entered %.2f." % number)input("Hit enter when done with program.")
A simpler solution
• done = Falsewhile not done myinput = (input('Enter a number:')) if myinput.isdigit() : number = int(myinput) done = True else : print ('You did not enter a number')print ("You entered %.2f." % number)input("Hit enter when done with program.")
PICKLED FILES
Pickled files
• import pickle – Provides a way to save complex data
structures in a file– Sometimes said to provide a
serialized representation of Python objects
Basic primitives (I)
• dump(object,fh)– appends a sequential representation of
object into file with file handle fh– object is virtually any Python object– fh is the handle of a file that must have been
opened in 'wb' mode
b is a special option allowing towrite or read binary data
Basic primitives (II)
• target = load( filehandle)– assigns to target next pickled object stored in
file filehandle– target is virtually any Python object– filehandle id filehandle of a file that was
opened in rb mode
Example (I)
• >>> mylist = [ 2, 'Apples', 5, 'Oranges']• >>> mylist
[2, 'Apples', 5, 'Oranges']• >>> fh = open('testfile', 'wb') # b is for BINARY• >>> import pickle• >>> pickle.dump(mylist, fh)• >>> fh.close()
Example (II)
• >>> fhh = open('testfile', 'rb') # b is for BINARY• >>> theirlist = pickle.load(fhh)• >>> theirlist
[2, 'Apples', 5, 'Oranges']• >>> theirlist == mylist
True
What was stored in testfile?
• Some binary data containing the strings 'Apples' and 'Oranges'
Using ASCII format
• Can require a pickled representation of objects that only contains printable characters– Must specify protocol = 0
• Advantage:– Easier to debug
• Disadvantage:– Takes more space
Example
• import picklemydict = {'Alice': 22, 'Bob' : 27}fh = open('asciifile.txt', 'wb') # MUST be 'wb'pickle.dump(mydict, fh, protocol = 0)fh.close()fhh = open('asciifile.txt', 'rb')theirdict = pickle.load(fhh)print(mydict)print(theirdict)
The output
• {'Bob': 27, 'Alice': 22}{'Bob': 27, 'Alice': 22}
What is inside asciifile.txt?
• (dp0VBobp1L27LsVAlicep2L22Ls.
Dumping multiple objects (I)
• import picklefh = open('asciifile.txt', 'wb')for k in range(3, 6) : mylist = [i for i in range(1,k)] print(mylist) pickle.dump(mylist, fh, protocol = 0)fh.close()
Dumping multiple objects (II)
• fhh = open('asciifile.txt', 'rb')lists = [ ] # initializing list of listswhile 1 : # means forever try:
lists.append(pickle.load(fhh))except EOFError :
breakfhh.close()print(lists)
Dumping multiple objects (III)
• Note the way we test for end-of-file (EOF)
– while 1 : # means forever try:
lists.append(pickle.load(fhh)) except EOFError :
break
The output
• [1, 2][1, 2, 3][1, 2, 3, 4][[1, 2], [1, 2, 3], [1, 2, 3, 4]]
What is inside asciifile.txt?
• (lp0L1LaL2La.(lp0L1LaL2LaL3La.(lp0L1LaL2LaL3LaL4La.
Practical considerations
• You rarely pick the format of your input files– May have to do format conversion
• You often have to use specific formats for you output files– Often dictated by program that will use them
• Otherwise stick with pickled files!