Top Banner
1 Teaching Python programming with automatic assessment and feedback provision Hans Fangohr, Neil O’Brien, Faculty of Engineering and the Environment University of Southampton Anil Prabhakar, Arti Kashyap IIT Madras, IIT Mandi Abstract—We describe a method of automatic feedback provi- sion for students learning programming and computational meth- ods in Python. We have implemented, used and refined this system since 2009 for growing student numbers, and summarise the design and experience of using it. The core idea is to use a unit testing framework: the teacher creates a set of unit tests, and the student code is tested by running these tests. With our implementation, students typically submit work for assessment, and receive feedback by email within a few minutes after submis- sion. The choice of tests and the reporting back to the student is chosen to optimise the educational value for the students. The system very significantly reduces the staff time required to establish whether a student’s solution is correct, and shifts the emphasis of computing laboratory student contact time from as- sessing correctness to providing guidance. The self-paced nature of the automatic feedback provision supports a student-centred learning approach. Students can re-submit their work repeatedly and iteratively improve their solution, and enjoy using the system. We include an evaluation of the system and data from using it in a class of 425 students. Index Terms—Automatic assessment tools, automatic feed- back provision, programming education, Python, self-assessment technology 1 I NTRODUCTION 1.1 Context Programming skills are key for software engineer- ing and computer science but increasingly relevant for computational science outside computer science as well, for example in engineering, natural and social science, mathematics and economics. The learning and teaching of programming is a critical part of a computer science degree and becoming more and more important in taught and research degrees of other disciplines. This paper focuses on an automatic submission, testing and feedback provision system that has been designed, implemented, used and further de- veloped at the University of Southampton since 2009 for undergraduate and postgraduate program- ming courses. While in this setting, the primary target group of students were engineers, the same system could be used to benefit the learning of computer science students. 1.2 Effective teaching of programming skills One of the underpinning skills for computer sci- ence, software engineering and computational sci- ence is programming. A thorough treatment of the existing literature on teaching introductory pro- gramming was given by Pears et al. [1], while a previous review focused mainly on novice pro- gramming and topics related to novice teaching and learning [2]. Here, we motivate the use of an automatic assessment and feedback system in the context of teaching introductory programming skills. Programming is a creative task: given the con- straints of the programming language to be used, it is the choice of the programmer what data structure to use, what control flow to implement, what pro- gramming paradigm to use, how to name variables arXiv:1509.03556v1 [cs.CY] 11 Sep 2015
26

Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

May 28, 2018

Download

Documents

ngotuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

1

Teaching Python programming withautomatic assessment and feedback

provisionHans Fangohr, Neil O’Brien,

Faculty of Engineering and the EnvironmentUniversity of Southampton

Anil Prabhakar, Arti KashyapIIT Madras, IIT Mandi

F

Abstract—We describe a method of automatic feedback provi-sion for students learning programming and computational meth-ods in Python. We have implemented, used and refined thissystem since 2009 for growing student numbers, and summarisethe design and experience of using it. The core idea is to usea unit testing framework: the teacher creates a set of unit tests,and the student code is tested by running these tests. With ourimplementation, students typically submit work for assessment,and receive feedback by email within a few minutes after submis-sion. The choice of tests and the reporting back to the studentis chosen to optimise the educational value for the students.The system very significantly reduces the staff time required toestablish whether a student’s solution is correct, and shifts theemphasis of computing laboratory student contact time from as-sessing correctness to providing guidance. The self-paced natureof the automatic feedback provision supports a student-centredlearning approach. Students can re-submit their work repeatedlyand iteratively improve their solution, and enjoy using the system.We include an evaluation of the system and data from using it ina class of 425 students.

Index Terms—Automatic assessment tools, automatic feed-back provision, programming education, Python, self-assessmenttechnology

1 INTRODUCTION

1.1 ContextProgramming skills are key for software engineer-ing and computer science but increasingly relevantfor computational science outside computer scienceas well, for example in engineering, natural andsocial science, mathematics and economics. Thelearning and teaching of programming is a critical

part of a computer science degree and becomingmore and more important in taught and researchdegrees of other disciplines.

This paper focuses on an automatic submission,testing and feedback provision system that hasbeen designed, implemented, used and further de-veloped at the University of Southampton since2009 for undergraduate and postgraduate program-ming courses. While in this setting, the primarytarget group of students were engineers, the samesystem could be used to benefit the learning ofcomputer science students.

1.2 Effective teaching of programming skills

One of the underpinning skills for computer sci-ence, software engineering and computational sci-ence is programming. A thorough treatment of theexisting literature on teaching introductory pro-gramming was given by Pears et al. [1], while aprevious review focused mainly on novice pro-gramming and topics related to novice teachingand learning [2]. Here, we motivate the use ofan automatic assessment and feedback system inthe context of teaching introductory programmingskills.

Programming is a creative task: given the con-straints of the programming language to be used, itis the choice of the programmer what data structureto use, what control flow to implement, what pro-gramming paradigm to use, how to name variables

arX

iv:1

509.

0355

6v1

[cs

.CY

] 1

1 Se

p 20

15

Page 2: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

and functions, how to document the code, and howto structure the code that solves the problem intosmaller units (which potentially could be re-used).Experienced programmers value this freedom andgain satisfaction from developing a ‘beautiful’ pieceof code or finding an ‘elegant’ solution. For begin-ners (and teachers) the variety of ‘correct’ solutionscan be a challenge.

Given a particular problem (or student exercise),for example to compute the solution of an ordinarydifferential equation, there are a number of criteriathat can be used to assess the computer programthat solves the problem:

1) correctness: does the code produce the cor-rect answer? (For numerical problems, thisrequires some care: for the example of thedifferential equation, we would expect fora well-behaved differential equation that thenumerical solution converges towards the ex-act solution as the step-width is reduced to-wards zero.)

2) execution time performance: how fast is thesolution computed?

3) memory consumption: how much RAM isrequired to compute the solution?

4) robustness: how robust is the implementationwith respect to missing/incorrect input val-ues, etc?

5) elegance, readability, documentation: howlong is the code? Is it easy for others tounderstand? Is it easy to extend? Is it welldocumented, or is the choice of algorithm,data structures and naming of objects suffi-cient to document what it does?

The first aspect – correctness – is probably mostimportant: it is better to have a slow piece of codethat produces the correct answer, than to have onethat is very fast but produces a wrong answer.When teaching and providing feedback, in particu-lar to beginners, one tends to focus on correctnessof the solution. However, the other criteria 2 to 5are also important.

We demonstrate in this paper that the assessmentof criteria 1 to 4 can be automated in day-to-day teaching of large groups of student. While thehigher-level aspects such as elegance, readabilityand documentation of item 5 do require manualinspection of the code from an experienced pro-grammer, we find that the teaching of the highlevel aspects benefits significantly from automaticfeedback as all the contact time with experienced

staff can be dedicated to those points, and no timeis required to check the criteria 1 to 4.

1.3 Automatic feedback provision and assess-ment

Over the past two decades interest has been rapidlygrowing in utilising new technologies to enhancethe learning and feedback provision processes inhigher education. In 1997, Price and Petre con-sidered the importance of feedback from an in-structor to students learning programming, espe-cially looking into how electronic assignment han-dling can contribute to Internet-based teaching ofprogramming [3]. Their study compares feedbackgiven manually by several instructors to cohorts ofconventional and Internet learning students, onlya small fraction of which involved running thestudents’ submissions. For the functional program-ming language Scheme, Saikkonen et al. describeda system that assesses programming exercises withthe possibility to analyse individual procedures andmetrics such as run time [4]. A feedback systemcalled “submit” for code in Java was introducedin 2003, which worked by allowing users to up-load code, which would be compiled and (if thecompilation was successful) run, with the outputdisplayed for comparison with model output pro-vided by the lecturer; the lecturer would manuallygrade the work later, and the system would alsodisplay this information [5]. Recognising the pop-ularity of test-driven development and adoptingthat approach in programming courses, StephenEdwards implemented a system, web-CAT, thatwould assess both the tests and the code writtenby students [6]. Shortly thereafter, another groupproduced a tool for automatically assessing thestyle of C++ programs [7], which students wereencouraged to use, and which was also used beinstructors when manually assessing assignments;it was found that the students started to followmany important style guidelines once the tool wasmade available.

By 2005 there was sufficient interest in the fieldof automatic assessment systems that multiple re-views were published [8], [9], highlighting theemergence of evidence that automatic assessmentcan lead to increased student performance [10],[11]. Another benefit realised with automatic as-sessment systems is greater ease in detecting pla-giarism, tools for the purpose having been includedin several of the systems surveyed. Also reported

2

Page 3: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

on that year was CourseMarker [12], which canmark C++ and Java programs, and uses a Java clientprogram to provide a graphical user interface tostudents.

A more recent review of automatic assessmentsystems [13] which highlighted newer developmentrecommended that future systems devote more at-tention to security, and future literature describemore completely how the systems work. A workfrom MIT CSAIL and Microsoft introduces a modelin which the system – provided with a referenceimplementation of a solution, and an error modelconsisting of potential correction to errors that stu-dents may make – automatically derives minimalcorrections to students’ incorrect solutions [14]. An-other relatively recent development is the adoptionof distributed, web-based training and assessmentsystems [15], as well as the increasingly-popular“massive open online courses” or MOOCs [16].A current innovation in the field is the nbgraderproject [17], an open-source project that is designedfor generating and grading assignments in IPythonnotebooks [18].

1.4 Outline

In this work, we describe motivation, design, im-plementation and effectiveness of an automaticfeedback system for Python programming exercisesused in undergraduate teaching for engineers. Weaim to address the shortcomings of the currentliterature as outlined in the review [13] by detailingour implementation and security model, as wellas providing sample testing scripts, inputs andoutputs, and usage data from the deployed system.We combine the provision of the technical softwareengineering details of the testing and feedback sys-tem, with motivation and explanation of its use in aeducational setting, and data on student receptionbased on 6 years of experience of employing thesystem in multiple courses and countries.

In Sec. 2, we provide some historic context ofhow programming was taught prior the introduc-tion of the automatic testing system described here.Sec. 3 introduces the new method of feedbackprovision, initially from the student’s perspective– who are the users from a software engineeringpoint of view – then providing more detail ondesign and implementation. Based on our use ofthe system over multiple years, we have composedresults, statistics and a discussion of the system inSec. 4, before we close with a summary in Sec. 5.

2 TRADITIONAL DELIVERY OF PROGRAM-MING EDUCATION

In this section, we describe the learning and teach-ing methods used in the Engineering degree pro-grammes at the University of Southampton beforethe automatic feedback system was introduced.

2.1 Programming languages usedWe taught languages such as C and MATLABto students in Engineering as their first program-ming languages until 2004, when we introducedPython [19] into the curriculum. Over time, wehave moved to teaching Python as a versatile lan-guage [20], [21] that is relatively easy to learn [22]and useful in wide variety of applications [23], [24].We teach C for advanced students in later years asa compiled and fast language.

2.2 LecturesLectures that introduce a programming language tobeginners are typically scheduled over a durationof 12 weeks, with two 45 minute lectures per week.This is combined with a scheduled computing lab-oratory (90 minutes) every week (Sec. 2.3), andan additional and optional weekly “help session“(Sec. 2.4)

The lectures introduce new material, demonstratewhat one can do with new commands, and how touse programming elements or numerical methods.In nearly all lectures, new commands and featuresare used and demonstrated by the lecturer in live-coding of small programs; often with involvementof the students. The lectures are thus a mixtureof traditional lectures and a tutorial-like compo-nent where the new material is applied to solvea problem, and – while only the lecturer has akeyboard which drives a computer with displayoutput connected to a data projector – all studentscontribute, or are at least engaged, in the processof writing a piece of code.

2.3 Computing laboratoriesHowever, for the majority of students the actuallearning takes place when they carry out program-ming exercises themselves.

To facilitate this, computer laboratory sessions(90 minutes every week) are arranged in whicheach student has one computer, and works at theirown pace through a number of exercises. Teaching

3

Page 4: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

staff are available during the session, and we havefound that about 1 (teaching assistant) demonstra-tor per 10 students is required for this set up.

The lecturer and demonstrators (either academicsor postgraduate students) fulfil three roles in theselaboratory sessions:

(i) to provide help and advice when studentshave difficulties or queries while carrying outthe self-paced exercises,

(ii) to establish whether a student’s work is cor-rect (i.e. does the student’s computer programdo what it is meant to do), and

(iii) to provide feedback to the student (in par-ticular: what they should change for futureprograms they write).

Typically, prior to introducing the automatic test-ing system in 2009, the teaching assistants werespending 90% of their time on activity (ii), i.e.checking students’ code for correctness, and theremaining 10% of time can be used on (i) and (iii),while the educational value is overwhelmingly in(i) and (iii).

In practical terms, the assessment and feedbackprovision was done in pairs consisting of onedemonstrator and one student looking through thestudent’s files on the student’s computer at somepoint during the subsequent computing laboratorysession. The feedback and assessment was thus de-livered one week after the students had completedthe work.

2.4 Help sessionIn the weekly voluntary help session, computersand teaching staff are available for students ifthey need support exceeding the normal provision,would like to discuss their solutions in more depth,or seek inspiration and tasks to study topics wellbeyond the expected material.

3 NEW METHOD OF AUTOMATIC FEED-BACK PROVISION

3.1 OverviewIn 2009, we introduced an automatic feedback pro-vision system that checks each student’s code forcorrectness and provides feedback to the studentwithin a couple of minutes of having completedthe work. This takes a huge load off the demon-strators who consequently can spend most of theirtime helping students to do the exercises (item(i) in Sec. 2.3) and providing additional feedback

on completed and assessed solutions (item (iii) inSec. 2.3). Due to the introduction of the systemthe learning process can be supported considerablymore effectively, and we could reduce the num-ber of demonstrators from 1 per 10 students aswe had pre-2009, to 1 demonstrator per 20 stu-dents, and still improve the learning experience anddepth of material covered. There was no changeto the scheduled learning activities, i.e. the weeklylectures (Sec. 2.2), computing laboratory sessions(Sec. 2.3), and help sessions (Sec. 2.4) remain.

In Sec. 3.2 “Student’s perspective” we show atypical example of a very simple exercise, alongwith correct and incorrect solutions, and the feed-back that those solutions give rise to. Later sectionsdetail the system design and work flow (Sec. 3.3)and in particular the implementation of the studentcode testing (Sec. 3.4), with reference to this exam-ple exercise.

3.2 Student’s perspective

Once a student completes a programming exercisein the computing laboratory session, they send anemail to a dedicated email account that has beencreated for the teaching course, and attach the filecontaining the code they have written. The subjectline is used by the student to identify the exercise;for example “Lab 4“ would identify the 4th practicalsession. The system receives the student’s email,and the next thing that the student sees is anautomatically generated email confirmation of thesubmission (or, should the submission not be valid,an error message is emailed instead, explainingwhy the submission was invalid. Invalid submis-sion can occur for example if emails are sent fromemail accounts that are not authorised to submitcode). At this stage, the student’s code is enqueuedfor testing, and after a short interval, the studentreceives another email containing their assessmentresults and feedback by email. Where problems aredetected, this email also includes details of what theproblems were. Typically, the student will receivefeedback in their inbox within two to three minutesof sending their email.

We shall use the following example exercise,which is typical of one that we might use in anintroductory Python laboratory, as the basis forour case study:

4

Page 5: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

Please define the following functions in the filetraining1.py and make sure they behave as ex-pected. You also should document them suitably.

1) A function distance(a, b) that returns thedistance between numbers a and b.

2) A function geometric_mean(a, b) that re-turns the geometric mean of two numbers, i.e.the edge length that a square would have sothat its area equals that of a rectangle withsides a and b.

3) A function pyramid_volume(A, h) thatcomputes and returns the volume of a pyra-mid with base area A and height h.

We show a correct solution to question 3 of thisexample exercise in Listing 1. If a student who isenrolled on the appropriate course submits this,along with correct responses to the other questions,by email to the system, they will receive feedbackas shown in Listing 2.

def pyramid_volume(A, h):"""Calculate and return the volume of a pyramidwith base area A and height h."""return (1./3.) * A * h

Listing 1: A correct solution to question 3 of theexample exercise

Dear Neil O’Brien,

Testing of your submitted code has been completed:

Overview========

test_distance : passed -> 100% ; with weight 1test_geometric_mean : passed -> 100% ; with weight 1test_pyramid_volume : passed -> 100% ; with weight 1

Total mark for this assignment: 3 / 3 = 100%.

(Points computed as 1 + 1 + 1 = 3)

-----------------------------------------------

This message has been generated automatically. Shouldyou feel that you observe a malfunction of the system,or if you wish to speak to a human, please contact thecourse team ([email protected]).

Listing 2: email response to correct submission,additional line wrapping due to column width

If the student submits an incorrect solution, forexample with a mistake in question 3 as shownin Listing 3, they will instead receive the feedback

shown in Listing 4. Of course the students mustlearn to interpret this style of feedback in orderto gain the maximum benefit, but this is in itselfa useful skill, as we discuss more fully in Sec-tion 4.8.2, and comments from the testing codeassist the students, as discussed in Section 3.4.5.The submission in Listing 3 is incorrect becauseinteger division is used rather than the requiredfloating-point division. These exercises are based onPython 2, where the “/” operator represents integerdivision if both operands are of integer type, asis common in many programming languages (inPython 3, the “/” operator represents floating pointdivision even if both operands are of type integer).

def pyramid_volume(A, h):"""Calculate and return the volume of a pyramidwith base area A and height h."""return (A * h) / 3

Listing 3: An incorrect solution to question 3 of theexample exercise, using integer division

Within the testing feedback in Listing 4, thestudent code is visible in the name spaces, i.e. the function s.pyramid_volume is thefunction defined in Listing 3. The functioncorrect_pyramid_volume is visible to the testsystem but students cannot see the implementationin the feedback their receive – this allows us todefine tests that compute complicated values forcomparison with those computed by the student’ssubmission, without revealing the implementationof the reference computation to the students.

3.3 Design and Implementation

The design is based on three different processesthat are started periodically (every minute) andcommunicate via file system based task queueswith each other:

1) A incoming queue of incoming student submis-sions, initial validation and extraction of filesand required tests to run (see high level flowchart in Fig. 1a)

2) A queue of outgoing messages that need tobe delivered to the users and administratorswhich – in our email based user interface– decouples the actual testing queue fromavailability of the email servers (flow chart inFig. 1b).

5

Page 6: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

Dear Neil O’Brien,

Testing of your submitted code has been completed:

Overview========

test_distance : passed -> 100% ; with weight 1test_geometric_mean : passed -> 100% ; with weight 1test_pyramid_volume : failed -> 0% ; with weight 1

Total mark for this assignment: 2 / 3 = 67%.

(Points computed as 1 + 1 + 0 = 2)

Test failure report====================

test_pyramid_volume-------------------def test_pyramid_volume():

# if height h is zero, expect volume zeroassert s.pyramid_volume(1.0, 0.0) == 0.

# tolerance for floating point answerseps = 1e-14

# if we have base area A=1, height h=1,# we expect a volume of 1/3.:assert abs(s.pyramid_volume(1., 1.) - 1./3.) < eps

# another exampleh = 2.A = 4.assert abs(s.pyramid_volume(A, h) -

correct_pyramid_volume(A, h)) < eps

# does this also work if arguments are integers?> assert abs(s.pyramid_volume(1, 1) - 1. / 3.) < epsE assert 0.3333333333333333 < 1e-14E + where 0.3333333333333333 = abs((0 - (1.0/3.0)))E + where 0 = <function pyramid_volume at

0x7f0ce1af4e60>(1, 1)E + where <function pyramid_volume at

0x7f0ce1af4e60> = s.pyramid_volume

Listing 4: email response to incorrect solution

3) A queue of tests to be run, where the actualtesting of the code takes place in a restrictedenvironment (flow chart in Fig. 1c)

We describe how these work together in moredetail in the following sections.

The system is implemented in Python, and pri-marily tests Python code (in Section 4.6 we discussgeneralisation of the system to test code in otherlanguages).

3.3.1 Email receipt and incoming queue processEach course that uses the automatic feedback pro-vision system has a dedicated email account setup to receive submissions. At the University ofSouthampton, for a course with code ABC, theemail address would be [email protected].

As the subject line, the student has to use a pre-defined string (such as lab 1), which is specifiedin the assignment instructions, so the testing systemcan identify which submission this is. The identityof the student is known through the email addressof the sender.

The testing system accesses the email inbox ev-ery minute, and downloads all incoming mailsfrom it using standard tools such as fetchmail,or getmail combined with cron. These incom-ing mails are then processed sequentially as sum-marised in the flow chart in Fig. 1a:

1) The email is copied, for backup purposes, toan archive of all incoming mail for the givencourse and year.

2) The email is checked for validity in the fol-lowing respects:

a) the student must be known on thiscourse (this is checked using a list ofstudents enrolled on the course, pro-vided by the student administration of-fice); submissions from students who arenot enrolled are logged for review by anadministrator in case the student list wasnot correct or a student has transferredbetween courses;

b) the subject line of the email must relateto a valid exercise for the course;

c) all required files must be attached to theemail, and these must be named as perthe instructions for the exercise.

3) If the email is invalid (i.e., one or more ofthe above criteria are not met), an error re-port is created and enqueued in the outgoingemail queue for delivery. The email explainswhy the submission is not valid, inviting thestudent to correct the problems and re-submittheir work.

4) For a valid submission, the attachments of theincoming email containing the student’s codeare saved and

5) an item is placed into the testing queue, in-cluding the exercise that is to be tested, thestudent’s user name, and names and paths ofthe files that were submitted.

6) For a valid submission, an email to the stu-dent is enqueued in the outgoing messagequeue that confirms receipt of the submission;the student can use this to evidence theirsubmission and submission time, and it re-assures the students that all required files

6

Page 7: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

Is there email in the inbox?

Is email valid? (Sender, attached files, subject line)

Yes

No

Extract attachmentsInject job into testing

queue for this submission

Inject submission receipt message into outgoing email queue

Extract first email, and archive copy

Remove email from inbox

Inject error message into outgoing email

queue

Yes

No

Start

Stop

(a) Incoming queue process

Is there another job in

the queue?

Yes

No

Remove job from queue

Send first message as email

Start

Stop

(b) Outgoing email queue process

Is there another job in the queue? Can student code

files be imported (in sandbox)?

Yes

No

Store output files, logs, marks,

feedback in database

Inject message with feedback and marks into

outgoing email queue

Process first job: Copy student files to sandboxed location

Remove job from queue

Inject error message into outgoing email queue, invite

re-submission

Yes

No

Run tests on code in sandboxed

environment

Start

Stop

(c) Testing queue process

Fig. 1: Flow charts illustrating the work flow in each process. Processes are triggered every minute via acronjob entry, and don’t start until their previous instance has completed.

were present, and that the submission hasentered the system.

7) For both valid and invalid submissions, theemail is removed from the incoming queue.

3.3.2 Outgoing messages

The implementation of sending error messages andfeedback reports to the students, and any othermessages to administrators, is realised througha separate queue and process for outgoing mes-sages (see Fig. 1b and discussion of this de-sign in Sec. 3.4.7). This process is also used forweekly emails informing students about the overallprogress (Sec. 3.4.6).

We note in passing that all automatically gen-

erated messages invite the student to contact thecourse leader, other teaching staff or the adminis-trator of the feedback provision system should theynot understand the email or feel that the system hasmalfunctioned; help can be sought by email or inperson during the timetabled teaching activities.

3.4 Design and implementation of student codetestingThe testing queue shown in Fig. 1c processes sub-missions that have been enqueued by the incomingmail processing script. The task is to execute anumber of predefined tests against the student codein a secure environment, using unit testing tools toestablish correctness of the student submission. As

7

Page 8: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

we use Python for these courses in computationfor science and engineering, we can plug into thetesting capabilities that come with Python, andthose that are provided by third party tools, suchas nose [25] and pytest [26]. We have chosen thepy.test tool because we have more experiencewith this system.

Here, we provide a brief overview of the testingprocess which is invoked every minute (unless aninstance started earlier has not completed executionyet), with Sections. 3.4.2 to 3.4.7 providing moredetails on the requirements and chosen design andimplementation.

For each testing job found in the queue, the fol-lowing steps are carried out (see the flow diagramin Fig. 1c):

1) The student files to be tested are copied to asand-boxed location on the file system withlimited access permissions (Sec. 3.4.1).

2) A dedicated local user with minimal privi-leges tries to import the code in a Pythonprocess to check for correct syntax.

3) If the import fails due to syntax errors an errormessage is prepared for the user and injectedinto the outgoing message queue. (See alsoSec. 4.4.1 for a discussion.) The job is removedfrom the testing queue and the process movesto the next item in the queue.

4) If the import succeeds, the tests are run on thesubmitted code in the restricted environment(Sec. 3.4.2 to 3.4.4).

5) Output files (that the student code may pro-duce) and testing logs are archived, marksextracted and all data are stored in a databasewhich may be used by the lecturer to discoverthe marks for each student, for each questionand assignment.

6) A feedback message for the student is pre-pared and injected into the outgoing messagequeue containing the test results (Sec. 3.4.5).This provides the student with a score foreach question in the assignment, and wheremistakes were found, provides details of theparticular incorrect behaviour that was dis-covered. Listing 4 shows an example of suchfeedback.

7) The test job is removed from the queue.

We discussion additional weekly feedback to stu-dents in Sec. 3.4.6 and the system’s dependabilityin Sec. 3.4.7.

3.4.1 Security measures

By the nature of the testing system, it containsstudent data (names, email addresses, and submis-sions), and it is incumbent upon the developers andadministrators to take all reasonable measures tosafeguard these data against unauthorised disclo-sure or modification. We also require the systemto maintain a high availability and reliability. Therisks that we need to guard against can largely bedivided into two categories: (i) genuine mistakesmade by students in their code, and (ii) attemptsby students – or others who have somehow gainedaccess to a student’s email account – to intentionallyaccess or change their own or other students’ work,assigned marks, or other parts of the testing system.

Experience shows that some of the most commongenuine mistakes made by students include casessuch as unterminated loops, which would executeindefinitely. Due to the serialisation of the tests inour system, this problem, if left unchecked, wouldstop the system processing any further submissionsuntil an administrator corrected it. However, wehave applied a POSIX resource limit [27], [28] onCPU time to ensure that student work consumingmore than a reasonable and fixed limit is termi-nated by the system. We catch any such termina-tions, and in this case we have adopted a policy ofinforming the student by email, and giving themthe opportunity to re-submit an amended versionof their work. We apply similar resource limits onboth disk space consumption and virtual memorysize, in order that loops which would output largeamounts of data to stdout, stderr, or a file ondisk, or which interminably append to a list orarray resulting in its consumption of unreason-able amounts of memory, are also prevented fromcausing an undue impact on the testing machine’sresources.

We address the potential that submitted codecould attempt to maliciously access data aboutanother student (or parts of the system) with amulti-faceted approach:

1) We execute the tests on the student codeunder a separate local user account on theserver that performs the tests. This accounthas minimal permissions on the file system.

2) We create a separate directory for each sub-mission that we test, and run the tests withinthis directory.

3) The result of the two previous points, as-suming that all relevant file system permis-

8

Page 9: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

sions are configured correctly, means that nostudent submission may read or modify anyother student’s submissions or marks, nor canit read the code comprising the testing system.

4) The environment variables available to pro-cesses running as the test user are limited toa small set of pre-defined variables, so that nosensitive data will be disclosed through thatmechanism.

5) We do not provide the students informationabout the file system layout, local accountnames, etc. on the host that runs the tests, toreduce the chance that students know of thelocations of sensitive data on the file system.

3.4.2 Iterative testing of student codeWe have split the exercises on our courses intoquestions, and arranged to test each question sep-arately. Within a question, the testing process stopsif any of the test criteria are not satisfied. Thisapproach was picked to encourage an iterativeprocess whereby students are guided to focus onone mistake at a time, correct it, and get furtherfeedback, which improves the learning experience.This approach is similar to that taken by Tillmannet al. [29], where the iterative process of supplyingcode that works towards the behaviour of a modelsolution for a given exercise is so close to gamingthat it “is viewed by users as a game, with abyproduct of learning”. Our process resembles test-driven development strategies and familiarises thestudents with test-driven development [30] in apractical way.

3.4.3 Defining the testsThere are an indefinite number of both correct andincorrect ways to answer an exercise, and to testcorrectness using a regression testing frameworkrequires some skill and experience in constructinga suitably rigorous test case for the exercise. Webuild on our experience before and after the in-troduction of the testing system, ongoing feedbackfrom interacting with the students and reviewingtheir submissions to design the best possible unittesting for the learning experience. This includestesting for correctness but also structuring tests in adidactically meaningful order. Comments added inthe testing code will be visible to the students whena test fails, and can be used to provide guidance tothe learners as to what is tested for, and what thecause of any failure may be (if desired).

Considering question 3) in the example exercisewe introduced in Section 3.2, the tests that we carryout on the student’s function include the following:

1) Volume must be 0 when h is 0.2) Volume must be 0 when A is 0.3) If we have A = 1 and h = 3, volume must

be 1.4) If we have A = 3 and h = 1, volume must

be 1.5) If we have A = 1.0 (as a float) and h = 1.0

(as a float), volume must be 13 .

6) If we test another combination of values offloating-point numbers A and h then the re-turned volume must be A * h / 3.0.

7) If we have A = 1 (as an integer) and h = 1(as an integer), volume must be 1

3 .8) The function must have a documentation

string; this must contain several words, oneof which is “return”.

In this very simple example, we set up the firstgroup of criteria (1–6) to determine that the studenthas implemented the correct formula to solve theproblem at hand. Criterion 7 tests for the commonmistake of using integer division where floating-point division is required. The final criteria concerncoding style. In this example, it is a strict require-ment that the code is documented to at least someminimal standard, and the student will gain nomarks for a question that is answered without asuitable documentation string.

Our implementation of the tests described aboveis given in Listing 5. In implementing these criteria,we avoid testing for exact equality of floating pointnumbers at any point in the testing process. Insteadwe define some tolerance (e.g. eps = 1e-14), andrequire that the magnitude of the difference be-tween the result of the student’s code and the re-quired answer be below this tolerance. This avoidsfailing student submissions which have e.g. per-formed accumulation operations in a different or-der and concomitantly suffered differing floating-point round-off effects. As exercises become morecomplex and related to numerical methods, a dif-ferent tolerance may have to be chosen.

We order the criteria so those that are mostlikely to pass are tested earlier, and we have cho-sen to stop the testing process at the first errorencountered. This encourages students to addressand correct one error at a time in an iterativeprocess, if required, which is possible thanks tothe short timescale between their submitting work

9

Page 10: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

def test_pyramid_volume():

# if height h is zero, expect volume zeroassert s.pyramid_volume(1.0, 0.0) == 0.

# if base A is zero, expect volume zeroassert s.pyramid_volume(0.0, 1.0) == 0.

# if base has area A=1, and the height is h=3,# we expect a volume of 1:assert s.pyramid_volume(1.0, 3.0) == 1.

# if base has area A=3, and the height is h=1,# we expect a volume of 1:assert s.pyramid_volume(3.0, 1.0) == 1.

#acceptable tolerance for floating point answerseps = 1e-14

# if base has area A=1, and the height is h=1,# we expect a volume of 1/3.:assert abs(s.pyramid_volume(1., 1.) - 1./3.) < eps

# another exampleh = 2.A = 4.assert abs(s.pyramid_volume(A, h) -

correct_pyramid_volume(A, h)) < eps

# does this also work if arguments are integers?eps = 1e-14assert abs(s.pyramid_volume(1, 1) - 1./3.) < eps

# is the function documented welldocstring_test(s.pyramid_volume)

Listing 5: testing code for example question

and receiving feedback (see Sec. 3.4.2). The im-plementation of the tests for py.test is basedon assert statements, which are True when thestudent’s code passes the relevant test, and Falseotherwise. The final criteria, that the documentationstring must exist and pass certain tests, is handledby asserting that a custom function that we provideto check the documentation string returns True. Ofcourse, the tests must be developed carefully to suitthe exercise they apply to, and to exercise any likelyweaknesses in the students’ answers, such as thechance that integer division would be used in theimplementation of the formula for the volume of apyramid discussed above.

3.4.4 Clean code and PEP 8In addition to the hard syntactic requirements ofa programming language, there are often recom-mendations how to style and lay-out source code.We find that it is very efficient to introduce this tostudents from the very beginning of their program-ming learning journey.

For Python, the so-called “PEP 8 Style Guide” forPython Code [31] is useful guidance, and electronic

tools are available to check that code follows thesevoluntary recommendations for clean code. PEP 8has recommendations for the number of spacesaround operators, before and after commas, thenumber of empty lines between functions, classdefinitions, etc.

We use the pep8 utility [32] to assess the con-formance of the student’s entire submission file(which will usually consist of answers to severalquestions like the above) with the PEP 8 StyleGuide. Our system counts the number of errors thatare found, Nerr, and penalises the student’s totalscore according to a policy (e.g. we may choose apolicy of multiplying the raw mark that could beobtained for full PEP 8 conformance by 2−Nerr , or ofimplementing any other desired mark adjustmentas a function of that value).

3.4.5 Results and feedback provision to student

The results of the testing process are written tomachine-readable files by py.test. For each testedsubmission, the report is parsed by our system,with one of a number of results being possible: thestudent code may have run completely in whichcase, we have a pass result or a fail result for each ofthe defined tests. Otherwise the student code mayhave terminated with an error which is most likelydue to a resource limit being exceeded causing theoperating system to abort the process, as discussedin Section 3.4.1.

The number of questions that were answered cor-rectly (i.e. have no failed assertions in the associatedtests) is counted and stored in a database. If therewere incorrect answers, we extract a backtrace fromthe py.test output which we incorporate into theemail that is sent into the student. The generalformat of the results email is to give a per-questionmark, with a total mark for the submission, andthen to detail any errors that were encountered.In the calculation of the mark for the assessment,questions can be given different weights to re-flect greater importance or challenges of particularquestions. For the example shown in Listing 2 allquestions have the same weight of 1.

We described and illustrated a typical question,which might form part of an assignment, in Sec-tion 3.2. As shown in Listing 4, when an erroris encountered, the results that are sent to thestudent include the portions of the testing code forthe question in which the error was found thathave passed successfully, and then indicate with

10

Page 11: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

the > character the line whose assertion failed (inthis case the 7th-last line shown). This is followedby a backtrace which illustrates that, in this case,the submitted pyramid_volume function returned0 when it was expected to return an answer of13 ± 1 × 10−14. The report also includes severalcomments, which are introduced in the testing code(shown in Listing 5), and assist students in workingout what was being tested when the error wasfound. Here, the comment “does this also work ifarguments are integers?” shows the student that weare about to test their work with integer parame-ters; that should prompt them to check for integerdivision operations. If they do not succeed in doingthis, they will be able to show their feedback toa demonstrator or academic, who can can use thefeedback to locate the error in the student’s codeswiftly, then help the student find the problem, anddiscuss ways to improve the code.

3.4.6 Statistical reporting to lecturers and routineperformance feedback to studentThe system records all pertinent data about eachsubmission including the user who made the sub-mission, as well as the date and time of the submis-sion, and the mark awarded. We use this data tofurther engage students with the learning process,by sending out a weekly email summary of theirperformance to date, as shown in Listing 6. Thisincludes a line for each exercise whose deadlinehas passed, which reminds the student of theirmark and whether their submission was on timeor not. For a student who has submitted no work,a different reminder is sent out, requesting theysubmit work, and giving contact details of thecourse leader, asking them to make contact if theyare experiencing problems. Messages are sent viathe outgoing queue (Sec. 3.3.2).

We also monitor missing submissions in the firstcouple of weeks very carefully and contact stu-dents individually who appear not to have sub-mitted any work. Occasionally, they are registeredon the wrong course, but similarly some studentsjust need a little bit of extra help with their firstever programming exercises and by expecting thefirst submitted work at the end of the first orsecond teaching week, we can intervene early inthe semester and help those students get startedwith the exercises and follow the remainder of thecourse.

After the deadline for each set of exercises, thecourse lecturers will generally flick through the

Dear Neil O’Brien,

Please find below your summary of submissions andpreliminary marks for the weekly laboratory sessionsfor course ABC, as of Fri Jan 30 17:06:44 2015.

lab 2 : 25% Details: 1.00 / 4.00,submitted before deadline

lab 3 : 31% Details: 1.25 / 4.00,submitted before deadline

lab 4 : 0% Details: 4.00 / 4.00, but submissionat 2014-11-14 20:39:02 waslate by 4:39:02.

lab 5 : 80% Details: 4.00 / 5.00,submitted before deadline

lab 6 : 77% Details: 3.06 / 4.00,submitted before deadline

lab 7 : 75% Details: 3.00 / 4.00,submitted before deadline

The average mark over the listed labs is 48%.

With kind regards,

The teaching team ([email protected])

Listing 6: Typical routine feedback email

code that students have submitted (or at least 10 to20 randomly chosen submissions if the number ofstudents is large). This helps the teacher in identi-fying typical patterns and mistakes in the students’solutions, which can be discussed, analysed andimproved effectively in the next lecture: once allstudent specific details are removed from the code(such as name, login and email address), submitted(and anonymised) code can be shown in the nextlecture. We find that students clearly enjoy this kindof discussion and code review jointly carried outby students and lecturer in the lecture theatre, inparticular where there is the possibility that theiranonymised code is being shown (although onlythey would know).

The data for the performance of the whole classis made available to the lecturer through privateweb pages which allow quick navigation to eachstudent and all their submissions, files and results.Key data is also made available as a spreadsheet,and a number of graphs showing the submissionactivity (some are shown in Figs. 3, 4 and 5 anddiscussed in section 4).

3.4.7 Dependability and resilience

The submission system is a critical piece of infras-tructure for the delivery of those courses that haveadopted it as their marking and feedback system;this means that its reliability and availability mustbe maximised. We have taken several measures to

11

Page 12: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

reduce the risk of downtime and service outages,and also to reduce the risk of data loss to a lowlevel.

The machine on which the system is installedis a virtual machine which is hosted on centrallymanaged University infrastructure. This promisesgood physical security for the host machines, andhigh-availability features of the hypervisor, suchas live migration [33], improve resilience againstpossible individual hardware failures. To combatthe possibility of the data (especially the studentsubmissions) being lost we have instituted a multi-tier backup system, which backs up the system’sdata to multiple physical locations and to multipledestination storage media, so that the probabilityof losing data should be very small.

The remaining potential single point of failure isthe University’s email system, which is required forany student to be able to submit work or receivefeedback. In the case that the email system wereto fail close to a deadline, we would have thechoice of extending the deadline to allow submis-sions after the service was restored, and/or manualintervention to update marks where students coulddemonstrate that their submission was ready ontime, depending on the lecturer’s chosen policy.

The internal architecture of the testing systemwas designed to be as resilient as possible, andto limit the potential impact of any faults. A keyapproach to this goal is the use of various (filesystem based) queues (Fig. 1) that decouple thedifferent stages of submission handling and testingso that e.g. a failure of the system’s ability todeliver emails would not impede testing submis-sions already received. Emails are received intoa local mailbox and are processed one item at atime so this is the first effective queue; receipt ofemails can continue even if the testing process hashalted. Valid submissions from processed emailsare then entered into a queue for testing, the entriesof which are processed sequentially. The receiptand testing processes generate emails, which areplaced into an outgoing mail queue and are sentregularly, the queue items being removed only aftersuccessful transmission. This way, if the outgoingemail service is unavailable, mails will accumulatein the queue and be sent en masse when the serviceis restored.

Another key design decision was that each in-dividual part of the receipt and testing process iscarried out sequentially for each submission and is

protected by lock files. Prior to processing receivedmessages, the systems checks for existing locks; ifthese exist, the processing doesn’t start, and theevent is logged (receipt of other emails continuesas the receiving and processing are separate pro-cesses). If no locks are found, a lock is created,which is removed upon successful processing; anyunexpected termination of the processing code willresult in a lock file being left behind, so that wecan investigate what went wrong and make anyrequired corrections before restarting the system.The testing process itself is likewise protected bylocking. A separate watch dog process alerts theadministrator if lock files have stayed in place formore than a few minutes – typically each processcompletes within a minute.

In practice we have developed the system to thepoint that we have not had an unexpected failurerequire us to manually clean up and unlock in twoyears of production use, but should an unexpectedbug be found, this design ensures that at most onesubmission will be affected (a copy will have beenmade before any processing was carried out, soeven in this case there would be no loss of studentdata).

4 RESULTS

4.1 Testing system deployment

The automatic testing system was first used at theUniversity of Southampton’s Highfield Campus inthe academic year 2009/2010 for teaching about 85Aerospace engineers, and has been used every yearsince for growing student numbers, reaching 425students in 2014/2015. The Southampton deploy-ment now additionally serves another cohort of stu-dents who study at the University of SouthamptonMalaysia Campus (USMC) and there is a furtherdeployment at the Indian Institute of Technology(IIT) Mandi and Madras campuses, where the sys-tem has been integrated with the Moodle learningmanagement system, as described in Section 4.5.

The testing system has also been used in a num-ber of smaller courses at Southampton, typicallyof approximately 20 students, such as one-weekintensive Python programming courses offered toPhD students. It also serves Southampton’s coursesin advanced computational methods where around100 students have submitted assignments in C, asdescribed in Section 4.6.

12

Page 13: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

Training Assignment

Exercises: Question T1, Question T2, ...

Laboratory Assignment

Exercises: Question L1, Question L2, ...

Voluntary, formative feedback & assessment (not contributing to final mark)

Compulsory, summative feedback & assessment(first submission contributing to final mark)

Fig. 2: Overview of the structure of the weekly com-puter laboratory session: A voluntary set of trainingexercises is offered to the students as a “training“assignment on which they receive feedback and amark, followed by a compulsory set of exercises inthe same topic area as the “laboratory“ assignmentwhich is marked and contributes to each student’sfinal mark for the course. Automatic feedback isprovided for both assignments and repeat submis-sions are invited.

4.2 Case study: Introduction to Computing

In this section, we present and discuss experienceand pertinent statistics from the production usageof the system in teaching our first-year computingcourse, in which programming is a key component.In 2014/15, there were about 425 students in theirfirst semester of studying Acoustic Engineering,Aerospace Engineering, Mechanical Engineering,and Ship Science.

4.2.1 Course structure

The course is delivered through weekly lectures(Sec. 2.2) and weekly self-paced student exercise(Sec. 2.3) with a completion deadline a day beforethe next lecture takes place (to allow the lecturerto sight submissions and provide generic feedbackin the lecture the next day). Students are offereda 90 minute slot (which is called “computing lab-oratory” in Southampton) in which they can carryout the exercises, and teaching staff are available toprovide help. Students are allowed and able to startthe exercise before that laboratory session, and usethe submission and testing system anytime before,during and after that 90 minute slot.

Each weekly exercise is split into two assign-ments: a set of “training“ exercises and a set ofassessed “laboratory“ exercises. This is summarisedin Fig. 2.

The training assignment is checked for correct-ness and marked using the automatic system, butwhilst we record the results and feed back to thestudents, they do not influence the students’ gradesfor the course. Training exercises are voluntarybut the students are encouraged to complete themin order to practice the skills they are currentlylearning and prepare for the following assessedexercise which tests broadly similar skills.

Students can repeatedly re-submit their (modi-fied) code for example until they have removed allerrors from the code. Or they may wish to submitdifferent implementations to get feedback on those.

The assessed laboratory assignment is the sec-ond part of each week’s exercises. For these, thestudents attempt to develop a solution as perfectas possible before submitting this by email to thetesting system. This “laboratory“ submission is as-sessed and marks and feedback are provided to thestudent. These marks are recorded as the student’smark for that week’s exercises, and contribute tothe final course mark. The student is allowed (andencouraged) to submit further solutions, which willbe assessed and feedback provided, but it is the firstsubmission that is recorded as the student’s markfor that laboratory.

The main assessment of the course is donethrough a programming exam at the end of thesemester in which students write code on a com-puter in a 90 minute session, without Internet ac-cess but having an editor and Python interpreter totest the code they write. Each weekly assignmentcontributes of the order of one percent to the finalmark, i.e. 10% overall for a 10 week course. Eachlaboratory session can be seen as a training oppor-tunity for the exam as the format and expectationsare similar.

4.2.2 Student behaviour: exploiting learning oppor-tunities from multiple submissions

In Figure 3a, we illustrate the distribution of sub-mission counts for “training 2”, which is the vol-untary set of exercises from week 2 of the course.

The bar labelled 1 with height 92 shows that 92students have submitted the training assignmentexactly once, the bar labelled 2 shows that 76 stu-dents submitted their training assignment exactly

13

Page 14: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

1 2 3 4 5 6 7 8 9 10 11 15number of submissions for training2

0

20

40

60

80

numbe

r of stude

nts

(a) training2

1 2 3 4 5 6 7 8number of submissions for lab2

0

50

100

150

200

250

300

numbe

r of stude

nts

(b) lab2

Fig. 3: Histogram illustrating the distribution ofsubmission counts per student for the (a) voluntarytraining and (b) assessed laboratory assignment(see text in Sec. 4.2.2)

twice, and so on. The sum over all bars is 316 andshows the total number of students participatingin this voluntary training assignment. 87 studentssubmitted four or more times, and several studentssubmitted 10 or more times. This illustrates thatour concept of students being free to make step-wise improvements where needed and rapidly getfurther feedback has been successfully realised.

We can contrast this to Figure 3b, which showsthe same data for the compulsory laboratory assign-ment in week 2 (“lab2”). This submission attractsmarks which contribute to the students’ overallgrades for the course. In this case the students areadvised that while they are free to submit multipletimes for further feedback, only the mark recordedfor their first submission will count towards theirscore for the course. For lab 2, 423 students sub-mitted work, of whom 314 submitted once only.

However, 64 students submitted a second revisedversion and a significant minority of 45 studentssubmitted three or more times to avail themselvesof the benefits of further feedback after revisingtheir submissions, even though the subsequent sub-missions do not affect their mark.

Significant numbers of students choose to submittheir work for both voluntary and compulsory as-signments repeatedly, demonstrating that the sys-tem offers the students an extended learning op-portunity that the conventional cycle of submittingwork once, having it marked once by a human, andmoving to the next exercise, does not provide.

The proportion of students submitting multipletimes for the assessed laboratory assignment (Fig-ure 3b) is smaller than for the training exercise(Figure 3a) and likely to highlight the differencebetween the students’ approaches to formative andsummative assessment. It is also possible that stu-dents need more iterations to learn new conceptsin the training assignment before applying the newknowledge in the laboratory assignment, contribut-ing to the difference in resubmissions. The largernumber of students submitting for the assessed as-signment (423 ≈ 100%) over the number of studentssubmitting for the training assignment (316 ≈ 74%)shows that the incentive of having a mark con-tribute to their overall grade is a powerful one.

4.2.3 Student behaviour: timing of submissions

In Figure 4 we show the submission timelines forall the voluntary “training” assignments that thestudents were offered every week. There are tensuch scheduled assignments in total, and for eacha line is shown. The assignments may be identifiedby their chronological sequence, as discussed in thefollowing paragraphs.

In Figure 5, we show the same data but for thecompulsory and assessed laboratory assignments(see Fig. 2 and Sec. 4.2.1 for a detailed explanationof the “training” and “laboratory” assignments).

Plot a) in Figure 4 and a) in Figure 5 showthe “unique” student submissions counts for everyexercise. With unique, we mean that only the firstsubmission that any individual student makes fora given assignment is counted in the graph. On thecontrary, subplots b) in Figure 4 and b) in Figure 5show the “non-unique” submissions that includeevery submission made, even repeat submissionsfrom any particular student for the same assign-ment.

14

Page 15: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

Oct 06 2014

Oct 20 2014

Nov 03 2014

Nov 17 2014

Dec 01 2014

Dec 15 2014

Dec 29 2014

Jan 12 20150

50

100

150

200

250

300

350

400

cum

ulat

ive

uniq

uesu

bmis

sion

s

EA SS

L6 L7L4 L5L2 L3L1 L10L8 L9

EXtraining1

training2

training3

training4

training5

training6

training7

training8

training9

training10

training11

(a) Unique submissions of voluntary training exercises, showing the number of students participat-ing as a function of time.

Oct 06 2014

Oct 20 2014

Nov 03 2014

Nov 17 2014

Dec 01 2014

Dec 15 2014

Dec 29 2014

Jan 12 20150

200

400

600

800

1000

cum

ulat

ive

non-

uniq

uesu

bmis

sion

s

EA SS

L6 L7L4 L5L2 L3L1 L10L8 L9

EXtraining1

training2

training3

training4

training5

training6

training7

training8

training9

training10

training11

(b) Non-unique submissions of voluntary training exercises, showing the total number of submis-sions as a function of time.

Fig. 4: Submissions of voluntary training assignments as a function of time for (a) unique studentparticipation for each assignment, (b) total number of submissions for each assignment. Labels L1. . . L10and associated dashed vertical lines indicate time-tabled computing laboratory sessions 1 to 10 atSouthampton; EA – end of autumn term; SS – start of spring term (Christmas break is between thesedates); EX – exam. (See Sec. 4.2.3 for details.)

15

Page 16: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

Oct 10 2014

Oct 24 2014

Nov 07 2014

Nov 21 2014

Dec 05 2014

Dec 19 2014

Jan 02 20150

50

100

150

200

250

300

350

400

450

cum

ulat

ive

uniq

uesu

bmis

sion

s

EA SS

L6 L7L4 L5L2 L3L1 L10L8 L9

S9S8S3S2 S10S7S6S5S4

M5M4 M7M6M3M2 M9M8 M10

EXlab2

lab3

lab4

lab5

lab6

lab7

lab8

lab9

lab10

(a) Unique submissions of compulsory assessed laboratory assignments, showing the number ofstudents participating as a function of time.

Oct 10 2014

Oct 24 2014

Nov 07 2014

Nov 21 2014

Dec 05 2014

Dec 19 2014

Jan 02 20150

100

200

300

400

500

600

700

cum

ulat

ive

non-

uniq

uesu

bmis

sion

s

EA SS

L6 L7L4 L5L2 L3L1 L10L8 L9

S9S8S3S2 S10S7S6S5S4

M5M4 M7M6M3M2 M9M8 M10

EXlab2

lab3

lab4

lab5

lab6

lab7

lab8

lab9

lab10

(b) Non-unique submissions of compulsory assessed laboratory assignments, showing the totalnumber of submissions as a function of time.

Fig. 5: Submissions of compulsory and assessed laboratory assignments as a function of time for (a)unique student participation for each assignment, (b) total number of submissions for each assignment.Labels as in Fig. 4. Additional labels S2 to S10 (vertical dashed lines) indicate submission deadlines inSouthampton; M2 to M10 (vertical dash-dotted lines in lower part of plot) show submission deadlinesfor students in Malaysia.

16

Page 17: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

The unique plots allow us to gauge the totalnumber of students submitting work to a givenassignment (as a function of time), and the non-unique plots allow us to see the total number ofsubmissions made by the entirety of the studentbody together.

We discuss the labels and annotations in Figure 4first, but they apply similarly to Figure 5. Thedashed vertical lines represent time-tabled comput-ing laboratory sessions lasting 90 minutes wherethe students are invited to carry out the voluntarytraining and assessed laboratory assignment forthat week in the presence of and with support fromteaching staff. These time-tabled sessions, in whichevery student has a computer available to writetheir code, are labelled L1 to L10 in the figures.

The coloured symbols which are connected bystraight lines count the number of submissions.In Figure 4 a), the “training 1” assignment submis-sions are shown in blue, the next week’s “training2“ assignment submissions are shown in green,etc. There are ten scheduled laboratory sessions,and ten associated voluntary training assignments.There is one additional assignment at the end ofthe course which is offered to help revision for theexam, shown on the very right in Figure 4 a) andb) in yellow without symbols.

Figure 5 shows submission counts for the com-pulsory assessed laboratory exercises. There were 9such assessed assignments, starting in the secondweek of the course: while there is a voluntarytraining assignment in week 1, there is no as-sessed assignment laboratory assignment to givethe students some time to familiarise themselveswith the teaching material and submission system.From week 2 onward, there is one voluntary train-ing assignment and one assessed assignment everyweek up to and including week 10. The studentswere given submission deadlines for the assessedlaboratory assignments, and these deadlines forSouthampton students are shown in the plots asdotted vertical lines labelled S2 to S10.

The course was delivered simultaneously atthe University of Southampton (UoS) HighfieldCampus in the United Kingdom – where about400 students were taught – and the Universityof Southampton Malaysia Campus (USMC) inMalaysia – where a smaller group of about 25students was taught. While following the samelecture material and assignments, these two cam-puses, due to different local arrangements and time

zones, taught the course to different schedules, andthe effect of this division is visible in all of thefigures. The Malaysia students have different dead-lines from the Southampton students, and these areshown as shorter vertical dash-dotted lines, labeledM2 to to M10, towards the bottom of each plotin Figure 5. The deadlines of students in Malaysia(M) and Southampton (S) follow local holidays andother constraints, although they often fully coincide(S6 to S9), or are delayed by one week (S2 to S5).

We now discuss the actual data presented, start-ing with the voluntary training assignment submis-sions in Figure 4. Looking at Figure 4a we see thatthe first training exercise had the largest number ofsubmissions of any of the training exercises. About300 of these submissions occurred during the firsthands-on taught session L1, reflecting a large num-ber of students who followed the recommendedlearning procedure of completing the voluntaryexercises and doing so during the computing lab-oratory session in the presence of teaching staff,and who had sufficient resource and instructionavailable to do so.

The corresponding burst of submissions duringthe computing laboratory sessions L2 and L3 hasdecreased to about 175 and 150 submissions, re-spectively. The total number of students participat-ing in these voluntary assignments in the first threeweeks decreases from about 400 in week 1, to about310 and about 270 in weeks 2 and 3, respectively.The total number of unique submissions reaches itsminimum of about 80 in week 4 (the purple dataset associated with hands-on session L4), and thenstarts to increase again for the remainder of thecourse.

We see in Figure 5 a) that the compulsory sub-missions remain high, so that this drop in thevoluntary submissions is no reason for concern, andmay reflect that students understand the learningmethods, and options for learning activities thatsuit their own preferences and strengths. The datamay also suggest an opportunity to make the as-signments slightly more challenging as studentsseem to feel very confident in tackling them.

In addition to the burst of submissions duringthe time-tabled sessions L1 to L10, we also notea significant number of submissions both beforeand after these sessions in Figure 4 a) and b),reemphasising the flexibility that the system af-fords students as to where and when they submittheir work. Anecdotal evidence, written feedback

17

Page 18: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

from the students (Sec. 4.3) combined with thesubmission data suggests that some students willdo the exercises as soon as they become available,and others prefer to do this during the weekendor evening hours. Many students see the offeredcomputing laboratory sessions as an opportunityto seek support which they make use of if they feelthis will benefit their learning.

Figure 4a shows that as the examination date(labelled as E at the right-hand side of the graphs)approached, a relatively small number of studentsstarted to submit solutions to the training exercisesthey had not submitted before as part of theirrevision and exam preparation.. The same tendencyis visible in Figure 4b with a slightly larger increasedue to repeat submissions that cannot be seen in thegraph in Fig. 4a.

We now discuss Figure 5 which shows the sametype of data as Figure 4 but for the compulsoryassessed laboratory assignments rather than volun-tary training assignments. The most notable dif-ference is that the total number of submissionsremained high for all the assignments, reflectingthat these assignments are not voluntary and docontribute to the final course mark. There is aslow decline of submissions present (from about425 to 375 during the course, corresponding toapproximately 10%) which is not unexpected andincludes students leaving their degree programmestudies altogether, suspending on health reasons,etc.

The vast majority of first submissions for thecompulsory laboratory assignments, which con-tribute to the overall course marks, occur in ad-vance of the deadline, as illustrated in Figure 5awhere the deadlines are shown as vertical dottedlines.

The assessed assignment timings in Figure 5ashow that submissions take place in differentphases. The trend is visible in all the lines, butmost clearly where the submission deadlines inMalaysia coincide with those in Southampton, i.e.laboratory sessions 6, 7, 8 and 9: the first sub-missions are received after the assignments havebeen published, and then a steady stream of sub-missions comes in, leading to an approximatelystraight diagonal line in Figure 5a. The second setof submissions is received during the associatedlaboratory session (shown as dashed line) wheremany students complete the work in the timetabledsession. Following that, there is again a steady

stream of submissions up to the actual deadline(shown as dotted line) where submissions accumu-late. Very few (first) submissions are received afterthe deadline. The submissions in the second phasecan be used to estimate student attendance in thelaboratory sessions (see discussion in 4.8.4).

The University of Southampton Christmas breakis also apparent, a period during which thereare few new unique submissions (Fig. 5a), butslightly more new non-unique submissions (Fig. 5b)from students revising over the holiday and re-submitting assignments they had submitted before.It is reasonable to assume that they have re-writtenthe code as an exam preparation exercise.

Trends seen in the voluntary submission datain Figure 4, such as a notable rise in the non-unique(i.e. repeat) submissions across all assignments inthe days leading up to the exam, are also evidentin Figure 5.

4.3 Feedback from students

While overall ratings of our courses using the au-tomatic testing and feedback system are very good,it is hard to distinguish the effect of the testingsystem from that of, for example, an enthusiasticteam of teachers, that would also achieve goodratings when using more conventional assessmentand feedback methods.

We invited feedback explicitly on the automaticfeedback system asking for voluntary provisionof (i) reasons why students liked the system and(ii) reasons why students disliked the system. Thereplies are not homogeneous enough to compilestatistical summaries, but we provide a representa-tive selection of comments we have received below.

4.3.1 I like the testing system because. . .

The following items of feedback were given by thestudents when offered to complete the sentence “Ilike the testing system because. . . ” as part of the courseevaluation:

1) because we can get quick feedback2) it is very quick3) it provides a quick response4) immediate effect5) quick response6) it gives very quick feedback on whether code has

the desired effect7) it provides speedy feedback, even if working at

home in the evening

18

Page 19: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

8) it worked and you could submit and re-submit atyour own pace

9) I like the introduction to the idea of automated unittesting.

10) concise, straight to the point, no mess, no fuss.”Got an error? Here’s where it is. FIX IT!”

11) it was easy to read output to find bugs in programs

12) you can see where you went wrong13) very informative, quick response14) it reassures me quickly about what I do15) it gave quick feedback and allowed for quick re-

assessment once changes were made16) feedback on quality of the code17) it is fast and easy to use18) it indicates where the errors are and we can submit

our work as many times as we want19) it is quick and automatic20) it is automated and impartial21) gives quick feedback, for training lets you test

things quickly22) it saves time and can give feedback very quickly.

The re-submission of training exercises is veryuseful.

We briefly summarise and discuss these points:the most frequent student feedback is on the im-mediate feedback that the system provides. Somestudent comments mention explicitly the usefulnessof the system’s feedback which allows to identifythe errors they have made more easily (items 10, 11,12, 18). In addition to these generic endorsements,some students mention explicitly advantages ofthe test-driven development such as re-assuranceregarding correctness of code (item 14), quick feed-back on refactoring (15), the indirect introductionof unit tests through the system (9), and help inwriting clean code (16). It is worth noting that Agilemethods and test-driven development have notbeen introduced to the students at the time wherethey have provided the above feedback. Furtherstudent feedback welcomes the ability to re-submitcode repeatedly (items 8, 15, 22) and the flexibilityto do so at any time (7). Interestingly, one studentmentions the objectiveness of the system (20) –presumably this comment is based on experiencewith assessment systems where a set of markersmanually assess submissions which naturally dis-play some variety in rigour and the application ofmarking guidelines.

4.3.2 I dislike the testing system because. . .

The following items of feedback were given by thestudents when offered to complete the sentence “Idislike the testing system because. . . ” as part of thecourse evaluation:

1) error messages not easy to understand2) it takes some time to understand how to interpret

it3) sometimes difficult to understand what was wrong

4) it complains (gives failures) for picky reasons likewrong function names and missing docstrings.That’s not a complaint, it is only a machine.

5) it is a bit unforgiving6) it is extremely [strict] about PEP 87) tiny errors in functions would result in complete

failure of test.Several comments (items 1 to 3) state that the

feedback from the automatic testing system is hardto understand. This refers to test-failure reportssuch as shown in Listing 4. Indeed, the learningcurve at the beginning of the course is quite high:the first 90 minute lecture introduces Python, HelloWorld and functions, and demonstrates feedbackfrom the testing system to prepare students for theirself-paced exercises and the automatic feedbackthey will receive. However, a systematic explana-tion of the assert statements, True and Falsevalues, and exceptions takes only place after thestudents have used the testing system repeatedly.The reading of error messages is of course a keyskill (and the importance of this is often underes-timated by these non-computer science students),and we like to think that the early introduction oferror messages from the automatic testing is overallquite useful. In practice, most students use thehands-on computing laboratory sessions to learnand understand the error messages with the helpof teaching staff before these are covered in greaterdetail in the lectures. See also Sec. 4.8.2.

A second set of comments relates to the harsh-ness and unforgiving nature of the automatic tests(items 4 to 7). Item 7 refers to the assessmentmethod of not awarding any points for one ofmultiple exercises that form an assignment if thereis any mistake in the exercise, and is a criticismregarding the assessment as part of the learningprocess.

For items 4 to 6 it is not clear whether thesestatements relate to the feedback on the code orthe assessment. If the comments relate to the code,

19

Page 20: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

then they reflect a lack of understanding (and thus ashortcoming in our teaching) of the importance ofdocumenting code and the importance of gettingeverything right in developing software (and notjust approximately right).

4.3.3 Generic commentsThe following comments on the feedback systemwere provided by students unprompted, i.e. as partof generic feedback on the course, and are in-linewith the more detailed points made above:

1) Fantastic real-time feedback with online submis-sion of exercises.

2) Loved the online submission.3) Really like the online submission system with very

quick feedback.4) Description in the feedback by automated system

can be unclear.5) Instant feedback on lab and training exercises was

welcome.6) Autotesting feature is VERY useful! Keep it and

extend it!7) The automatic feedback is fairly useful, once you

have worked out how to understand it.In the context of enthusiastic endorsements of

the testing system, we like to add our subjectiveobservation from teaching the course that manystudents seem to regard the process of making theircode pass the automatic tests as a challenge or gamewhich they play against the testing system, and thatthey experience great enjoyment when they pass allthe tests – be it in the first or a repeat submission(see also Section 3.4.2). As students like this game,they very much look forward to being able to startthe next set of exercises which is a great motivationto actively follow and participate in all the teachingactivities.

4.4 IssuesDuring the years of using the automatic testingsystem, we have experienced a number of issueswhich are unique to the automated method ofassessment described here. We summarise themand our response to each challenge below.

4.4.1 Submissions including syntax errorsWhen a student submits a file containing a syntaxerror, our testing code (here driven by the py.testframework) is unable to import the submission,and therefore testing cannot commence. Techni-cally, such a submission is not a valid Python

program (because it contains at least one syntaxerror). We ask the students to always test their workthoroughly before submitting, which should detectsyntax errors first, and such submissions should notoccur.

However, in practice, and given the large numberof submissions (about 20 assignments per student,and currently 500 students per year), occasionallystudents will either forego the testing to save time,or will inadvertently introduce syntax errors suchas additional spaces or indentation between check-ing their work and submitting it. From a purelytechnical point of view, the system is able to recog-nise this situation when it arises and we could statethat any such submission is incorrect, and thereforeassign a zero mark. However, these submissionsmay represent significant effort and contain a lotof valid code (for multiple exercises submitted inone file), so we have adopted a policy of allow-ing re-submission in such a scenario: if a syntaxerror is detected on import of the submission, thestudent is automatically informed about this, andre-submission is invited.

4.4.2 Submissions in undeclared non-ASCII char-acter encodingWe noticed an increasing trend, especially amonginternational students, for submitting files in 8-bit character sets other than ASCII. Such files areaccepted by the Python 2 interpreter so long asthe encoding is declared in the first lines of thefile according to the PEP263 [34]; but many ofthe students who were using non-ASCII characterswere not describing their encodings at all. Our firstresponse was to update our system to check forthis situation, and upon discovering it, to send anautomated email to the student concerned with asuggestion that they declare their encoding and re-submit. More recently, we have began recommend-ing the use of the Spyder [35] environment, whosedefault behaviour is to annotate the encoding ofthe file in question in a PEP 263–compliant manner.This has now virtually eliminated the occurrence ofcharacter encoding issues. For the few cases wherethese still arise, the automatic suggestion email,and (if required) personal support in scheduledlaboratory and help sessions enables the studentsto understand and overcome the issue.

4.4.3 PEP 8 style checker issuesAs described in Section 3.4.3, we take advantage ofthe pep8 utility [32] to assess the conformance of

20

Page 21: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

the students’ submissions against the style recom-mendations of PEP 8.

Students find following style guidelines a lotharder than adapting to hard syntactic and seman-tic requirements of the programming language asthey can solve the given exercises so that theircode exhibits correct functional behaviour while notnecessarily following the style guidelines. In our ex-perience, it is critical to help students to adapt theirown style habits to recommended guidelines, forexample through tools that flag up non-confirmingconstructs immediately while editing code. Onesuch freely available tool for Python is the Spyder[35] development environment, for which PEP 8-compatibility highlighting can be activated [36]. Byencouraging all students to use this environment– at university machines and in installations ofthe software on their own machines for which weprovide recommendations [37] – we find that theygenerally pick up the PEP 8 guidelines quickly.As with so many things, if introduced early on,they soon embrace the approach and use it withoutadditional effort in the future. Consequently, wepenalise submissions that are not PEP 8 compliantfrom the second week onward.

One issue that arises with integrating PEP 8guidelines into the assessment is that differentsoftware release versions of the pep8 tool mayyield different numbers of warnings; this is partlydue to changes in the view on what representsgood coding style over time and partly due tobugs being fixed in the pep8 tool itself. This canresult in unexpected warnings from the PEP 8-related tests. As a practical measure, we ensuredthat we are using the latest version of the pep8checking tool, and have elected to omit those teststhat are treated differently by other recent versions.The student body will generally report any suchdeviations between the PEP 8 behaviour on theirown computer and the testing system, and help inidentifying any potential problems here.

4.5 Integration with Moodle

Moodle (Modular Object-Oriented Dynamic Learn-ing Environment) [38] is a widely-used open sourcelearning management system which can be usedto deliver course content and host online learningactivities. It is designed to support both teachingand learning activities. The Indian Institute of Tech-nology (IIT) Mandi and IIT Madras use Moodleto manage the courses at the institute level. When

running a course, instructors can add resourcesand activities for their students to complete, e.g.a simple page with downloadable documents orsubmission of the assignments by prescribed timeand date.

It was envisaged that integrating the automaticfeedback provision system with Moodle wouldsimplify the use of the automatic feedback systemfor IIT instructors and students, by allowing tosubmit and retrieve feedback through the Moodleinterface that they use routinely already insteadof using email, thus replacing the incoming queueprocess (Fig. 1a). Outgoing messages to adminis-trators are still emailed using the outgoing emailqueue (Fig. 1b). The testing process queue (Fig. 1c)is used as in the Southampton deployment that isdescribed in the main part of this paper.

In integrating the assessment system with the IITMoodle deployment, we have used the SharableContent Object Reference Model, SCORM, which isa set of technical standards for e-learning softwareproducts. The user front end is provided throughthe browser-based Moodle User Interface, whilescripts at the back end make the connection to theautomatic assessment system. The results are thenfetched from the system and made visible to thestudent and the instructor. Using Moodle also helpsthe IIT to leverage the security that is already a partof the SCORM protocols.

The implementation at IIT is via a Moodle plugindesigned such that, when a student submits anassignment, the plugin collects the global file ID ofthe submission and creates a copy of the file outsidethe Moodle stack. The plugin then invokes a Pythonscript through exec(), transferring the location offile and file ID to the script. This Python script thenacts as a user of the automatic feedback and as-sessment system, and directly enqueues the file forprocessing. The job ID inside automatic assessmentsystem engine is returned to the Moodle pluginwhich maintains a database mapping job IDs tofile IDs. After the file is processed by the automaticassessment system, the results are saved as files thatare named after the (unique) job ID. When studentsaccess their results through Moodle, the relevant jobIDs are retrieved from the database, allowing thecorresponding results file to be opened, convertedto HTML and published in a new page.

21

Page 22: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

4.6 Testing of other languagesThere should be no conceptual barriers for usingthe automatic testing and feedback system for test-ing of code written in other languages that provideunit testing frameworks. In particular JUnit [39] forJava could be used instead of py.test for Javaprogramming courses. In this case, the execution ofthe actual tests (and the writing of the tests to run)would need to be done in Java, but the remainingframework implemented in Python could remain(mostly) unchanged, providing the student submis-sions handling and receipts, separation of testingjobs, a limited-privilege, limited-resource runtimeenvironment, maintaining a database of results, andautomatically emailing students their feedback.

As part of our education programme in com-putational science [40], we we are interested intesting C code that students write in our advancedcomputational methods courses in which they areintroduced to C programming. Students learn inparticular how to combine C and Python codeto benefit from Python’s effectiveness as a highlevel language but achieving high execution perfor-mance by implementing performance critical sec-tions in C.

We are exploring a set of light weight options to-wards automating the testing of the C code withinthe given framework and our education setting:

1) Firstly, we compile the submitted C code us-ing gcc, capturing and parsing its standardoutput and standard error to capture the num-ber of errors and warnings generated.

2) We then run the generated executable un-der the same security restrictions as we usefor Python, capturing its standard outputand error, and potentially comparing them toknown-correct examples.

3) We are also using the ctypes library to makefunctions compiled from students’ C codeavailable within Python, so that they may betested with tests defined the same way as fornative Python code (see Listing 5).

The system that we built for testing Python code ismodular enough that the above can be incorporatedinto the test work-flow for the courses where it isrequired. We note that it is now necessary to handlesegmentation faults that may arise from calling thestudent’s C code: these may be treated similarlyto the cases where resource limits are exceeded intesting Python code, causing the OS to terminatethe process; the student’s marks may be updated

if required, or a re-submission invited, in line withthe course leader’s chosen educational policy.

4.7 Pre-marking examsAs well as assessing routine laboratory assign-ments, the system is also used to support exammarking. The format of the exam for our first yearintroductory programming course is a 90 minutesession which the students spend at a computer, ina restricted environment. They are given access tothe Spyder Python development environment to beable to write and run code but have no access tothe Internet, and have to write code to answer examquestions which follow the format experienced inthe weekly assignments. At the end of the exam, allthe students’ code files are collected electronicallyfor assessment. We pre-test the exam code files usingthe marking system with an appropriate suite oftests, and then distribute the automatically assignedmarks and detailed test results and the source codeto the examiners for manual marking. This enablesthe examiners to save significant amounts of timebecause it is immediately apparent when studentsachieve full marks and, where errors are found, thesystem’s output assists in swiftly locating them. Italso increases objectivity compared to leaving allthe assessment to be done by hand, possibly by ateam of markers who would each have to interpretand apply a mark scheme to the exam code files.

The system has also been used to receive course-work submissions for a course leader who decidedto exclusively manually assess the work. In thiscase, the system was configured simply to receivethe submission, identify the user, store the submis-sion, and log the date and time of submission ofthe coursework.

4.8 DiscussionIn this section, we discuss key aspects of the de-sign, use and effectiveness of the automatic testingsystem to support learning of programming.

4.8.1 Key benefits of automatic testingA key benefit of using the automatic testing systemis to reduce the amount of repeated algorithmicwork that needs to be carried out by teachingstaff. In particular establishing the correctness ofstudent solutions, and providing basic feedback ontheir code solutions is now virtually free (once thetesting code has been written) as it can be doneautomatically.

22

Page 23: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

This allowed us to very significantly increase thenumber of exercises that students carry out as partof the course, which helped the students to moreactively engage with the content and resulted indeeper learning and greater student satisfaction.

The marking system frees teaching staff time thatwould otherwise have been devoted to manualmarking, and which can now be used to repeatmaterial where necessary, explain concepts, discusselegance, cleanness, readability and effectiveness ofcode, and suggest alternative or advanced solutiondesigns to those who are interested, without havingto increase the number of contact hours.

Because of the more effective learning throughactive self-paced exercises, we have also been ableto increase the breadth and depth of materials insome of our courses without increasing contact timeor student time devoted to the course.

4.8.2 Quality of automatic feedback provision

The quality of the feedback provision involves twomain aspects: (i) the timeliness, and (ii) the useful-ness, of the feedback.

The system typically provides feedback to stu-dents within 2 to 3 minutes of their submission(inclusive of an email round-trip time on the orderof a couple of minutes). This speed of feedback pro-vision allows and encourages students to iterativelyimprove submissions where problems are detected,addressing one issue at a time, and learning fromtheir mistakes each time.

This near-instant feedback is almost as good asone could hope for, and is a very dramatic im-provement on the situation without the system inplace (where the provision of feedback would bewithin a week of the deadline, when an academicor demonstrator is available in the next practicallaboratory session).

The usefulness of the feedback is dependentupon the student’s ability to understand it, and thisis a skill that takes time and practice to acquire.We elected to use the traceback output providedby py.test in the feedback emails that are sentto students in the case of a test failure, as per theexample in Listing 4. The traceback, combined withour helpful comments in the test definitions, allowsa student to understand under precisely whichcircumstances their code failed, and also to under-stand why we are testing with that particular set ofparameters. Although interpreting the tracebacks isnot a skill that is immediately obvious, especially

to students who have never programmed before,it is a skill that is usually quickly acquired, andone which all competent programmers should bewell-versed in. We suggest that it is an advantageto encourage students to develop this ability at anearly stage of their learning. Students at Southamp-ton are well-supported in acquiring these skills,including timetabled weekly laboratories and helpsessions staffed by academics and demonstrators.Once the students master reading the output, theusefulness of the feedback is very good: it pinpointsexactly where the error was found, and providesthe rationale for the choice of test case as well.

A third aspect of the quality of feedback andassessment is objectivity. Because all of the submis-sions are tested to the same criteria, the system alsoimproves the objectivity of our marking comparedto having several people each interpreting the markscheme and applying their interpretations to stu-dent work.

4.8.3 Flexible learning opportunitiesA further enhancement to the student experienceis that the system allows and promotes flexibleworking. Feedback is available to students fromanywhere in the world (assuming they have In-ternet access), at any time of day or night, ratherthan being restricted to the locations and hoursthat laboratory sessions are scheduled. This meansthat the most confident students are free to workat their own pace and convenience. Those studentswho wish for more guidance and support can availthemselves of the full resources in the time tabledsessions. All the students can repeat training exer-cises multiple times, dealing with an error at once,when errors are discovered. They may also repeatand re-submit assessed laboratory exercises to gainadditional feedback and deeper understanding, butin line with our policies, this does not change theirrecorded marks for assessed work.

4.8.4 Large classesWe have found that the assessment system is in-valuable as our student numbers grow betweenyears. Once exercises and didactic testing code aredeveloped, the automatic testing and feedback pro-vision does not require additional staff time to pro-cess, assess and feedback on student submissionswhen student numbers grow from year to year.Additional teaching staff in the practical sessionsare required to maintain the student-staff ratio, but

23

Page 24: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

the automatic system reduces the overall burdenvery significantly, and has helped us to deliver thetraining in the face of an increase from 85 to 425students enrolled in our first-year introduction tocomputing course.

The flexible learning that the system allows (seeSect. 4.8.3) holds opportunities for more efficientspace use. In the weekly hands-on computing lab-oratories, we currently provide all students theirown computer for 90 minutes in the presence ofteaching staff. With large student numbers, depend-ing on the local facilities, this can become a timetabling and resource challenge.

We know from student attendance behaviourthat the first two weeks see nearly all studentsattending the hands-on computing laboratories, butthat student attendance in the computing labora-tories declines significantly after week two, as –for example – the best students will often havecompleted and submitted the exercise before thetime-tabled laboratory session, and some studentswill only come to the laboratory session to get helpon a particular problem that they could not solveon their own; needing 15 minutes attendance ratherthan 90. As a result, it should be possible to ’over-book’ computing laboratory spaces as is commonin the airline industry, for example based on theassumption that only a fraction of the students willmake regular use of the laboratory sessions in thelater weeks. Figure 5a and its discussion showssupporting data of student laboratory attendance.We have not made use of this yet.

4.8.5 Student satisfactionStudent feedback on the automatic testing andlearning with it has been overall very positive.We believe that the increased number of practicalexercises is an effective way to educate students tobecome better programmers, and it is gratifying forteaching staff to see students enjoying the learningexperience.

4.8.6 Software designOur system design of having multiple loosely cou-pled processes that process student submissionswith clearly defined sub-tasks, and pass jobs fromone to another through file-system based queueshas provided a robust system, which allowed us toconnect it with other tools, such as for example theMoodle front-end for code submission in Madrasand Mandi.

5 SUMMARY

We have reported on the automatic marking andfeedback system that we developed and deployedfor teaching programming to large classes of under-graduates. We provided statistics from one year ofuse of our live system, illustrating that the studentstook good advantage of the “iterative refinement”model that the system was conceived to support,and that they also benefited from increased flexibil-ity and choice regarding when they work on, andsubmit, assignments. The system has also helpedreduce staff time spent on administration and man-ual marking duties, so that the available time canbe spent more effectively supporting those studentswho need this. Attempting to address some of theshortcomings of other literature in the field as per-ceived by a recent review article, we provided co-pious technical details of our implementation. Withincreasing class sizes forecast for the future, weforesee this system continuing to provide us valueand economy whilst giving students the benefit ofprompt, efficient and impartial feedback. We alsoenvisage further refining the system’s capabilitiesat assessing submissions in languages other thanPython.

Acknowledgements

This work was supported by the British Coun-cil, the Engineering and Physical Sciences Re-search Council (EPSRC) Doctoral Training grantEP/G03690X/1 and EP/L015382/1. We providedata shown in the figures in the supplementarymaterial.

REFERENCES

[1] A. Pears, S. Seidman, L. Malmi, L. Mannila, E. Adams,J. Bennedsen, M. Devlin, and J. Paterson, “A survey ofliterature on the teaching of introductory programming,”in Working Group Reports on ITiCSE on Innovation andTechnology in Computer Science Education, ser. ITiCSE-WGR’07. New York, NY, USA: ACM, 2007, pp. 204–223.[Online]. Available: http://doi.acm.org/10.1145/1345443.1345441

[2] A. Robins, J. Rountree, and N. Rountree, “Learningand teaching programming: A review and discussion,”Computer Science Education, vol. 13, no. 2, pp. 137–172,2003. [Online]. Available: http://www.tandfonline.com/doi/abs/10.1076/csed.13.2.137.14200

[3] B. Price and M. Petre, “Teaching programmingthrough paperless assignments: An empirical evaluationof instructor feedback,” SIGCSE Bull., vol. 29,no. 3, pp. 94–99, Jun. 1997. [Online]. Available:http://doi.acm.org/10.1145/268809.268849

24

Page 25: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

[4] R. Saikkonen, L. Malmi, and A. Korhonen, “Fully automaticassessment of programming exercises,” in Proceedings ofthe 6th Annual Conference on Innovation and Technology inComputer Science Education, ser. ITiCSE ’01. New York,NY, USA: ACM, 2001, pp. 133–136. [Online]. Available:http://doi.acm.org/10.1145/377435.377666

[5] A. Venables and L. Haywood, “Programming studentsneed instant feedback!” in Proceedings of the FifthAustralasian Conference on Computing Education - Volume 20,ser. ACE ’03. Darlinghurst, Australia, Australia: AustralianComputer Society, Inc., 2003, pp. 267–272. [Online]. Avail-able: http://dl.acm.org/citation.cfm?id=858403.858436

[6] S. H. Edwards, “Teaching software testing: Automaticgrading meets test-first coding,” in Companion of the18th Annual ACM SIGPLAN Conference on Object-orientedProgramming, Systems, Languages, and Applications, ser.OOPSLA ’03. New York, NY, USA: ACM, 2003, pp.318–319. [Online]. Available: http://doi.acm.org/10.1145/949344.949431

[7] K. Ala-Mutka, T. Uimonen, and H.-M. Jarvinen,“Supporting students in c++ programming courseswith automatic program style assessment,” Journalof Information Technology Education: Research, vol. 3,no. 1, pp. 245–262, January 2004. [Online]. Available:http://www.editlib.org/p/111452

[8] C. Douce, D. Livingstone, and J. Orwell, “Automatictest-based assessment of programming: A review,” J. Educ.Resour. Comput., vol. 5, no. 3, Sep. 2005. [Online]. Available:http://doi.acm.org/10.1145/1163405.1163409

[9] K. M. Ala-Mutka, “A survey of automated assessmentapproaches for programming assignments,” ComputerScience Education, vol. 15, no. 2, pp. 83–102, 2005. [Online].Available: http://dx.doi.org/10.1080/08993400500150747

[10] D. Woit and D. Mason, “Effectiveness of onlineassessment,” SIGCSE Bull., vol. 35, no. 1, pp. 137–141, Jan. 2003. [Online]. Available: http://doi.acm.org/10.1145/792548.611952

[11] J. English, “Experience with a computer-assisted formalprogramming examination,” SIGCSE Bull., vol. 34, no. 3,pp. 51–54, Jun. 2002. [Online]. Available: http://doi.acm.org/10.1145/637610.544432

[12] C. A. Higgins, G. Gray, P. Symeonidis, and A. Tsintsifas,“Automated assessment and experiences of teachingprogramming,” J. Educ. Resour. Comput., vol. 5, no. 3,Sep. 2005. [Online]. Available: http://doi.acm.org/10.1145/1163405.1163410

[13] P. Ihantola, T. Ahoniemi, V. Karavirta, and O. Seppala,“Review of recent systems for automatic assessmentof programming assignments,” in Proceedings of the10th Koli Calling International Conference on ComputingEducation Research, ser. Koli Calling ’10. New York,NY, USA: ACM, 2010, pp. 86–93. [Online]. Available:http://doi.acm.org/10.1145/1930464.1930480

[14] R. Singh, S. Gulwani, and A. Solar-Lezama, “Automatedfeedback generation for introductory programmingassignments,” in Proceedings of the 34th ACMSIGPLAN Conference on Programming Language Designand Implementation, ser. PLDI ’13. New York, NY,USA: ACM, 2013, pp. 15–26. [Online]. Available:http://doi.acm.org/10.1145/2491956.2462195

[15] E. Verd, L. M. Regueras, M. J. Verd, J. P. Leal, J. P. de Castro,and R. Queirs, “A distributed system for learningprogramming on-line,” Computers & Education, vol. 58,no. 1, pp. 1 – 10, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S036013151100193X

[16] K. Masters, “A brief guide to understanding moocs,” TheInternet Journal of Medical Education, vol. 1, no. 2, 2011.

[17] Jupyter Development Team, “nbgrader documentation,”

Online, 2015, accessed at http://nbgrader.readthedocs.org/en/stable/, 4th August 2015.

[18] F. Perez and B. Granger, “IPython: A system for interactivescientific computing,” Computing in Science Engineering,vol. 9, no. 3, pp. 21–29, May 2007.

[19] H. Fangohr, “Teaching computational engineering usingPython,” in EuroPython 2005, June 2005, Gothenburg, Swe-den.

[20] D. Griffiths and P. Barry, Head First Programming: A learner’sguide to programming using the Python language. Sebastopol,CA: O’Reilly Media, 2009.

[21] A. Bogdanchikov, M. Zhaparov, and R. Suliyev, “Python tolearn programming,” Journal of Physics: Conference Series,vol. 423, no. 1, p. 012027, 2013. [Online]. Available:http://stacks.iop.org/1742-6596/423/i=1/a=012027

[22] H. Fangohr, “A comparison of C, MATLAB, and Pythonas teaching languages in engineering,” in ComputationalScience – ICCS 2004, ser. Lecture Notes in ComputerScience, M. Bubak, G. van Albada, P. Sloot, andJ. Dongarra, Eds. Springer Berlin Heidelberg, 2004, vol.3039, pp. 1210–1217. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-25944-2 157

[23] P. S. Foundation, “Applications for Python,” Online,2015, accessed at https://www.python.org/about/apps/,31 March 2015.

[24] H. Fangohr, “Exploiting real-time 3d visualisation toenthuse students: A case study of using visual pythonin engineering,” in Computational Science – ICCS 2006,ser. Lecture Notes in Computer Science, V. Alexandrov,G. van Albada, P. Sloot, and J. Dongarra, Eds. SpringerBerlin Heidelberg, 2006, vol. 3992, pp. 139–146. [Online].Available: http://dx.doi.org/10.1007/11758525 19

[25] J. Pellerin, “nose is nicer testing for python,” Online,2015, accessed at http://nose.readthedocs.org/en/latest/,31 March 2015.

[26] H. Krekel, “pytest: helps you write better programs,” On-line, 2015, accessed at http://pytest.org/latest/, 31 March2015.

[27] The IEEE and The Open Group, “POSIX.1-2008 [TheOpen Group Base Specifications Issue 7],” 2013 Edition,available online: http://pubs.opengroup.org/onlinepubs/9699919799/.

[28] W. R. Stevens and S. A. Rago, Advanced programming in theUNIX environment, 3rd ed. Addison-Wesley, 2013.

[29] N. Tillmann, J. de Halleux, T. Xie, S. Gulwani, and J. Bishop,“Teaching and learning programming and software engi-neering via interactive gaming,” in Software Engineering(ICSE), 2013 35th International Conference on, May 2013, pp.1117–1126.

[30] K. Beck, Test Driven Development: By Example, 1st ed.Addison-Wesley, 2003.

[31] G. van Rossum, B. Warsaw, and N. Coghlan, “PEP 8 - StyleGuide for Python Code,” Online, 2015, accessed at https://www.python.org/dev/peps/pep-0008/, 31 March 2015.

[32] “pep8 software,” Online, 2015, https://pypi.python.org/pypi/pep8.

[33] A. Mashtizadeh, E. Celebi, T. Garfinkel, and M. Cai, “Thedesign and evolution of live storage migration in vmwareesx,” in Proceedings of the 2011 USENIX Conference onUSENIX Annual Technical Conference, ser. USENIXATC’11.Berkeley, CA, USA: USENIX Association, 2011, pp. 14–14. [Online]. Available: http://dl.acm.org/citation.cfm?id=2002181.2002195

[34] M.-A. Lemburg and M. von Lwis, “PEP 263 - DefiningPython Source Code Encodings,” Online, 2001, accessedat https://www.python.org/dev/peps/pep-0263/, 15 Aug2015.

25

Page 26: Teaching Python programming with - arXiv.org e-Print archive ·  · 2015-09-14it was found that the students started to follow many important style guidelines once the tool was ...

[35] “Spyder – Scientific PYthon Development EnviRonment,”Online, 2015, accessed at https://github.com/spyder-ide/spyder, 7 August 2015.

[36] H. Fangohr, “Spyder tutorial,” Online, 2014, accessedat http://www.southampton.ac.uk/∼fangohr/blog/spyder-the-python-ide-spyder-23.html.

[37] ——, “Anaconda installation sum-mary,” Online, 2014, accessed athttp://www.southampton.ac.uk/ fangohr/blog/installation-of-python-spyder-numpy-sympy-scipy-pytest-matplotlib-via-anaconda.html, 30 July 2015.

[38] “Module object-oriented dynamic learning environment,”Online, 2015, accessed at http://moodle.org, 30 August2015.

[39] “JUnit,” Online, 2002, accessed at http://junit.org, 4 Au-gust 2015.

[40] H. Fangohr, “Anaconda installation sum-mary,” Online, 2015, accessed at http://www.southampton.ac.uk/∼fangohr/blog/essential-tools-for-computational-science-and-engineering.html 31 August 2015.

26