Top Banner
42

Constructing qualitative event models automatically from video input

Mar 11, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Constructing qualitative event models automatically from video input

Constructing Qualitative Event ModelsAutomatically from Video Input ?J Fernyhough and A G Cohn and D C HoggSchool of Computer Studies, University of Leeds, LS2 9JT, [email protected] describe an implemented technique for generating event models automaticallybased on qualitative reasoning and a statistical analysis of video input. Using anexisting tracking program which generates labelled contours for objects in everyframe, the view from a �xed camera is partitioned into semantically relevant re-gions based on the paths followed by moving objects. The paths are indexed withtemporal information so objects moving along the same path at di�erent speedscan be distinguished. Using a notion of proximity based on the speed of the movingobjects and qualitative spatial reasoning techniques, event models describing the be-haviour of pairs of objects can be built, again using statistical methods. The systemhas been tested on a tra�c domain and learns various event models expressed in thequalitative calculus which represent human observable events. The system can thenbe used to recognise subsequent selected event occurrences or unusual behaviours.

1 IntroductionDynamic scene analysis has traditionally been quantitative and typicallygenerates large amounts of temporally evolving data. Recently, increasing in-terest has been shown in higher level approaches to representing and reasoningwith such data using conceptual and qualitative approaches (e.g.[6,16,6]). Arich selection of qualitative representation and reasoning systems already exist[39,8], although there are relatively few real-world applications. One motiva-tion for the present work was the desire to apply qualitative spatio-temporalreasoning techniques to real-world dynamic scene analysis. The other principal? The support of the EPSRC under grant GR/K65041 is gratefully acknowledged.Our thanks also for the comments from three anonymous referees.Preprint submitted to Elsevier Preprint 14 April 1999

Page 2: Constructing qualitative event models automatically from video input

motivation was to ensure that the system could build its own models of theworld, thus eliminating the need for tedious hand built models.The information provided from existing tracking applications is, by nature,quantitative with the position and spatial extent of objects usually providedin screen coordinates. However, using the approximate zone or region ratherthan the exact location will collapse broadly similar behaviours into equiva-lence classes to provide a generic model. Of course a scene cannot be arbitrarilysegmented into regions | rather, the regions should be conceptually relevantto the physical structure of the domain rather than arbitrary. Our domainsof interest are those (typically natural outdoor scenes) where the movementof objects is somewhat stylized (i.e. domains in which objects tend to complywith a number of default behaviours, like the movement of vehicles on a road,see �gure 1. Such scenes are observed by a static camera over an extendedperiod to provide training data for the learning processes. Although an appro-priate spatial model has been found [25], such representations have had to begenerated by hand. However, we have outlined an e�ective learning strategy[13] which can automatically generate a similar spatial representation fromthe observation of object movements which extends the model [13]. We willdescribe this work more fully in section 2.Fig. 1. Example of test domains viewed from a static camera.Given such a conceptually relevant representation of space, it becomes pos-sible to determine abnormal behaviour patterns from subsequent continuedobservation of objects travelling within the domain. The spatial model isobtained from the statistical evidence of observed behaviours in which thequantity of \normal" behaviour is signi�cantly greater than that of abnormalbehaviour. Thus the locations where abnormal behaviour have occurred duringthe learning cycle should not adversely e�ect the spatial model (this should in-clude apparent movements caused by noise, e.g. shadows). Should any unusualbehaviour occur after the training period (for example, a motor-way crash)the default behaviour and movement of domain objects may change radically,indicating an unusual situation.However, visual surveillance is not just concerned with abnormal behaviourpatterns. To conduct a full behavioural analysis, the system has to be capable2

Page 3: Constructing qualitative event models automatically from video input

of recognizing (and interpreting) normal behaviour patterns. Typically (e.g.[25,33,3]), systems designed to recognize sequences of situated actions (events)are provided with a priori system knowledge of event models that can be usedto recognize instances of particular events. Rather than providing event modelsas a priori system knowledge, we propose an event learning strategy. Of course,like any learning system, there is a builtin inductive bias that in uences thekind of behaviours that can be learned; this is a standard feature of learningprograms. In particular, the choice of representation language will a�ect whatcan and is learned | a system cannot learn what it cannot represent.In this introduction we have provided a broad outline of the research contextfor this paper. Next, we brie y describe a technique for learning semanticallyrelevant spatio-temporal regions. The body of the paper demonstrates the ef-fectiveness of this spatio-temporal model in a qualitative event learning system[12,28] supported by experimental results. Then we will compare and contrastour approach with alternative techniques in the literature. There then followsa discussion of the underlying assumptions and of the methodology proposed.In conclusion we will discuss how the our present system might be extendedand further evaluated. The material in this paper is largely from [12] and hasbeen reported on brie y previously in [13,28].2 Generation of Semantic RegionsIn [13], we outlined how a (hierarchical) region based model of space corre-sponding to the underlying spatial structure of a domain, may be automati-cally constructed from the extended observation of objects moving within thedomain, in real time. Region types include leaf regions, which de�ne the un-derlying structure of space, and composite regions, which are constructed byconcatenating adjacent leaf regions, and which describe areas of behaviouralsigni�cance (such as a lane or a give-way zone). This automatically generatedmodel is similar to the hand generated models of [25].We employ an existing tracking application (the simple background trackerin [5] is used to collect training data) that provides the position and shapedescriptions of moving objects as well as associating each object with its ownlabel (which is maintained throughout the period the object remains withinthe scene). Object paths are constructed from the area covered by an objecttravelling through the domain. These paths are then merged into a databasebefore statistical analysis indicates which entries are too infrequent to be in-cluded in the spatial model. Leaf regions for the spatial representation areobtained from the combination of the remaining paths stored in the database.However, some form of attentional control mechanism is often employed3

Page 4: Constructing qualitative event models automatically from video input

in visual surveillance applications (e.g. [21,36]) to help identify potentiallyinteresting objects. We follow this route and have developed a technique toidentify when one object is \close" to another since behaviours involving morethan one object are typically between objects that are \close" in some sense.Since objects in our intended domain are generally moving at varying speeds, asimple static notion of closeness is not particularly appropriate. We thereforechoose to make closeness of one object to another depend on the speed ofthe reference object. 1 This is achieved by extending the spatial model [13] toincorporate temporal information [12,28]: when we construct the database ofpaths used by objects travelling through the scene, we also incorporate pointcoordinates at regular time intervals 2 that can be used later to form regionswhich sub-divide the composite regions (paths) within the spatial model intoequi-temporal regions (ETRs). The spatial extent of an ETR is a�ected by thevelocity of objects as well as the distance from the camera (i.e. size due tocamera perspective.) However, the main feature is that it takes approximatelythe same time 3 taken for an object to traverse each ETR in a composite ETRpath. Thus if, for example, the same path is traversed at very di�erent speeds(e.g. during and out of busy periods), then more than one set of ETRs willbe generated and e�ectively there will be two paths, spatially identical, butwith di�erent ETRs (e.g. the busy period path will have more but smallerETRs because the tra�c moves more slowly). Having generated ETR indexedpaths, then a new moving object occurrence can be identi�ed with a particularpath and the ETRs can be used to identify other \close" objects. This spatio-temporal model appears to be unique in the literature. Note that this gives us aqualitative notion of closeness (whether two objects are in the same or adjacentETRs for example) which matches our overall qualitative methodology.2.1 Details of the ETR generation processA tracking process [5] accepts live video images from a static camera. Shapedescriptions corresponding to all moving objects within the scene are producedon a frame-by-frame basis. Real-time analysis of the dynamic scene data isperformed to build a database of paths used by objects. Further informationpertaining to time is also stored in a second (temporal) database. At the end ofthe training period, data stored in the two databases is processed to generate1 Arguably, as discussed in the �nal section, it should be a function of the speedsof both objects, but we have only experimented with the simpler notion to date,and this appears to work well provided both objects have similar speeds.2 We select a two second interval for tra�c domains since the UK \Highway Code"[20, rule 57] recommends a two second gap as a minimum inter vehicle distance.3 Note that it is common in natural language to equate time and distance: theanswer to \how far is it to the shops?" could be \about one mile" or \about 10minutes". 4

Page 5: Constructing qualitative event models automatically from video input

the (leaf, composite and temporal) regions required for the spatio-temporalmodel. A diagram outlining this system is shown in �gure 2.Database

Update

Video Image

Sequence

Tracked Object

ShapesReal-world

scene

Tracking

Process

Model

Spatio-TemporalFuture

Applications Generation

Region Generated Paths &

Temporal Intervals

Path

Database

Temporal

Database

Generation

Path

Temporal

Camera

Fig. 2. Overview of the temporally extended method.There are three main stages:� A tracking process obtains shape descriptions of moving objects (subsec-tion 2.2).� Tempral path generation builds a model corresponding to the course takenby moving objects, complete with a sequence of temporal intervals whereeach interval has the same passage duration for that object. Subsequently,the database of paths and the database of temporal interval sequences areupdated with information contained in the model (subsection 2.3.)� Region generation accesses the database of paths and the database of tem-poral interval sequences so that leaf, composite and temporal regions can beconstructed for the spatio-temporal model of the domain (subsection 2.4).2.2 TrackingIt is possible to specify exactly how many frames are to be processed eachsecond (up to 30.) Thus, the exact duration between one frame and anothercan be calculated allowing the precise number of frames in any period of timeto be ascertained. This is equivalent to providing a time-stamp for each framewhich would have been as equally acceptable. The actual frame count selectedis 25 frames/second as this is currently the standard frame rate for PAL full-motion video.Unfortunately, the tracker used for the experiments 4 does not handle oc-4 The full model-based tracker of Baumberg and Hogg [5] does handle occlusion5

Page 6: Constructing qualitative event models automatically from video input

clusion well so we chose the test domain to limit the amount of occlusionoccurring. Although important to handle occlusion, this is peripheral to themain interest of this investigation at present so we have felt justi�ed in makingthe domain selection and viewing angles to avoid undue occlusions.2.3 Temporal Path GenerationA single spatial path is constructed and then di�erent ETRs are overlaid ontop of this. This allows greater reliability in the spatial extent since all speedsof object will contribute to the spatial extent (of course if, say, faster objectsregularly take a path that is su�ciently di�erent this will show up in the pathdatabase as a separate path for spatial reasons rather than speed reasons).The spatial extent of an object's path is determined by the combination ofall pixels occupying the convex hull 5 of the silhouette of an object's convexhull along its course through the domain. However, for the temporal sub-divisions and to account for camera perspective and speed variations over thelength of the path, it becomes necessary to maintain a list of point coordi-nates indicating the location of the object at regular intervals of time. Thesetemporal point coordinates will allow the temporal regions to be constructedsubsequently.As previously mentioned, two seconds is a reasonable value to identify\close" objects. Therefore, the location of an object needs to be recordedat two second (or 50 frame) intervals. The centroid point of an object's out-line, which is readily available from the tracking process, is an appropriateselection for the temporal point coordinate | it should not cause bias insubsequent processes when locating objects in the relevant temporal region.Figure 3 shows two example paths complete with temporal points located attwo second intervals.On completion, an object's path is merged into the database of existingpaths after searching the database for any equivalent entries. The path equiv-alence test is based on the percentage overlap (we have found a �gure of 80%works well) of constituent pixels of the new path and the existing databasepath. If an equivalent path is not discovered in the database then:� the temporal interval sequence, associated with the new path, is added to abut this is intended for pedestrian scenes and thus did not function so well withmotor vehicles.5 We take the convex hull of the object at each frame since this produces a smootherpath, and a moving object naturally �lls out its own convex hull, at least whenmoving along a straight line. 6

Page 7: Constructing qualitative event models automatically from video input

Fig. 3. Two paths complete with temporal point intervals.second (temporal) database containing all the alternative temporal intervalsequences.� a link between the (new) temporal database entry and its path is created.� the new path is added to the path database.Otherwise, an equivalent path has been discovered and should be revisedto incorporate the information contained in the new path; the new path iscombined with the database path using a function analogous to addition.This provides a frequency value for each constituent pixel (of the databasepath) indicating the number of contributing equivalent paths. Subsequently,the path threshold operation can be applied to generate the composite region.The temporal interval sequence for the new path also requires merginginto the database of temporal interval sequences. A temporal equivalence testis performed on the existing temporal interval sequences contained in thedatabase. Not all database entries should be checked | only those associatedwith the equivalent (updated) database path. For this purpose, each databasepath entry contains a list of links (relations) to associated temporal intervalsequences contained in the temporal database (as shown in �gure 4.) Shouldno equivalent temporal interval sequences be discovered, the new temporalinterval sequence is added to the temporal database along with an associatedlink to the database path entry.Temporal equivalence requires a di�erent type of test to that of path equiv-alence. Unlike the generated object paths, there are no constituent pixels tocoincide, so it is not possible to check a percentage overlap value. Instead,it is necessary to match points in both temporal interval sequences. Objectsentering the domain should essentially appear in (approximately) the samelocation for a particular path. Thus, to check whether two temporal intervalsequences are equivalent, all that should be necessary is to check that the num-ber of intervals correspond and that the length of the corresponding intervalsis approximately the same in each sequence.7

Page 8: Constructing qualitative event models automatically from video input

(Spatial) Path Database Temporal Database

Fig. 4. Structure of path and temporal database.Unfortunately, the tracking process does not always detect the initial ap-pearance of an object. For example, a small vehicle, entering in the distance,combined with light re ections, may not have enough presence to be detectedimmediately. This means that in the temporal equivalence test a starting pointneeds to be identi�ed before matching the lengths of the remaining temporalintervals.It is most unlikely that the starting points and subsequent interval distanceswill exactly coincide, although that would make the process simpler. Instead,these matches must be approximately the same. More formally, a threshold ortolerance space [35] is required to provide reasonable matches. The value forthe tolerance space changes with each interval to be matched and is calculatedfrom the mean duration of the corresponding temporal intervals to be matchedin the two interval sequences.When checking for a starting position in each sequence, the test is for twocorresponding point locations not interval lengths. However, a tolerance spaceis still appropriate and is calculated from the mean length of the temporalintervals on either side of the focus points in the two sequences. If the focuspoint is the initial point in the interval sequence there is not a prior intervalso only the next interval length is considered (in that sequence.)8

Page 9: Constructing qualitative event models automatically from video input

The actual value obtained for the current tolerance space is a 20% thresholdvalue calculated from the mean temporal interval lengths. Figure 5 shows anexample to demonstrate this calculation. From the diagram:t = `1 + `2 + `3 + `44 � 20%This threshold value appears reasonable. If the threshold value was highermore matches would be found and fewer matches if lower. So, if correspondingstarting points can be determined and the remaining temporal intervals areapproximately the same lengths, then the two temporal interval sequences areseen to be equivalent.PSfrag replacements `1

`2`3

`4tpn pd

New Entry Database Entry Tolerance Space

Fig. 5. Calculation of tolerance space for temporal intervals.When two temporal interval sequences are determined to be equivalent, thetemporal database entry should be updated. Beginning with the initial pointsmatched in both interval sequences, the mean position for the two points iscalculated and the temporal database entry updated. The mean position takesinto account all temporal points that have contributed to its location not justthe two current points | otherwise, each new temporal interval sequencewould have a greater e�ect on the �nal location of each point in the databaseentry. Therefore the number of contributors for each point in the temporalinterval sequence is also required in the temporal database entries.The calculation is then:(xi; yi) = (xi; yi)�Ni + (vj; wj)(Ni + 1)9

Page 10: Constructing qualitative event models automatically from video input

where (xi; yi) is the ith temporal point in the database entry which matchesthe jth temporal point, (vj; wj), in the equivalent temporal interval sequenceand Ni is the total number of contributors for the ith temporal point in thedatabase entry.For clari�cation purposes, a sketch of the algorithm is provided in �gure 6.for each framereceive object descriptionsfor each object in framegenerate convex hull of object shapeupdate object path rasterevery 2 secondsrecord temporal point coordinatefor each completed object pathsearch path database for an equivalent entryif equivalent entry foundmerge new path with database pathsearch temporal database entries for equivalent entryif temporal equivalent entry foundupdate temporal database entryelseadd new temporal database entryelseadd new path database entryadd new temporal database entryFig. 6. Sketch algorithm of path and temporal database generation.2.4 Region GenerationThis phase accesses the database of paths and the database of temporalinterval sequences so that leaf, composite and temporal regions can be con-structed for the spatio-temporal model of the domain. In the �rst subsectionbelow, we will describe the purely spatial component of this operation (i.e.without considering the temporal information); the following subsection thendiscusses the full spatio-temporal model.2.4.1 Spatial Region GenerationAt any time during the training period it is possible to generate regionsfor the spatial model. E�ectively this halts the database generation process(although it may be resumed) and uses that information to build the regions.10

Page 11: Constructing qualitative event models automatically from video input

A new region model can be created during the path generation stage eachtime a path becomes complete and is merged into the database. However, itis unclear how useful this continuous region generation may be. The spatialmodel may change frequently and the latest underlying region map may di�ersubstantially to that in the previous state. Without an accurate mappingbetween the adjacent states, object behaviours may prove di�cult to interpret.When regions are generated only as required, path veri�cation may alsobe accomplished. Each database path is tested against all other paths in thedatabase to verify that no path equivalences have been created through thedatabase update process | the merging of equivalent paths may alter theoriginal shape enough that a previously unmatched path may now be foundequivalent. Should any \new" equivalences be discovered they are mergedtogether as before.Although this step is not entirely necessary, it has the advantage that apreviously statistically \weak" path may be strengthened by a \new" equiv-alence. Without this operation, such paths will be strengthened with extratraining | essentially, this step allows a shorter training period and as suchprovides an advantage over continuous region generation.Alternatively, this operation could be performed during the database up-date process. The resulting database entry, after a new path is merged intothe database, could then be reprocessed to check for any further equivalences.However, this operation may prove to be the bottleneck for real-time process-ing. It is possible that several database merges may be necessary before previ-ously unmatched paths become equivalent. This means that several databaseupdate checks may be required. However, if the test is left until the startof the region generation stage, then any equivalent paths can be found in asingle \veri�cation" pass. In fact, experimental results have shown that fewerdatabase checks and updates are made when using a single path veri�cationprocess rather than continuous update.To reduce \noise", any path with a uniformly low frequency distributionis discarded. Although low frequency distribution may represent infrequentobject movement rather than \noise", it is also possible that abnormal or un-usual behaviour is being displayed. In some applications this information maybe useful; however, the method described here relies on behavioural evidenceand it is safe to reject these paths as they are not statistically frequent enough.The remaining paths are then processed to obtain a binary representationof the \best" or most \common" route used | this depends on the databasepath update function being \addition" rather than \or". Thresholding is usedto provide a binary representation where the threshold is selected from thecumulative frequency histogram of each database path and the percentage11

Page 12: Constructing qualitative event models automatically from video input

overlap value employed in the test for path equivalence. An 80% overlap valueis required to merge a path into the database and indicates the percentage ofpixels shared by equivalent paths. This is re ected in the cumulative frequencyhistogram where the \common" path forms the highest 80% of the histogram.So, the frequency value found at 20% of the histogram provides the value forthe threshold operation.These binary path representations express the composite regions for thespatial model | they describe each area of similar behavioural signi�cancefrom objects following the same course through the domain. The leaf regionscan be completely de�ned by how the binary path representations overlap.Each binary path is allocated a unique identi�cation before being added tothe region map. Overlapping segments form separate leaf regions and are re-assigned a new unique identi�cation. When all the paths have been processedeach leaf region will have been identi�ed and labelled.Occasionally, adjacent paths may share small areas of common ground |perhaps from shadows or the occasional large vehicle. This can generate verysmall regions that are not actually useful and the last step in leaf regiongeneration is to remove such small regions by merging them with an adjacentregion. The most appropriate adjacent region selected for the merge is obtainedby considering the smoothness of the resulting merged regions. Smoothness ischecked by considering the boundary of the small region and the proportionshared with the adjacent leaf regions. The adjacent region sharing the highestproportion of the small region's boundary is selected for the merge, e.g. if thesmall region has a border length of seven pixels and shares �ve with regionA and only two with region B, the combination with region B would form a\spike" whereas region A may have a \local concavity" �lled and subsequentlybe smoother (see �gure 7). Figure 8 displays the leaf regions obtained for sometest domains.To complete the purely spatial model, it is necessary to discover the unionof leaf regions which make up each composite region (based on the binaryrepresentations of the database paths). A complication in this process resultsfrom the previous merge of small \useless" regions which may now be partof a larger leaf region that should not be a member of the composite regionfor the path under consideration. Each composite region should contain onlythose leaf regions that are completely overlapped by the path it represents. Aselection of composite regions is displayed in �gure 8 along with the identi�edleaf regions.When complete, the spatial model is in raster format. Although this maybe suitable for some applications, for storage, a vector representation is muchmore e�cient. A raster-vector conversion is therefore applied to the rasterdata and then output to a \map-�le" (as used by [24]). The obtained spatial12

Page 13: Constructing qualitative event models automatically from video input

A

BC

A

B+C

A+B

C

Fig. 7. Merge operation for \useless" small regions.model is then composed of composite regions, leaf regions, line segments andpoints.2.4.2 Spatio-Temporal Region GenerationLeaf regions and composite regions are constructed as described in the pre-vious subsection.The temporal database entries associated with each of these paths are thenprocessed to generate sets of temporal regions for the relevant composite re-gion. Similarly to the path database, each set of temporal database entriesbelonging to a particular path are veri�ed to ensure that no equivalences havebeen created through the update process. Should any \new" equivalences bediscovered they are merged together as described in the previous subsection.However, the calculation for the mean location of the points in the temporalinterval sequences has to be generalized to take into account the number ofcontributors to the point location in both sequences, thus:(xi; yi) = (xi; yi)�Ni + (vj; wj)�Nj(Ni +Nj)where (xi; yi) is the ith temporal point in one database entry which matchesthe jth temporal point, (vj; wj), in the equivalent temporal database entry. Niis the total number of contributors for the ith temporal point in the databaseentry and Nj is the total number of contributors for the jth temporal point inthe equivalent database entry. 13

Page 14: Constructing qualitative event models automatically from video input

a

b

cFig. 8. Test domains: (a) Road junction, (b) dual carriage-way and (c) pedestrianscene displaying identi�ed leaf regions along with a selection of composite regions(indicated by wider borders).14

Page 15: Constructing qualitative event models automatically from video input

The temporal veri�cation stage also ensures that the temporal points withinthe interval sequence are all positioned within the boundary of the generatedcomposite region. It is possible that the threshold operation applied to a path(to obtain the composite region) will leave some of these points outside theresulting area. Should this occur the entire interval sequence is discarded. Typ-ically, this only occurs if the interval sequence has a low statistical frequency| otherwise, the mean location of each point obtained from the combinationof more frequent equivalent temporal interval sequences is likely to place thosepoints within the boundary of the resulting composite region. As the next stepremoves infrequently occurring temporal interval sequences from the databaseno signi�cant information is discarded.Each composite region will now have left at least one temporal intervalsequence (should there be none then the composite region itself is invalidand should be discarded.) Should the composite region have more than oneassociated interval sequence this would represent objects travelling at di�erentspeeds along the path thus containing relevant information. For example, pushbikes typically travel slower than motor bikes but are likely to travel alongsimilar paths, or at di�erent times of the day when the tra�c is heavier orlighter, the typical travelling speed changes.The spatial extent of a temporal region is bounded by the line segmentsobtained from the composite region border and the points to either side of atemporal interval. The line segments (from the composite region boundary)provide the (intrinsic) left and right edges for the temporal region, whereasthe (intrinsic) front and rear edges are obtained by generating lines passingthrough the points at either side of the temporal interval.Although the initial temporal interval has a start point, it is not used tobound the �rst temporal region (in the composite region) because it occurs atthe entry location for new objects | any object entering a composite regionshould enter into the �rst temporal region whether before the �rst point or not.Typically, this will only occur if a small object is detected earlier than normal| which is unlikely. As such, the spatial extent of the �rst temporal regionis bounded to the left, right and rear by the line segments for the compositeregion boundary and to front by the second point in the interval sequence (i.e.the point at the end of the �rst temporal interval.)Although the left and right edges (obtained from the line segments for thecomposite region boundaries) are already known for the temporal region, thefront and rear edges have to be constructed, This is achieved by consideringeach temporal point, (xi; yi), in turn along with the previous point, (xi�1; yi�1),and next point, (xi+1; yi+1). 15

Page 16: Constructing qualitative event models automatically from video input

The gradient, m1, of the line joining the previous and next temporal pointscan be calculated with ease: m1 = yi+1 � yi�1xi+1 � xi�1When multiplying the gradient of any two perpendicular lines we know that:m1 �m2 = �1Therefore, the gradient of any line perpendicular to the line joining theprevious and next temporal points is:m2 = xi�1 � xi+1yi+1 � yi�1In turn, it is now possible to de�ne the equation for a perpendicular linethat passes through the current temporal point:y � yi = xi�1 � xi+1yi+1 � yi�1 ! (x� xi)Using the equation of the perpendicular line it then becomes possible to�nd the location of the points which intersect the composite region boundaryproviding the \corner" points for the temporal region (see �gure 9.)There is a special case for the last point in the temporal interval sequence.Unlike the �rst point in the temporal interval sequence, objects are still trav-elling along the path after the last point | they just leave the domain in lessthan the 2 seconds required for a complete interval. This means that there is a�nal temporal region at the end of a composite region which occurs after thelast temporal interval. In this situation, there are no further temporal pointsto obtain the line gradient from. Instead, the current (last) and previous tem-poral points are used in the gradient calculation rather than the next andprevious temporal points, thus: m1 = yi � yi�1xi � xi�1where i is the last point in the temporal interval sequence. The remainingcalculations are then followed as before.The spatio-temporal model is complete when each temporal database entryassociated with a composite region has been processed. A summary of the ofthe region generation algorithm is provided in �gure 10.16

Page 17: Constructing qualitative event models automatically from video input

Line connecting previousand next point

Perpendicular lineCornerpoints

Fig. 9. Obtaining edge points for temporal regions.at end of training periodverify path database entriesfor each verified and statistically significant path database entryverify associated temporal database entriesif path still validthreshold database path matrixupdate region map with threshold datafind and merge small regions with relevant adjacent regionfor each verified and statistically significant path database entryidentify leaf regions each composite regionfor each associated temporal database entryfind corner points for temporal regionsFig. 10. Sketch algorithm of region generation process.2.5 Experimental ResultsThe video image sequence used for the tra�c junction is about 10 minutesin length and averages 5 or 6 objects each frame. For the dual carriage-way,again about 10 minutes of video footage is used, this time with up to 20objects in each frame. In comparison, the pedestrian scene is roughly doublethe length with at most 3 objects in any frame and often with periods of no17

Page 18: Constructing qualitative event models automatically from video input

object movement.At the end of the training period the tra�c junction has entered 200 pathsinto the database which reduces to 70 after checking for equivalences. Of thesepaths, 28 prove frequent enough to be used in region generation so giving28 composite regions and initially over 400 leaf regions. The removal of smallregions reduces this number to around 150. After only 2 minutes, many of thesigni�cant routes have already been identi�ed with 16 paths strong enoughto be considered composite regions and generating a total of 87 leaf regions.For the dual carriage-way approximately 150 leaf regions are obtained from21 recognized paths and in the pedestrian scene about 120 leaf regions aregenerated from 23 recognized paths.These results rely on three threshold parameters we were unable to elim-inate from the system. Thresholds remain necessary for the overlap value inthe path equivalence test, the actual threshold operation used to obtain bi-nary path representations and the size of leaf regions that are to be mergedinto an adjacent region. As previously indicated, the overlap value for pathequivalence and the path threshold operation are linked | one being the dualof the other. Experimental results indicated that an overlap value of 80% wassuitable for each test domain. It is possible that the percentage overlap valueis related to the camera angle for the scene. As the angle is reduced, objects inadjacent lanes will naturally overlap more. This means that when attemptingthe path equivalence test a higher overlap percentage value will be required todistinguish equivalent paths from those that are actually adjacent lanes. Thevalue used to determine small regions is passed on from the tracking program| here the minimum tracked object size is 10 pixels otherwise problems canarise. Ten pixels is less than 0.02 percent of the total image area size so it isquite conservative.The system maintains real-time performance during the database updatestages and is only marginally slower when generating regions (which can stillbe generated at any time.) Results are successful, providing a number of al-ternative sets of temporal regions for the majority of the composite regions.Although some composite regions only show a single set of temporal regionsit is still acceptable | typically, objects travel at the same speed along thatpath. A selection of temporal regions sets, contained within their compositeregions, are displayed for the dual carriage-way in �gure 11.3 Event LearningTo demonstrate the e�ectiveness of the spatio-temporal model, we present aqualitative event learning strategy (in contrast to the usual method of provid-18

Page 19: Constructing qualitative event models automatically from video input

Fig. 11. Resulting temporal regions.ing event models as a priori system information) that uses the contextuallyrelevant features of the spatio-temporal model. The style of the learning pro-cedure is very similar to the semantic region learning system described abovein that the input data consists of temporally extended descriptions which arethen matched against existing descriptions, and frequency counts are used toeliminate noise.As noted at the beginning of this paper, we use qualitative spatial relations19

Page 20: Constructing qualitative event models automatically from video input

to describe a behaviour. Our reasons for doing this were not only as an exer-cise in validating the applicability of the representations being developed inthe qualitative spatial reasoning literature, but also because we believe thatby using qualitative relationships, whole classes of broadly similar behavioursmay be readily identi�ed. Another reason is that the use of such a representa-tion allows the communication of the behaviour to a human audience rathernaturally, since natural language spatio-temporal descriptions tend to be qual-itative in nature (e.g. [4]). Indeed, previous work where event descriptions havebeen manually supplied to an event recognition system have often used sucha kind of qualitative representation language (e.g. [21,11,14]).Of course, the particular choice of representation primitives will a�ect whatcan be represented, and thus learned. A wide variety of qualitative spatialcalculi have been proposed [8]; for pairs of interacting objects, obvious re-lationships of interest are the orientation of one from the other and theirdirections of motions. We are assuming physical objects, so there is no possi-bility of their overlapping, so mereological relations are of no interest, thoughin other domains (e.g. in geographical domains where objects may overlap| e.g. the habitats of di�erent species), it might be appropriate to representmereological relationships such as \part of", or \overlapping". Since we are re-stricting our attention to close objects, it does not seem worth while recordingthe distance between pairs of objects, though this could be done (and wouldthen allow behavious such as a \tailgating" car repeatedly \closing the gap"to be represented and thus perhaps learned. 6 Qualitative speed information isrepresented implicitly via the ETRs. Qualitative acceleration is not currentlyrepresented since we chose to focus on what we saw as the key relationshipsin our �rst experimentation.Using the attention control mechanism, \close" objects are identi�ed andthe qualitative relationships for relative position and relative direction of mo-tion are maintained in object relationship history lists. When an object leavesthe domain, the associated history lists are veri�ed and added to a database.On completion of the training period, the database can be statistically an-alyzed to determine which sequences of relationships occur su�ciently fre-quently to be considered as the basis for an event model.A diagram outlining this approach is shown in �gure 12. There are �ve mainstages:(1) The same tracking process previously used obtains shape descriptions ofmoving objects.6 In the limit, distinguishing between zero and non zero distance gives a topologicaldistinction { i.e. whether or not two regions are connected or not. There are manyqualitative spatial calculi for representing and reasoning about such relationships[8]. 20

Page 21: Constructing qualitative event models automatically from video input

Video Image

Sequence

Tracked Object

ShapesReal-world

scene

Tracking

Process

LocatedObjects

"close"objects

relationshiptuple

QualitativeRelationshipIdentification

AttentionControl

Object HistoryUpdate

GeneratedHistories

VerifiedHistories

DatabaseUpdate

ObjectHistory

DatabaseDatabaseRevision

Event

EventModels

EventRecognition

HistoryVerification

Object

FutureApplications

Camera Classification

Object History

Spatio-temporalModel

Fig. 12. Overview of the system.(2) A Classi�cation stage allows the identi�cation of qualitative position anddirection from the quantitative information provided by the tracking ap-plication (x 3.1.) and the spatial model.(3) Object history generation uses an attentional control mechanism to iden-tify \close" objects and the qualitative relationships to those objects areadded to the object history (x 3.2.)(4) Object history veri�cation analyzes each object history (case) to ensurethat all relationship transitions are valid (x 3.3.)(5) In the event database revision, each valid object history is added to thecase-base. On completion of the training period, statistical analysis candetermine event models from the object histories contained in the case-base (x 3.4.) Finally, the induced event models can be used to analysesubsequent new camera input which has been classi�ed.3.1 Classi�cationThe �rst step in generating a history for each object is to correctly identifythe position of each object within the spatio-temporal model and to classify21

Page 22: Constructing qualitative event models automatically from video input

its direction and velocity. For each object location, the composite region be-ing occupied has to be established along with the correct ETR within thatcomposite region. This classi�cation of position can be seen as data reductionin terms of converting the unnecessarily detailed quantitative location into amore desirable qualitative location.To this end, the database containing the spatio-temporal model is processedto produce a (two dimensional) leaf region map where each position indicatesthe leaf region occupying that pixel in the scene. Region borders lie betweenpixels and thus will cause no classi�cation problems. Shape descriptions foreach tracked object can be processed to provide a silhouette \mask" whichcan be located on the leaf region map. Any corresponding points will indicatethe set of leaf regions overlapped by the object. To reduce potential errors(see below), the number of points overlapping each leaf region is also countedand if less than a predetermined threshold that leaf region will be ignored. Inthis case, the predetermined threshold is 10% of the object size. Each objectand leaf region has to have a minimum size of 10 pixels (from the trackingapplication parameters and the removal of small regions.) Therefore, 10% ofthe minimum size is a single pixel | the smallest discernible unit.If more than one composite region is identi�ed then the principle of momen-tum is applied to choose the same composite region as in the previous frame(an arbitrary choice 7 is made initially).The potential errors, mentioned above, that may be created by not pruningout leaf regions with minimal occupancy include misidentifying the correctcomposite region. By removing those leaf regions, the \core" leaf regions beingoverlapped still remain and prove su�cient for the identi�cation of the correctcomposite region.The distance between the centroid point of the object at its current locationand where it was two seconds ago allows the correct set of ETRs of the selectedcomposite region to be identi�ed and the particular ETR where the objectcurrently is. (Special techniques have to be used in the �rst two seconds of anobject's appearance.)A deictic frame of reference based on the position of the camera is used toclassify the direction being taken by a moving object. This allows informationprovided directly from the tracking application to be used in the classi�cationprocedure. With each object description (after the �rst), a quantitative di-rection vector is computed, indicating the direction just taken by that object.For the domains in mind, a qualitative classi�cation into eight 45 degree zones(see �gure 13) [19] is suitable and the appropriate quantitative to qualitative7 Or one could use the statistical likelihood of each path (which is readily availablefrom the path generation phase). 22

Page 23: Constructing qualitative event models automatically from video input

conversion is easily performed.8

6

2

4

1

5

37

Fig. 13. Direction classi�cation.3.2 Object HistoriesOnce the position (with respect to the ETR within a composite region ofthe spatial model) and (qualitative) direction of each object in the currentframe has been classi�ed, the history for each object can be updated. Byhistory, we refer to the sequence of relationships between each object and anyother (potentially) interacting (i.e. \close") objects on each object's coursethrough the domain. Such relationships are modelled qualitatively such thatonly critical changes are recorded.As discussed above, there are two relationships modelled between \close"objects which are recorded in each history item:� The relative position of the \close" object with respect to the reference ob-ject. There are eight possible classi�cations; ahead, ahead-left, adjacent-left,behind-left, behind, behind-right, adjacent-right and ahead-right (as illus-trated in �gure 14.) This is similar to the orientation model proposed by[34] such that objects in the \lines of travel" are either ahead or behind thereference object. However, the \lines of travel" do not rely on the currenttrajectory of the reference object, but, rather on the composite region cur-rently occupied by that object, which may include curves rather than juststraight lanes.� The relative direction of movement between the two objects. The number of\interesting" relative directions is presently limited to six: same, opposing,(perpendicular) towards from the left and right, and (perpendicular) awayto the left and right (as illustrated in �gure 15).The temporal extent created by the attention control mechanism incorpo-rates the ETR occupied by the reference object and the ETRs immediatelyin front of and behind the occupied one. Since the two objects may be to-wards the edge of their respective ETRs, this brings any object sharing thesame composite region within nearly four seconds distance into the temporal23

Page 24: Constructing qualitative event models automatically from video input

BehindRight

BehindLeft

Behind

LeftAhead Ahead

RightAhead

RightLeft

LeftAhead Ahead

RightAhead

RightLeft

BehindRight

BehindLeft

OR

BehindFig. 14. Position of \close" objects relative to the reference object.Perpendicular Away Left

Perpendicular Towards Right

Same

Opposing

Perpendicular Away Right

Perpendicular Towards LeftFig. 15. Relative direction of motion.extent 8 . To identify \close" objects in adjacent paths, the temporal extentis broadened; in the present implementation this achieved by extending theleading and trailing edge of the temporal extent to each side by the width ofthe edge | (�gure 16) shows an example of a temporal extent created in thismanner.)

Fig. 16. Example of a temporal extent generated by the attention control mecha-nism.The attention control mechanism is thus capable of identifying any objectscontained within the bounds of the temporal extent. The centroid point of anobject is the determining point of occupancy.8 This of course ignores any motion of the other object.24

Page 25: Constructing qualitative event models automatically from video input

LeftAhead

Behind

RightAhead Ahead

RightLeft

LeftBehindBehind

Right

Fig. 17. Classi�cation of relative qualitative position using sub-regions obtainedfrom the temporal extent.Following the identi�cation of \close" objects, it is necessary to classifythe relative qualitative relationships (position and direction of motion) fromthe reference object to an identi�ed \close" object. This is accomplished bysplitting the temporal extent (generated by the attention control mechanism)into nine sub-regions | which allow the \close" object to be classi�ed into thepath of the reference object or a neighbouring paths 9 . The sub-region thatthe identi�ed \close" object occupies 10 determines the relative qualitativeposition (c.f. �gure 17).Calculating the relative direction of motion is simple: the actual directionsof motion for the two interacting objects have already been converted to qual-9 Only 8 positions are relevant; the 9th position is the central location occupied bythe reference object.10 As noted above, the centroid of the object is used to determine membership ofthe object in a region; this is only ambiguous if it lies on the boundary of twosubregions, in which case the previously occupied subregion is chosen (frames aresu�ciently close that an object will not \jump" a subregion); by choosing the pre-viously occupied subregion, a change of relation will not occur until it is forcedto. 25

Page 26: Constructing qualitative event models automatically from video input

itative values. These two values now need to be compared to determine therelative direction.The relevant object history maintained for the reference object is then up-dated if it already exists, or created if not. The qualitative relationship tupleis compared with the most recent history item. If the history item matchesthe relationship tuple then an associated count is incremented to indicatethe total number of matches for the current relationship. Otherwise the newrelationship pair is appended to the history list.Each reference object may interact with several other objects and a sepa-rate history is maintained for each. The same procedure has to be followed foreach reference object | both �nding \close" objects and obtaining the rela-tive qualitative relationships. Since each object can be travelling at di�erentvelocities, and along di�erent width paths, the associated temporal extentswill be di�erent. This means that \close" is not necessarily commutative andalthough one object may be deemed \close" to another, the reverse is notalways true. Similarly, even if the objects are deemed \close" in both situ-ations, the resulting relationships may not correspond. Each object has itsown frame of reference which may be di�erent to that of the \close" object.Relative position depends on the reference object's frame of reference and if itis di�erent to the other object's then the identi�ed relationships may not cor-respond. This might seem to be unfortunate, but it could be argued that it isa useful feature of the representation (as well as necessary consequence of ourapproach to calculating closeness) | indeed there is evidence in the spatialreasoning literature that closeness is not a symmetric relationship [29], butdepends, for example on the nature and size of the objects involved { e.g. onemight say \Versailles is close to Paris", but not \Paris is close to Versailles".3.3 Object History Veri�cationWhen an object leaves the domain its associated object history lists can bemerged with the database. However, to ensure that the object history is validand free from extraneous relationships caused by tracking \noise", the objecthistory is analysed in a veri�cation procedure that relies on the statistical dataprovided when generating the history.First of all, the veri�cation procedure checks whether or not the history hassu�cient statistical strength to be considered. If the history sequence refersto a relatively short interaction between two objects, then that interactioncould be between elements of tracked \noise" and not actual objects. Since itis not possible at present to determine the object types such short interactionsare discarded. Over the entire training period, su�cient object histories will26

Page 27: Constructing qualitative event models automatically from video input

be processed so discarding potentially \risky" histories should not adverselya�ect learning.Next, the history sequence is analysed to locate (potentially) irrelevantitems, i.e. history items that only occur in a sequence of one or two framesand lie between two matching items that appear for signi�cantly more frames.For example, in the sequence: . . . , ((behind, same) 23), ((behind-left,same) 1), ((behind, same) 74), . . . the relationship tuple (behind-left,same) only occurs in a single frame and splits a signi�cantly longer sequenceof (behind,same). It is important to remember that a single frame takes(1=25)th of a second which is essentially negligible and this pruning operationonly strengthens the re-combined relationship (behind,same).If the aberrant relationship tuple occurs between two that are not the same,the removal process is more complicated. The transition from one relationshiptuple to the other has to respect the underlying assumption that motion iscontinuous. This is achieved by checking the transition is allowed by referenceto a continuity network (c.f. [8], also known as a conceptual neighbourhood),for the qualitative spatial relations involved. A continuity network for thequalitative relative position relations is depicted in �gure 18 and for directionof motion relations in �gure 19.Ahead

Right Behind Right

Behind

Behind LeftLeft

Ahead Right

Ahead LeftFig. 18. Continuity Network for qualitative relative position.Similarly, the �nal veri�cation step involves checking that all adjacent itemsin the history respect the continuity network. In this example:...((infront, same) 34)((right, same) 13)((infront-right, same) 23)...there is no direct transition between the positional relationships infront and27

Page 28: Constructing qualitative event models automatically from video input

Perpendicular Away Left Same Perpendicular Away Right

Perpendicular Towards Right Opposing Perpendicular Towards LeftFig. 19. Continuity Network for qualitative relative direction of motion.right. The continuity network (�gure 19) shows that the only possible tran-sitions from infront are to infront-left or infront-right. As with Her-nandez' [19] topological/orientation model, simultaneous changes in both re-lationships (relative direction and relative direction of motion) may occur (i.e.(right, opposing)may change directly to (behind-right, perp-away-right).)For single discrepancies, it may be possible to \�x" the history by insertinga missing relationship tuple or removing an extraneous one (though in generalthis is not uniquely possible). However, if the number of discrepancies is large,it is easier and safer to discard the entire sequence rather than trying to \�x"the history and then including the conglomerate sequence in the database.Figure 20 provides several examples of actual history sequences and theresulting history sequence after veri�cation. In the �rst example (a), \InfrontRight in Same Direction" only occurs in a single frame and the relationshipon either side is identical. The relationship is therefore discarded and theremaining two are merged and similarly in example (b). The third example(c) is somewhat more complex and results in a history consisting of three itemsalthough perhaps just \Behind in the Same Direction" may have been moreappropriate. Had the minimum number of frames been set higher that wouldhave been the result.3.4 Event Database RevisionFollowing the veri�cation stage, an object history list will represent a se-quence of relationship tuples between two interacting objects depicting a singleevent or a composite event episode. An event will usually be represented by asingle relationship tuple indicating simple behaviour patterns such as follow-ing, being followed, travelling alongside left. . . 11 Although a single event may11Of course, the system does not generate these English event names.28

Page 29: Constructing qualitative event models automatically from video input

52: Infront in Same Direction1: Infront Right in Same Direction61: Infront in Same Direction =)113: Infront in Same Direction(a)16: Behind Right in Same Direction9: Right in Same Direction18: Infront Right in Same Direction1: Infront in Same Direction17: Infront Right in Same Direction =)16: Behind Right in Same Direction9: Right in Same Direction35: Infront Right in Same Direction(b)2: Behind in Same Direction1: Behind in Opposing Direction15: Behind in Same Direction1: Behind Right in Same Direction1: Behind in Same Direction3: Behind Right in Same Direction3: Behind in Same Direction2: Behind Right in Same Direction2: Behind Left in Same Direction1: Behind Right in Same Direction2: Behind Left in Same Direction1: Behind Right in Same Direction16: Behind in Same Direction2: Behind Left in Same Direction1: Behind in Same Direction2: Behind Left in Same Direction2: Behind in Same Direction1: Behind Left in Same Direction12: Behind in Same Direction=)

18: Behind in Same Direction3: Behind Right in Same Direction34: Behind in Same Direction(c)Fig. 20. Example relationship history sequence along with the results from veri�ca-tion. 29

Page 30: Constructing qualitative event models automatically from video input

occur through multiple relationships (for example, pulled out behind would re-quire (behind, same) and (behind-right, same)) such events usually fol-low a simpler relationship. In the example of pulled out behind the referenceobject would have been followed before the other object pulled out. Thus theobject history represents two events or a (composite) event episode.This observation is important when updating the database. Not only is itnecessary to search for equivalent database entries, it is also necessary to searchthe database for entries that match a (continuous) subset of the new entry.Such subsets represent simpler event patterns which compose the new eventepisode. If an equivalent entry is not found, the database also needs searchingfor entries that the new history is a subset of. In this situation, the new historyrepresents an event pattern that currently has not been discovered. However,the new event may already be modelled within one or more composite eventsequences.Using a qualitative representation scheme for the relationships eases thedatabase search. An entry is only equivalent (to the new object history) if therelationship tuple sequences are identical (i.e. each relationship tuple mustappear in the matching sequence in the same order.) The equivalence testdoes not include the item count { that was only necessary for the veri�cationprocedure. If an exact database entry is discovered, a \hit" count is incre-mented, otherwise a new entry is inserted into the database. The \hit" countindicates the number of times that particular sequence of relationship tupleshas occurred in the training period and provides statistical information thatwill allow event models to be constructed.This \hit" count is the reason why it is necessary to search for matchingsubsets. The �rst subset search �nds the less complex event sequences thatcompose the new sequence (i.e. discovers the matching events in an eventepisode.) Although these less complex event sequences have not occurred ontheir own, they are part of a more complex behaviour pattern that requiresthese less complex sequences. Thus the \hit" count on those matching subsetsis also incremented.The �nal subset search, looking for database entries that the new sequenceforms a subset of, is only necessary if an equivalent entry is not discovered.This test searches the database for more complex sequences (event episodes)that the new entry contributes towards. Rather than updating the \hit" counton the existing database entries, the \hit" count associated with the newentry is incremented. The search is not necessary if an equivalent entry wasinitially discovered because this process would have been performed when theentry initially appeared and subsequently updated with the previous searchmechanism. 30

Page 31: Constructing qualitative event models automatically from video input

At the end of a training period any su�ciently frequent database entryrepresents the sequence of relationships in an event model.3.5 Experimental ResultsOver a 15 minute training period observing object interactions on a dualcarriageway (i.e. divided highway) over 60 distinct relationship sequences werecaptured in the case-base. Subsequent analysis determined that of these, 25proved to be su�ciently frequent to represent events. By far the most observedbehaviour was \following" where the only relationships contained in the se-quence show the focus object \behind" the interacting object and travellingin the same direction. Unfortunately, the most complex \overtake" sequence,where an object starts behind a second and pulls out and all the way around to�nish in front of the other vehicle, was not discovered although the less com-plex version, where the objects start and �nish in adjacent lanes, is modelledas well as other subsets like pulling out behind and pulling in front. It wouldappear that in this particular domain, the observed area is not large enough toobtain all the necessary information to form more complex behaviour patternsthat we would like to discern.To demonstrate the e�ectiveness of the event models discovered in the learn-ing process a demonstration program has been set up that allows the user tospecify a particular event to watch for. The event models can be loaded froman \event-info" �le along with the spatio-temporal map-�le. These are inter-preted into the desired format allowing the user to cycle through a list of eventmodels and to decide which event sequence the program should watch out for.Currently a diagrammatic presentation has not been provided. Instead, theuser is shown a linguistic description of the transitions composing the event.For example, an overtake event episode would be described as:� travelling behind-right in the same direction.� travelling right in the same direction.� travelling infront right in the same direction.To allow the simultaneous interpretation of the observed actions the eventsequence is represented as a state transition network (similar to that used by[3]). Figure 21 displays the overtake sequence as a state transition network.The attention control mechanism isolates objects in the same vicinity whichare then categorized with the correct relationship tuple; this is then checkedagainst the starting state in the state transition network { if they match, thenthe event episode has potentially been initiated. To show a potential eventepisode the relevant objects are coloured on the display: green indicates atarget object in a relationship and blue represents the reference object poten-31

Page 32: Constructing qualitative event models automatically from video input

A B OVERTAKEBehind-Right(x,y)

Behind-Right(x,y)

Behind-Right(x,y)

Ahead-Right(x,y)

Right(x,y)

Right(x,y)

NO OVERTAKENO OVERTAKE

~Right(x,y) &

~Behind-Right(x,y) &

~Ahead-Right(x,y)~Right(x,y)

~Behind-Right(x,y) &Fig. 21. Overtake state transition network.tially involved in the event episode. If the last state in the transition networkis reached, the reference object turns red to indicate that the event episodehas been recognized. 12 Multiple windows could show several di�erent eventtypes being recognised simultaneously. Figure 22 shows a sequence of framesshowing the recognition of the overtake sequence of relationships shown above.4 Related WorkAlthough there is increasing interest in work which provides conceptualdescriptions of images and video, as evidenced by recent workshops such as[1,16,6], much of the work assumes manually provided models rather thanlearning them as we propose. Of those that do learn, ours appears to beunique in its use of a qualitative spatio-temporal representation calculus. Inthe rest of this section we brie y survey some of this work and attempt torelate it to our approach.We have already mentioned the work of Howarth and Buxton (e.g. [25,26,21]),which has inspired much of our approach. In many ways our work can bethought of as complementary { learning the semantic regions and the eventmodels which have to be manually provided in their approach. A key feature ofthe approach, as exempli�ed in [21], is the use of Bayesian networks to recog-nise events. Again, the conditional probabilities to enable this are explicitlyprovided; we have not explicitly considered this, but this kind of knowledge(e.g. the conditional probability of the current event being an \overtake", givenan \alongside-consecutive" relationship) could be derived from the knowledgebase garnered by our techniques.Kollnig et al [17] scanned a German dictionary to obtain a �nal set of 67verbs and verb phrases indicating motion; they then associated (by hand) a12 A demonstration of this can be viewed athttp://www.scs.leeds.ac.uk/qsr/ivc.html.32

Page 33: Constructing qualitative event models automatically from video input

Fig. 22. Sequence of frames showing the recognition of the overtake manoeuvre.(Original in colour { see http://www.scs.leeds.ac.uk/qsr/ivc.html.)33

Page 34: Constructing qualitative event models automatically from video input

precondition, monotonicity condition and postcondition which each of these;these were expressed qualitatively, but implemented via fuzzy sets. In morerecent work, Haag et al[33,32] extend this approach with more structured eventdescriptions (which they call \situation subgraphs") and which are processedusing a fuzzy metric temporal logic programming approach.Another example of a system which requires prede�ned events in order tobe able to recognise them is that of Bremond and Medioni [11]. They identifymobile objects with regions and compute a number of measures (e.g. height,width, speed, motion direction, current location, set of previous locations,distance to a reference object). They then de�ne scenarios such as \a caravoids a checkpoint" using an automaton with conditions de�ned over thesemeasures. These conditions appear to be essentially qualitative in nature, asin our work; e.g. \goes toward", \stops before the checkpoint".Ayers and Shah [10] also require prior knowledge to be able to recogniseevents, in their case, monitoring human behaviours in an o�ce environment.Essential to their system is \an accurate description of the layout of the scene... information about the location of entrances"; similarly \scene change de-tection is only performed in areas that have been declared as interesting". Inaddition, they prede�ne a list of actions and recognition conditions, such as\put down object" which is recognised when \person moves away from objectbeing tracked". Finally they provide a \transition state model" which spec-i�es the possible transitions between the various actions. All this has to bemanually provided and it is the recognition of the di�culty of providing suchmodels which motivates our research.Jebara and Pentland [2] recognise the need not to have to tediously de-scribe behaviour models directly but to enable a machine to learn them. Inmany ways the motivation of their work is similar to ours, but the underlyingdescriptive apparatus is entirely di�erent: their time series predictions giveprobabilistic estimations of the values of the parameters of the \blobs" (de-scribing shape and position) at each consecutive time frame, which is clearlya much lower level representation than ours; it is entirely adequate for thetask they have in mind { enabling a computer to mimic a human's facial andgesticural behaviour, but it is certainly not at a qualitative level. Equally, oursystem can predict qualitative behaviour, but not at the precise quantitativelevel of their system.Oliver et al [36] consider the problem of learning models to predict interac-tions between humans, using Hidden Markov Model techniques. Trajectoriesare modelled using a �rst order Extended Kalman �lter that predicts suc-cessive blob positions. Rather than have the system automatically discoverpossible event types, they pre-identify a set (�ve in the reported work) ofevent types and train the system using speci�c instances of these event types34

Page 35: Constructing qualitative event models automatically from video input

(in di�erent positions and orientations etc).Johnson [31] also learns the behaviour of moving objects but again at aquantitative level as a pdf over possible sequences of object con�guations { inthe simplest case the con�gurations are instantaneous point and velocity onthe ground plane or in the image plane. This work has recently been extended[30] to learn joint behaviour of people (e.g. facial expressions) with the aim ofgenerating a plausible virtual partner in an interaction.Finally it is worth pointing to work learning qualitative conceptual descrip-tions in static scenes such as that described by Walischemwski [18] who usesa simple qualitative spatial calculus to represent the layout of structured doc-uments (such as envelopes).5 DiscussionSuper�cially, it might be thought that since the domains we have chosento be of interest are assumed to be somewhat regular with respect to the be-haviour patterns within them, that a traditional \expert systems" approachmight be (more) suitable, rather than the learning approach we have promu-lagated here. However, as had been widely observed in the knowledge basedsystems literature, the acquisition of rule bases is not always easy and thishas driven many researchers to investigate ways of acquiring rule bases viaMachine Learning techniques, as we suggest here.Thus our work can be seen as complementary to the works described in theprevious section (in particular that of Buxton and Howarth [7,25,26,23,27,22,21]),in that we suggest a techniques to learn the kind of models which these workspresuppose.However, it might be argued that we still have to supply background knowl-edge, in our case it is the language of the qualitative calculus; what is or canbe learned will clearly be crucially a�ected by the choice of this representationlanguage This is inevitable { any learning system must have some kind of in-ductive bias supplied to it, and in our case it is the the choice of the qualitativerepresentation language. We hypothesise however, that the class of learnableevents will be relatively insensitive to the actual choice (though obviously theirprecise representation will vary). The veri�cation of this hypothesis remains asfuture work; for the present we are content that by selecting a small numberof qualitative spatio-temporal primitives, we are able to learn a potentiallyvery large class of events. Moreover, even if the actual event descriptions arerelatively simple to specify (cf �gure 21), the advantage of our approach isthat there may be very many of them and a manual method would be both35

Page 36: Constructing qualitative event models automatically from video input

tedious and/or prone to error or incompleteness. Of course, learning systemssuch as ours may su�er from error or incompleteness too, but at the veryleast it could provide an independent check on manually speci�ed rules. Thehistory of knowledge acquisition is full of examples of the "knowledge acquisi-tion bottleneck" and we believe the domain of video interpretation is no lesssusceptible to this. Our approach also has the advantage that it learns thestatistical frequency of the domain events which would not be present if ahuman simply encoded the event descriptions by hand.Our thesis is that we can automatically identify \interesting" behaviour.There are (at least) two kinds of notions of \interesting":(1) \interesting" because the behaviour is frequent (or at least su�cientlyfrequent to be picked up by our techniques).(2) \interesting" because it is very rare and thus not picked up by the sta-tistical methods.We can recognise both types of behaviour { the �rst because we have built anexplicit model, and the second because a \run time" event history does notmatch an existing model and thus potentially represents an unusual behaviouroccurrence. We would not claim however to have properly analysed or de�nedwhat constitutes an \interesting behaviour" which certainly deserves furtherwork; but we think our technique is an approach worth exploring. More gen-erally, there is the question as to how much we may have been (inadvertently)in uenced in our system design by what we hoped and expected the systemto produce { how general really is our approach? If we applied our techniquesto some totally new domain which has not be investigated before, how wellwould our techniques work? This is a good question to which only future workwill bring forth the answer { it is a really a question of our choice of inductivebias as already mentioned.Notice that although the possible space of behaviours is large (depending onthe length of the event sequences, the number of di�erent spatial distinctionswe make, the number of permissible actors in a behaviour, and the degreeof focus), we are not searching it { rather the actual data we obtain selectscertain elements from the space. There is no search in our program (except inthe sense of searching the database of existing behaviours to match a newlyobserved one).Another issue of concern in learning a probabilistic model is determiningwhen su�cient training has taken place. We have not speci�cally addressedthis problem yet. To a certain extent this is a function for the user to address{ i.e. they must ensure that the training takes place for su�ciently long and atthe right times to allow a su�ciently accurate model to be built. Statistics caneasily be output to aid in these decisions, e.g. the rate at which new paths and36

Page 37: Constructing qualitative event models automatically from video input

new events are being added to the system, or the rate of change of the relativeprobabilities of the learned concepts. This issue is addressed in the approachof [36] who use pretraining with synthetic models amongst other techniques,but this in turn begs the question of how the synthetic models are derived.At present our techniques for attentional control are relatively crude { wesimply consider objects that are \close" to a reference object, where \close"takes account of the speed of the reference object. As already remarked thismay sometimes be overly restrictive (e.g. when a chase is being conductedat a distance). Moreover the actual metric cuto� for closeness will clearly bedomain dependent { our choice of a two second gap was in uenced by the UK\Highway Code" recommendation, but is clearly not necessarily appropriatefor all tra�c situations, let alone all domains.It is also worth discussing our approach to handling noise. Much \noise"has already been eliminated in lower levels { for example, regions extracted ineach frame are only considered in the higher level if coherent motion is beingdescribed (i.e. the region representing the mobile object has been tracked fora number of frames). In this way, regions which appear, due to inaccuratesegmentation, never reach the event learning stage. Similarly, the semanticregions give a basis for rejecting motion not following a path learned in thisphase { the implication being that any such \movement" is noise. Within theevent database revision phase, we use the conceptual neighbourhood diagramsto �lter out event histories missing more than one intermediate relation, sincewe view such histories as being too noisy. Finally, the frequency counts at-tached to the event descriptors are a way of expressing a con�dence in thedetected event { and in fact learned events with very low frequency counts arediscarded as noise. Of course this raises the question of how to interpret anevent at runtime (i.e. after learning is complete) which has not be encounteredbefore (cf. our 2nd notion of \interesting" above) { is this simply noise or anew event type? There is no easy answer to this question. One could imag-ine developing generic descriptions of event types, classifying them into eventclasses (based on the structure of the event perhaps and its constituent rela-tions) and a new event type which fell into an existing class might be classi�edas such rather than noise. Exploration of this problem and such a resolutionremains as future work.This leads on to consideration of how we interpret the frequency counts.We have not yet conducted detailed investigations in this area. For the exper-imental data described above, the most frequent event occurred more than a100 times, while the most infrequent of the \su�ciently frequent" was around30 occurrences. We took a count of three as the minimum to be regarded asan event to be used in subsequent surveillance. We found a large gap betweenthe frequency of events with very few occurrences and those with rather more.A reasonable approach to determining the meaning of \su�ciently frequent"37

Page 38: Constructing qualitative event models automatically from video input

might be to look for such a gap.6 Future WorkAlthough the prototype described here appears to work well there is plentyof opportunity for further work; some ideas are mentioned here in conclusion.� First, it is clear that we should evaluate the existing system in more thanone domain; in the �rst instance we intend to apply it to other tra�c scenes;given a su�ciently rich variety of tra�c scenes, we would hope to be ableto reproduce and indeed verify the set of tra�c event types claimed to beexhaustive in [38]. Following this, we would investigate its application toother domains with more or less perceived regularity of behaviour.� Improvements in the underlying tracking technology would improve thequality of the data available which should result in wider applicability andbetter quality learned models.For example, the tracking application which was available for our use atthe time of the experimentation was incapable of handling occlusion (i.e.situations when, due to camera perspective, two objects overlap.) In suchsituations, one of the object labels will be lost.Also, the tracker is not model based, meaning that it is unable to rec-ognize the di�erence between actual objects moving in the scene and scenevariations due to camera vibration or \noise".Finally, note that the tracking application applied throughout this paperprovides two-dimensional shape descriptions. If we were to utilize a three-dimension model based tracker, the same region generation methodologywould be capable of generating a three dimensional spatial representation 13(plus a further spatial dimension for the spatio-temporal model.) Similarly,the event learning strategy could also be extended to incorporate three-dimensional spatial relationships.� The learned event types could be represented in diagrammatic form or asnatural language descriptions 14 .� At present velocity information is only represented implicitly as member-ship in di�erent ETRs associated with a particular path. One could modelvelocity knowledge explicitly (in a qualitative manner of course), either byclassifying velocities into a �nite discrete set of velocity ranges (perhapsinduced from the di�erent ETRs learned), or by simply recording relativevelocity information (same, faster, slower 15 ). This would allow, for exam-13Of course, the current application programs would have to be extended to copewith the extra information, but the underlying method would remain the same.14 See [15] for existing work in this area.15 This three way qualitative distinction is wide spread in the qualitative reason-38

Page 39: Constructing qualitative event models automatically from video input

ple, a simple following behaviour to be distinguished from the �rst part ofan overtaking behaviour, allowing better prediction.� The spatio-temporal model could be extended to represent accelerationchanges, so that corresponding event types could be learned. This wouldrequire an extension to the event modelling language, as would recognisingevents involving (temporarily) stationary objects.� Another possibly useful extension to the modelling language would be to usea calculus for representing and reasoning about regions with indeterminateboundaries [9].� Currently, the system only examines relationships between two objects and,as such, only learns events involving two interacting objects. More complexbehaviour involving several interacting objects (for example queueing) iscurrently not modelled as a single event between multiple vehicles. Instead,several events between two vehicles are modelled which does not su�cientlyrepresent the more complex behaviour being observed. The system couldbene�t by being enhanced to model relationships between three or moreinteracting objects. How these relationships could be modelled requires morethought.� The attention control mechanism described within this paper only identi�espotentially interacting objects through proximity based on the temporalregion occupied by the object under attention. From this mechanism, theevent learning strategy relies on the assumption that events occur between\close" objects. Further investigation is required to determine when this issu�cient and what further attention control techniques might be required.For instance, in some situations two interacting objects may not be \close"according to this de�nition; e.g. when one vehicle is being chased by a policecar at a distance or when two objects are travelling at vastly di�erent speeds:a pedestrian crossing a road may not be in \close" proximity to movingvehicles, but that pedestrian has examined the environment and decidedthat an accident will not be caused by crossing the road at that time.� From just a single static camera, the application domain is fairly limited.This could be extended by combining several cameras with (slightly) over-lapping views to allow following the object movements over an extendedobservation area. If the connection between camera positions is unknown,the system could build the spatial model for each of the views and thencombine them into a single area by �nding the overlapping features.� The techniques presented here may be applied in many possible domains andsituations. The most obvious application is that of visual surveillance. Forexample as a security system in a parking lot, the event system combinedwith the ETRs might easily identify unusual behaviour (e.g. a person notfollowing the usual pedestrian paths or spending too long next to a vehicle.)In police surveillance, the ETRs may immediately identify speeding tra�c.ing literature[39] and has, by and large, proved very useful, despite its apparentsimplicity. 39

Page 40: Constructing qualitative event models automatically from video input

However other applications are also possible; e.g., from the spatial modelconstruction, one can identify areas with minimal occupation. This might beuseful when refurbishing a shopping centre or other public area: the spatialmodel would show typical behaviour patterns that might allow new featuresto be placed with the minimal amount of disruption.References[1] Bobick A, editor. IEEE Computer Society Workshop on The Interpretationof Visual Motion held in conjunction with CVPR'98, volume Workshop Notes.http://vismod.www.media.mit.edu/�bobick/ivm-site/, 1998.[2] Jebara A and Pentland A. Action reaction learning: Analysis and synthesis ofhuman behaviour. In IEEE Workshop on The Interpretation of Visual Motion.http://vismod.www.media.mit.edu/�bobick/ivm-site/, 1998.[3] Elisabeth Andr�e, Gerd Herzog, and Thomas Rist. On the simultaneousinterpretation of real world image sequences and their natural languagedescription: The system soccer. In Proc. ECAI-88, pages 449{454, Munich,1988.[4] M Aurnague and L Vieu. A three-level approach to the semantics of space.In C Zelinsky-Wibbelt, editor, The semantics of prepositions - from mentalprocessing to natural language processing, Berlin, 1993. Mouton de Gruyter.[5] Adam M. Baumberg and David C. Hogg. Learning exible models from imagesequences. In European Conf. on Computer Vision, pages 299{308, May 1994.[6] H Buxton, editor. ECCV Workshop on Conceptual Descriptions from Images,1996. Cambridge, UK.[7] Hilary Buxton and Richard Howarth. Spatial and temporal reasoning in thegeneration of dynamic scene descriptions. In Rodr�iguez [37], pages 107{115.[8] A G Cohn. Qualitative spatial representation and reasoning techniques. InCh. Habel G Brewka and B Nebel, editors, KI-97: Advances in Arti�cialIntelligence, volume LNCS 1303, pages 1{30. Springer Verlag, 1997.[9] A G Cohn and N M Gotts. A mereological approach to representing spatialvagueness. In J Doyle L C Aiello and S Shapiro, editors, Principles of KnowledgeRepresentation and Reasoning, Proc. 5th Conference, pages 230{241. MorganKaufmann, 1996.[10] Ayers D and Shah M. Monitoring human behavior in an o�ceenvironment. In IEEE Workshop on The Interpretation of Visual Motion.http://vismod.www.media.mit.edu/�bobick/ivm-site/, 1998.[11] Br�emond F and G�erard Medioni. Scenario recognition in airborne videoimagery. In IEEE Workshop on The Interpretation of Visual Motion.http://vismod.www.media.mit.edu/�bobick/ivm-site/, 1998.40

Page 41: Constructing qualitative event models automatically from video input

[12] J H Fernyhough. Generation of qualitative spatio-temporal representations fromvisual input. PhD Thesis, 1997.[13] Jonathan H. Fernyhough, Anthony G. Cohn, and David C. Hogg. Generationof semantic regions from image sequences. In Fourth European Conference onComputer Vision, volume 1065 of Lecture Notes in Computer Science, page475=48. ECCV, Springer Verlag, 1996.[14] Herzog G. Utilizing interval-based event representations for incrementalhigh-level scene analysis. Technical Report Berich Nr. 91, Universitaet desSaarlandes, 1992.[15] Herzog G, Sung C-K, Andr�e, Enkelmann W, Nagel H-H, Rist T, Wahlster W,and Zimmermann G. Incremental natural language description of dynamicimagery. In Freksa C and Brauer W, editors, Wissenbasierte System: 3. IntGI-Kongress, pages 153{162. Springer, 1989.[16] Buxton H and Mukerjee A, editors. ICCV-98 Workshop on ConceptualDescription of Images (CDI-98), volume Workshop Notes, 1998.[17] Kollnig H, Nagel H-H, and Otte M. Asssociation of motion verbs with vehiclemovements extracted from dense optical ow �elds. In Proc. ECCV. Springer,1994.[18] Walischewski H. Learning and interpretation of the layout of structureddocuments. In Ch. Habel G Brewka and B Nebel, editors, KI-97: Advancesin Arti�cial Intelligence, volume LNCS 1303, pages 409{412. Springer Verlag,1997.[19] D Hern�andez. Qualitative Representation of Spatial Knowledge, volume 804 ofLecture Notes in Arti�cial Intelligence. Springer-Verlag, 1994.[20] HMSO. The Highway Code. Her Majesty's Stationary O�ce, HMSOPublications Centre, PO Box 276, London, SW8 5DT, 1996.[21] R J Howarth. Interpreting a dynamic and uncertain world: task-based control.Arti�cial Intelligence, 100:5{85, 1998.[22] Richard Howarth. Using cellular topology to unite space and time. In Rodr�iguez[37], pages 161{169.[23] Richard J. Howarth. The control of spatial reasoning for high-level vision:a tra�c surveillance example. In Workshop on Spatial Concepts: ConnectingCognitive Theories with Formal Representations. ECAI-92, August 1992.[24] Richard J. Howarth. Spatial Representation and Control for a SurveillanceSystem. PhD thesis, Queen Mary and West�eld College, The University ofLondon, July 1994.[25] Richard J. Howarth and Hilary Buxton. An analogical representation of spaceand time. Image and Vision Computing, 10(7):467{478, October 1992.41

Page 42: Constructing qualitative event models automatically from video input

[26] Richard J. Howarth and Hilary Buxton. Analogical representation of spatialevents for understanding tra�c behaviour. In B. Neumann, editor, Proc. 10thEuropean Conf. on Arti�cial Intelligence, pages 785{789. John Wiley & Sons.Ltd, 1992.[27] Richard J. Howarth and Hilary Buxton. Selective attention in dynamic vision.In Proceedings of the Thirteenth IJCAI Conference, pages 1579{1584, August1993.[28] Fernyhough J, Cohn A G, and Hogg D C. Building qualitative event modelsautomatically from visual input. In Proc. ICCV. IEEE, 1998.[29] Holyoak K J and Mah W A. Cognitive reference points in judgments of symbolicmagnitude. Cognitive Psychology, 14:328{352, 1982.[30] N Johnson, A Galata, and D C Hogg. The acquisition and use of interactionbehaviour models. In Proc. IEEE Computer Society Conference on ComputerVision and Pattern Recognition - CVPR'98, pages 866{871. IEEE ComputerSociety Press., 1998.[31] N Johnson and D C Hogg. Learning the distribution of object trajectories forevent recognition. Image And Vision Computing, 14(8):609{615, 1996.[32] Haag M and H Nagel. Incremental recognition of tra�c situations from videoimage sequences. In ICCV-98 Workshop on Conceptual Description of Images(CDI-98), 1998.[33] K Schaefer M Haag, W Theilmann and H Nagel. Integration of image sequenceevaluation and fuzzy metric temporal logic programming. In Ch. HabelG Brewka and B Nebel, editors, KI-97: Advances in Arti�cial Intelligence,volume LNCS 1303, pages 301{312. Springer Verlag, 1997.[34] A Mukerjee and G Joe. A qualitative model for space. In Proceedings AAAI-90,pages 721{727, Los Altos, 1990. Morgan Kaufmann.[35] A. Mukerjee and T. Schnorrenberg. Hybrid systems: Reasoning across scales inspace and time. In AAAI Symposium on Principles of Hybrid Reasoning, pages15{17, Nov. 1991.[36] Oliver N, Rosario B, and Pentland A. Statistical modeling of humaninteractions. In IEEE Workshop on The Interpretation of Visual Motion.http://vismod.www.media.mit.edu/�bobick/ivm-site/, 1998.[37] R. V. Rodr�iguez, editor. Proceedings on Spatial and Temporal Reasoning,Montr�eal, Canada, aug 1995. IJCAI-95 Workshop.[38] von Seelen U C. Ein formalismus zur beschreibung von bewegungsverben mithilfe von trajektorien. Diplomarbeit, Fakultaet fuer Informatik der UniversitaetKarlsruhe, 1988.[39] D S Weld and J De Kleer, editors. Readings in Qualitative Reasoning AboutPhysical Systems. Morgan Kaufman, San Mateo, Ca, 1990.42