Multimodal Segmentation on a Large Interactive … Segmentation on a Large Interactive Tabletop: Extending Interaction on Horizontal Surfaces with Gaze Joshua Newn, Eduardo Velloso,

Multimodal Segmentation on a Large Interactive Tabletop:Extending Interaction on Horizontal Surfaces with Gaze

Joshua Newn, Eduardo Velloso, Marcus Carter, Frank VetereMicrosoft Research Centre for Social NUI

The University of Melbourne[joshua.newn][evelloso][marcusc][f.vetere]@unimelb.edu.au

ABSTRACTEye tracking is a promising input modality for interactivetabletops. However, issues such as eyelid occlusion and theviewing angle at distant positions present significant chal-lenges for remote gaze tracking in this setting. We presentthe results of two studies that explore the way gaze interactioncan be enabled. Our first study contributes the results from anempirical investigation of gaze accuracy on a large horizon-tal surface, finding gaze to be unusable close to the user (dueto eyelid occlusion), accurate at arms length, and only pre-cise horizontally at large distances. In consideration of theseresults, we propose two solutions for the design of interac-tive systems that utilise remote gaze-tracking on the tabletop;multimodal segmentation and the use of X-Gazeour noveltechniqueto interact with out-of-reach objects. Our secondstudy evaluates and validates both these solutions in a Video-on-Demand application, presenting immediate opportunitiesfor remote-gaze interaction on horizontal surfaces.

Author KeywordsInteractive tabletop; large horizontal surfaces; eye tracking;smooth pursuit; gaze interaction; multimodal interaction.

ACM Classification KeywordsH.5.2. Information Interfaces and Presentation: UserInterfaces: Input devices and strategies

INTRODUCTIONTouch is the most widely supported input modality for in-teractive tabletops, providing a precise, spatial, and naturalmeans of interacting with digital content [2, 6]. As the sizesof interactive surfaces increase, reachability starts becominga problem as touch input can be only used within physicalreach [31]. To minimise this, various novel interaction tech-niques have been proposed in the literature, which extendtouch on tabletops: through emulation of a mouse [5], theuse of tools/widgets [1, 7], or by introducing an additionalinput device/modality [4, 16, 18, 32].Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] 16, November 06-09, 2016, Niagara Falls, ON, Canadac 2016 ACM. ISBN 978-1-4503-4248-3/16/11...$15.00

DOI: http://dx.doi.org/10.1145/2992154.2992179

Figure 1: Multimodal Segmentation. Left: Input modalities for eachregion segmented. Right: The application built in relation to where eachmodality/interaction technique works best. This is an animated figureand it is best viewed in Adobe Reader.

Recently, eye tracking has been shown to be a promising tech-nology for remote interaction with very large vertical dis-plays, as gaze is fast, naturally drawn to objects of interest,and able to interact with objects beyond our reach [29, 30,34]. However, remote gaze-tracking is inaccurate on largehorizontal surfaces. As the results from our first study demon-strate, this is constrained for two reasons. At the far end, thewide visual angle leads to inaccuracies, while in the closerend to the user, the eyelid tends to cover the pupil as the userlooks down. To get around these constraints, HCI researchershave typically used wearable eye-trackers to simulate perfectgaze tracking on tabletops [3, 14]. This approach limits ac-tual implementation where remote gaze tracking is often theonly suitable option, e.g., in unsupervised walk-up-and-useinteractive surfaces found in museums [10] or in collabora-tive spaces where impromptu interactions take place [39].

This paper contributes the initial steps for enabling interac-tion on a large interactive tabletop with gaze input. Our firststudy identifies the gaze estimation accuracy and precision ona large tabletop display with a low-cost remote eye tracker.The results show that, whereas gaze pointing works well at

the centre of the screen, the accuracy decreases exponentiallywith distance. However, even at the far end of the table, hori-zontal eye movements are still tracked with high precision.Based on these results we contribute a multi-level interac-tion architecture in which we divide the table into three re-gions that employ different input modalities i.e. multimodalsegmentation (see Figure 1-Left). In the firstthe Touch Re-gionusers interact using touch gestures. In the secondtheGaze+Touch Regionusers point with gaze combined withindirect touch gestures. In the thirdthe X-Gaze Regionusers select targets with X-Gaze, a novel technique we devel-oped in this work using a horizontal smooth pursuit trackingalgorithm adapted from Vidal et al. [36]. Our second studyevaluates both solutionsmultimodal segmentation and X-Gazethrough a Video-on-Demand application (see Figure1-Right) that demonstrates how they can be used in a cohe-sive interface on an interactive tabletop. Lastly, we presentthe immediate future opportunities from these solutions.

RELATED WORKA considerable amount of research in HCI has explored theuse and design of large (> 1m2) interactive tabletops. Touchinput on tabletops has been the common modality, reflectiveof its ease of use, intuitiveness, and social advantages [2, 6].However, touch is limited by the users reach, limiting theseadvantages. Gaze input presents a possible solution [12], andHCI researchers have typically used wearable eye trackers tosimulate perfect gaze-tracking on tabletops [3, 14, 37]. Thisapproach is likely due to the inherent challenge of remotelydetecting gaze on a horizontal plane. Therefore, few studieshave explored the use of remote gaze tracking on tabletopsand predominantly in the context of intent prediction [17, 41].

Holman [12] first proposed the idea of using remote gaze asa possible solution to address inherent problems when inter-acting with large tabletop surfaces (e.g. reachability). Basedon this idea, an eye-tracking tabletop interface (ETTI) andan accompanying game were developed and evaluated by Ya-mamoto et al. [40, 41]. Beyond demonstrating the ability topredict a users intention based on gaze detection, their exper-iment highlighted two developments for interaction with hor-izontal surfaces using gaze. First, it showed the potential forunsupervised walk-up-and-use interaction using remote gaze,such as in museums: the natural habitat of multi-user largehorizontal surfaces [10]. Second, it showed that remote gazecould be used in a horizontal configuration with high accu-racy when they conducted their initial empirical gaze detec-tion accuracy study. However, their system was designed sothat all parts of the projected surface could be reached (withinarms length) by any single user.

Mauderer et al. [17] demonstrated the use of remote gazeon a larger surface (53-inch) where reachability starts to be-come a problem in which their proposed technique attemptedto address. By combining the use of touch and gaze usingtwo existing interaction techniquesMAGIC [42] and Su-perflick [20]the operable area was extended allowing gaze-assisted object placement through a flick gesture. Their re-sults showed a high error in the y-axis compared to the x-axisin different conditions, warranting further investigation. Ac-cordingly, our first study sets out to empirically characterise

Figure 2: The diagram illustrates the theoretical regions on the tabletopcorresponding to 5 degrees of visual angle close and far to the user on a1m projected surface.

remote gaze tracking on a large horizontal surface. Neverthe-less, these works highlight two challenges faced when usingremote gaze tracking on horizontal surfaces. First, commer-cial eye trackers are built for vertical displays, thus their un-derlying algorithms start from this assumption. Second, thedifferences in visual angle when a user is positioned on oneside of the surface causes difficulties in gaze tracking.

The combination of remote gaze and touch have been ex-plored in various ways. Stellmach and Dachselt [29] com-bined touch (using a hand-held device) and remote gaze inputwith distant displays, demonstrating the potential for multi-modal gaze-supported interaction. Pfeuffer et al. [19]s workemphasises the complementary combination of gaze for se-lection and multi-touch for manipulation on the same hori-zontal surface. Their Gaze-touch technique allowed objectswithin reach to be interacted with using touch while objectsout of reach were selectable using gaze and manipulated bytouch gestures in the reachable space. Results showed thatwhen touch is combined with gaze, interaction speed in-creased, and as physical mid-air hand movements were de-creased, fatigue decreased as well. Serim and Jacucci [25]illustrated that gaze input could be used to extend interactionbeyond gaze pointing as seen in related works (e.g. [19, 29,30, 34]). Their results show that gaze can support interfaceuse with decreased reliance on visual guidance. For largeinterfaces, this means that touch input combined with gazecould be used differently i.e. the user touches where she islooking versus not looking. As demonstrated in the literature,combining touch with gaze shows great promise, motivatingus to explore this area further. Thus, the contributions in thispaper aim to extend this existing area of literature.

STUDY 1: CHARACTERISING EYE GAZE ON TABLETOPSWe noted two significant challenges for remote gaze track-ing in this setting. First, commercial eye trackers are builtfor vertical displays, so their underlying algorithms start fromthis assumption. Second, the gaze inaccuracy increases withdistance from the user on a horizontal surface due to the dif-ferences in visual angle. Here, we identify a third challenge:when users look down at the edge of the table closest to them,the eyelids cover the pupil thereby, deteriorating the accuracy.Focusing on the second challenge, the further from the user,the wider the estimation error is for a given visual angle (seeFigure 2). Using the derived equation below, it is possibleto predict how inaccurate (Error) the gaze estimation willbe depending on the Distance to the user, the given visualangle and the Height of the eyes.

Error = Height tan( + arctan

(DistanceHeight

)) 100

Figure 2 shows the theoretical error curve at scale for a 5-degree visual angle at a 60cm height on a 1m surface. Tocharacterise the gaze estimation on the tabletop in practice,we designed a study that builds on the gaze estimation accu-racy evaluation by Yamamoto et al. [41] but on a much largersurface, simultaneously investigating the high error rates inthe y-axis found in Mauderer et al. [17]s evaluation. There-fore, the aim of this study is to empirically characterise thelimitations of remote gaze on a large horizontal surface. Thiswas done by comparing the tracking error between the esti-mated gaze point for each ground truth point displayed.

Hardware ConfigurationAs shown in Figure 2, a face-down short-throw projector wasused to create a 45-inch (1000x563mm) horizontal displaywith a resolution of 1920x1080px on the tabletop. A low-cost eye tracker (Tobii EyeX (30Hz)) was mounted on thetable surface on the edge of the projection facing the userat 40-50 degree angle and 25cm away from the short edge ofthe table. As the eye tracker specified working distance is 4580cm (tracker to eyes), this positioning was suitable and ableto accommodate to participants of different seated heights butfor each participant, the angle of the tracker was adjusted topoint towards her eyes from below for better tracking perfor-mance.

Study ProcedureUpon arrival, participants signed an informed consent formand completed a simple demographics questionnaire. Par-ticipants were asked to sit comfortably in front of the eyetracker, which we first calibrated using its default 9-point pro-cedure. Participants were then asked to look at 63 circulartarget points (size: 20px, approx. 1cm in diameter) displayedsequentially for 4 seconds each. All participants looked atthe same sequence row by row. To make it more comfortable,each point was presented with a grow/shrink animation.

ResultsWe recruited 10 participants (7M/3F), aged 25 to 52 years(mean=32), 6 participants wore glasses, and only 2 had pre-vious experience with eye tracking. For each of the 63 points,we computed the difference between the estimated coordi-nates of the gaze point and the corresponding coordinates ofthe ground truth. Figure 3 shows that the error in each direc-tion depends on the position of the ground truth on the table-top. To elaborate, vertical error refers to the deviation of theestimated coordinates of the gaze point for each point (in pix-els) from the coordinates of the ground truth point in terms ofthe Y coordinate. Likewise, the horizontal error refers to thedeviation in the X coordinate. Arranged in 9 rows of 7, Fig-ure 3(a) shows the error for the 9 rows in terms of the y-axis,while Figure 3(b) shows in terms of the x-axis accordingly.

We found that in the y-axis, the error increases exponentiallywith the angle between the gaze direction and the tabletop,as predicted by the error estimation equation. However, it isnoted that there is a slight increase in the error at the posi-tions immediately closer to the user, likely due to the eye-lids covering the eyes. The x-axis presented a much smaller

Figure 3: Horizontal and vertical error. The vertical error increasesexponentially with the distance to the user. The horizontal error is muchsmaller in magnitude and follows a slender U shape curve.

estimation error, with a curve following a slender U-shape.This is also expected, as the distance is at its minimum whenusers are looking straight ahead, increasing as they look to-wards either side. In any case, this represents consistentlylow error throughout the x-axis in comparison to the y-axis.We then evaluated how the distance to the user affects thetracking precision. The effects of the Y coordinate of theground truth points on the mean horizontal and vertical stan-dard deviations of the measurements was tested using Pear-sons product-moment correlation. A significant correlationbetween the Y coordinate and the vertical standard devia-tion was found (Pearsons r(7) = 0.84, p = 0.0045), butno significant correlation with the horizontal standard devia-tion (p = 0.22). This means that for each target point furtheraway from the user, the recorded gaze estimation points de-viated more from the target point. As it is highly significant(p < 0.01), the results confirm our original prediction wherethe gaze inaccuracy is likely to increase with distance in thissetting. As mentioned, there is no statistical significance inthe x-axis, meaning that the deviations remained consistentthroughout and was not affected by changes in horizontal dis-tance, showing that the X coordinate remained fairly accurateand usable for all 63 points.

To confirm this, we conducted a follow-up study in which theeye tracker was moved to the centre of the long edge of the ta-ble to obtain a wider surface. We collected the gaze data from10 participants (6M/4F), aged 22 to 52 years (mean=33.2).Four participants wore glasses and three had no experiencewith eye trackers. In addition, three participants took part inour initial study. Figure 3(c) shows a similar plot as Figure3(a), showing the consistency of error at the positions clos-est to the user in both studies. More importantly, Figure 3(d)shows that the tracking error remains consistently low in thisconfiguration, and we tested the correlation between the Y co-ordinate and the horizontal standard deviation and found nosignificant correlation (p = 0.8955). These results furtheredour confidence in using the X coordinate for interaction on alarge tabletop, especially at the end furthest from the user.

In summary, these results show that the gaze estimation inthe far end of the tabletop is highly inaccurate and that the

accuracy decreases exponentially. However, the results alsoshow that even though the vertical precision deteriorates asthe distance to the user increases, the horizontal precision re-mains the same. For interface design, this suggests that tech-niques that require high accuracy (e.g. gaze pointing) are notwell suited for the far end of the tabletop, but techniques thatonly require a consistently precise estimatesuch as Pursuits[36]can still work if only the horizontal direction is consid-ered. Instead of comparing the absolute position of the gazepoint and the target, Pursuits compares the relative movementof the eyes with moving targets on the screen. In the next sec-tion, we elaborate and describe how Pursuits can be adaptedto extend interaction at the far end of the tabletop with gaze.Finally, it appears that there is some tracking deterioration atthe edge closer to the user due to eyelid occlusion.

STUDY 2: INTERACTIVE SYSTEM EVALUATIONThe results from our first study showed that vertical gaze es-timation is considerably affected by the distance to the user,but the relative movement in the horizontal axis can still beused for interaction, even at the furthest area of the table-top. Moreover, the findings also showed that the estimationerror is at its minimum at the centre of the table, increas-ing again as it gets closer to the user due to eyelid occlusion.With knowledge of these constraints, we designed a systemthat segments the tabletop in three distinct regions, each withan interaction technique that overcomes its inherent shortfallsand builds on its opportunitieswhat we call multimodal seg-mentation. Further, we draw on guidelines by Shen et al. [27]that consider occlusion and reachthe two overarching prob-lems in our work aims to address. The purpose of the systemis to evaluate the viability of (1) gaze-only interaction on thefar end of the tabletop and (2) interacting with a large surfaceusing multimodal segmentation.

Segmented RegionsIn the region closest to the user (Touch Region), the gaze es-timation error is high due to eyelid occlusion, but this is notnecessarily a problem since the user can still interact with theregion using touch. Touch is a well suited interaction modal-ity in reachable areas on horizontal surfaces, it is highly ap-plicable and suited for use in this region. For this reason,only touch was adopted in this region. In the central regionof the tabletop (Gaze+Touch Region), touch-based interactiontechniques become awkward due to the need to reach out,whereby the user needs to move in order to be able to reachsufficiently far. However, it is the region with the smallestgaze estimation error, meaning that both the X and Y gazecoordinates are the most accurate and thus usable for gazepointing in this region. Therefore, the interaction techniqueused in this region draws from Pfeuffer et al.s Gaze-touchwhere we combine gaze pointing with indirect touch confir-mation i.e. the touch is used for manipulation in the regionclose to the user while gaze is used for selection beyond thereach of the user [19].

In the area furthest from the user (X-Gaze Region), the usercannot physically reach targets and gaze estimation error ishigh in the vertical axis. However, the still precise horizon-tal axis can be taken advantage of by displaying targets thatmove only in the horizontal direction and correlating their

X coordinate with the X coordinate of the eyes. This ap-proach has been used in previous works using 2D movementsto enable interaction with public displays [36] and with smartwatches [9], using both X and Y coordinates. We adaptedthe approach by solely using the X coordinate to overcomethe limitations of remote gaze estimation at far distances onhorizontal surfaces that we identified in our first study.

Smooth Pursuit Eye MovementsSmooth pursuit eye movements have recently been proposedas a solution for contexts where calibration and precise point-ing are challenging, such as in public displays, smart watches,and smart homes [9, 35, 36]. The technique works by corre-lating the smooth movement of the eyes with moving targetson the interface to detect where the user is looking, lever-aging the smooth movement our eyes naturally perform whenwe follow a moving object. It is known that our eyes naturallyare drawn towards objects of interest such as moving objects[13]. The Pursuits technique is suitable for scenarios wheregaze tracking is inherently inaccurate such as on horizontalsurfaces as it is not a concern where exactly the user is look-ing, but in the movement that the eye makes when fixated ona moving object. The details of how the Pursuits algorithmworks can be found in Vidal et al. [36].

Here, we note that in the X-Gaze Region, the optimised pa-rameters are not known, as the Pursuits algorithm has notbeen implemented on a horizontal surface, let alone one thatis large in size. In theory, a horizontal adaptation of the algo-rithm is ideal for visual interaction. Collewijn and Tamminga[8] suggested that due to the extensive use in following every-day objects that tend to be horizontal, our ability to performhorizontal smooth pursuit is likely to be better than verticalsmooth pursuit. Similar results are also seen in Rottach et.als study [22] where horizontal, vertical and diagonal smoothpursuit eye movements were compared. Furthermore, exist-ing implementations of Pursuits solely use gaze as an input.This paper presents an example of how Pursuits can be com-bined as part of a multimodal system.

ApplicationTo illustrate how these regions can be combined into one co-hesive application, we built a Video-on-Demand applicationthat allows users to explore an on-line video library in a mul-timodal fashion (see Figure 1). In the X-Gaze Region, multi-ple tags are displayed that correspond to different video chan-nels. These tags move left and right in distinct patterns andthe smooth pursuit correlation algorithm [36] is used to selectthem. However, by not considering the Y coordinate of theeye gaze and tag movements, this substantially reduces thenumber of possible trajectories for different targets. For ex-ample, in 2D, even if two objects present the same horizontalmovement, the selection can be disambiguated by the verticalaxis. To compensate for this, the X-Gaze region is divided intwo side-by-side sub-regions (see Figure 4). The absolute Xcoordinate is then used to estimate at which sub-region theuser is looking, only presenting moving targets in that sub-region. The relative movement of the targets is compared withthe relative movement of the eyes to select a specific tag. Thisway, only one of the sub-regions presents any movements at

Figure 4: X-Gaze technique illustration.

a time, which start depending on the absolute horizontal co-ordinate of the gaze point. Moreover, by sub-regioning, thevisible number of possible tags doubles, presenting more se-lectable channel options.

When a tag is selected, the Gaze+Touch Region is populatedwith videos that satisfy that query. The thumbnails of thevideos are displayed along with their titles and summary de-scriptions. To select a video, the user looks at its thumbnailand touches the Play button displayed in the Touch Region atthe very bottom of the table together with other playback con-trols. The video is shown in the Touch Region. The user canalso use the Next 10 button in the Touch Region to repopulatethe Gaze+Touch Region with another 10 videos.

Hardware ConfigurationWe used the same hardware configuration as in the previ-ous study, but with added touch capabilities using an over-head Kinect mounted on top of the projector (see Figure 5-Right). The touch events were detected using the Ubi Dis-plays Toolkit [11], which matched touch points on the tableto the pixels being projected. The eye tracker is moved tothe bottom of the Gaze+Touch Region, as there is no longera need to track the users eyes in the Touch Region and thetouch gestures could occlude the eye tracking if the locationremains. By moving the eye tracker further away, the abil-ity to use gaze increases and therefore increasing the possiblearea for interaction. In the enhanced setup used in this study, a25% increase in display size (45-inch to 60-inch) was createdby moving the eye tracker forward further from the user. Alarger display size means that more content can be displayedat any one time. The placement of the segmented regions inrelation to the hardware and user is shown in Figure 5-Right.

Data CollectionThe screen was recorded using Open Broadcaster Software(OBS), along side the recording from the video camera thatrecorded the side view of the user and the interface, as shownin Figure 5-Right. This generated video recordings for analy-sis, showing the actions of the users synchronised with whatthey saw on the display. Throughout the tasks, participantswere asked to employ the think-aloud technique. The dataprimarily provides insights into the participants thinking andwhether they understood the system. Both interview andthink-aloud data were included as part of the combinedvideo recording which was transcribed and coded using the-matic analysis. Observations were made from a desk placedbehind the participant, viewing the actions on the interfacethrough a display clone. Notes were taken during both thetasks and the semi-structured interviews to be used for dataanalysis. Participants were then asked to complete a post-study questionnaire with a five-point semantic differential

Figure 5: Study 2 application and hardware configuration. Left: High-light selection used in Gaze+Touch Region with reachability demonstra-tion shown. Right: Enhanced setup used with added touch capabilities.

scale consisting of 15 dimensions (e.g. I felt uncomfortableversus I felt comfortable) (see Figure 6) that focuses on the X-Gaze Technique, the Multimodal Segmentation and the Over-all User Experience in line with the goals of the evaluation.

Study ProcedureUpon arrival, participants were asked to complete the consentform and a simple demographics questionnaire. Participantswere then asked to sit comfortably in front of the eye trackerand undergo a standard calibration procedure using the de-fault 9-point procedure. Upon a successful calibration, par-ticipants were then asked to perform three tasks sequentially.Each task was designed with increasing difficulty, encouragedthe participant to use the system as a whole, and are adaptedfrom the functionality (i.e. find similar, remove odd) of aprior study on video searching [28]. In the first task, we ex-plained how the system worked through a live tutorial, guid-ing participant through steps before inviting participants toexplore the video library. Here, participants were encouragedto use all regions on the tabletop to familiarise themselves andto get over the initial learning curve while not giving awayhow the interface should be used. We were able to observehow participants initially approached the techniques withoutthe complexities of a difficult task. This was followed by ashort semi-structured interview to gain insights into partici-pants impressions, their overall perception of the system, theinteraction techniques used in each region, and any difficul-ties that they experienced so far.

The following tasks were presented one after another and par-ticipants were asked to stop if they could not find all videosafter 5 minutes. In the second task, the participant was askedto find similar videos, requiring the user to jump between thechannels as they look for e.g. cat videos, which are not allin a single channel. The cat video will be one that belongsto the channel but may not be immediately obvious at first.For example, in one channel, the thumbnail is a cartoon illus-tration of a cat instead of a picture while in another, the wordcat appears in the video title but the thumbnail does not dis-play a cat. In the third task, the videos were shuffled amongthe channels and the participants were asked to find the videosthat did not belong to that respective channel. Among the10 videos displayed, there can be any number of videos thatdo not belong to that channel, encouraging participants to goback and forth between the channels to contrast the videoswith the dominant type of videos in that channel. Therefore,the placement of the content in relation to the tasks requiredparticipants to not only navigate within the channel but alsoto switch between different channels, encouraging the use ofthe X-Gaze technique implicitly rather than explicitly. Once

Figure 6: Study 2 Post-Study Questionnaire Results. The median score is shown for each dimension.

completed, the participants were asked to complete the post-study questionnaire followed by a second semi-structured in-terview.

ResultsA total of 13 participants took part in the study (4F/9M),aged between 24 and 38 years (mean=28.2). However, 1participant (P7) was omitted due the inability to calibratewith the eye trackers default calibration. All participantshad little to no experience with eye tracking. Two partici-pants wore corrective contact lenses and one wore glasses.All participants either completed all tasks within 5 minutesor gave up by asking where the remaining videos were lo-cated. Figure 6 shows the results of the post-study ques-tionnaire where participants rated between 1 (strongly un-favourable) to 5 (strongly favourable) for each dimension.The semi-structured interviews provided further insight intothese scores.

The X-Gaze technique in the distant region was quickly high-lighted with 10 out of 12 participants reporting that the tech-nique being minimal was easy to learn, and that it feltnatural enough that there was little room for improvement(e.g. I looked at things and it worked right away [P8]). Thiscorresponds with the low improvement scores but high learn-ability scores in the questionnaire. However, three partici-pants (P4, P9, P13) reported that they felt the technique wassomewhat slow. Overall, participants praised the novelty ofthe technique, along with the ability to accurately select dis-tant targets solely with gaze, despite the unusual technique.For example, P3 mentioned you follow something with youreyes and it then selects. There is no real-world equivalent.Its easy and its effective but it doesnt feel like somethingnormally I would do in my life. Thought, it was intuitive so Iwasnt thinking about it too much.

In the Gaze+Touch Region, we have chosen not to use a gazepointer (or cursor), instead, the tiles were highlighted (blue)when selected by dwell time shown in Figure 5-Left. Whenthe gaze estimate was accurate, this provided effective, subtlefeedback, but when the tracking was particularly inaccurate, itcreated a flickering nuisance as noted in observation and con-firmed in our video analysis. This is despite the centre of thescreen being the area of the tabletop where gaze is most ac-curate. This shows that gaze tracking in this configuration isstill substantially less accurate than on vertical screens. Thiswas expected to some extent and was compensated by theuse of large targets for the video thumbnails. However, halfthe participants still reported that the targets at the edges ofthe screen were more difficult to select that the ones in thecentre. A subset of these participants (P3, P8, P12) also re-ported that it was sometimes harder to select videos on the

bottom row (closer to the user) than compared to the top row(further from the user). This difficulty in selection led to thelow gaze accuracy score. On the other hand, Participant P4and P5 mentioned when reading the video titles, the buttonwill highlight suddenly (due delayed dwell-time activation)causing a shock to the eyes contributing to eye fatigue. Fur-ther, P4 mentioned When I wanted to look at something, Ijust wanted to look but when I wanted to select something, Iwanted it to be fast. This is reflective of the Midas Touchproblem, an expected inherent problem when using eye gazemodality for interaction [13].

In contrast, a few participants (P3, P4, P12) explicitly notedthe ability to select further than their physical reach when in-teracting with out of reach areas, which is reflective of ourmotivation to use eye gaze to extend reachability on largetabletops. Their positive responses are as follows:

P3: It made sense that you can touch things in front ofyou and everything else [demonstrated the use of eye gazeby selecting objects that were out of reach]. Its prettyunique to be able to select stuff that is so far away.

P4: I think it works when its a big screen like that. Likeyou dont want to go all the way and touch it.

P12: When its here [shows touch area], its easy to useyour hands, but when its over there [points at out-of-reacharea], its hard, so yeah., Its actually very good. I reallylike it, its easy and I dont want to be [participant stretcheshand out ]

Overall, participants enjoyed using the system through theseamless transition between the regions, receiving high rat-ings in all dimensions as shown in Figure 6. In the nextsection, we will discuss the issues that have risen from ourevaluation followed by the implications of our solutions.

DISCUSSIONThis paper demonstrates the immediate opportunities for en-abling gaze interaction on large horizontal surfaces. The find-ings from the first study characterised the use of remote gazeon large interactive surfaces, in this case a large tabletop.This empirical investigation showed the different accuraciesin the different parts of the tabletop. The highest accuracywas found in the centre while at the far end of the tabletop,accuracy only remains along the horizontal axis. This led tomultimodal segmentation, the formation of regions with eachregion employing an interaction technique in accordance withthe strengths and limitations. Subsequently, the characterisa-tion informed the development of X-Gazeour novel interac-tion techniquethat enables the interaction on a large table-top where the vertical axis is unusable especially at the farend. In the second study, both multimodal segmentation and

X-Gaze were evaluated through a Video-on-Demand applica-tion that illustrates how these regions can be used togetherin a cohesive interface on a large interactive tabletop. Theevaluation showed positive results, yielding many interestinginsights presenting how the system can be implemented in thefuture. In this section, we draw attention to both solutions anddiscuss their implications in light of our findings.

X-Gaze: Findings and OpportunitiesWe presented the design, development and evaluation of X-Gaze, a novel interaction technique we developed directly torespond to the need to interact with distant targets using onlyX coordinates. In our evaluation, the technique was foundto be favourable by participants as the low overhead madeit minimal and therefore easy to learn with little roomfor improvement as well as the ability to select out-of-reachtargets. This highlights the fact that even techniques as suchmay look unusual at first, can still provide an effective meansof providing input to interactive systems.

Some participants found that using the technique becameslower after adapting to the technique, especially after com-pleting the first task. This is a common effect in eyes-onlytechniques that require the user to follow a target or to dwellon it for a certain amount of time, due to a trade-off betweenthe systems responsiveness and its robustness to errors [15].Another possibility is that participants transitioned to beingexpert users after successfully using the technique during thefirst task. This caused these expert users feel slow due tothe thresholds in place to prevent false activations. Our imple-mentation privileged robustness, using a 2-second activationwindow and a .95 correlation threshold, substantially higherthan previous works [9, 36], which came at the cost of re-sponsiveness. Future implementations could consider givingparticipants control over the activation window time, reduc-ing it as users become more adept at using the technique.

The Pursuits technique [36], in which X-Gaze has beenadapted from, relies on interfaces that are highly dynamic i.e.interfaces with objects of interest moving with different tra-jectories and speeds. The authors referred to as a potentiallimitation as the constant movement might be a source of con-fusion or fatigue for users if used for longer periods of time.X-Gaze provides a possible solution to this by leveraging theuse of gaze-aware regions which is possible with large sur-faces as the general gaze direction and point-of-regard canbe estimated. When a gaze-aware region is activated, onlythen objects will start to move which further provides the userfeedback that gaze has been detected in that region. Our ap-plication used in the evaluation used two gaze-aware subre-gions which could have been easily increased.

Furthermore, Vidal et al. [36] states that their technique ispotentially unsuitable for objects that contain more than ashort segment of text as it may be difficult to read and followthe moving text at the same time and that objects that movetoo slowly may also cause bad performance. In our imple-mentation, we demonstrated that text could be read as it onlymoves in one direction with the combination of a slow speed.As the coordinates of only one axis is used, the slow speedof the objects does not have an impact on performance, rather

provided users with more control. As mentioned, a high cor-relation threshold was achieved. This is likely attributed tothe fact that humans perform horizontal smooth pursuit betterthan vertical smooth pursuit [8, 22]. Thus, addressing somelimitations of Pursuits, particularly in situations where gazeaccuracy starts to becomes a problem with distance and con-sequently, where reachability becomes a problem too. Wenote that reachability is present when interacting with verticalsurfaces as well [10]. This presents opportunities for explor-ing the use of X-Gaze on large vertical displays especially foruse in public settings.

Multimodal Segmentation: A Viable StrategyThe informed decision to divide the tabletop into segmentedregions can be viewed as a divide and conquer strategytaken where an interaction technique was employed to con-quer each segment in accordance to their strengths. OurVideo-on-Demand application illustrated how each of thesedifferent regions could cohesively work together. This co-hesion was largely achieved by the use of a flat hierarchicalstructure that formed a natural division between each level.This was also possible as there were at least three levels whereeach level is progressively further from the user. These levelsare visible at all times afforded by the large display that en-couraged the serendipitous discovery of content. Moreover,this afforded visibility and the techniques employed in eachregion allowed the user to quickly jump between the differentchannels. This means that the user does not need to traverseup and down the hierarchy. For instance, while watching avideo in the region closest to the user, the user can select an-other channel and scan for the next video while the currentvideo remains played. The seamless transition between theregions builds on the use of a familiar drill down navigationalstructure commonly seen in interface design. This allowedusers to bring the desired content within their reach as theirinteraction moves from the furthers region to the closest re-gion to them. The two highest median scores in the question-naire were given to content and organisation as it was wellreceived by participants.

However, the evaluation encountered some difficulties withgaze tracking especially in the Gaze+Touch Region (centre ofthe tabletop). The deterioration in accuracy reflects the curvein Figure 3(b), which shows how the accuracy deterioratestowards the edges. Likewise, some participants encountereddifficulty in selecting the videos closer to them, and this ismost likely due to the error between 200 to 400 pixels in thevertical axis shown in Figure 3(a) where there is some degreeof tracking deterioration due to eyelid occlusion.

Focusing on interaction techniques employed in the regions,this work presents the first instance where two types of gazeinteraction were combined on the same interface i.e. gazepointing and a natural gaze behaviour (e.g. smooth pursuiteye movement). Previous works that employ the former typ-ically relied on accurate tracking while those that employ thelatter were used in scenarios where accurate pointing waschallenging. In this study, both types were identified throughthe characterisation and the intended use of a large surface,leading to the employment of the both. In the evaluation,participants did not have any issues switching between both

Figure 7: Regions in relation to the Theory of Tabletop Territoriality byScott et al. [28] in a multi-user setting.

types of interaction and was observed to have been performedsomewhat unconsciously. This proves that there are opportu-nities in combining both types of interaction. Lastly, humansnaturally divide spaces on computers, such as grouping win-dows and icons on desktop computers. On tabletops, whetherdigital or non-digital, it has been shown that there is a natu-ral segmentation which we aimed to investigate as part of ourfuture work and is discussed further in the next section.

EXTENSIONS TO MULTIUSER TABLETOP SCENARIOSPrevious studies have articulated numerous advantages of us-ing tabletops for collaborative work. Tables in general pro-vide a large and natural interface for supporting human-to-human interactions whereby its characteristics provide affor-dances that allow the gathering of people for face-to-facecommunication as they surround a shared surface [26]. How-ever, sharing the surface has certain negative connotations de-spite having advantages for subtle communication and thishas to do with both space and access. Ryall et al. [23]observed that the actions of people in this setting often con-flict with one another, both intentionally and accidentally. Onoccasion, users want to use the whole table, but sometimesprivacy becomes an issue when users want to interact withdisplayed elements without sharing them with other users,which raises the problem of undesired access in specific situ-ations [21]. Therefore, multiuser tabletops should allow usersto protect their data against undesired access.

A potential solution for this is to increase the size of the inter-active surface and divide the tabletop interface into territoriesin accordance to the Theory of Tabletop Territoriality [24].The authors observed three distinct territories on the table,namely personal, storage and group (shared). This divisionlimits other users from reaching into the personal space ofanother user. The personal territory is typically determinedby the users reach while storage territory is within their ex-tended reach [31]. However, placing and reaching for objectsoutside their personal and storage territories (e.g. group ter-ritory) now becomes a problem. Likewise, placing or obtain-ing an object in another users personal territory with agree-ment from the owner also becomes a problem. Current pro-posed techniques such as I-Grabber [1] allows the user toseamlessly interact with both out-of-reach territories from theusers current location without blocking the territory of an-

Figure 8: Potential application of X-Gaze in a multi-user setting. A usercan initiate the movement while the other user selects using X-Gaze.

other user. This once again raises the problem of undesiredaccess and that it might be useful to implement authorisationand privacy protocols to avoid distant users to see or manipu-late ones workspace without permission [38].

Our contributions present immediate opportunities to balancethe issues of private personal territory, reachability, and hav-ing a form of authorisation protocol for seeking permissionto access content from another user. Coincidentally, the re-gions defined in this work echo the territories of the tabletoppresented by Scott et al. [24] (see Figure 7). In relation tothese regions, the X-Gaze technique can be used to supportmulti-user sessions in both terms of passing objects and asa method of authentication. In their personal and immedi-ate storage territory, users can interact using touch gestures,which is well suited for private tasks such as reading, writ-ing, annotating. In the group territory users interact using thecombination of gaze and touch. This not only makes usersaware of each others attention but allows them to directly in-teract with public content at a distance. Users can also usecombinations of gaze and indirect touch gestures as exempli-fied in Gaze+RST [34] to move items back and forth betweenthis space and their personal space. Finally, X-Gaze elegantlysupports protecting a users personal territory from interfer-ence by other users [33]. For the user on the other side of thetable to select an item, it is necessary for that item to move.If the owner of the personal space authorises the other userto interact with that content, she can move the object side-to-side a few times. If the other user follows it with her eyes,she is able to select that object and can be transferred over toher personal territory. To avoid unwanted selections, the usermoving the object can casually gaze into the other users per-sonal territory which also naturally facilitates feedback, animportant function in human-computer interaction. For in-stance, the user passing the object will anticipate the objectreappearing in the other users personal territory and whenthis occurs it provides the user with a closure that the interac-tion has been successful.

Consequently, this form of authentication presents a form ofsocial contract and can be expanded to facilitate simulta-neous exchange between two users, and this is best demon-strated by way of example. Take Monopoly, and picture adigitised version on a large interactive tabletop with two userssitting opposite one another. The group space displays theboard while each user holds virtual deed cards to their prop-erties and their virtual money notes in their personal territo-

ries. In the scenario where one user purchases a property fromanother, one user can move the virtual deed card while theother moves the money notes. Both users will consequentlygaze into personal territory of one another, forming a socialcontract where both users authenticate one another simul-taneous upon agreed terms, therefore, facilitating a naturalexchange. Without neglecting the opportunities in group ter-ritory, it is possible that gaze awareness can be employed herewhich could change the social experience in this setting col-laboratively. Tse et al. [32] mentions that monitoring thegaze of others lets us know where they are looking and whereattention is directed. More importantly, gaze awareness hap-pens easily and naturally in co-located tabletop settings asusers can easily gauge what another user is gazing towards toin addition to being a great indicator for attention. Therefore,making gaze explicitly visible on a shared surface has someinteresting connotations. For example, one user can say toanother when solving a large jigsaw puzzle collaboratively,Can you get me that piece?. The other user will knowwhich piece was refereed to by simply looking where theuser is looking on the surface, serving as an implicit pointingmethod. Alternatively, it can be used competitively in a gameof chess such that if one user is aware of the other users inten-tions through their visible gaze and whether they will changetheir strategy or trick their opponent. Nevertheless, we hopeto further explore how a multi-user gaze-enabled tabletop cansupport collaborative tasks using our interaction architecture.

CONCLUSIONThis paper contributes a first step in enabling gaze interac-tion on large tabletops and for the first time, gaze-based in-teraction has been used effectively over a large distance ona horizontal surface (1m2). This was achieved by initiallyidentifying that the relative accuracy of the gaze along x-axis(parallel to user) was usable rather than y-axis (perpendicu-lar to user) in our first study. We highlight that this first timecharacterisation to be a key contribution of this paper.

This informed the development of two solutions in which weevaluated in a second study. First, X-Gaze, a novel gaze-basedinteraction technique that leverage natural gaze behaviourwas developed to enable gaze-only interaction at the far endof the tabletop. It is important to note that there are naturallimitations of gaze, and despite the maturing enabling tech-nology, it will not change the way our eyes work and it crucialthat gaze-based techniques adhere to this fact. Consequently,novel techniques such as X-Gaze show great potential, suchas enabling users to select of out-of-reach objects with lowoverhead and high precision. Secondly, we demonstrated howmultimodal segmentation and X-Gaze can be incorporated inan interface design through a Video-on-Demand application,which we evaluated in a user study. Our findings showedthat participants overall enjoyed using the system through theseamless transition between the regions but more importantly,showed that our solutions can be used in practice. Therefore,addressing two specific problems with respect to remote eyetracking, (1) the accuracy of eye-tracking on horizontal sur-faces at long distances (i.e. beyond physical reach) and (2)the problem of eyelid occlusion at short distances. For ourfuture work, we will expand our solutions formed in this pa-per to support collaborative multi-user environments.

REFERENCES1. Abednego, M., Lee, J.-H., Moon, W., and Park, J.-H.

I-grabber: Expanding physical reach in a large-displaytabletop environment through the use of a virtualgrabber. In Proc. of ITS 09, ACM (2009), 6164.

2. Ardito, C., Buono, P., Costabile, M. F., and Desolda, G.Interaction with large displays: A survey. ACM Comput.Surv. 47, 3 (Feb. 2015), 46:146:38.

3. Bader, T., Vogelgesang, M., and Klaus, E. Multimodalintegration of natural gaze behavior for intentionrecognition during object manipulation. In Proc. of the2009 Int. Conf. on Multimodal Interfaces, ICMI-MLMI09, ACM (2009), 199206.

4. Banerjee, A., Burstyn, J., Girouard, A., and Vertegaal,R. Pointable: An in-air pointing technique to manipulateout-of-reach targets on tabletops. In Proc. of ITS 11,ACM (2011), 1120.

5. Bartindale, T., Harrison, C., Olivier, P., and Hudson,S. E. Surfacemouse: Supplementing multi-touchinteraction with a virtual mouse. In Proc. of TEI 11,ACM (2011), 293296.

6. Benko, H., Morris, M. R., Brush, A. B., and Wilson,A. D. Insights on interactive tabletops: A survey ofresearchers and developers. Tech. Rep.MSR-TR-2009-22, March 2009.

7. Bezerianos, A., and Balakrishnan, R. The vacuum:Facilitating the manipulation of distant objects. In Proc.of CHI 05, ACM (2005), 361370.

8. Collewijn, H., and Tamminga, E. P. Human smooth andsaccadic eye movements during voluntary pursuit ofdifferent target motions on different backgrounds. TheJournal of Physiology 351, 1 (1984), 217250.

9. Esteves, A., Velloso, E., Bulling, A., and Gellersen, H.Orbits: Gaze interaction for smart watches using smoothpursuit eye movements. In Proc. of UIST 15, ACM(2015), 457466.

10. Geller, T. Interactive tabletop exhibits in museums andgalleries. Computer Graphics and Applications, IEEE26, 5 (Sept 2006), 611.

11. Hardy, J., and Alexander, J. Toolkit support forinteractive projected displays. In Proc. of MUM 12,ACM (2012), 42:142:10.

12. Holman, D. Gazetop: Interaction techniques forgaze-aware tabletops. In CHI 07 Extended Abstracts onHuman Factors in Computing Systems, CHI EA 07,ACM (2007), 16571660.

13. Jacob, R. J. K. What you look at is what you get: Eyemovement-based interaction techniques. In Proc. of CHI90, ACM (1990), 1118.

14. Lander, C., Gehring, S., Kruger, A., Boring, S., andBulling, A. Gazeprojector: Accurate gaze estimationand seamless gaze interaction across multiple displays.In Proc. of UIST 2015 (2015), 395404.

15. Majaranta, P. Communication and Text Entry by Gaze.IGI Global, 2012.

16. Marquardt, N., Jota, R., Greenberg, S., and Jorge, J. A.The continuous interaction space: Interaction techniquesunifying touch and gesture on and above a digitalsurface. In Proc. of INTERACT11, Springer-Verlag(2011), 461476.

17. Mauderer, M., Daiber, F., and Kruger, A. Combiningtouch and gaze for distant selection in a tabletop setting.In CHI 2013: Workshop on Gaze Interaction in thePost-WIMP World (2013).

18. Parker, J. K., Mandryk, R. L., and Inkpen, K. M.Integrating point and touch for interaction with digitaltabletop displays. Computer Graphics and Applications,IEEE 26, 5 (Sept 2006), 2835.

19. Pfeuffer, K., Alexander, J., Chong, M. K., and Gellersen,H. Gaze-touch: Combining gaze with multi-touch forinteraction on the same surface. In Proc. of UIST 14,ACM (2014), 509518.

20. Reetz, A., Gutwin, C., Stach, T., Nacenta, M., andSubramanian, S. Superflick: A natural and efficienttechnique for long-distance object placement on digitaltables. In Proc. of GI 06, Canadian InformationProcessing Society (2006), 163170.

21. Remy, C., Weiss, M., Ziefle, M., and Borchers, J. Apattern language for interactive tabletops incollaborative workspaces. In Proc. of EuroPLoP 10,ACM (2010), 9:19:48.

22. Rottach, K. G., Zivotofsky, A. Z., Das, V. E.,Averbuch-Heller, L., Discenna, A. O., Poonyathalang,A., and Leigh, R. Comparison of horizontal, vertical anddiagonal smooth pursuit eye movements in normalhuman subjects. Vision Research 36, 14 (1996),21892195.

23. Ryall, K., Forlines, C., Shen, C., Morris, M. R., andEveritt, K. Experiences with and observations ofdirect-touch tabletops. In Proc. of TABLETOP 06, IEEEComputer Society (2006), 8996.

24. Scott, S. D., Carpendale, S., and Inkpen, K. M.Territoriality in collaborative tabletop workspaces. InProc. of CSCW 04, ACM (2004), 294303.

25. Serim, B., and Jacucci, G. Pointing while lookingelsewhere: Designing for varying degrees of visualguidance during manual input. In Proceedings of CHI16, ACM (2016), 57895800.

26. Shen, C. From clicks to touches: Enabling face-to-faceshared social interface on multi-touch tabletops. In Proc.of OCSC07, Springer-Verlag (2007), 169175.

27. Shen, C., Ryall, K., Forlines, C., Esenther, A., Vernier,F. D., Everitt, K., Wu, M., Wigdor, D., Morris, M. R.,Hancock, M., and Tse, E. Informing the design ofdirect-touch tabletops. IEEE Comput. Graph. Appl. 26, 5(Sept. 2006), 3646.

28. Smeaton, A. F., Lee, H., Foley, C., and Mcgivney, S.Collaborative video searching on a tabletop. MultimediaSyst. 12, 4-5 (Mar. 2007), 375391.

29. Stellmach, S., and Dachselt, R. Look & touch:Gaze-supported target acquisition. In Proc. of CHI 12,ACM (2012), 29812990.

30. Stellmach, S., and Dachselt, R. Still looking:Investigating seamless gaze-supported selection,positioning, and manipulation of distant targets. In Proc.of CHI 13, ACM (2013), 285294.

31. Toney, A., and Thomas, B. H. Considering reach intangible and table top design. In Proc. of TABLETOP06, IEEE (2006), 2pp.

32. Tse, E., Greenberg, S., Shen, C., and Forlines, C.Multimodal multiplayer tabletop gaming. Computers inEntertainment (CIE) 5, 2 (Apr. 2007).

33. Tse, E., Histon, J., Scott, S. D., and Greenberg, S.Avoiding interference: How people use spatialseparation and partitioning in sdg workspaces. In Proc.of CSCW 04, ACM (2004), 252261.

34. Turner, J., Alexander, J., Bulling, A., and Gellersen, H.Gaze+rst: Integrating gaze and multitouch for remoterotate-scale-translate tasks. In Proc. of CHI 15, ACM(2015), 41794188.

35. Velloso, E., Wirth, M., Weichel, C., Esteves, A., andGellersen, H. AmbiGaze: Direct Control of AmbientDevices by Gaze. In Proc. of DIS16, ACM (2016),812817.

36. Vidal, M., Bulling, A., and Gellersen, H. Pursuits:Spontaneous interaction with displays based on smoothpursuit eye movement and moving targets. In Proc. ofUbiComp 13, ACM (2013), 439448.

37. Voelker, S., Matviienko, A., Schoning, J., and Borchers,J. Combining direct and indirect touch input forinteractive workspaces using gaze input. In Proc. of SUI15, ACM (2015), 7988.

38. Voelker, S., Weiss, M., Wacharamanotham, C., andBorchers, J. Dynamic portals: A lightweight metaphorfor fast object transfer on interactive surfaces. In Proc.of ITS 11, ACM (2011), 158161.

39. Wigdor, D., Jiang, H., Forlines, C., Borkin, M., andShen, C. Wespace: The design development anddeployment of a walk-up and share multi-surface visualcollaboration system. In Proc. of CHI 09, ACM (2009),12371246.

40. Yamamoto, M., Komeda, M., Nagamatsu, T., andWatanabe, T. Development of eye-tracking tabletopinterface for media art works. In Proc. of ITS 10, ACM(2010), 295296.

41. Yamamoto, M., Komeda, M., Nagamatsu, T., andWatanabe, T. Hyakunin-eyesshu: A tabletophyakunin-isshu game with computer opponent by theaction prediction based on gaze detection. In Proc. ofNGCA 11, ACM (2011), 5:15:4.

42. Zhai, S., Morimoto, C., and Ihde, S. Manual and gazeinput cascaded (magic) pointing. In Proc. of CHI 99,ACM (1999), 246253.

IntroductionRelated WorkStudy 1: Characterising Eye Gaze on TabletopsHardware ConfigurationStudy ProcedureResults

Study 2: Interactive System EvaluationSegmented RegionsSmooth Pursuit Eye MovementsApplicationHardware ConfigurationData CollectionStudy ProcedureResults

DiscussionX-Gaze: Findings and OpportunitiesMultimodal Segmentation: A Viable Strategy

EXTENSIONS TO MULTIUSER TABLETOP SCENARIOSConclusionREFERENCES

Multimodal Segmentation on a Large Interactive … Segmentation on a Large Interactive Tabletop: Extending Interaction on Horizontal Surfaces with Gaze Joshua Newn, Eduardo Velloso,

Documents