Finding Time Together: Detection and Classification of ... · 2. Focused Interaction Detection using Multimodal Features 3. Evaluation Focused Interaction Dataset • 19 egocentric

Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video

Sophia Bano, Jianguo Zhang, Stephen J. McKenna, {s.bano, j.n.zhang, s.j.z.mckenna}@dundee.ac.uk

Computer Vision and Image Processing Group, Computing, School of Science and Engineering, University of Dundee, United Kingdom

International Conference on

Computer Vision 2017

1. Introduction

Focused Interaction (FI)

• Co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation [1]

Hypothesis

• Fusion of multimodal features can improve the overall FI detection

Challenges

• Face-to-face engagement often not maintained

• Conversational partner not always present in the video frame

• Varying illumination

• Varied scenes

Examples from our Focused Interaction Dataset

Existing methods

• Off-line processing of video clips or photo streams captured in quite constrained conditions and interacting people always in view [2, 3, 4]

2. Focused Interaction Detection using Multimodal Features

3. EvaluationFocused Interaction Dataset

• 19 egocentric videos (378 mins) captured using a shoulder-mounted GoPro Hero4 at 18 different locations and with 16 conversational partners

Observations

• Fusion of multimodal features is useful for discriminating no FI and FI (walk) when using SVM with RBF kernel

• Face track and VAD scores are significant for discriminating FI (non-walk)

Limitations

• Sound from nearby surroundings influenced the VAD

• Low illumination scenarios affected the face tracker

HOG: Histogram of Oriented Gradient

KLT: Kanade-Lucas-Tomasi Tracker

HOOF: Histogram of Oriented Optical Flow [5]

VAD: Voice Activity Detection [6]

References[1] E. Goffman. Encounters: Two studies in the sociology of interaction. Bobbs-Merrill, 1961.[2] M. Aghaei, M. Dimiccoli, P. Radeva. With whom do I interact? Detecting social interactions in egocentric photostreams. IEEE

ICPR, 2016.[3] S. Alletto, G. Serra, S. Calderara, F. Solera, R. Cucchiara. From ego to nos-vision: Detecting social relationships in first-person

views. IEEE CVPRW, 2014.[4] A. Fathi, J. K. Hodgins, J. M. Rehg. Social interactions: A first-person perspective. IEEE CVPR, 2012.[5] R. Chaudhry, A. Ravichandran, G. Hager, R. Vidal. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear

dynamical systems for the recognition of human actions. IEEE CVPR, 2009.[6] M. Van Segbroeck, A. Tsiartas, S. Narayanan. A robust frontend for VAD: exploiting contextual, discriminative and spectral cues

of human voice. INTERSPEECH, 2013.

4. Conclusion and Future Work• Automatic online classification of Focused Interaction in continuous,

egocentric videos

• Multimodal features: face track, VAD and camera motion profile

• Best performance with multimodal feature fusion and SVM with RBF kernel

• Future work involves the use of recurrent neural network for classification and to extend this work to identify conversational partners

AcknowledgementsThis work is supported by the UKEngineering and Physical SciencesResearch Council (EPSRC) under grantEP/N014278/1: ACE-LP: AugmentingCommunication using EnvironmentalData to drive Language Prediction.

An outdoor night-time FI scenario with weak visual cues due to low illumination

Lin

ear

Ke

rne

lSV

M

https://ace-lp.ac.uk/

FI in which conversational partners are in the field of view of the camera

FI in which the conversational partner is no longer in the field of view as the interaction occurred while walking

RB

FK

ern

elS

VM

No Focused Interaction Focused Interaction (non-walk) Focused Interaction (walk)

Computer work FI-NW Searching for documents Walk- turn around-walk FI-NWCamera setup

HO

OF

(bin

s)

VAD score

0 20 40 60 80 100 120 140 160 180 200Time (sec)

0

0.5

1

VA

D s

core

s

0 20 40 60 80 100 120 140 160 180 2000

200

400

Tra

cke

r sc

ore

Camera motion feature (HOOF)

Input video

Video Stream

Audio Stream Audio-based

feature (VAD)

Face detection (HOG) and

tracking (KLT)

Feature concatenation

Temporal windowing

Classification using SVM

(Linear/RBF)

No FI

FI-NW(non-walk)

FI-W (walk)

M

T

V

https://ace-lp.ac.uk/

Finding Time Together: Detection and Classification of ... · 2. Focused Interaction Detection using Multimodal Features 3. Evaluation Focused Interaction Dataset • 19 egocentric

Documents