Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video Sophia Bano, Jianguo Zhang, Stephen J. McKenna, {s.bano, j.n.zhang, s.j.z.mckenna}@dundee.ac.uk Computer Vision and Image Processing Group, Computing, School of Science and Engineering, University of Dundee, United Kingdom International Conference on Computer Vision 2017 1. Introduction Focused Interaction (FI) • Co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation [1] Hypothesis • Fusion of multimodal features can improve the overall FI detection Challenges • Face-to-face engagement often not maintained • Conversational partner not always present in the video frame • Varying illumination • Varied scenes Examples from our Focused Interaction Dataset Existing methods • Off-line processing of video clips or photo streams captured in quite constrained conditions and interacting people always in view [2, 3, 4] 2. Focused Interaction Detection using Multimodal Features 3. Evaluation Focused Interaction Dataset • 19 egocentric videos (378 mins) captured using a shoulder-mounted GoPro Hero4 at 18 different locations and with 16 conversational partners Observations • Fusion of multimodal features is useful for discriminating no FI and FI (walk) when using SVM with RBF kernel • Face track and VAD scores are significant for discriminating FI (non-walk) Limitations • Sound from nearby surroundings influenced the VAD • Low illumination scenarios affected the face tracker HOG: Histogram of Oriented Gradient KLT: Kanade-Lucas-Tomasi Tracker HOOF: Histogram of Oriented Optical Flow [5] VAD: Voice Activity Detection [6] References [1] E. Goffman. Encounters: Two studies in the sociology of interaction. Bobbs-Merrill, 1961. [2] M. Aghaei, M. Dimiccoli, P. Radeva. With whom do I interact? Detecting social interactions in egocentric photostreams. IEEE ICPR, 2016. [3] S. Alletto, G. Serra, S. Calderara, F. Solera, R. Cucchiara. From ego to nos-vision: Detecting social relationships in first-person views. IEEE CVPRW, 2014. [4] A. Fathi, J. K. Hodgins, J. M. Rehg. Social interactions: A first-person perspective. IEEE CVPR, 2012. [5] R. Chaudhry, A. Ravichandran, G. Hager, R. Vidal. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. IEEE CVPR, 2009. [6] M. Van Segbroeck, A. Tsiartas, S. Narayanan. A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice. INTERSPEECH, 2013. 4. Conclusion and Future Work • Automatic online classification of Focused Interaction in continuous, egocentric videos • Multimodal features: face track, VAD and camera motion profile • Best performance with multimodal feature fusion and SVM with RBF kernel • Future work involves the use of recurrent neural network for classification and to extend this work to identify conversational partners Acknowledgements This work is supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant EP/N014278/1: ACE-LP: Augmenting Communication using Environmental Data to drive Language Prediction. An outdoor night-time FI scenario with weak visual cues due to low illumination Linear Kernel SVM https://ace-lp.ac.uk/ FI in which conversational partners are in the field of view of the camera FI in which the conversational partner is no longer in the field of view as the interaction occurred while walking RBF Kernel SVM No Focused Interaction Focused Interaction (non-walk) Focused Interaction (walk) Computer work FI-NW Searching for documents Walk- turn around-walk FI-NW Camera setup HOOF (bins) 0 20 40 60 80 100 120 140 160 180 200 Time (sec) 0 0.5 1 VAD scores 0 20 40 60 80 100 120 140 160 180 200 0 200 400 Tracker score Camera motion feature (HOOF) Input video Video Stream Audio Stream Audio-based feature (VAD) Face detection (HOG) and tracking (KLT) Feature concatenation Temporal windowing Classification using SVM (Linear/RBF) No FI FI-NW (non-walk) FI-W (walk) M T V