Finding Time Together: Detection and Classification of ... · 2. Focused Interaction Detection using Multimodal Features 3. Evaluation Focused Interaction Dataset • 19 egocentric
Post on 23-Aug-2020
0 Views
Preview:
Transcript
Finding Time Together: Detection and Classification of Focused Interaction in Egocentric Video
Sophia Bano, Jianguo Zhang, Stephen J. McKenna, {s.bano, j.n.zhang, s.j.z.mckenna}@dundee.ac.uk
Computer Vision and Image Processing Group, Computing, School of Science and Engineering, University of Dundee, United Kingdom
International Conference on
Computer Vision 2017
1. Introduction
Focused Interaction (FI)
• Co-present individuals, having mutual focus of attention, interact by establishing face-to-face engagement and direct conversation [1]
Hypothesis
• Fusion of multimodal features can improve the overall FI detection
Challenges
• Face-to-face engagement often not maintained
• Conversational partner not always present in the video frame
• Varying illumination
• Varied scenes
Examples from our Focused Interaction Dataset
Existing methods
• Off-line processing of video clips or photo streams captured in quite constrained conditions and interacting people always in view [2, 3, 4]
2. Focused Interaction Detection using Multimodal Features
3. EvaluationFocused Interaction Dataset
• 19 egocentric videos (378 mins) captured using a shoulder-mounted GoPro Hero4 at 18 different locations and with 16 conversational partners
Observations
• Fusion of multimodal features is useful for discriminating no FI and FI (walk) when using SVM with RBF kernel
• Face track and VAD scores are significant for discriminating FI (non-walk)
Limitations
• Sound from nearby surroundings influenced the VAD
• Low illumination scenarios affected the face tracker
HOG: Histogram of Oriented Gradient
KLT: Kanade-Lucas-Tomasi Tracker
HOOF: Histogram of Oriented Optical Flow [5]
VAD: Voice Activity Detection [6]
References[1] E. Goffman. Encounters: Two studies in the sociology of interaction. Bobbs-Merrill, 1961.[2] M. Aghaei, M. Dimiccoli, P. Radeva. With whom do I interact? Detecting social interactions in egocentric photostreams. IEEE
ICPR, 2016.[3] S. Alletto, G. Serra, S. Calderara, F. Solera, R. Cucchiara. From ego to nos-vision: Detecting social relationships in first-person
views. IEEE CVPRW, 2014.[4] A. Fathi, J. K. Hodgins, J. M. Rehg. Social interactions: A first-person perspective. IEEE CVPR, 2012.[5] R. Chaudhry, A. Ravichandran, G. Hager, R. Vidal. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear
dynamical systems for the recognition of human actions. IEEE CVPR, 2009.[6] M. Van Segbroeck, A. Tsiartas, S. Narayanan. A robust frontend for VAD: exploiting contextual, discriminative and spectral cues
of human voice. INTERSPEECH, 2013.
4. Conclusion and Future Work• Automatic online classification of Focused Interaction in continuous,
egocentric videos
• Multimodal features: face track, VAD and camera motion profile
• Best performance with multimodal feature fusion and SVM with RBF kernel
• Future work involves the use of recurrent neural network for classification and to extend this work to identify conversational partners
AcknowledgementsThis work is supported by the UKEngineering and Physical SciencesResearch Council (EPSRC) under grantEP/N014278/1: ACE-LP: AugmentingCommunication using EnvironmentalData to drive Language Prediction.
An outdoor night-time FI scenario with weak visual cues due to low illumination
Lin
ear
Ke
rne
lSV
M
https://ace-lp.ac.uk/
FI in which conversational partners are in the field of view of the camera
FI in which the conversational partner is no longer in the field of view as the interaction occurred while walking
RB
FK
ern
elS
VM
No Focused Interaction Focused Interaction (non-walk) Focused Interaction (walk)
Computer work FI-NW Searching for documents Walk- turn around-walk FI-NWCamera setup
HO
OF
(bin
s)
VAD score
0 20 40 60 80 100 120 140 160 180 200Time (sec)
0
0.5
1
VA
D s
core
s
0 20 40 60 80 100 120 140 160 180 2000
200
400
Tra
cke
r sc
ore
Camera motion feature (HOOF)
Input video
Video Stream
Audio Stream Audio-based
feature (VAD)
Face detection (HOG) and
tracking (KLT)
Feature concatenation
Temporal windowing
Classification using SVM
(Linear/RBF)
No FI
FI-NW(non-walk)
FI-W (walk)
M
T
V
top related