Local Relationship Learning with Person-specific Shape Regularization for Facial Action Unit Detection Xuesong Niu 1,3 , Hu Han 1,2 , Songfan Yang 5,6 , Yan Huang 6 , Shiguang Shan 1,2,3,4 1 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2 Peng Cheng Laboratory, Shenzhen, China 3 University of Chinese Academy of Sciences, Beijing 100049, China 4 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China 5 College of Electronics and Information Engineering, Sichuan University, Chengdu, Sichuan, China 6 TAL Education Group, Beijing, China [email protected], {hanhu, sgshan}@ict.ac.cn, {yangsongfan, galehuang}@100tal.com Abstract Encoding individual facial expressions via action unit- s (AUs) coded by the Facial Action Coding System (FACS) has been found to be an effective approach in resolving the ambiguity issue among different expressions. While a num- ber of methods have been proposed for AU detection, robust AU detection in the wild remains a challenging problem be- cause of the diverse baseline AU intensities across individu- al subjects, and the weakness of appearance signal of AUs. To resolve these issues, in this work, we propose a novel AU detection method by utilizing local information and the relationship of individual local face regions. Through such a local relationship learning, we expect to utilize rich local information to improve the AU detection robustness against the potential perceptual inconsistency of individual local re- gions. In addition, considering the diversity in the baseline AU intensities of individual subjects, we further regularize local relationship learning via person-specific face shape information, i.e., reducing the influence of person-specific shape information, and obtaining more AU discriminative features. The proposed approach outperforms the state-of- the-art methods on two widely used AU detection datasets in the public domain (BP4D and DISFA). 1. Introduction Facial expression is a natural and powerful means for hu- man communications, which is highly associated with hu- man’s intention, attitude or mental state. Therefore, facial expression analysis has wide potential applications in di- agnosing mental health [32], improving e-learning experi- AU6 (Cheek Raiser) AU4 (Brow Lowerer) AU4 (Brow Lowerer) AU6 (Cheek Raiser) Figure 1: Each single local facial region defined for AUs in FACS (red circles) can be ambiguous because of face variations in pose, illumination, etc.; therefore, taking in- to account the relationship of multiple related face region- s (yellow circles) can provide more robustness than us- ing individual single local regions separately. At the same time, person-specific face shape information also influ- ences the AU detection performance, i.e., detection of AU4 (Brow Lowerer) is highly influenced by the eye-eyebrow distance, which may vary significantly among different sub- jects. Therefore, we expect to reduce the influence of such person-specific shape information to the AU detection task, i.e., through regularization during feature learning. ences [30], and detecting deception [12]. However, direct facial expression recognition in the wild can be challenging because of ambiguities between several expressions. One of the effective methods in resolving the ambiguity issue is to represent individual expression using the Facial Action Coding System (FACS) [10], in which each expression is identified as a specific configuration of multiple basic fa- 11917
10
Embed
Local Relationship Learning With Person-Specific …openaccess.thecvf.com/content_CVPR_2019/papers/Niu_Local...Local Relationship Learning with Person-specific Shape Regularization
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Local Relationship Learning with Person-specific Shape Regularization for
Facial Action Unit Detection
Xuesong Niu1,3, Hu Han1,2, Songfan Yang5,6, Yan Huang6, Shiguang Shan1,2,3,4
1 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS),
Institute of Computing Technology, CAS, Beijing 100190, China2 Peng Cheng Laboratory, Shenzhen, China
3 University of Chinese Academy of Sciences, Beijing 100049, China4 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China
5 College of Electronics and Information Engineering, Sichuan University, Chengdu, Sichuan, China6 TAL Education Group, Beijing, China
m mj for C AUs, which are further used for refining the AU
occurrence probabilities predicted by L-Net.
where Pcenter is the center point of the two eyes, and d is
the interpupillary distance (IPD).
The normalized landmarks are used as the input to our
P-Net in order to predict the person-specific shape regular-
ization term (see Fig. 4). We expect the P-Net only learn the
AU-independent person-specific face shape information, so
we propose a regularization loss Lr aiming for orthogonal-
izing the features learned by P-Net and the features used for
AU detection by L-Net. The loss is formulated as
Lr = |fau • fs| (3)
where • represents the inner product of two vectors, fauis the average of the k local features generated from the
stem network, and fs is the last layer feature of P-Net for
regularization term prediction. For each input image, we
calculate the regularization terms m1,m2, · · · ,mc for all
the C AUs and use them to refine the predicted probabilities
by L-Net. The final predicted probability p1, p2, · · · , pc of
the LP-Net for all the C AUs can be written as
pj = σ(1
k
k∑
i=1
LSTMj(fi) +mj)
j = 1, 2, · · · , C
(4)
AU prediction is a multi-label binary classification prob-lem, and for most of the AU prediction benchmarks, theoccurrences of AUs are highly imbalanced [9, 27, 39]. Tobetter handle such a multi-label and imbalance problem, wechoose to use binary cross entropy loss Lau with SelectiveLearning [16] as our loss function
Lau = −1
C
C∑
j=1
wc[pj log pj + (1− pj) log(1− pj)] (5)
where pj represents the ground-truth probability of the oc-
currence for the j-th AU, with 1 denoting occurrence of an
AU and 0 denoting no occurrence. pj is the predicted prob-
ability by our LP-Net. The weight wc is a balancing param-
eter which is calculated in each batch using the Selective
Learning strategy [16]. The overall loss function of the pro-
posed LP-Net can be written as
Lall = Lau + λLr (6)
where λ is a hyper-parameter that balances the influences of
the two losses.
4. Experimental Results
In this section, we provide experimental evaluations on
several public-domain AU detection databases and give de-
tailed analysis of the experimental results.
4.1. Experimental Settings
4.1.1 Database
We evaluate our LP-Net on two spontaneous databases
BP4D [45] and DISFA [28], which have been widely used
for facial AUs detection. BP4D is a spontaneous facial ex-
pression database containing 328 videos for 41 participants
(23 females and 18 males). Each subject is involved in 8
sessions, and their spontaneous facial actions are captured
with both 2D and 3D videos. 12 AUs are coded for the 328
videos, and there are about 140,000 frames with AU label-
s of occurrence or absence. DISFA consists of 27 videos
from 12 females and 15 males. Each subject is asked to
watch a 4-minute video to elicit facial AUs. 12 AUs are la-
beled with AU intensity from 0 to 5 for each video. About
130,000 frames are used in the final experiments. Follow-
ing the experiment setting of [8, 23, 24, 35], we conduc-
t a subject-exclusive 3-fold cross-validation on BP4D, and
further fine-tune the best model trained on BP4D for AU
detection on DISFA under a subject-exclusive 3-fold vali-
dation protocol. For DISFA databse, 8 of the 12 AUs are
used for evaluation and the frames with AU intensity equal
or greater than 2 are selected as positive samples and the
rest are selected as negative samples.
4.1.2 Image Pre-processing
For each input image, the CE-CLM facial landmark detec-
tor is used to estimate the 68 facial landmarks (see Fig. 4).
Then following the idea of Baltrusaitis et al. [2], all the
faces are aligned and masked using a similarity transform
based on the detected landmarks to reduce the variations of
pose and scale. All the aligned face images are resized to
240 × 240 and then randomly cropped to 224 × 224 for
training. Images center-cropped from the aligned faces are
utilized for testing. We also use random horizontal flip and
random rotation for data augmentation.
11921
4.1.3 Training
We incrementally train each part of our LP-Net. First, we
pre-train our stem network on the face recognition database
VGGFace2 [6]. Then we train the stem network on the AU
databases. An Adam optimizer with an initial learning rate
of 0.001 is applied for optimizing the stem network. After
that, we add the L-Net module and jointly train the stem net-
work and the L-Net. The initial learning rate is set to 0.0005
for the stem network and 0.001 for L-Net. Next, we add the
P-Net to the network and jointly train the whole network
with an initial learning rate of 0.0005 for stem network and
L-Net and an initial learning rate of 0.001 for P-Net. The
max iteration for all the training steps is set to 30 epochs,
and the batch size is set to 100. The balance parameter λ for
regularization loss Lr is set to 1. All the implementations
are based on PyTorch [31].
4.1.4 Evaluation Metrics
We evaluate the performance of all methods using F1-frame
score [19]. F1-frame score is the harmonic mean of preci-
sion and recall of frame-based AU detection and has been
widely used for AU detection. For each method, F1-frame
for all the AUs are calculated and then averaged (donated as
Avg.) for evaluation.
4.2. Results
4.2.1 Comparisons with the State-of-the-art
We first compare our LP-Net against the state-of-the-art
methods under the same subject-exclusive three-fold cross-
validation protocol. Traditional methods LSVM [11], JPM-
L [46], APL [48], and CPM [43], and deep learning method-
s DRML [47], EAC-Net [24], ROI [23], DSIN [8], and JAA-
Net [35] are used for comparison. Since we focus on image-
based AU detection in this work, the video-based methods
such as ROI-LSTM [23] are not used for comparison. At
the same time, we notice some methods such as DSIN [8]
used threshold turning per AU, while most of the other base-
line methods did not use threshold turning per AU. So for
fair comparisons, we report the performance of individual
methods without threshold tuning per AU. For the baseline
methods LSVM [11], JPML [46], APL [48], and CPM [43],
we directly use their results reported in [24, 35, 47].
Table 1 shows the results of different methods on the
BP4D database. It can be seen that our LP-Net outperform-
s all the baseline approaches on this challenging sponta-
neous facial expression database. Comparing LP-Net with
the state-of-the-art methods based on deeply-learned local
features such as ROI [23], DRML [47], JAA-Net [35] and
DSIN [8], our LP-Net could achieve the best or second-best
detection performance for most of the 12 AUs annotated in
BP4D. We also achieve the best performance in terms of av-
erage F1-frame score. At the same time. Our LP-Net also
outperforms the person-specific AU detection models, such
as CPM [43], by a large margin, which indicates that our P-
Net is very effective in dealing with the challenge of diverse
baseline AU intensities among different subjects.
When comparing with the state-of-the-art methods [35,
8], we also find that the performance of our LP-Net drops
when the facial regions of the AUs are small, such as AU1
and AU2. The reason is that the local features are generat-
ed from the last layer of the Stem-Net, which are high-level
in semantics and may be not sensitive in representing small
regions. However, although the performance drops when di-
rectly using the Stem-Net for local features generation, the
computational complexity is significantly reduced because
our LP-Net does not need an additional backbone network-
s [8] for local feature generation or an additional branch to
enhance the local feature [35].
Experimental results on the DISFA database are reported
in Table 2. It can be observed that our LP-Net again outper-
forms all the state-of-the-art methods. We achieve the best
performance on most AUs, as well as the average F1-frame
score for all AUs. These results suggest that our LP-Net has
a good generalization ability.
4.2.2 Ablation Study
We provide ablation study to investigate the effectiveness
of each part of our LP-Net. Table 3 shows the F1-frame
scores for each AU as well as the average F1-frame score
by individual ablation experiments on BP4D.
Choice of Stem Network: In our LP-Net, stem network
is used for local features generation. We choose ResNet
as the stem network and three commonly used networks
(RestNet-18, ResNet-34, and Rest-Net50) have been con-
sidered. The results are shown in Table 3. From the results,
we can see that ResNet-34 outperforms ResNet-18 with an
improvement of average F1-frame from 52.9 to 53.7, indi-
cating that the deeper network could give richer features for
AU detection. However, when the network is further deep-
ened to ResNet-50, the performance drops to 52.5. The pos-
sible reason is that the AU databases have limited subjects
and a very deep network may suffer from over-fitting. We
use ResNet-34 in our following experiments.
Data Balancing with Selective Learning: Since it is
complicated to collect and annotate AUs for a large face
database, most AU databases are highly imbalanced. After
we apply the Selective Learning strategy [16] for data bal-
ancing, the average F1-frame on BP4D has been improved
from 53.7 to 55.2, indicating the effectiveness of Selective
Learning [16] used in our LP-Net.
Data Augmentation and Model Pre-training: Because of
the difficulties of AU data collection, there are usually limit-
ed subjects in AU databases. Data augmentation and model
11922
Table 1: F1-frame score (in %) for 12 AUs reported by the proposed LP-Net and the state-of-the-art methods on the BP4D
database. The best and second are indicated using brackets and bold, and brackets alone, respectively.